Gemma Boleda

resources

license

The data linked from this page is made available under the CreativeCommons BY-SA (Attribution-ShareAlike) license unless specified otherwise in the documentation. By downloading the data, you acknowledge the terms and conditions of the license. If you use the resources, please cite the papers indicated in the respective pieces of documentation.

data

ManyNames (Silberer, Zarrieß, Westera, Boleda, COLING 2020; Silberer, Zarrieß, Boleda, LREC 2020). 25,000 objects in images associated to 36 human-provided names each. Download it from here.
Instantiation dataset (updated); and new dataset of category relatedness (Westera, M., A. Gupta, G. Boleda, S Padó, Cognitive Science 2021). In comparison to the EACL 2017 version, the instantiation dataset includes a new category of challenging confounders (but the categories are restricted to those with at lest 5 instances). The relatedness dataset covers 981 pairs of categories (like BISHOP-EPISTLE): 50 pairs of categories from each of the sparser domains (Artifact, Act, Other, Communication) and 300 pairs from each of the more populated domains (Object, Location, Person) from the domains covered in the paper. Both data and code for paper results available here.
Instantiation / hypernymy datasets (Boleda, Gupta, Padó, EACL 2017). The instantiation dataset, encoding category membership (e.g. Mumbai - CITY), contains 577 categories and 4750 entities. Download them from here. Check the updated version for the instantiation dataset (see point above).
The LAMBADA dataset (Paperno, Kruszewski, Lazaridou, Pham, Bernardi, Pezzelle, Baroni, Boleda, and Fernández, ACL 2016) for word prediction requiring a broad discourse context. Download it from here.
Intensional / non-intensional adjective modification dataset (Boleda, Baroni, Pham, McNally, IWCS 2013). Download it from here.
Regular polysemy evaluation dataset (Boleda, Pado, Utt, *SEM 2012). Download it from here.
Intersective, subsective, and intensional adjective-noun phrases (Boleda, Vecchi, Cornudella, McNally, EMNLP 2012). Download it from here.
Datasets for color terms (Bruni, Boleda, Baroni, Tran, ACL 2012). Download them from here.
The Database of Catalan Adjectives (DCA) (Sanromà and Boleda, LREC 2012). The DCA consists of 2,296 Catalan adjective lemmata enriched with morphological, syntactic and semantic information. Download it from here.

corpora

Wikicorpus (Reese, Boleda, Cuadros, Padró, Rigau, LREC 2012). The Wikicorpus is a trilingual corpus (Catalan, Spanish, English) that contains large portions of the Wikipedia (based on a 2006 dump) and has been automatically enriched with linguistic information. In its present version, it contains over 750 million words. For more information and download options, visit the project's page.
CUCWEB (Boleda, Bott, Castillo, Meza, Badia, López, EACL WaC workshop 2006). CUCWeb is a 166 million word corpus for Catalan built by crawling the Web. If you are interested in obtaining it, get in touch with me.

tools

POS-Tagger for Old Spanish (Sánchez-Marco, Boleda, Padró, LaTeCH 2011). Part of the open source suite of language analyzers FreeLing.
CatCG (Alsina, Badia, Boleda, Bott, Gil, Quixal, Valentín, LREC 2002). CatCG is a Constraint Grammar tagger and shallow parser for Catalan. Documentation (in Catalan) here. If you are looking for a freely available NLP tool to process Catalan, consider FreeLing.