The data linked from this page is made available under the
CreativeCommons BY-SA (Attribution-ShareAlike) license
unless specified otherwise in the documentation. By downloading the data, you
acknowledge the terms and conditions of the license. If you use
the resources, please cite the papers indicated in the respective pieces of documentation.
ManyNames (Silberer, Zarrieß, Westera, Boleda, COLING 2020; Silberer, Zarrieß, Boleda, LREC 2020). 25,000 objects in images associated to 36 human-provided names each. Download it from here.
Instantiation dataset (updated); and new dataset of category relatedness (Westera, M., A. Gupta, G. Boleda, S Padó, Cognitive Science 2021). In comparison to the EACL 2017 version, the instantiation dataset includes a new category of challenging confounders (but the categories are restricted to those with at lest 5 instances). The relatedness dataset covers 981 pairs of categories (like BISHOP-EPISTLE): 50 pairs of categories from each of the sparser domains (Artifact, Act, Other, Communication) and 300 pairs from each of the more populated domains (Object, Location, Person) from the domains covered in the paper. Both data and code for paper results available here.
Instantiation / hypernymy datasets (Boleda, Gupta, Padó, EACL 2017). The instantiation dataset, encoding category membership (e.g. Mumbai - CITY), contains 577 categories and 4750 entities. Download them from here. Check the updated version for the instantiation dataset (see point above).
The LAMBADA dataset (Paperno, Kruszewski, Lazaridou, Pham, Bernardi, Pezzelle, Baroni, Boleda, and Fernández, ACL 2016) for word prediction requiring a broad discourse context. Download it from here.
Intensional / non-intensional modification dataset (Boleda, Baroni, Pham, McNally, IWCS 2013). Download it from here.
Regular polysemy evaluation dataset
(Boleda, Pado, Utt, *SEM 2012). Download it from here.
Intersective, subsective, and intensional adjective-noun phrases (Boleda, Vecchi, Cornudella, McNally, EMNLP
2012). Download it from here.
Datasets for color terms (Bruni, Boleda, Baroni, Tran, ACL 2012). Download them from here.
The Database of Catalan Adjectives (DCA) (Sanromà and Boleda, LREC 2012). The DCA consists of 2,296 Catalan adjective lemmata enriched with morphological, syntactic and semantic information.
Download it from
Wikicorpus (Reese, Boleda, Cuadros, Padró, Rigau, LREC 2012). The Wikicorpus is a trilingual corpus (Catalan, Spanish, English) that contains large portions of the Wikipedia (based on a 2006 dump) and has been automatically enriched with linguistic information. In its present version, it contains over 750 million words. For more information and download options, visit the project's page.
Available via a web interface here
(select corpus "Wikipèdia en català/castellà/anglès").
CUCWEB (Boleda, Bott, Castillo, Meza, Badia, López, EACL WaC workshop 2006). CUCWeb is a 166 million
word corpus for Catalan built by crawling the Web.
If you are interested in obtaining it, get in touch with me.
Available via a web interface here (select corpus "Cátedra Telefónica corpus of the use of Catalan on the Web").
POS-Tagger for Old Spanish (Sánchez-Marco, Boleda, Padró, LaTeCH 2011). Part of the open source suite of language analyzers FreeLing.
CatCG (Alsina, Badia, Boleda, Bott, Gil, Quixal, Valentín, LREC 2002). CatCG is a Constraint Grammar tagger and shallow parser for Catalan. If you are looking for a freely available NLP tool to process Catalan, consider FreeLing.