The data linked from this page is made available under the
CreativeCommons BY-SA (Attribution-ShareAlike) license
unless specified otherwise in the documentation. By downloading the data, you
acknowledge the terms and conditions of the license. If you use
the resources, please cite the papers indicated in the respective pieces of documentation.
The LAMBADA dataset (Paperno, Kruszewski, Lazaridou, Pham, Bernardi, Pezzelle, Baroni, Boleda, and Fernández, ACL 2016) for word prediction requiring a broad discourse context. Download it from its webpage.
Intersective, subsective, and intensional adjective-noun phrases (Boleda, Vecchi, Cornudella, McNally, EMNLP
2012). Download it here.
Datasets for color terms (Bruni, Boleda, Baroni, Tran, ACL 2012). Download it here.
The Database of Catalan Adjectives (DCA) (Sanromà and Boleda, LREC 2012). The DCA consists of 2,296 Catalan adjective lemmata enriched with morphological, syntactic and semantic information.
Download it from
its page at the ACL data and code repository.
Wikicorpus (Reese, Boleda, Cuadros, Padró, Rigau, LREC 2012). The Wikicorpus is a trilingual corpus (Catalan, Spanish, English)
that contains large portions of the Wikipedia (based on a 2006 dump) and
has been automatically enriched with linguistic information. In its
present version, it contains over 750 million words.
For more information and download options, visit the project's page. Available via a web interface here.
CUCWEB (Boleda, Bott, Castillo, Meza, Badia, López, EACL WaC workshop 2006). CUCWeb is a 166 million
word corpus for Catalan built by
crawling the Web.
If you are interested in obtaining it, get in touch with me.
POS-Tagger for Old Spanish (Sánchez-Marco, Boleda, Padró, LaTeCH 2011). Part of the open source suite of language analyzers FreeLing.
CatCG (Alsina, Badia, Boleda, Bott, Gil, Quixal, Valentín, LREC 2002). CatCG is a Constraint Grammar tagger and shallow parser for Catalan. If you are looking for a freely available NLP tool to process Catalan, consider FreeLing.