Presented at: The Sixth International Language Resources and Evaluation Conference (LREC2008)
Webpage: http://www.lrec-conf.org/proceedings/lrec2008/pdf/89_paper.pdfThe JOSmorphosyntactic resources for Slovene consist of the specifications, lexicon, and two corpora: jos100k, a 100,000 word balanced monolingual sampled corpus annotated with hand validated morphosyntactic descriptions (MSDs) and lemmas, and jos1M, the 1 million-word partially hand validated corpus. The two corpora have been sampled from the 600M-word Slovene reference corpus FidaPLUS. The JOS resources have a standardised encoding, with the MULTEXT-East-type morphosyntactic specifications and the corpora encoded according to the Text Encoding Initiative Guidelines P5. JOS resources are available as a dataset for research under the Creative Commons licence and are meant to facilitate developments of HLT for Slovene.
Keywords: Corpus (creation, annotation, etc.), Standards for LRs, Tagging, Linguistics
Resource URI on the dog food server: http://data.semanticweb.org/conference/lrec/2008/papers/89
Explore this resource elsewhere: