146 research outputs found
TamilTB: An Effort Towards Building a Dependency Treebank for Tamil
Annotated corpora such as treebanks are important for the development
of parsers, language applications as well as understanding of the language itself.
Only very few languages possess these scarce resources. In this paper, we describe
our effort in syntactically annotating a small corpora (600 sentences) of Tamil
language. Our annotation is similar to Prague Dependency Treebank (PDT 2.0)
and consists of 2 levels or layers: (i) morphological layer (m-layer) and (ii) analytical
layer (a-layer). For both the layers, we introduce annotation schemes i.e. positional
tagging for m-layer and dependency relations (and how dependency structures
should be drawn) for a-layers. Finally, we evaluate our corpora in the tagging and
parsing task using well known taggers and parsers and discuss some general issues
in annotation for Tamil language
Replication issues in syntax-based aspect extraction for opinion mining
Reproducing experiments is an important instrument to validate previous work
and build upon existing approaches. It has been tackled numerous times in
different areas of science. In this paper, we introduce an empirical
replicability study of three well-known algorithms for syntactic centric
aspect-based opinion mining. We show that reproducing results continues to be a
difficult endeavor, mainly due to the lack of details regarding preprocessing
and parameter setting, as well as due to the absence of available
implementations that clarify these details. We consider these are important
threats to validity of the research on the field, specifically when compared to
other problems in NLP where public datasets and code availability are critical
validity components. We conclude by encouraging code-based research, which we
think has a key role in helping researchers to understand the meaning of the
state-of-the-art better and to generate continuous advances.Comment: Accepted in the EACL 2017 SR
Yet Another Ranking Function for Automatic Multiword Term Extraction
International audienceTerm extraction is an essential task in domain knowledge acquisition. We propose two new measures to extract multiword terms from a domain-specific text. The first measure is both linguistic and statistical based. The second measure is graph-based, allowing assessment of the importance of a multiword term of a domain. Existing measures often solve some problems related (but not completely) to term extraction, e.g., noise, silence, low frequency, large-corpora, complexity of the multiword term extraction process. Instead, we focus on managing the entire set of problems, e.g., detecting rare terms and overcoming the low frequency issue. We show that the two proposed measures outperform precision results previously reported for automatic multiword extraction by comparing them with the state-of-the-art reference measures
Chapter Bibliography
authored support system; contextual machine translation; controlled document authoring; controlled language; document structure; terminology management; translation technology; usability evaluatio
Building Web Corpora for Minority Languages
Web corpora creation for minority languages that do not have their own top-level Internet domain is no trivial matter. Web pages in such minority languages often contain text and links to pages in the dominant language of the country. When building corpora in specific languages, one has to decide how and at which stage to make sure the texts gathered are in the desired language. In the {``}Finno-Ugric Languages and the Internet{''} (Suki) project, we created web corpora for Uralic minority languages using web crawling combined with a language identification system in order to identify the language while crawling. In addition, we used language set identification and crowdsourcing before making sentence corpora out of the downloaded texts. In this article, we describe a strategy for collecting textual material from the Internet for minority languages. The strategy is based on the experiences we gained during the Suki project.Peer reviewe
- …