Search CORE

146 research outputs found

TamilTB: An Effort Towards Building a Dependency Treebank for Tamil

Author: Ramasamy Loganathan
Publication venue
Publication date: 01/01/2011
Field of study

Annotated corpora such as treebanks are important for the development of parsers, language applications as well as understanding of the language itself. Only very few languages possess these scarce resources. In this paper, we describe our effort in syntactically annotating a small corpora (600 sentences) of Tamil language. Our annotation is similar to Prague Dependency Treebank (PDT 2.0) and consists of 2 levels or layers: (i) morphological layer (m-layer) and (ii) analytical layer (a-layer). For both the layers, we introduce annotation schemes i.e. positional tagging for m-layer and dependency relations (and how dependency structures should be drawn) for a-layers. Finally, we evaluate our corpora in the tagging and parsing task using well known taggers and parsers and discuss some general issues in annotation for Tamil language

Biblio at Institute of Formal and Applied Linguistics

Replication issues in syntax-based aspect extraction for opinion mining

Author: Marrese-Taylor Edison
Matsuo Yutaka
Publication venue
Publication date: 01/01/2017
Field of study

Reproducing experiments is an important instrument to validate previous work and build upon existing approaches. It has been tackled numerous times in different areas of science. In this paper, we introduce an empirical replicability study of three well-known algorithms for syntactic centric aspect-based opinion mining. We show that reproducing results continues to be a difficult endeavor, mainly due to the lack of details regarding preprocessing and parameter setting, as well as due to the absence of available implementations that clarify these details. We consider these are important threats to validity of the research on the field, specifically when compared to other problems in NLP where public datasets and code availability are critical validity components. We conclude by encouraging code-based research, which we think has a key role in helping researchers to understand the meaning of the state-of-the-art better and to generate continuous advances.Comment: Accepted in the EACL 2017 SR

arXiv.org e-Print Archive

Crossref

Yet Another Ranking Function for Automatic Multiword Term Extraction

Author: A. Barrón-Cedeño
A. Hliaoutakis
A. Ittoo
F. Rousseau
J.A. Lossio-Ventura
K. Frantzi
K. Kageura
L. Ji
M.S. Conrado
N.J. Eck Van
R. Blanco
T.. Noh
V. Stoykova
Y. Matsuo
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2014
Field of study

International audienceTerm extraction is an essential task in domain knowledge acquisition. We propose two new measures to extract multiword terms from a domain-specific text. The first measure is both linguistic and statistical based. The second measure is graph-based, allowing assessment of the importance of a multiword term of a domain. Existing measures often solve some problems related (but not completely) to term extraction, e.g., noise, silence, low frequency, large-corpora, complexity of the multiword term extraction process. Instead, we focus on managing the entire set of problems, e.g., detecting rare terms and overcoming the low frequency issue. We show that the two proposed measures outperform precision results previously reported for automatic multiword extraction by comparing them with the state-of-the-art reference measures

Chapter Bibliography

Author
Publication venue: 'Informa UK Limited'
Publication date: 10/02/2021
Field of study

authored support system; contextual machine translation; controlled document authoring; controlled language; document structure; terminology management; translation technology; usability evaluatio

Directory of Open Access Books (DOAB)

Building Web Corpora for Minority Languages

Author: Jauhiainen Heidi
Jauhiainen Tommi
Linden Krister
Publication venue: The Association for Computational Linguistics
Publication date: 01/01/2020
Field of study

Web corpora creation for minority languages that do not have their own top-level Internet domain is no trivial matter. Web pages in such minority languages often contain text and links to pages in the dominant language of the country. When building corpora in specific languages, one has to decide how and at which stage to make sure the texts gathered are in the desired language. In the {``}Finno-Ugric Languages and the Internet{''} (Suki) project, we created web corpora for Uralic minority languages using web crawling combined with a language identification system in order to identify the language while crawling. In addition, we used language set identification and crowdsourcing before making sentence corpora out of the downloaded texts. In this article, we describe a strategy for collecting textual material from the Internet for minority languages. The strategy is based on the experiences we gained during the Suki project.Peer reviewe

Helsingin yliopiston digitaalinen arkisto