62,390 research outputs found
Sentences and Documents in Native Language Identification
Starting from a wide set of linguistic features, we present the first in depth feature analysis in two different Native Language Identification (NLI) scenarios. We compare the results obtained in a traditional NLI document classification task and in a newly introduced sentence classification task, investigating the different role played by the considered features. Finally, we study the impact of a set of selected features extracted from the sentence classifier in document classification.Partendo da un ampio insieme di caratteristiche linguistiche, presentiamo la prima analisi approfondita del ruolo delle caratteristiche linguistiche nel compito di identificazione della lingua nativa (NLI) in due differenti scenari. Confrontiamo i risultati ottenuti nel tradizionale task di NLI ed in un nuovo compito di classificazione di frasi, studiando il ruolo differente che svolgono le caratteristiche considerate. Infine, studiamo lāimpatto di un insieme di caratteristiche estratte dal classificatore di frasi nel task di classificazione di documenti
Towards using web-crawled data for domain adaptation in statistical machine translation
This paper reports on the ongoing work focused on domain adaptation of statistical machine translation using domain-speciļ¬c data obtained by domain-focused web crawling. We present a strategy for crawling monolingual and parallel data and their exploitation for testing, language modelling, and system tuning in a phrase--based machine translation framework. The proposed approach is evaluated on the domains of Natural Environment and Labour Legislation and two language
pairs: EnglishāFrench and EnglishāGreek
Complex Word Identification: Challenges in Data Annotation and System Performance
This paper revisits the problem of complex word identification (CWI)
following up the SemEval CWI shared task. We use ensemble classifiers to
investigate how well computational methods can discriminate between complex and
non-complex words. Furthermore, we analyze the classification performance to
understand what makes lexical complexity challenging. Our findings show that
most systems performed poorly on the SemEval CWI dataset, and one of the
reasons for that is the way in which human annotation was performed.Comment: Proceedings of the 4th Workshop on NLP Techniques for Educational
Applications (NLPTEA 2017
- ā¦