19,545 research outputs found
Web 2.0, language resources and standards to automatically build a multilingual named entity lexicon
This paper proposes to advance in the current state-of-the-art of automatic Language Resource (LR) building by taking into consideration three elements: (i) the knowledge available in existing LRs, (ii) the vast amount of information available from the collaborative paradigm that has emerged from the Web 2.0 and (iii) the use of standards to improve interoperability. We present a case study in which a set of LRs for diļ¬erent languages (WordNet for English and Spanish and Parole-Simple-Clips for Italian) are
extended with Named Entities (NE) by exploiting Wikipedia and the aforementioned LRs. The practical result is a multilingual NE lexicon connected to these LRs and to two ontologies: SUMO and SIMPLE. Furthermore, the paper addresses an important problem which aļ¬ects the Computational Linguistics area in the present, interoperability, by making use of the ISO LMF standard to encode this lexicon. The diļ¬erent steps of the procedure (mapping, disambiguation, extraction, NE identiļ¬cation and postprocessing) are comprehensively explained and evaluated. The resulting resource contains 974,567, 137,583 and 125,806 NEs for English, Spanish and Italian respectively. Finally, in order to check the usefulness of the constructed resource, we apply it into a state-of-the-art Question Answering system and evaluate its impact; the NE lexicon improves the systemās accuracy by 28.1%. Compared to previous approaches to build NE repositories, the current proposal represents a step forward in terms of automation, language independence, amount of NEs acquired and richness of the information represented
Multilingual search for cultural heritage archives via combining multiple translation resources
The linguistic features of material in Cultural Heritage (CH) archives may be in various languages requiring a facility for effective multilingual search. The specialised
language often associated with CH content introduces problems for automatic translation to support search applications. The MultiMatch project is focused on enabling
users to interact with CH content across different media types and languages. We present results from a MultiMatch study exploring various translation techniques for
the CH domain. Our experiments examine translation techniques for the English language CLEF 2006 Cross-Language
Speech Retrieval (CL-SR) task using Spanish, French and German queries. Results compare effectiveness of our query
translation against a monolingual baseline and show improvement when combining a domain-specific translation lexicon with a standard machine translation system
Simple data-driven context-sensitive lemmatization
Lemmatization for languages with rich inflectional morphology is one of the basic, indispensable steps in a language processing pipeline. In this paper we present a simple data-driven context-sensitive approach to lemmatizating word forms in running text. We treat lemmatization as a classification task for Machine Learning, and automatically induce class labels. We achieve this by computing a Shortest Edit Script (SES) between reversed input and output strings. A SES describes the transformations that have to be applied to the
input string (word form) in order to convert it to the output string (lemma). Our approach shows competitive performance on a range of typologically different languages
External Lexical Information for Multilingual Part-of-Speech Tagging
Morphosyntactic lexicons and word vector representations have both proven
useful for improving the accuracy of statistical part-of-speech taggers. Here
we compare the performances of four systems on datasets covering 16 languages,
two of these systems being feature-based (MEMMs and CRFs) and two of them being
neural-based (bi-LSTMs). We show that, on average, all four approaches perform
similarly and reach state-of-the-art results. Yet better performances are
obtained with our feature-based models on lexically richer datasets (e.g. for
morphologically rich languages), whereas neural-based results are higher on
datasets with less lexical variability (e.g. for English). These conclusions
hold in particular for the MEMM models relying on our system MElt, which
benefited from newly designed features. This shows that, under certain
conditions, feature-based approaches enriched with morphosyntactic lexicons are
competitive with respect to neural methods
- ā¦