103 research outputs found

    Concept and entity grounding using indirect supervision

    Get PDF
    Extracting and disambiguating entities and concepts is a crucial step toward understanding natural language text. In this thesis, we consider the problem of grounding concepts and entities mentioned in text to one or more knowledge bases (KBs). A well-studied scenario of this problem is the one in which documents are given in English and the goal is to identify concept and entity mentions, and find the corresponding entries the mentions refer to in Wikipedia. We extend this problem in two directions: First, we study identifying and grounding entities written in any language to the English Wikipedia. Second, we investigate using multiple KBs which do not contain rich textual and structural information Wikipedia does. These more involved settings pose a few additional challenges beyond those addressed in the standard English Wikification problem. Key among them is that no supervision is available to facilitate training machine learning models. The first extension, cross-lingual Wikification, introduces problems such as recognizing multilingual named entities mentioned in text, translating non-English names into English, and computing word similarity across languages. Since it is impossible to acquire manually annotated examples for all languages, building models for all languages in Wikipedia requires exploring indirect or incidental supervision signals which already exist in Wikipedia. For the second setting, we need to deal with the fact that most KBs do not contain the rich information Wikipedia has; consequently, the main supervision signal used to train Wikification rankers does not exist anymore. In this thesis, we show that supervision signals can be obtained by carefully examining the redundancy and relations between multiple KBs. By developing algorithms and models which harvest these incidental signals, we can achieve better performance on these tasks

    Featurebased method for document alignment in comparable news corpora

    Get PDF
    In this paper, we present a feature-based method to align documents with similar content across two sets of bilingual comparable corpora from daily news texts. We evaluate the contribution of each individual feature and investigate the incorporation of these diverse statistical and heuristic features for the task of bilingual document alignment. Experimental results on the English-Chinese and English-Malay comparable news corpora show that our proposed Discrete Fourier Transformbased term frequency distribution feature is very effective. It contributes 4.1 % and 8 % to performance improvement over Pearson’s correlation method on the two comparable corpora. In addition, when more heuristic and statistical features as well as a bilingual dictionary are utilized, our method shows an absolute performance improvement of 23.2% and 15.3 % on the two sets of bilingual corpora when comparing with a prior information retrieval-based method.

    Bi-texte : approche systémique et textométrie

    Get PDF
    International audienceFollowing a lexicogrammar approach (Gledhill, 2011), this research explores characteristic attractions of signifcantly overrepresented linguistic patterns in corresponding text zones (corpus parts) to reveal translation correspondences. The results show that translation mapping is achieved when automatically discovered lexical correspondences are used as anchor points to explore functional equivalence of related linguistic features. In textometric studies, this process based on contrastive analysis of selected text zones is known as resonance. It relies upon characteristic elements computation (Lebart et al., 1998) and can be propagated across multiple annotation layers.The research fndings suggest that combing systemic functional approach and textometic analysis offers new perspectives for context-based comparable text processing.La méthodologie suivie dans ce travail s’appuie sur les principes d’exploration textométrique de corpus multilingues développés dans des travaux antérieurs (Fleury, Zimina, 2008). Le principe d’amplification systémique de la résonance textuelle (Salem, 2004) est mobilisé afin de prendre en compte plusieurs niveaux d’analyse linguistique.Le “signal” envoyé pour initier l’exploration en corpus (induction) est amorcé en tenant compte des principes théoriques empruntés à la Linguistique Systémique Fonctionnelle (LSF) (Banks, 2005 ; Gledhill, 2011).L’approche est présentée à l’aide d’une série d’explorations textométriques du corpus BBC_Lenta.RU. Ce corpus médiatique est composé de 2 345 textes correspondant à des fils d’actualités diffusés par la chaîne britannique BBC et de leurs traductions-adaptations en russe publiées par Lenta.ru, un site d’actualités basé à Moscou (Klementiev, Roth, 2006). Les textes originaux en anglais comptent plus d’un million d’occurrences, tandis que leurs traductions-adaptations en russe sont, en moyenne, deux fois plus courtes (moins de 500 000 d’occurrences).Le caractère informatif des textes médiatiques BBC_Lenta.RU a plusieurs spécificités qui apparaissent à travers l’étude des schémas lexico-grammaticaux propres à ce type de texte. Les traits caractéristiques de ces schémas offrent des indices pour l’alignement des structures textuelles correspondantes. L’approche textométrique des corpus multilingues permet de déclencher un recensement automatique de ces indices grâce à la prise en compte des annotations multiples sur les unités du texte formalisées sous forme de Trames textométriques (Fleury, 2013).Ces annotations reflètent plusieurs états du corpus (lemmatisation, étiquetage automatique des catégories grammaticales, relations de dépendance syntaxique, analyse sémantique, etc.). Les annotations multiples sur les unités textuelles sont générées et gérées grâce aux fonctionnalités du logiciel Le Trameur (http://www.tal.univ-paris3.fr/trameur).Les éléments caractéristiques des discours multilingues calculés sur plusieurs niveaux d’annotations sont mobilisés à des fins d’alignement textuel. Cette approche permet d’avancer dans la réflexion sur la modélisation d’un bi-texte informatisé

    Translation Alignment Applied to Historical Languages: methods, evaluation, applications, and visualization

    Get PDF
    Translation alignment is an essential task in Digital Humanities and Natural Language Processing, and it aims to link words/phrases in the source text with their translation equivalents in the translation. In addition to its importance in teaching and learning historical languages, translation alignment builds bridges between ancient and modern languages through which various linguistics annotations can be transferred. This thesis focuses on word-level translation alignment applied to historical languages in general and Ancient Greek and Latin in particular. As the title indicates, the thesis addresses four interdisciplinary aspects of translation alignment. The starting point was developing Ugarit, an interactive annotation tool to perform manual alignment aiming to gather training data to train an automatic alignment model. This effort resulted in more than 190k accurate translation pairs that I used for supervised training later. Ugarit has been used by many researchers and scholars also in the classroom at several institutions for teaching and learning ancient languages, which resulted in a large, diverse crowd-sourced aligned parallel corpus allowing us to conduct experiments and qualitative analysis to detect recurring patterns in annotators’ alignment practice and the generated translation pairs. Further, I employed the recent advances in NLP and language modeling to develop an automatic alignment model for historical low-resourced languages, experimenting with various training objectives and proposing a training strategy for historical languages that combines supervised and unsupervised training with mono- and multilingual texts. Then, I integrated this alignment model into other development workflows to project cross-lingual annotations and induce bilingual dictionaries from parallel corpora. Evaluation is essential to assess the quality of any model. To ensure employing the best practice, I reviewed the current evaluation procedure, defined its limitations, and proposed two new evaluation metrics. Moreover, I introduced a visual analytics framework to explore and inspect alignment gold standard datasets and support quantitative and qualitative evaluation of translation alignment models. Besides, I designed and implemented visual analytics tools and reading environments for parallel texts and proposed various visualization approaches to support different alignment-related tasks employing the latest advances in information visualization and best practice. Overall, this thesis presents a comprehensive study that includes manual and automatic alignment techniques, evaluation methods and visual analytics tools that aim to advance the field of translation alignment for historical languages

    Fine-grained Arabic named entity recognition

    Get PDF
    This thesis addresses the problem of fine-grained NER for Arabic, which poses unique linguistic challenges to NER; such as the absence of capitalisation and short vowels, the complex morphology, and the highly in infection process. Instead of classifying the detected NE phrases into small sets of classes, we target a broader range (i.e. 50 fine-grained classes 'hierarchal-based of two levels') to increase the depth of the semantic knowledge extracted. This has increased the number of classes, complicating the task, when compared with traditional (coarse-grained) NER, because of the increase in the number of semantic classes and the decrease in semantic differences between fine-grained classes. Our approach to developing fine-grained NER relies on two different supervised Machine Learning (ML) technologies (i.e. Maximum Entropy 'ME' and Conditional Random Fields 'CRF'), which require annotated training data in order to learn by extracting informative features. We develop a methodology which exploit the richness of Arabic Wikipedia (A W) in order to create a scalable fine-grained lexical resource and a corpus automatically. Moreover, two gold-standard created corpora from different genres were also developed to perform comparable evaluation. The thesis also developed a new approach to feature representation by relying on the dependency structure of the sentence to overcome the limitation of traditional window-based (i.e. n-gram) representation. Furthermore, by exploiting the richness of unannotated textual data to extract global informative features using word-level clustering technique was also achieved. Each contribution was evaluated via controlled experiment and reported using three commonly applied metrics, i.e. precision, recall and harmonic F-measure

    An Urdu semantic tagger - lexicons, corpora, methods and tools

    Get PDF
    Extracting and analysing meaning-related information from natural language data has attracted the attention of researchers in various fields, such as Natural Language Processing (NLP), corpus linguistics, data sciences, etc. An important aspect of such automatic information extraction and analysis is the semantic annotation of language data using semantic annotation tool (a.k.a semantic tagger). Generally, different semantic annotation tools have been designed to carry out various levels of semantic annotations, for instance, sentiment analysis, word sense disambiguation, content analysis, semantic role labelling, etc. These semantic annotation tools identify or tag partial core semantic information of language data, moreover, they tend to be applicable only for English and other European languages. A semantic annotation tool that can annotate semantic senses of all lexical units (words) is still desirable for the Urdu language based on USAS (the UCREL Semantic Analysis System) semantic taxonomy, in order to provide comprehensive semantic analysis of Urdu language text. This research work report on the development of an Urdu semantic tagging tool and discuss challenging issues which have been faced in this Ph.D. research work. Since standard NLP pipeline tools are not widely available for Urdu, alongside the Urdu semantic tagger a suite of newly developed tools have been created: sentence tokenizer, word tokenizer and part-of-speech tagger. Results for these proposed tools are as follows: word tokenizer reports F1F_1 of 94.01\%, and accuracy of 97.21\%, sentence tokenizer shows F1_1 of 92.59\%, and accuracy of 93.15\%, whereas, POS tagger shows an accuracy of 95.14\%. The Urdu semantic tagger incorporates semantic resources (lexicon and corpora) as well as semantic field disambiguation methods. In terms of novelty, the NLP pre-processing tools are developed either using rule-based, statistical, or hybrid techniques. Furthermore, all semantic lexicons have been developed using a novel combination of automatic or semi-automatic approaches: mapping, crowdsourcing, statistical machine translation, GIZA++, word embeddings, and named entity. A large multi-target annotated corpus is also constructed using a semi-automatic approach to test accuracy of the Urdu semantic tagger, proposed corpus is also used to train and test supervised multi-target Machine Learning classifiers. The results show that Random k-labEL Disjoint Pruned Sets and Classifier Chain multi-target classifiers outperform all other classifiers on the proposed corpus with a Hamming Loss of 0.06\% and Accuracy of 0.94\%. The best lexical coverage of 88.59\%, 99.63\%, 96.71\% and 89.63\% are obtained on several test corpora. The developed Urdu semantic tagger shows encouraging precision on the proposed test corpus of 79.47\%
    • …
    corecore