57 research outputs found

    Projecting named entity recognizers without annotated or parallel corpora

    Get PDF
    Named entity recognition (NER) is a well-researched task in the field of NLP, which typically requires large annotated corpora for training usable models. This is a problem for languages which lack large annotated corpora, such as Finnish. We propose an approach to create a named entity recognizer with no annotated or parallel documents, by leveraging strong NER models that exist for English. We automatically gather a large amount of chronologically matched data in two languages, then project named entity annotations from the English documents onto the Finnish ones, by resolving the matches with limited linguistic rules. We use this “artificially” annotated data to train a BiLSTM-CRF model. Our results show that this method can produce annotated instances with high precision, and the resulting model achieves state-of-the-art performance.Peer reviewe

    Multilingual Named Entity Recognition through Data and Model Transfer

    Get PDF
    Maisterintutkielma kÀsittelee monikielistÀ nimien tunnistusta. Tutkielmassa testataan kahta lÀhestymistapaa monikieliseen nimien tunnistukseen: annotoidun datan siirtoa toisille kielille, sekÀ monikielisen mallin luomista. LisÀksi nÀmÀ kaksi lÀhestymistapaa yhdistetÀÀn. Tarkoitus on löytÀÀ menetelmiÀ, joilla nimien tunnistusta voidaan tehdÀ luotettavasti myös pienemmillÀ kielillÀ, joilla annotoituja nimientunnistusaineistoja ei ole suuressa mÀÀrin saatavilla. Tutkielmassa koulutetaan ja testataan malleja neljÀllÀ kielellÀ: suomeksi, viroksi, hollanniksi ja espanjaksi. EnsimmÀisessÀ metodissa annotoitu data siirretÀÀn kieleltÀ toiselle monikielisen paralleelikorpuksen avulla, ja nÀin syntynyttÀ dataa kÀytetÀÀn neuroverkkoja hyödyntÀvÀn koneoppimismallin opettamiseen. Toisessa metodissa kÀytetÀÀn monikielistÀ BERT-mallia. Mallin koulutukseen kÀytetÀÀn annotoituja korpuksia, jotka yhdistetÀÀn monikieliseksi opetusaineistoksi. Kolmannessa metodissa kaksi edellistÀ metodia yhdistetÀÀn, ja kieleltÀ toiselle siirrettyÀ dataa kÀytetÀÀn monikielisen BERT-mallin koulutuksessa. Kaikkia kolmea lÀhestymistapaa testataan kunkin kielen annotoidulla testisetillÀ, ja tuloksia verrataan toisiinsa. Metodi, jossa rakennettiin monikielinen BERT-malli, saavutti selkeÀsti parhaimmat tulokset nimien tunnistamisessa. Neuroverkkomallit, jotka koulutettiin kielestÀ toiseen siirretyillÀ annotaatioilla, saivat selkeÀsti heikompia tuloksia. BERT-mallin kouluttaminen siirretyillÀ annotaatioilla tuotti myös heikkoja tuloksia. Annotaatioiden siirtÀminen kieleltÀ toiselle osoittautui haastavaksi, ja tuloksena syntynyt data sisÀlsi virheitÀ. Tulosten heikkouteen vaikutti myös opetusaineiston ja testiaineiston kuuluminen eri genreen. Monikielinen BERT-malli on tutkielman mukaan testatuista parhaiten toimiva metodi, ja sopii myös kielille, joilla annotoituja aineistoja ei ole paljon saatavilla

    Cross-lingual Word Clusters for Direct Transfer of Linguistic Structure

    Get PDF
    It has been established that incorporating word cluster features derived from large unlabeled corpora can significantly improve prediction of linguistic structure. While previous work has focused primarily on English, we extend these results to other languages along two dimensions. First, we show that these results hold true for a number of languages across families. Second, and more interestingly, we provide an algorithm for inducing cross-lingual clusters and we show that features derived from these clusters significantly improve the accuracy of cross-lingual structure prediction. Specifically, we show that by augmenting direct-transfer systems with cross-lingual cluster features, the relative error of delexicalized dependency parsers, trained on English treebanks and transferred to foreign languages, can be reduced by up to 13%. When applying the same method to direct transfer of named-entity recognizers, we observe relative improvements of up to 26%

    Multilingual Spoken Language Translation

    Get PDF

    Character-level and syntax-level models for low-resource and multilingual natural language processing

    Get PDF
    There are more than 7000 languages in the world, but only a small portion of them benefit from Natural Language Processing resources and models. Although languages generally present different characteristics, “cross-lingual bridges” can be exploited, such as transliteration signals and word alignment links. Such information, together with the availability of multiparallel corpora and the urge to overcome language barriers, motivates us to build models that represent more of the world’s languages. This thesis investigates cross-lingual links for improving the processing of low-resource languages with language-agnostic models at the character and syntax level. Specifically, we propose to (i) use orthographic similarities and transliteration between Named Entities and rare words in different languages to improve the construction of Bilingual Word Embeddings (BWEs) and named entity resources, and (ii) exploit multiparallel corpora for projecting labels from high- to low-resource languages, thereby gaining access to weakly supervised processing methods for the latter. In the first publication, we describe our approach for improving the translation of rare words and named entities for the Bilingual Dictionary Induction (BDI) task, using orthography and transliteration information. In our second work, we tackle BDI by enriching BWEs with orthography embeddings and a number of other features, using our classification-based system to overcome script differences among languages. The third publication describes cheap cross-lingual signals that should be considered when building mapping approaches for BWEs since they are simple to extract, effective for bootstrapping the mapping of BWEs, and overcome the failure of unsupervised methods. The fourth paper shows our approach for extracting a named entity resource for 1340 languages, including very low-resource languages from all major areas of linguistic diversity. We exploit parallel corpus statistics and transliteration models and obtain improved performance over prior work. Lastly, the fifth work models annotation projection as a graph-based label propagation problem for the part of speech tagging task. Part of speech models trained on our labeled sets outperform prior work for low-resource languages like Bambara (an African language spoken in Mali), Erzya (a Uralic language spoken in Russia’s Republic of Mordovia), Manx (the Celtic language of the Isle of Man), and Yoruba (a Niger-Congo language spoken in Nigeria and surrounding countries)

    Agile in-litero experiments:how can semi-automated information extraction from neuroscientific literature help neuroscience model building?

    Get PDF
    In neuroscience, as in many other scientific domains, the primary form of knowledge dissemination is through published articles in peer-reviewed journals. One challenge for modern neuroinformatics is to design methods to make the knowledge from the tremendous backlog of publications accessible for search, analysis and its integration into computational models. In this thesis, we introduce novel natural language processing (NLP) models and systems to mine the neuroscientific literature. In addition to in vivo, in vitro or in silico experiments, we coin the NLP methods developed in this thesis as in litero experiments, aiming at analyzing and making accessible the extended body of neuroscientific literature. In particular, we focus on two important neuroscientific entities: brain regions and neural cells. An integrated NLP model is designed to automatically extract brain region connectivity statements from very large corpora. This system is applied to a large corpus of 25M PubMed abstracts and 600K full-text articles. Central to this system is the creation of a searchable database of brain region connectivity statements, allowing neuroscientists to gain an overview of all brain regions connected to a given region of interest. More importantly, the database enables researcher to provide feedback on connectivity results and links back to the original article sentence to provide the relevant context. The database is evaluated by neuroanatomists on real connectomics tasks (targets of Nucleus Accumbens) and results in significant effort reduction in comparison to previous manual methods (from 1 week to 2h). Subsequently, we introduce neuroNER to identify, normalize and compare instances of identify neuronsneurons in the scientific literature. Our method relies on identifying and analyzing each of the domain features used to annotate a specific neuron mention, like the morphological term 'basket' or brain region 'hippocampus'. We apply our method to the same corpus of 25M PubMed abstracts and 600K full-text articles and find over 500K unique neuron type mentions. To demonstrate the utility of our approach, we also apply our method towards cross-comparing the NeuroLex and Human Brain Project (HBP) cell type ontologies. By decoupling a neuron mention's identity into its specific compositional features, our method can successfully identify specific neuron types even if they are not explicitly listed within a predefined neuron type lexicon, thus greatly facilitating cross-laboratory studies. In order to build such large databases, several tools and infrastructureslarge-scale NLP were developed: a robust pipeline to preprocess full-text PDF articles, as well as bluima, an NLP processing pipeline specialized on neuroscience to perform text-mining at PubMed scale. During the development of those two NLP systems, we acknowledged the need for novel NLP approaches to rapidly develop custom text mining solutions. This led to the formalization of the agile text miningagile text-mining methodology to improve the communication and collaboration between subject matter experts and text miners. Agile text mining is characterized by short development cycles, frequent tasks redefinition and continuous performance monitoring through integration tests. To support our approach, we developed Sherlok, an NLP framework designed for the development of agile text mining applications

    Acta Cybernetica : Volume 18. Number 3.

    Get PDF

    Empowering machine learning models with contextual knowledge for enhancing the detection of eating disorders in social media posts

    Get PDF
    Social networks have become information dissemination channels, where announcements are posted frequently; they also serve as frameworks for debates in various areas (e.g., scientific, political, and social). In particular, in the health area, social networks represent a channel to communicate and disseminate novel treatments' success; they also allow ordinary people to express their concerns about a disease or disorder. The Artificial Intelligence (AI) community has developed analytical methods to uncover and predict patterns from posts that enable it to explain news about a particular topic, e.g., mental disorders expressed as eating disorders or depression. Albeit potentially rich while expressing an idea or concern, posts are presented as short texts, preventing, thus, AI models from accurately encoding these posts' contextual knowledge. We propose a hybrid approach where knowledge encoded in community-maintained knowledge graphs (e.g., Wikidata) is combined with deep learning to categorize social media posts using existing classification models. The proposed approach resorts to state-of-the-art named entity recognizers and linkers (e.g., Falcon 2.0) to extract entities in short posts and link them to concepts in knowledge graphs. Then, knowledge graph embeddings (KGEs) are utilized to compute latent representations of the extracted entities, which result in vector representations of the posts that encode these entities' contextual knowledge extracted from the knowledge graphs. These KGEs are combined with contextualized word embeddings (e.g., BERT) to generate a context-based representation of the posts that empower prediction models. We apply our proposed approach in the health domain to detect whether a publication is related to an eating disorder (e.g., anorexia or bulimia) and uncover concepts within the discourse that could help healthcare providers diagnose this type of mental disorder. We evaluate our approach on a dataset of 2,000 tweets about eating disorders. Our experimental results suggest that combining contextual knowledge encoded in word embeddings with the one built from knowledge graphs increases the reliability of the predictive models. The ambition is that the proposed method can support health domain experts in discovering patterns that may forecast a mental disorder, enhancing early detection and more precise diagnosis towards personalized medicine

    Predicting Linguistic Structure with Incomplete and Cross-Lingual Supervision

    Get PDF
    Contemporary approaches to natural language processing are predominantly based on statistical machine learning from large amounts of text, which has been manually annotated with the linguistic structure of interest. However, such complete supervision is currently only available for the world's major languages, in a limited number of domains and for a limited range of tasks. As an alternative, this dissertation considers methods for linguistic structure prediction that can make use of incomplete and cross-lingual supervision, with the prospect of making linguistic processing tools more widely available at a lower cost. An overarching theme of this work is the use of structured discriminative latent variable models for learning with indirect and ambiguous supervision; as instantiated, these models admit rich model features while retaining efficient learning and inference properties. The first contribution to this end is a latent-variable model for fine-grained sentiment analysis with coarse-grained indirect supervision. The second is a model for cross-lingual word-cluster induction and the application thereof to cross-lingual model transfer. The third is a method for adapting multi-source discriminative cross-lingual transfer models to target languages, by means of typologically informed selective parameter sharing. The fourth is an ambiguity-aware self- and ensemble-training algorithm, which is applied to target language adaptation and relexicalization of delexicalized cross-lingual transfer parsers. The fifth is a set of sequence-labeling models that combine constraints at the level of tokens and types, and an instantiation of these models for part-of-speech tagging with incomplete cross-lingual and crowdsourced supervision. In addition to these contributions, comprehensive overviews are provided of structured prediction with no or incomplete supervision, as well as of learning in the multilingual and cross-lingual settings. Through careful empirical evaluation, it is established that the proposed methods can be used to create substantially more accurate tools for linguistic processing, compared to both unsupervised methods and to recently proposed cross-lingual methods. The empirical support for this claim is particularly strong in the latter case; our models for syntactic dependency parsing and part-of-speech tagging achieve the hitherto best published results for a wide number of target languages, in the setting where no annotated training data is available in the target language
    • 

    corecore