1,092 research outputs found

    Entity Linking to Wikipedia : Grounding entity mentions in natural language text using thematic context distance and collective search

    Get PDF
    This thesis proposes new methods for entity linking in natural language text that assigns entity mentions in unstructured natural language text to the semi-structured encyclopedia Wikipedia. Doing so, entity linking grounds a mention to an encyclopedic entry in Wikipedia and embeds it into this Linked-Open-Data hub. This enables a higher level view on single documents, provides hints for further reading and may be used to add details from other sources. Furthermore, enriching text documents with such links simultaneously resolves the ambiguity of entity names. This ambiguity is an unsolved challenge for many text mining applications: one entity may be designated by a multitude of names and every mention may denote a multitude of entities. Resolving the ambiguity of entity names is thus a crucial step for entity based retrieval, an open problem for most information retrieval and extraction tasks. For instance, search engines relying on heuristic string matches often retrieve irrelevant results as they can not satisfyingly resolve ambiguity. Moreover, there is a huge number of entity mentions that can not be linked to Wikipedia since albeit of its size, Wikipedia has a restricted coverage. Earlier and current work often ignored this and consequently all mentions of uncovered entities. Other approaches handle only entity mentions of specific types or are focussed on English as target language. Apart from such restrictions, no method achieves perfect linking performance. These are the tasks approached in this thesis. We introduce new methods for candidate entity retrieval and candidate entity consolidation, the key components to recall and precision, exploiting both the vast amount of structured and unstructured information stored in Wikipedia. First, we propose a new contextual similarity measure based on latent topic distributions inferred from unstructured natural language text. We show that this thematic distance between mention and candidate entity contexts yields a lower linking error rate than purely word based distances. Being language independent, this method enables high performance entity linking in previously neglected languages such as German and French. This approach is especially suitable, albeit not restricted to link person names, the class of mentions with highest ambiguity. We next propose a new candidate retrieval method to enable successful entity linking also for other entities that are not referenced canonically or exhibit the thematic coherence of persons. We introduce collective search that uses the structured information encoded in Wikipedia’s hyperlink graph to arrive at sets of strongly related candidate entities. This enables us to better handle synonymy, one of the hardest problems in entity linking and not thoroughly treated in previous work. We emphasize on general applicability and evaluate this method on a broad collection of benchmark corpora both in a supervised as well as in an unsupervised setting. We show that candidate enhancement through collective search increases linking performance on nearly all of these corpora and that our method is the most stable compared to other state-of-the-art approaches. Presenting the first unification of diverse performance measures, we also make a step forward to the comparability of entity linking methods. In conclusion, we provide state-of-the-art entity linking methods for nearly all of the current use cases. When it comes to fine-tuning, we note that entity linking has subjective aspects and adaptions may be necessary depending on the task at hand

    Suomenkielisen geojäsentimen kehittäminen: kuinka hankkia sijaintitietoa jäsentelemättömistä tekstiaineistoista

    Get PDF
    Alati enemmän aineistoa tuotetaan ja jaetaan internetin kautta. Aineistot ovat vaihtelevia muodoiltaan, kuten verkkoartikkelien ja sosiaalisen media julkaisujen kaltaiset digitaaliset tekstit, ja niillä on usein spatiaalinen ulottuvuus. Teksteissä geospatiaalisuutta ilmaistaan paikannimien kautta, mutta tavanomaisilla paikkatietomenetelmillä ei kyetä käsittelemään tietoa epätäsmällisessä kielellisessä asussaan. Tämä on luonut tarpeen muuntaa tekstimuotoisen sijaintitiedon näkyvään muotoon, koordinaateiksi. Ongelmaa ratkaisemaan on kehitetty geojäsentimiä, jotka tunnistavat ja paikantavat paikannimet vapaista teksteistä, ja jotka oikein toimiessaan voisivat toimia paikkatiedon lähteenä maantieteellisessä tutkimuksessa. Geojäsentämistä onkin sovellettu katastrofihallinnasta kirjallisuudentutkimukseen. Merkittävässä osassa geojäsentämisen tutkimusta tutkimusaineiston kielenä on ollut englanti ja geojäsentimetkin ovat kielikohtaisia – tämä jättää pimentoon paitsi geojäsentimien kehitykseen vaikuttavat havainnot pienemmistä kielistä myös kyseisten kielten puhujien näkemykset. Maisterintutkielmassani pyrin vastaamaan kolmeen tutkimuskysymykseen: Mitkä ovat edistyneimmät geojäsentämismenetelmät? Mitkä kielelliset ja maantieteelliset monitulkintaisuudet vaikeuttavat tämän monitahoisen ongelman ratkaisua? Ja miten arvioida geojäsentimien luotettavuutta ja käytettävyyttä? Tutkielman soveltavassa osuudessa esittelen Fingerin, geojäsentimen suomen kielelle, ja kuvaan sen kehitystä sekä suorituskyvyn arviointia. Arviointia varten loin kaksi testiaineistoa, joista toinen koostuu Twitter-julkaisuista ja toinen uutisartikkeleista. Finger-geojäsennin, testiaineistot ja relevantit ohjelmakoodit jaetaan avoimesti. Geojäsentäminen voidaan jakaa kahteen alitehtävään: paikannimien tunnistamiseen tekstivirrasta ja paikannimien ratkaisemiseen oikeaan koordinaattipisteeseen mahdollisesti useasta kandidaatista. Molemmissa vaiheissa uusimmat metodit nojaavat syväoppimismalleihin ja -menetelmiin, joiden syötteinä ovat sanaupotusten kaltaiset vektorit. Geojäsentimien suoriutumista testataan aineistoilla, joissa paikannimet ja niiden koordinaatit tiedetään. Mittatikkuna tunnistamisessa on vastaavuus ja ratkaisemisessa etäisyys oikeasta sijainnista. Finger käyttää paikannimitunnistinta, joka hyödyntää suomenkielistä BERT-kielimallia, ja suoraviivaista tietokantahakua paikannimien ratkaisemiseen. Ohjelmisto tuottaa taulukkomuotoiseksi jäsenneltyä paikkatietoa, joka sisältää syötetekstit ja niistä mahdollisesti tunnistetut paikannimet koordinaattisijainteineen. Testiaineistot eroavat aihepiireiltään, mutta Finger suoriutuu niillä likipitäen samoin, ja suoriutuu englanninkielisillä aineistoilla tehtyihin arviointeihin suhteutettuna kelvollisesti. Virheanalyysi paljastaa useita virhelähteitä, jotka johtuvat kielten tai maantieteellisen todellisuuden luontaisesta epäselvyydestä tai ovat prosessoinnin aiheuttamia, kuten perusmuotoistamisvirheet. Kaikkia osia Fingerissä voidaan parantaa, muun muassa kehittämällä kielellistä käsittelyä pidemmälle ja luomalla kattavampia testiaineistoja. Samoin tulevaisuuden geojäsentimien tulee kyetä käsittelemään monimutkaisempia kielellisiä ja maantieteellisiä kuvaustapoja kuin pelkät paikannimet ja koordinaattipisteet. Finger ei nykymuodossaan tuota valmista paikkatietoa, jota kannattaisi kritiikittä käyttää. Se on kuitenkin lupaava ensiaskel suomen kielen geojäsentimille ja astinlauta vastaisuuden soveltavalle tutkimukselle.Ever more data is available and shared through the internet. The big data masses often have a spatial dimension and can take many forms, one of which are digital texts, such as articles or social media posts. The geospatial links in these texts are made through place names, also called toponyms, but traditional GIS methods are unable to deal with the fuzzy linguistic information. This creates the need to transform the linguistic location information to an explicit coordinate form. Several geoparsers have been developed to recognize and locate toponyms in free-form texts: the task of these systems is to be a reliable source of location information. Geoparsers have been applied to topics ranging from disaster management to literary studies. Major language of study in geoparser research has been English and geoparsers tend to be language-specific, which threatens to leave the experiences provided by studying and expressed in smaller languages unexplored. This thesis seeks to answer three research questions related to geoparsing: What are the most advanced geoparsing methods? What linguistic and geographical features complicate this multi-faceted problem? And how to evaluate the reliability and usability of geoparsers? The major contributions of this work are an open-source geoparser for Finnish texts, Finger, and two test datasets, or corpora, for testing Finnish geoparsers. One of the datasets consists of tweets and the other of news articles. All of these resources, including the relevant code for acquiring the test data and evaluating the geoparser, are shared openly. Geoparsing can be divided into two sub-tasks: recognizing toponyms amid text flows and resolving them to the correct coordinate location. Both tasks have seen a recent turn to deep learning methods and models, where the input texts are encoded as, for example, word embeddings. Geoparsers are evaluated against gold standard datasets where toponyms and their coordinates are marked. Performance is measured on equivalence and distance-based metrics for toponym recognition and resolution respectively. Finger uses a toponym recognition classifier built on a Finnish BERT model and a simple gazetteer query to resolve the toponyms to coordinate points. The program outputs structured geodata, with input texts and the recognized toponyms and coordinate locations. While the datasets represent different text types in terms of formality and topics, there is little difference in performance when evaluating Finger against them. The overall performance is comparable to the performance of geoparsers of English texts. Error analysis reveals multiple error sources, caused either by the inherent ambiguousness of the studied language and the geographical world or are caused by the processing itself, for example by the lemmatizer. Finger can be improved in multiple ways, such as refining how it analyzes texts and creating more comprehensive evaluation datasets. Similarly, the geoparsing task should move towards more complex linguistic and geographical descriptions than just toponyms and coordinate points. Finger is not, in its current state, a ready source of geodata. However, the system has potential to be the first step for geoparsers for Finnish and it can be a steppingstone for future applied research

    A Theatre of Places: Mapping 17 th French Theatre

    Get PDF
    International audienceAs a form of distant reading, mapping texts allows scholars to read classical works anew. Using 17 th French theatre as a test case, we describe an easily reproducible and fully open-source workflow used for extracting and mapping place names, then reach conclusions on literary influences and the strength of genre during the Grand Siècle based

    Named Entity Recognition for early-modern textual sources: a review of capabilities and challenges with strategies for the future

    Get PDF
    Purpose: By mapping-out the capabilities, challenges and limitations of named-entity recognition (NER), this article aims to synthesise the state of the art of NER in the context of the early modern research field and to inform discussions about the kind of resources, methods and directions that may be pursued to enrich the application of the technique going forward. // Design/methodology/approach: Through an extensive literature review, this article maps out the current capabilities, challenges and limitations of NER and establishes the state of the art of the technique in the context of the early modern, digitally augmented research field. It also presents a new case study of NER research undertaken by Enlightenment Architectures: Sir Hans Sloane's Catalogues of his Collections (2016–2021), a Leverhulme funded research project and collaboration between the British Museum and University College London, with contributing expertise from the British Library and the Natural History Museum. // Findings: Currently, it is not possible to benchmark the capabilities of NER as applied to documents of the early modern period. The authors also draw attention to the situated nature of authority files, and current conceptualisations of NER, leading them to the conclusion that more robust reporting and critical analysis of NER approaches and findings is required. // Research limitations/implications: This article examines NER as applied to early modern textual sources, which are mostly studied by Humanists. As addressed in this article, detailed reporting of NER processes and outcomes is not necessarily valued by the disciplines of the Humanities, with the result that it can be difficult to locate relevant data and metrics in project outputs. The authors have tried to mitigate this by contacting projects discussed in this paper directly, to further verify the details they report here. // Practical implications: The authors suggest that a forum is needed where tools are evaluated according to community standards. Within the wider NER community, the MUC and ConLL corpora are used for such experimental set-ups and are accompanied by a conference series, and may be seen as a useful model for this. The ultimate nature of such a forum must be discussed with the whole research community of the early modern domain. // Social implications: NER is an algorithmic intervention that transforms data according to certain rules-, patterns- or training data and ultimately affects how the authors interpret the results. The creation, use and promotion of algorithmic technologies like NER is not a neutral process, and neither is their output A more critical understanding of the role and impact of NER on early modern documents and research and focalization of some of the data- and human-centric aspects of NER routines that are currently overlooked are called for in this paper. // Originality/value: This article presents a state of the art snapshot of NER, its applications and potential, in the context of early modern research. It also seeks to inform discussions about the kinds of resources, methods and directions that may be pursued to enrich the application of NER going forward. It draws attention to the situated nature of authority files, and current conceptualisations of NER, and concludes that more robust reporting of NER approaches and findings are urgently required. The Appendix sets out a comprehensive summary of digital tools and resources surveyed in this article

    A Survey on Semantic Processing Techniques

    Full text link
    Semantic processing is a fundamental research domain in computational linguistics. In the era of powerful pre-trained language models and large language models, the advancement of research in this domain appears to be decelerating. However, the study of semantics is multi-dimensional in linguistics. The research depth and breadth of computational semantic processing can be largely improved with new technologies. In this survey, we analyzed five semantic processing tasks, e.g., word sense disambiguation, anaphora resolution, named entity recognition, concept extraction, and subjectivity detection. We study relevant theoretical research in these fields, advanced methods, and downstream applications. We connect the surveyed tasks with downstream applications because this may inspire future scholars to fuse these low-level semantic processing tasks with high-level natural language processing tasks. The review of theoretical research may also inspire new tasks and technologies in the semantic processing domain. Finally, we compare the different semantic processing techniques and summarize their technical trends, application trends, and future directions.Comment: Published at Information Fusion, Volume 101, 2024, 101988, ISSN 1566-2535. The equal contribution mark is missed in the published version due to the publication policies. Please contact Prof. Erik Cambria for detail

    Agile in-litero experiments:how can semi-automated information extraction from neuroscientific literature help neuroscience model building?

    Get PDF
    In neuroscience, as in many other scientific domains, the primary form of knowledge dissemination is through published articles in peer-reviewed journals. One challenge for modern neuroinformatics is to design methods to make the knowledge from the tremendous backlog of publications accessible for search, analysis and its integration into computational models. In this thesis, we introduce novel natural language processing (NLP) models and systems to mine the neuroscientific literature. In addition to in vivo, in vitro or in silico experiments, we coin the NLP methods developed in this thesis as in litero experiments, aiming at analyzing and making accessible the extended body of neuroscientific literature. In particular, we focus on two important neuroscientific entities: brain regions and neural cells. An integrated NLP model is designed to automatically extract brain region connectivity statements from very large corpora. This system is applied to a large corpus of 25M PubMed abstracts and 600K full-text articles. Central to this system is the creation of a searchable database of brain region connectivity statements, allowing neuroscientists to gain an overview of all brain regions connected to a given region of interest. More importantly, the database enables researcher to provide feedback on connectivity results and links back to the original article sentence to provide the relevant context. The database is evaluated by neuroanatomists on real connectomics tasks (targets of Nucleus Accumbens) and results in significant effort reduction in comparison to previous manual methods (from 1 week to 2h). Subsequently, we introduce neuroNER to identify, normalize and compare instances of identify neuronsneurons in the scientific literature. Our method relies on identifying and analyzing each of the domain features used to annotate a specific neuron mention, like the morphological term 'basket' or brain region 'hippocampus'. We apply our method to the same corpus of 25M PubMed abstracts and 600K full-text articles and find over 500K unique neuron type mentions. To demonstrate the utility of our approach, we also apply our method towards cross-comparing the NeuroLex and Human Brain Project (HBP) cell type ontologies. By decoupling a neuron mention's identity into its specific compositional features, our method can successfully identify specific neuron types even if they are not explicitly listed within a predefined neuron type lexicon, thus greatly facilitating cross-laboratory studies. In order to build such large databases, several tools and infrastructureslarge-scale NLP were developed: a robust pipeline to preprocess full-text PDF articles, as well as bluima, an NLP processing pipeline specialized on neuroscience to perform text-mining at PubMed scale. During the development of those two NLP systems, we acknowledged the need for novel NLP approaches to rapidly develop custom text mining solutions. This led to the formalization of the agile text miningagile text-mining methodology to improve the communication and collaboration between subject matter experts and text miners. Agile text mining is characterized by short development cycles, frequent tasks redefinition and continuous performance monitoring through integration tests. To support our approach, we developed Sherlok, an NLP framework designed for the development of agile text mining applications

    DARIAH and the Benelux

    Get PDF

    Few-shot entity linking of food names

    Get PDF
    Entity linking (EL), the task of automatically matching mentions in text to concepts in a target knowledge base, remains under-explored when it comes to the food domain, despite its many potential applications, e.g., finding the nutritional value of ingredients in databases. In this paper, we describe the creation of new resources supporting the development of EL methods applied to the food domain: the E.Care Knowledge Base (E.Care KB) which contains 664 food concepts and the E.Care dataset, a corpus of 468 cooking recipes where ingredient names have been manually linked to corresponding concepts in the E.Care KB. We developed and evaluated different methods for EL, namely, deep learning-based approaches underpinned by Siamese networks trained under a few-shot learning setting, traditional machine learning-based approaches underpinned by support vector machines (SVMs) and unsupervised approaches based on string matching algorithms. Combining the strengths of each of these approaches, we built a hybrid model for food EL that balances the trade-offs between performance and inference speed. Specifically, our hybrid model obtains 89.40% accuracy and links mentions at an average speed of 0.24 seconds per mention, whereas our best deep learning-based model, SVM model and unsupervised model obtain accuracies of 86.99%, 87.19% and 87.43% at inference speeds of 0.007, 0.66 and 0.02 seconds per mention, respectively
    corecore