125 research outputs found

    A Bayesian Learning, Greedy agglomerative clustering approach and evaluation techniques for Author Name Disambiguation Problem

    Full text link
    Author names often suffer from ambiguity owing to the same author appearing under different names and multiple authors possessing similar names. It creates difficulty in associating a scholarly work with the person who wrote it, thereby introducing inaccuracy in credit attribution, bibliometric analysis, search-by-author in a digital library, and expert discovery. A plethora of techniques for disambiguation of author names has been proposed in the literature. I try to focus on the research efforts targeted to disambiguate author names. I first go through the conventional methods, then I discuss evaluation techniques and the clustering model which finally leads to the Bayesian learning and Greedy agglomerative approach. I believe this concentrated review will be useful for the research community because it discusses techniques applied to a very large real database that is actively used worldwide. The Bayesian and the greedy agglomerative approach used will help to tackle AND problems in a better way. Finally, I try to outline a few directions for future workComment: 8 page

    Engineering a semantic web trust infrastructure

    No full text
    The ability to judge the trustworthiness of information is an important and challenging problem in the field of Semantic Web research. In this thesis, we take an end-to-end look at the challenges posed by trust on the Semantic Web, and present contributions in three areas: a Semantic Web identity vocabulary, a system for bootstrapping trust environments, and a framework for trust aware information management. Typically Semantic Web agents, which consume and produce information, are not described with sufficient information to permit those interacting with them to make good judgements of trustworthiness. A descriptive vocabulary for agent identity is required to enable effective inter agent discourse, and the growth of trust and reputation within the Semantic Web; we therefore present such a foundational identity ontology for describing web-based agents.It is anticipated that the Semantic Web will suffer from a trust network bootstrapping problem. In this thesis, we propose a novel approach which harnesses open data to bootstrap trust in new trust environments. This approach brings together public records published by a range of trusted institutions in order to encourage trust in identities within new environments. Information integrity and provenance are both critical prerequisites for well-founded judgements of information trustworthiness. We propose a modification to the RDF Named Graph data model in order to address serious representational limitations with the named graph proposal, which affect the ability to cleanly represent claims and provenance records. Next, we propose a novel graph based approach for recording the provenance of derived information. This approach offers computational and memory savings while maintaining the ability to answer graph-level provenance questions. In addition, it allows new optimisations such as strategies to avoid needless repeat computation, and a delta-based storage strategy which avoids data duplication.<br/

    Entities with quantities : extraction, search, and ranking

    Get PDF
    Quantities are more than numeric values. They denote measures of the world’s entities such as heights of buildings, running times of athletes, energy efficiency of car models or energy production of power plants, all expressed in numbers with associated units. Entity-centric search and question answering (QA) are well supported by modern search engines. However, they do not work well when the queries involve quantity filters, such as searching for athletes who ran 200m under 20 seconds or companies with quarterly revenue above $2 Billion. State-of-the-art systems fail to understand the quantities, including the condition (less than, above, etc.), the unit of interest (seconds, dollar, etc.), and the context of the quantity (200m race, quarterly revenue, etc.). QA systems based on structured knowledge bases (KBs) also fail as quantities are poorly covered by state-of-the-art KBs. In this dissertation, we developed new methods to advance the state-of-the-art on quantity knowledge extraction and search.Zahlen sind mehr als nur numerische Werte. Sie beschreiben Maße von EntitĂ€ten wie die Höhe von GebĂ€uden, die Laufzeit von Sportlern, die Energieeffizienz von Automodellen oder die Energieerzeugung von Kraftwerken - jeweils ausgedrĂŒckt durch Zahlen mit zugehörigen Einheiten. EntitĂ€tszentriete Anfragen und direktes Question-Answering werden von Suchmaschinen hĂ€ufig gut unterstĂŒtzt. Sie funktionieren jedoch nicht gut, wenn die Fragen Zahlenfilter beinhalten, wie z. B. die Suche nach Sportlern, die 200m unter 20 Sekunden gelaufen sind, oder nach Unternehmen mit einem Quartalsumsatz von ĂŒber 2 Milliarden US-Dollar. Selbst moderne Systeme schaffen es nicht, QuantitĂ€ten, einschließlich der genannten Bedingungen (weniger als, ĂŒber, etc.), der Maßeinheiten (Sekunden, Dollar, etc.) und des Kontexts (200-Meter-Rennen, Quartalsumsatz usw.), zu verstehen. Auch QA-Systeme, die auf strukturierten Wissensbanken (“Knowledge Bases”, KBs) aufgebaut sind, versagen, da quantitative Eigenschaften von modernen KBs kaum erfasst werden. In dieser Dissertation werden neue Methoden entwickelt, um den Stand der Technik zur Wissensextraktion und -suche von QuantitĂ€ten voranzutreiben. Unsere HauptbeitrĂ€ge sind die folgenden: ‱ ZunĂ€chst prĂ€sentieren wir Qsearch [Ho et al., 2019, Ho et al., 2020] – ein System, das mit erweiterten Fragen mit QuantitĂ€tsfiltern umgehen kann, indem es Hinweise verwendet, die sowohl in der Frage als auch in den Textquellen vorhanden sind. Qsearch umfasst zwei HauptbeitrĂ€ge. Der erste Beitrag ist ein tiefes neuronales Netzwerkmodell, das fĂŒr die Extraktion quantitĂ€tszentrierter Tupel aus Textquellen entwickelt wurde. Der zweite Beitrag ist ein neuartiges Query-Matching-Modell zum Finden und zur Reihung passender Tupel. ‱ Zweitens, um beim Vorgang heterogene Tabellen einzubinden, stellen wir QuTE [Ho et al., 2021a, Ho et al., 2021b] vor – ein System zum Extrahieren von QuantitĂ€tsinformationen aus Webquellen, insbesondere Ad-hoc Webtabellen in HTML-Seiten. Der Beitrag von QuTE umfasst eine Methode zur VerknĂŒpfung von QuantitĂ€ts- und EntitĂ€tsspalten, fĂŒr die externe Textquellen genutzt werden. Zur Beantwortung von Fragen kontextualisieren wir die extrahierten EntitĂ€ts-QuantitĂ€ts-Paare mit informativen Hinweisen aus der Tabelle und stellen eine neue Methode zur Konsolidierung und verbesserteer Reihung von Antwortkandidaten durch Inter-Fakten-Konsistenz vor. ‱ Drittens stellen wir QL [Ho et al., 2022] vor – eine Recall-orientierte Methode zur Anreicherung von Knowledge Bases (KBs) mit quantitativen Fakten. Moderne KBs wie Wikidata oder YAGO decken viele EntitĂ€ten und ihre relevanten Informationen ab, ĂŒbersehen aber oft wichtige quantitative Eigenschaften. QL ist frage-gesteuert und basiert auf iterativem Lernen mit zwei HauptbeitrĂ€gen, um die KB-Abdeckung zu verbessern. Der erste Beitrag ist eine Methode zur Expansion von Fragen, um einen grĂ¶ĂŸeren Pool an Faktenkandidaten zu erfassen. Der zweite Beitrag ist eine Technik zur Selbstkonsistenz durch BerĂŒcksichtigung der Werteverteilungen von QuantitĂ€ten

    Eesti keele ĂŒldvaldkonna tekstide laia kattuvusega automaatne sĂŒndmusanalĂŒĂŒs

    Get PDF
    Seoses tekstide suuremahulise digitaliseerimisega ning digitaalse tekstiloome jĂ€rjest laiema levikuga on tohutul hulgal loomuliku keele tekste muutunud ja muutumas masinloetavaks. Masinloetavus omab potentsiaali muuta tekstimassiivid inimeste jaoks lihtsamini hallatavaks, nt lubada rakendusi nagu automaatne sisukokkuvĂ”tete tegemine ja tekstide pĂ”hjal kĂŒsimustele vastamine, ent paraku ei ulatu praegused automaatanalĂŒĂŒsi vĂ”imalused tekstide sisu tegeliku mĂ”istmiseni. Oletatakse, tekstide sisu mĂ”istvale automaatanalĂŒĂŒsile viib meid lĂ€hemale sĂŒndmusanalĂŒĂŒs – kuna paljud tekstid on narratiivse ĂŒlesehitusega, tĂ”lgendatavad kui „sĂŒndmuste kirjeldused”, peaks tekstidest sĂŒndmuste eraldamine ja formaalsel kujul esitamine pakkuma alust mitmete „teksti mĂ”istmist” nĂ”udvate keeletehnoloogia rakenduste loomisel. KĂ€esolevas vĂ€itekirjas uuritakse, kuivĂ”rd saab eestikeelsete tekstide sĂŒndmusanalĂŒĂŒsi kĂ€sitleda kui avatud sĂŒndmuste hulka ja ĂŒldvaldkonna tekste hĂ”lmavat automaatse lingvistilise analĂŒĂŒsi ĂŒlesannet. Probleemile lĂ€henetakse eesti keele automaatanalĂŒĂŒsi kontekstis uudsest, sĂŒndmuste ajasemantikale keskenduvast perspektiivist. Töös kohandatakse eesti keelele TimeML mĂ€rgendusraamistik ja luuakse raamistikule toetuv automaatne ajavĂ€ljendite tuvastaja ning ajasemantilise mĂ€rgendusega (sĂŒndmusviidete, ajavĂ€ljendite ning ajaseoste mĂ€rgendusega) tekstikorpus; analĂŒĂŒsitakse korpuse pĂ”hjal inimmĂ€rgendajate kooskĂ”la sĂŒndmusviidete ja ajaseoste mÀÀramisel ning lĂ”puks uuritakse vĂ”imalusi ajasemantika-keskse sĂŒndmusanalĂŒĂŒsi laiendamiseks geneeriliseks sĂŒndmusanalĂŒĂŒsiks sĂŒndmust vĂ€ljendavate keelendite samaviitelisuse lahendamise nĂ€itel. Töö pakub suuniseid tekstide ajasemantika ja sĂŒndmusstruktuuri mĂ€rgenduse edasiarendamiseks tulevikus ning töös loodud keeleressurssid vĂ”imaldavad nii konkreetsete lĂ”pp-rakenduste (nt automaatne ajakĂŒsimustele vastamine) katsetamist kui ka automaatsete mĂ€rgendustööriistade edasiarendamist.  Due to massive scale digitalisation processes and a switch from traditional means of written communication to digital written communication, vast amounts of human language texts are becoming machine-readable. Machine-readability holds a potential for easing human effort on searching and organising large text collections, allowing applications such as automatic text summarisation and question answering. However, current tools for automatic text analysis do not reach for text understanding required for making these applications generic. It is hypothesised that automatic analysis of events in texts leads us closer to the goal, as many texts can be interpreted as stories/narratives that are decomposable into events. This thesis explores event analysis as broad-coverage and general domain automatic language analysis problem in Estonian, and provides an investigation starting from time-oriented event analysis and tending towards generic event analysis. We adapt TimeML framework to Estonian, and create an automatic temporal expression tagger and a news corpus manually annotated for temporal semantics (event mentions, temporal expressions, and temporal relations) for the language; we analyse consistency of human annotation of event mentions and temporal relations, and, finally, provide a preliminary study on event coreference resolution in Estonian news. The current work also makes suggestions on how future research can improve Estonian event and temporal semantic annotation, and the language resources developed in this work will allow future experimentation with end-user applications (such as automatic answering of temporal questions) as well as provide a basis for developing automatic semantic analysis tools

    Grounding event references in news

    Get PDF
    Events are frequently discussed in natural language, and their accurate identification is central to language understanding. Yet they are diverse and complex in ontology and reference; computational processing hence proves challenging. News provides a shared basis for communication by reporting events. We perform several studies into news event reference. One annotation study characterises each news report in terms of its update and topic events, but finds that topic is better consider through explicit references to background events. In this context, we propose the event linking task which—analogous to named entity linking or disambiguation—models the grounding of references to notable events. It defines the disambiguation of an event reference as a link to the archival article that first reports it. When two references are linked to the same article, they need not be references to the same event. Event linking hopes to provide an intuitive approximation to coreference, erring on the side of over-generation in contrast with the literature. The task is also distinguished in considering event references from multiple perspectives over time. We diagnostically evaluate the task by first linking references to past, newsworthy events in news and opinion pieces to an archive of the Sydney Morning Herald. The intensive annotation results in only a small corpus of 229 distinct links. However, we observe that a number of hyperlinks targeting online news correspond to event links. We thus acquire two large corpora of hyperlinks at very low cost. From these we learn weights for temporal and term overlap features in a retrieval system. These noisy data lead to significant performance gains over a bag-of-words baseline. While our initial system can accurately predict many event links, most will require deep linguistic processing for their disambiguation

    Grounding event references in news

    Get PDF
    Events are frequently discussed in natural language, and their accurate identification is central to language understanding. Yet they are diverse and complex in ontology and reference; computational processing hence proves challenging. News provides a shared basis for communication by reporting events. We perform several studies into news event reference. One annotation study characterises each news report in terms of its update and topic events, but finds that topic is better consider through explicit references to background events. In this context, we propose the event linking task which—analogous to named entity linking or disambiguation—models the grounding of references to notable events. It defines the disambiguation of an event reference as a link to the archival article that first reports it. When two references are linked to the same article, they need not be references to the same event. Event linking hopes to provide an intuitive approximation to coreference, erring on the side of over-generation in contrast with the literature. The task is also distinguished in considering event references from multiple perspectives over time. We diagnostically evaluate the task by first linking references to past, newsworthy events in news and opinion pieces to an archive of the Sydney Morning Herald. The intensive annotation results in only a small corpus of 229 distinct links. However, we observe that a number of hyperlinks targeting online news correspond to event links. We thus acquire two large corpora of hyperlinks at very low cost. From these we learn weights for temporal and term overlap features in a retrieval system. These noisy data lead to significant performance gains over a bag-of-words baseline. While our initial system can accurately predict many event links, most will require deep linguistic processing for their disambiguation

    Entity centric neural models for natural language processing

    Get PDF
    This thesis explores how to enhance natural language understanding by incorporating entity information into neural network models. It tackles three key questions:1. Leveraging entities for understanding tasks: This work introduces Entity-GCN, a model that performs multi-step reasoning on a graph where nodes represent entity mentions and edges represent relationships. This method achieved state-of-the-art results on a multi-document question-answering dataset.2. Identifying and disambiguating entities using large language models: This research proposes a novel system that retrieves entities by generating their names token-by-token, overcoming limitations of traditional methods and significantly reducing memory footprint. This approach is also extended to a multilingual setting and further optimized for efficiency.3. Interpreting and controlling entity knowledge within models: This thesis presents a post-hoc interpretation technique to analyze how decisions are made across layers in neural models, allowing for visualization and analysis of knowledge representation. Additionally, a method for editing factual knowledge about entities is proposed, enabling correction of model predictions without costly retraining

    Entity centric neural models for natural language processing

    Get PDF
    This thesis explores how to enhance natural language understanding by incorporating entity information into neural network models. It tackles three key questions:1. Leveraging entities for understanding tasks: This work introduces Entity-GCN, a model that performs multi-step reasoning on a graph where nodes represent entity mentions and edges represent relationships. This method achieved state-of-the-art results on a multi-document question-answering dataset.2. Identifying and disambiguating entities using large language models: This research proposes a novel system that retrieves entities by generating their names token-by-token, overcoming limitations of traditional methods and significantly reducing memory footprint. This approach is also extended to a multilingual setting and further optimized for efficiency.3. Interpreting and controlling entity knowledge within models: This thesis presents a post-hoc interpretation technique to analyze how decisions are made across layers in neural models, allowing for visualization and analysis of knowledge representation. Additionally, a method for editing factual knowledge about entities is proposed, enabling correction of model predictions without costly retraining
    • 

    corecore