5 research outputs found
Neural Architecture for Question Answering Using a Knowledge Graph and Web Corpus
In Web search, entity-seeking queries often trigger a special Question
Answering (QA) system. It may use a parser to interpret the question to a
structured query, execute that on a knowledge graph (KG), and return direct
entity responses. QA systems based on precise parsing tend to be brittle: minor
syntax variations may dramatically change the response. Moreover, KG coverage
is patchy. At the other extreme, a large corpus may provide broader coverage,
but in an unstructured, unreliable form. We present AQQUCN, a QA system that
gracefully combines KG and corpus evidence. AQQUCN accepts a broad spectrum of
query syntax, between well-formed questions to short `telegraphic' keyword
sequences. In the face of inherent query ambiguities, AQQUCN aggregates signals
from KGs and large corpora to directly rank KG entities, rather than commit to
one semantic interpretation of the query. AQQUCN models the ideal
interpretation as an unobservable or latent variable. Interpretations and
candidate entity responses are scored as pairs, by combining signals from
multiple convolutional networks that operate collectively on the query, KG and
corpus. On four public query workloads, amounting to over 8,000 queries with
diverse query syntax, we see 5--16% absolute improvement in mean average
precision (MAP), compared to the entity ranking performance of recent systems.
Our system is also competitive at entity set retrieval, almost doubling F1
scores for challenging short queries.Comment: Accepted to Information Retrieval Journa
SMAPH: A Piggyback Approach for Entity-Linking in Web Queries
We study the problem of linking the terms of a web-search query to a semantic representation given by the set of entities (a.k.a. concepts) mentioned in it. We introduce SMAPH, a system that performs this task using the information coming from a web search engine, an approach we call “piggybacking.” We employ search engines to alleviate the noise and irregularities that characterize the language of queries. Snippets returned as search results also provide a context for the query that makes it easier to disambiguate the meaning of the query. From the search results, SMAPH builds a set of candidate entities with high coverage. This set is filtered by linking back the candidate entities to the terms occurring in the input query, ensuring high precision. A greedy disambiguation algorithm performs this filtering; it maximizes the coherence of the solution by itera- tively discovering the pertinent entities mentioned in the query. We propose three versions of SMAPH that outperform state-of-the-art solutions on the known benchmarks and on the GERDAQ dataset, a novel dataset that we have built specifically for this problem via crowd-sourcing and that we make publicly available
Joint models for information and knowledge extraction
Information and knowledge extraction from natural language text is a key asset for question answering, semantic search, automatic summarization, and other machine reading applications. There are many sub-tasks involved such as named entity recognition, named entity disambiguation, co-reference resolution, relation extraction, event detection, discourse parsing, and others. Solving these tasks is challenging as natural language text is unstructured, noisy, and ambiguous. Key challenges, which focus on identifying and linking named entities, as well as discovering relations between them, include: • High NERD Quality. Named entity recognition and disambiguation, NERD for short, are preformed first in the extraction pipeline. Their results may affect other downstream tasks. • Coverage vs. Quality of Relation Extraction. Model-based information extraction methods achieve high extraction quality at low coverage, whereas open information extraction methods capture relational phrases between entities. However, the latter degrades in quality by non-canonicalized and noisy output. These limitations need to be overcome. • On-the-fly Knowledge Acquisition. Real-world applications such as question answering, monitoring content streams, etc. demand on-the-fly knowledge acquisition. Building such an end-to-end system is challenging because it requires high throughput, high extraction quality, and high coverage. This dissertation addresses the above challenges, developing new methods to advance the state of the art. The first contribution is a robust model for joint inference between entity recognition and disambiguation. The second contribution is a novel model for relation extraction and entity disambiguation on Wikipediastyle text. The third contribution is an end-to-end system for constructing querydriven, on-the-fly knowledge bases.Informations- und Wissensextraktion aus natürlichsprachlichen Texten sind Schlüsselthemen vieler wissensbassierter Anwendungen. Darunter fallen zum Beispiel Frage-Antwort-Systeme, semantische Suchmaschinen, oder Applikationen zur automatischen Zusammenfassung und zum maschinellem Lesen von Texten. Zur Lösung dieser Aufgaben müssen u.a. Teilaufgaben, wie die Erkennung und Disambiguierung benannter Entitäten, Koreferenzresolution, Relationsextraktion, Ereigniserkennung, oder Diskursparsen, durchgeführt werden. Solche Aufgaben stellen eine Herausforderung dar, da Texte natürlicher Sprache in der Regel unstrukturiert, verrauscht und mehrdeutig sind. Folgende zentrale Herausforderungen adressieren sowohl die Identifizierung und das Verknüpfen benannter Entitäten als auch das Erkennen von Beziehungen zwischen diesen Entitäten: • Hohe NERD Qualität. Die Erkennung und Disambiguierung benannter Entitäten (engl. "Named Entity Recognition and Disambiguation", kurz "NERD") wird in Extraktionspipelines in der Regel zuerst ausgeführt. Die Ergebnisse beeinflussen andere nachgelagerte Aufgaben. • Abdeckung und Qualität der Relationsextraktion. Modellbasierte Informationsextraktionsmethoden erzielen eine hohe Extraktionsqualität, bei allerdings niedriger Abdeckung. Offene Informationsextraktionsmethoden erfassen relationale Phrasen zwischen Entitäten. Allerdings leiden diese Methoden an niedriger Qualität durch mehrdeutige Entitäten und verrauschte Ausgaben. Diese Einschränkungen müssen überwunden werden. • On-the-Fly Wissensakquisition. Reale Anwendungen wie Frage-Antwort- Systeme, die Überwachung von Inhaltsströmen usw. erfordern On-the-Fly Wissensakquise. Die Entwicklung solcher ganzheitlichen Systeme stellt eine hohe Herausforderung dar, da ein hoher Durchsatz, eine hohe Extraktionsqualität sowie eine hohe Abdeckung erforderlich sind. Diese Arbeit adressiert diese Probleme und stellt neue Methoden vor, um den aktuellen Stand der Forschung zu erweitern. Diese sind: • Ein robustesModell zur integrierten Inferenz zur gemeinschaftlichen Erkennung und Disambiguierung von Entitäten. • Ein neues Modell zur Relationsextraktion und Disambiguierung von Wikipedia-ähnlichen Texten. • Ein ganzheitliches System zur Erstellung Anfrage-getriebener On-the-Fly Wissensbanken