11 research outputs found
End-to-end Learning for Short Text Expansion
Effectively making sense of short texts is a critical task for many real
world applications such as search engines, social media services, and
recommender systems. The task is particularly challenging as a short text
contains very sparse information, often too sparse for a machine learning
algorithm to pick up useful signals. A common practice for analyzing short text
is to first expand it with external information, which is usually harvested
from a large collection of longer texts. In literature, short text expansion
has been done with all kinds of heuristics. We propose an end-to-end solution
that automatically learns how to expand short text to optimize a given learning
task. A novel deep memory network is proposed to automatically find relevant
information from a collection of longer documents and reformulate the short
text through a gating mechanism. Using short text classification as a
demonstrating task, we show that the deep memory network significantly
outperforms classical text expansion methods with comprehensive experiments on
real world data sets.Comment: KDD'201
Entity Query Feature Expansion Using Knowledge Base Links
Recent advances in automatic entity linking and knowledge base
construction have resulted in entity annotations for document and
query collections. For example, annotations of entities from large
general purpose knowledge bases, such as Freebase and the Google
Knowledge Graph. Understanding how to leverage these entity
annotations of text to improve ad hoc document retrieval is an open
research area. Query expansion is a commonly used technique to
improve retrieval effectiveness. Most previous query expansion
approaches focus on text, mainly using unigram concepts. In this
paper, we propose a new technique, called entity query feature
expansion (EQFE) which enriches the query with features from
entities and their links to knowledge bases, including structured
attributes and text. We experiment using both explicit query entity
annotations and latent entities. We evaluate our technique on TREC
text collections automatically annotated with knowledge base entity
links, including the Google Freebase Annotations (FACC1) data.
We find that entity-based feature expansion results in significant
improvements in retrieval effectiveness over state-of-the-art text
expansion approaches
Mining Social Media to Extract Structured Knowledge through Semantic Roles
Semantics is a well-kept secret in texts, accessible only to humans. Artificial Intelligence struggles to enrich machines with human-like features, therefore accessing this treasure and sharing it with computers is one of the main challenges that the computational linguistics domain faces nowadays. In order to teach computers to understand humans, language models need to be specified and created from human knowledge. While still far from completely decoding hidden messages in political speeches, computer scientists and linguists have joined efforts in making the language easier to be understood by machines. This paper aims to introduce the VoxPopuli platform, an instrument to collect user generated content, to analyze it and to generate a map of semantically-related concepts by capturing crowd intelligence
A Sheaf Model of Contradictions and Disagreements. Preliminary Report and Discussion
We introduce a new formal model -- based on the mathematical construct of
sheaves -- for representing contradictory information in textual sources. This
model has the advantage of letting us (a) identify the causes of the
inconsistency; (b) measure how strong it is; (c) and do something about it,
e.g. suggest ways to reconcile inconsistent advice. This model naturally
represents the distinction between contradictions and disagreements. It is
based on the idea of representing natural language sentences as formulas with
parameters sitting on lattices, creating partial orders based on predicates
shared by theories, and building sheaves on these partial orders with products
of lattices as stalks. Degrees of disagreement are measured by the existence of
global and local sections.
Limitations of the sheaf approach and connections to recent work in natural
language processing, as well as the topics of contextuality in physics, data
fusion, topological data analysis and epistemology are also discussed.Comment: This paper was presented at ISAIM 2018, International Symposium on
Artificial Intelligence and Mathematics. Fort Lauderdale, FL. January 3 5,
2018. Minor typographical errors have been correcte
Understanding User Intent Modeling for Conversational Recommender Systems: A Systematic Literature Review
Context: User intent modeling is a crucial process in Natural Language
Processing that aims to identify the underlying purpose behind a user's
request, enabling personalized responses. With a vast array of approaches
introduced in the literature (over 13,000 papers in the last decade),
understanding the related concepts and commonly used models in AI-based systems
is essential. Method: We conducted a systematic literature review to gather
data on models typically employed in designing conversational recommender
systems. From the collected data, we developed a decision model to assist
researchers in selecting the most suitable models for their systems.
Additionally, we performed two case studies to evaluate the effectiveness of
our proposed decision model. Results: Our study analyzed 59 distinct models and
identified 74 commonly used features. We provided insights into potential model
combinations, trends in model selection, quality concerns, evaluation measures,
and frequently used datasets for training and evaluating these models.
Contribution: Our study contributes practical insights and a comprehensive
understanding of user intent modeling, empowering the development of more
effective and personalized conversational recommender systems. With the
Conversational Recommender System, researchers can perform a more systematic
and efficient assessment of fitting intent modeling frameworks
Factoid question answering for spoken documents
In this dissertation, we present a factoid question answering system, specifically tailored for Question Answering (QA) on spoken documents.
This work explores, for the first time, which techniques can be robustly adapted from the usual QA on written documents to the more difficult spoken documents scenario. More specifically, we study new information retrieval (IR) techniques designed for speech, and utilize several levels of linguistic information for the speech-based QA task. These include named-entity detection with phonetic information, syntactic parsing applied to speech transcripts, and the use of coreference resolution.
Our approach is largely based on supervised machine learning techniques, with special focus on the answer extraction step, and makes little use of handcrafted knowledge. Consequently, it should be easily adaptable to other domains and languages.
In the work resulting of this Thesis, we have impulsed and coordinated the creation of an evaluation framework for the task of QA on spoken documents. The framework, named QAst, provides multi-lingual corpora, evaluation questions, and answers key. These corpora have been used in the QAst evaluation that was held in the CLEF workshop for the years 2007, 2008 and 2009, thus helping the developing of state-of-the-art techniques for this particular topic.
The presentend QA system and all its modules are extensively evaluated on the European Parliament Plenary Sessions
English corpus composed of manual transcripts and automatic transcripts obtained by three different Automatic Speech Recognition (ASR) systems that exhibit significantly different word error rates. This data belongs to the CLEF 2009 track for QA on speech transcripts.
The main results confirm that syntactic information is very useful for learning to rank question candidates, improving results on both manual and automatic transcripts unless the ASR quality is very low. Overall, the performance of our system is comparable or better than the state-of-the-art on this corpus, confirming the validity of our approach.En aquesta Tesi, presentem un sistema de Question Answering (QA) factual, especialment ajustat per treballar amb documents orals.
En el desenvolupament explorem, per primera vegada, quines tècniques de les habitualment emprades en QA per documents escrit són suficientment robustes per funcionar en l'escenari més difícil de documents orals. Amb més especificitat, estudiem nous mètodes de Information Retrieval (IR) dissenyats per tractar amb la veu, i utilitzem diversos nivells d'informació linqüística. Entre aquests s'inclouen, a saber: detecció de Named Entities utilitzant informació fonètica, "parsing" sintàctic aplicat a transcripcions de veu, i també l'ús d'un sub-sistema de detecció i resolució de la correferència.
La nostra aproximació al problema es recolza en gran part en tècniques supervisades de Machine Learning, estant aquestes enfocades especialment cap a la part d'extracció de la resposta, i fa servir la menor quantitat possible de coneixement creat per humans. En conseqüència, tot el procés de QA pot ser adaptat a altres dominis o altres llengües amb relativa facilitat.
Un dels resultats addicionals de la feina darrere d'aquesta Tesis ha estat que hem impulsat i coordinat la creació d'un marc d'avaluació de la taska de QA en documents orals. Aquest marc de treball, anomenat QAst (Question Answering on Speech Transcripts), proporciona un corpus de documents orals multi-lingüe, uns conjunts de preguntes d'avaluació, i les respostes correctes d'aquestes. Aquestes dades han estat utilitzades en les evaluacionis QAst que han tingut lloc en el si de les conferències CLEF en els anys 2007, 2008 i 2009; d'aquesta manera s'ha promogut i ajudat a la creació d'un estat-de-l'art de tècniques adreçades a aquest problema en particular.
El sistema de QA que presentem i tots els seus particulars sumbòduls, han estat avaluats extensivament utilitzant el corpus EPPS (transcripcions de les Sessions Plenaries del Parlament Europeu) en anglès, que cónté transcripcions manuals de tots els discursos i també transcripcions automàtiques obtingudes mitjançant tres reconeixedors automàtics de la parla (ASR) diferents. Els reconeixedors tenen característiques i resultats diferents que permetes una avaluació quantitativa i qualitativa de la tasca. Aquestes dades pertanyen a l'avaluació QAst del 2009.
Els resultats principals de la nostra feina confirmen que la informació sintàctica és mol útil per aprendre automàticament a valorar la plausibilitat de les respostes candidates, millorant els resultats previs tan en transcripcions manuals com transcripcions automàtiques, descomptat que la qualitat de l'ASR sigui molt baixa. En general, el rendiment del nostre sistema és comparable o millor que els altres sistemes pertanyents a l'estat-del'art, confirmant així la validesa de la nostra aproximació
RANCANG BANGUN APLIKASI QUESTION ANSWERING (QA) SYSTEM PADA TERJEMAHAN AL QURAN MENGGUNAKAN EPHYRA FRAMEWORK
Al Quran merupakan tuntunan yang wajib diikuti oleh umat Islam. Permasalahan-permasalahan agama
dalam kehidupan sehari-hari banyak termaktub pada Al Quran. Dalam Al Quran suatu permasalahan
tidak hanya mengacu pada satu ayat ataupun satu surat saja sehingga dibutuhkan waktu yang lama
dalam proses pencarian secara manual mengingat banyaknya jumlah ayat dan surat yang terkandung
dalam Al Quran. Untuk itu dibutuhkan sebuah aplikasi yang mudah dalam mengenali dan mencari
masalah yang dibutuhkan oleh pengguna sehingga sistem tersebut dapat menampilkan ayat-ayat
Al Quran sebagai referensi. Berdasarkan hal tersebut pada penelitian ini, dibangun aplikasi QA
System pada terjemahan Al Quran menggunakan Ephyra Framewok dengan IR system yang digunakan
adalah model ruang vektor yang dibangun dengan bahasa PHP. Dalam penelitian ini digunakan tiga
kategori pertanyaan yang meliputi orang, tempat dan waktu dengan kata tanya yang digunakan adalah
siapa, siapakah, kapan, kapankah, dimana, dimanakah, kemana, kemanakah, darimana, dan
darimanakah. Hasilnya adalah secara keseluruhan aplikasi QA System ini memiliki nilai presisi sebesar
42,31% .
Kata kunci: Ephyra Framework, Information Retrieval, Model Ruang Vektor, Question Answering
System, Terjemahan Al Qura
Recommended from our members
Entity-based Enrichment for Information Extraction and Retrieval
The goal of this work is to leverage cross-document entity relationships for improved understanding of queries and documents. We define an entity to be a thing or concept that exists in the world, such as a politician, a battle, a film, or a color. Entity-based enrichment (EBE) is a new expansion model for both queries and documents using features from similar entitymentions in the document collection and external knowledge resources. It uses task-specific features from entities beyond words that include: name aliases, fine-grained entity types, categories, and relationships to other entities. EBE addresses the problem of sparse or noisy local evidence due to multiple topics, implicit context, or informal writing. With the ultimate goal of improving information retrieval effectiveness, we start from unstructured text and through information extraction build up rich entity-based representations linked to external knowledge resources. We study the application ofentity-based enrichment to each step in the pipeline: 1) Named entity recognition, 2) Entity linking, and 3) Ad hoc document retrieval. The empirical results for EBE in each of these tasks shows significant improvements. Our first application of entity-based enrichment is the task of detecting entities in named entity recognition. We enrich the representation of observed words likely to represent entities. We perform weighted feature copying of recognition features from similar tokens in the corpus and external collections. The evaluation shows statistically significant improvements on in-domain newswire accuracy and demonstrates that the models are more robust on out-of-domain data. In the second part of this work, we apply EBE to the task of entity linking. The proposed entity linking method for disambiguating the detected mentions to entries in an external knowledge base is based on information retrieval. Theneighborhood relevance model, an enrichment model, identifies salient associations between an entity mention and otherentity mentions in the document. The results show that the enrichment model outperforms other context models and results in a system that is competitive with leading methods. Using the constructed entity representation, the final task is ad hoc document retrieval. We enrich the query representation with entity features. Retrieval is performed over documents annotated with entities linked to Wikipedia and Freebase from our system as well as the publicly available Google FACC1 annotations. To effectively leverage linked entity features, we extend dependency-based retrieval models to include structured attributes. We also define a new query-specific entity context model which builds a model for disambiguated entities from retrieved documents. Our results demonstrate significant improvements over competitive query expansion baselines for several standard test collections
Automatic population of knowledge bases with multimodal data about named entities
Knowledge bases are of great importance for Web search, recommendations, and many Information Retrieval tasks. However, maintaining them for not so popular entities is often a bottleneck. Typically, such entities have limited textual coverage and only a few ontological facts. Moreover, these entities are not well populated with multimodal data, such as images, videos, or audio recordings.
The goals in this thesis are (1) to populate a given knowledge base with multimodal data about entities, such as images or audio recordings, and (2) to ease the task of maintaining and expanding the textual knowledge about a given entity, by recommending valuable text excerpts to the contributors of knowledge bases.
The thesis makes three main contributions. The first two contributions concentrate on finding images of named entities with high precision, high recall, and high visual diversity. Our main focus are less popular entities, for which the image search engines fail to retrieve good results. Our methods utilize background knowledge about the entity, such as ontological facts or a short description, and a visual-based image similarity to rank and diversify a set of candidate images.
Our third contribution is an approach for extracting text contents related to a given entity. It leverages a language-model-based similarity between a short description of the entity and the text sources, and solves a budget-constraint optimization program without any assumptions on the text structure. Moreover, our approach is also able to reliably extract entity related audio excerpts from news podcasts. We derive the time boundaries from the usually very noisy audio transcriptions.Wissensbasen wird bei der Websuche, bei Empfehlungsdiensten und vielen anderen Information Retrieval Aufgaben eine große Bedeutung zugeschrieben. Allerdings stellt sich deren Unterhalt für weniger populäre Entitäten als schwierig heraus. Üblicherweise ist die Anzahl an Texten über Entitäten dieser Art begrenzt, und es gibt nur wenige ontologische Fakten. Außerdem sind nicht viele multimediale Daten, wie zum Beispiel Bilder, Videos oder Tonaufnahmen, für diese Entitäten verfügbar.
Die Ziele dieser Dissertation sind daher (1) eine gegebene Wissensbasis mit multimedialen Daten, wie Bilder oder Tonaufnahmen, über Entitäten anzureichern und (2) die Erleichterung der Aufgabe Texte über eine gegebene Entität zu verwalten und zu erweitern, indem den Beitragenden einer Wissensbasis nützliche Textausschnitte vorgeschlagen werden.
Diese Dissertation leistet drei Hauptbeiträge. Die ersten zwei Beiträge sind im Gebiet des Auffindens von Bildern von benannten Entitäten mit hoher Genauigkeit, hoher Trefferquote, und hoher visueller Vielfalt. Das Hauptaugenmerk liegt auf den weniger populären Entitäten bei denen die Bildersuchmaschinen normalerweise keine guten Ergebnisse liefern. Unsere Verfahren benutzen Hintergrundwissen über die Entität, zum Beispiel ontologische Fakten oder eine Kurzbeschreibung, so wie ein visuell-basiertes Bilderähnlichkeitsmaß um die Bilder nach Rang zu ordnen und um eine Menge von Bilderkandidaten zu diversifizieren.
Der dritte Beitrag ist ein Ansatz um Textinhalte, die sich auf eine gegebene Entität beziehen, zu extrahieren. Der Ansatz nutzt ein auf einem Sprachmodell basierendes Ähnlichkeitsmaß zwischen einer Kurzbeschreibung der Entität und den Textquellen und löst zudem ein Optimierungsproblem mit Budgetrestriktion, das keine Annahmen an die Textstruktur macht. Darüber hinaus ist der Ansatz in der Lage Tonaufnahmen, welche in Beziehung zu einer Entität stehen, zuverlässig aus Nachrichten-Podcasts zu extrahieren. Dafür werden zeitliche Abgrenzungen aus den normalerweise sehr verrauschten Audiotranskriptionen hergeleitet