10,313 research outputs found
Named entity recognition and classification in search queries
Named Entity Recognition and Classification is the task of extracting from text, instances of
different entity classes such as person, location, or company. This task has recently been
applied to web search queries in order to better understand their semantics, where a search
query consists of linguistic units that users submit to a search engine to convey their search
need. Discovering and analysing the linguistic units comprising a search query enables search
engines to reveal and meet users' search intents. As a result, recent research has concentrated
on analysing the constituent units comprising search queries. However, since search queries
are short, unstructured, and ambiguous, an approach to detect and classify named entities is
presented in this thesis, in which queries are augmented with the text snippets of search results
for search queries.
The thesis makes the following contributions:
1. A novel method for detecting candidate named entities in search queries, which utilises
both query grammatical annotation and query segmentation.
2. A novel method to classify the detected candidate entities into a set of target entity
classes, by using a seed expansion approach; the method presented exploits the representation
of the sets of contextual clues surrounding the entities in the snippets as vectors
in a common vector space.
3. An exploratory analysis of three main categories of search refiners: nouns, verbs, and
adjectives, that users often incorporate in entity-centric queries in order to further refine
the entity-related search results.
4. A taxonomy of named entities derived from a search engine query log.
By using a large commercial query log, experimental evidence is provided that the work
presented herein is competitive with the existing research in the field of entity recognition and
classification in search queries
Named entity recognition and classification in search queries
Named Entity Recognition and Classification is the task of extracting from text, instances of
different entity classes such as person, location, or company. This task has recently been
applied to web search queries in order to better understand their semantics, where a search
query consists of linguistic units that users submit to a search engine to convey their search
need. Discovering and analysing the linguistic units comprising a search query enables search
engines to reveal and meet users' search intents. As a result, recent research has concentrated
on analysing the constituent units comprising search queries. However, since search queries
are short, unstructured, and ambiguous, an approach to detect and classify named entities is
presented in this thesis, in which queries are augmented with the text snippets of search results
for search queries.
The thesis makes the following contributions:
1. A novel method for detecting candidate named entities in search queries, which utilises
both query grammatical annotation and query segmentation.
2. A novel method to classify the detected candidate entities into a set of target entity
classes, by using a seed expansion approach; the method presented exploits the representation
of the sets of contextual clues surrounding the entities in the snippets as vectors
in a common vector space.
3. An exploratory analysis of three main categories of search refiners: nouns, verbs, and
adjectives, that users often incorporate in entity-centric queries in order to further refine
the entity-related search results.
4. A taxonomy of named entities derived from a search engine query log.
By using a large commercial query log, experimental evidence is provided that the work
presented herein is competitive with the existing research in the field of entity recognition and
classification in search queries
Target Type Identification for Entity-Bearing Queries
Identifying the target types of entity-bearing queries can help improve
retrieval performance as well as the overall search experience. In this work,
we address the problem of automatically detecting the target types of a query
with respect to a type taxonomy. We propose a supervised learning approach with
a rich variety of features. Using a purpose-built test collection, we show that
our approach outperforms existing methods by a remarkable margin. This is an
extended version of the article published with the same title in the
Proceedings of SIGIR'17.Comment: Extended version of SIGIR'17 short paper, 5 page
Dublin City University at QA@CLEF 2008
We describe our participation in Multilingual Question Answering at CLEF 2008 using German and English as our source and target languages respectively. The system was built using UIMA (Unstructured Information Management Architecture) as underlying framework
"i have a feeling trump will win..................": Forecasting Winners and Losers from User Predictions on Twitter
Social media users often make explicit predictions about upcoming events.
Such statements vary in the degree of certainty the author expresses toward the
outcome:"Leonardo DiCaprio will win Best Actor" vs. "Leonardo DiCaprio may win"
or "No way Leonardo wins!". Can popular beliefs on social media predict who
will win? To answer this question, we build a corpus of tweets annotated for
veridicality on which we train a log-linear classifier that detects positive
veridicality with high precision. We then forecast uncertain outcomes using the
wisdom of crowds, by aggregating users' explicit predictions. Our method for
forecasting winners is fully automated, relying only on a set of contenders as
input. It requires no training data of past outcomes and outperforms sentiment
and tweet volume baselines on a broad range of contest prediction tasks. We
further demonstrate how our approach can be used to measure the reliability of
individual accounts' predictions and retrospectively identify surprise
outcomes.Comment: Accepted at EMNLP 2017 (long paper
Naive Bayes Classification in The Question and Answering System
Abstract—Question and answering (QA) system is a system to answer question based on collections of unstructured text or in the form of human language. In general, QA system consists of four stages, i.e. question analysis, documents selection, passage retrieval and answer extraction. In this study we added two processes i.e. classifying documents and classifying passage. We use Naïve Bayes for classification, Dynamic Passage Partitioning for finding answer and Lucene for document selection. The experiment was done using 100 questions from 3000 documents related to the disease and the results were compared with a system that does not use the classification process. From the test results, the system works best with the use of 10 of the most relevant documents, 5 passage with the highest score and 10 answer the closest distance. Mean Reciprocal Rank (MMR) value for QA system with classification is 0.41960 which is 4.9% better than MRR value for QA system without classificatio
Same but Different: Distant Supervision for Predicting and Understanding Entity Linking Difficulty
Entity Linking (EL) is the task of automatically identifying entity mentions
in a piece of text and resolving them to a corresponding entity in a reference
knowledge base like Wikipedia. There is a large number of EL tools available
for different types of documents and domains, yet EL remains a challenging task
where the lack of precision on particularly ambiguous mentions often spoils the
usefulness of automated disambiguation results in real applications. A priori
approximations of the difficulty to link a particular entity mention can
facilitate flagging of critical cases as part of semi-automated EL systems,
while detecting latent factors that affect the EL performance, like
corpus-specific features, can provide insights on how to improve a system based
on the special characteristics of the underlying corpus. In this paper, we
first introduce a consensus-based method to generate difficulty labels for
entity mentions on arbitrary corpora. The difficulty labels are then exploited
as training data for a supervised classification task able to predict the EL
difficulty of entity mentions using a variety of features. Experiments over a
corpus of news articles show that EL difficulty can be estimated with high
accuracy, revealing also latent features that affect EL performance. Finally,
evaluation results demonstrate the effectiveness of the proposed method to
inform semi-automated EL pipelines.Comment: Preprint of paper accepted for publication in the 34th ACM/SIGAPP
Symposium On Applied Computing (SAC 2019
MIRACLE at GeoCLEF Query Parsing 2007: Extraction and Classification of Geographical Information
This paper describes the participation of MIRACLE research consortium at the Query Parsing task of GeoCLEF 2007. Our system is composed of three main modules. First, the Named Geo-entity Identifier, whose objective is to perform the geo-entity identification and tagging, i.e., to extract the “where” component of the geographical query, should there be any. This module is based on a gazetteer built up from the Geonames geographical database and carries out a sequential process in three steps that consist on geo-entity recognition, geo-entity selection and query tagging. Then, the Query Analyzer parses this tagged query to identify the “what” and “geo-relation” components by means of a rule-based grammar. Finally, a two-level multiclassifier first decides whether the query is indeed a geographical query and, should it be positive, then determines the query type according to the type of information that the user is supposed to be looking for: map, yellow page or information. According to a strict evaluation criterion where a match should have all fields correct, our system reaches a precision value of 42.8% and a recall of 56.6% and our submission is ranked 1st out of 6 participants in the task. A detailed evaluation of the confusion matrixes reveal that some extra effort must be invested in “user-oriented” disambiguation techniques to improve the first level binary classifier for detecting geographical queries, as it is a key component to eliminate many false-positives
TEQUILA: Temporal Question Answering over Knowledge Bases
Question answering over knowledge bases (KB-QA) poses challenges in handling complex questions that need to be decomposed into sub-questions. An important case, addressed here, is that of temporal questions, where cues for temporal relations need to be discovered and handled. We present TEQUILA, an enabler method for temporal QA that can run on top of any KB-QA engine. TEQUILA has four stages. It detects if a question has temporal intent. It decomposes and rewrites the question into non-temporal sub-questions and temporal constraints. Answers to sub-questions are then retrieved from the underlying KB-QA engine. Finally, TEQUILA uses constraint reasoning on temporal intervals to compute final answers to the full question. Comparisons against state-of-the-art baselines show the viability of our method
- …