14 research outputs found

    Performance Tags- Who's running the show?

    Get PDF
    We describe a pilot study which specifically examines the prevalence and characteristics of performance tags on several sites. Identifying post-coordination of tags as a useful step in the study of this phenomenon, as well as other approaches to leveraging tags based on text and/or sentiment analysis, we demonstrate an approach to automation of this process, postcoordinating (segmenting) terms by means of a probabilistic model based around Markov chains. The effectiveness of this approach to parsing is evaluated with respect to the wide range of constructions visible on various services. Several candidate approaches for the latter stages of automated classification are identified

    SMAPH: A Piggyback Approach for Entity-Linking in Web Queries

    Get PDF
    We study the problem of linking the terms of a web-search query to a semantic representation given by the set of entities (a.k.a. concepts) mentioned in it. We introduce SMAPH, a system that performs this task using the information coming from a web search engine, an approach we call “piggybacking.” We employ search engines to alleviate the noise and irregularities that characterize the language of queries. Snippets returned as search results also provide a context for the query that makes it easier to disambiguate the meaning of the query. From the search results, SMAPH builds a set of candidate entities with high coverage. This set is filtered by linking back the candidate entities to the terms occurring in the input query, ensuring high precision. A greedy disambiguation algorithm performs this filtering; it maximizes the coherence of the solution by itera- tively discovering the pertinent entities mentioned in the query. We propose three versions of SMAPH that outperform state-of-the-art solutions on the known benchmarks and on the GERDAQ dataset, a novel dataset that we have built specifically for this problem via crowd-sourcing and that we make publicly available

    A New Approach to Query Segmentation for Relevance Ranking in Web Search

    Get PDF
    Abstract In this paper, we try to determine how best to improve state-ofthe-art methods for relevance ranking in web searching by query segmentation. Query segmentation is meant to separate the input query into segments, typically natural language phrases. We propose employing the re-ranking approach in query segmentation, which first employs a generative model to create the top k candidates and then employs a discriminative model to re-rank the candidates to obtain the final segmentation result. The method has been widely utilized for structure prediction in natural language processing, but has not been applied to query segmentation, as far as we know. Furthermore, we propose a new method for using the results of query segmentation in relevance ranking, which takes both the original query words and the segmented query phrases as units of query representation. We investigate whether our method can improve three relevance models, namely n-gram BM25, key n-gram model and term dependency model, within the framework of learning to rank. Our experimental results on large scale web search datasets show that our method can indeed significantly improve relevance ranking in all three cases

    A Comparison of Retrieval Models using Term Dependencies

    Full text link

    Fritt søk med FAST søkemotor integrert i PostgreSQL relasjonsdatabase

    Get PDF
    Focus of this thesis is the relationship between databases and information retrieval systems. As a background, the first part consists of a general presentation of databases and information retrieval systems and some examples of already existing efforts to combine the two. While these examples typically have expanded either a database system or an IRS to obtain multi-functionality, we have made an effort of bridging the two systems. Our prototype integrates FDS (Fast Data Search) into the PostgreSQL database management system as a new index access method. FDS is a powerful and scalable commercial enterprise search platform using a typical search engine query language. PostgreSQL, being open source and a general basis for research, lends itself well to customization. The new index access method provides the database with powerful free text capabilities while retaining the power of the relational model for structured data. Preliminary results including a simple performance test verify the feasibility of the integration, and demonstrate the scalability of the prototype. Storage, indexing, updating and search functions are implemented, but ACID properties could not be guaranteed, because the external indexing system has no such guarantee. I also present a prototype for automatic extraction of related structured data in the relational database to XML. Combining these two prototypes by allowing the extracted information to be searched using the full text index, makes it possible to search the database without knowledge of the underlying database scheme. Finally I discuss potential expansions of our implementation by indexing other data than text, multicolumn-indexing and moving complex evaluation from PostgreSQL to FDS, and suggest how this could be done. The thesis is written in Norwegian

    Semantic Interpretation of User Queries for Question Answering on Interlinked Data

    Get PDF
    The Web of Data contains a wealth of knowledge belonging to a large number of domains. Retrieving data from such precious interlinked knowledge bases is an issue. By taking the structure of data into account, it is expected that upcoming generation of search engines is approaching to question answering systems, which directly answer user questions. But developing a question answering over these interlinked data sources is still challenging because of two inherent characteristics: First, different datasets employ heterogeneous schemas and each one may only contain a part of the answer for a certain question. Second, constructing a federated formal query across different datasets requires exploiting links between these datasets on both the schema and instance levels. In this respect, several challenges such as resource disambiguation, vocabulary mismatch, inference, link traversal are raised. In this dissertation, we address these challenges in order to build a question answering system for Linked Data. We present our question answering system Sina, which transforms user-supplied queries (i.e. either natural language queries or keyword queries) into conjunctive SPARQL queries over a set of interlinked data sources. The contributions of this work are as follows: 1. A novel approach for determining the most suitable resources for a user-supplied query from different datasets (disambiguation approach). We employed a Hidden Markov Model, whose parameters were bootstrapped with different distribution functions. 2. A novel method for constructing federated formal queries using the disambiguated resources and leveraging the linking structure of the underlying datasets. This approach essentially relies on a combination of domain and range inference as well as a link traversal method for constructing a connected graph, which ultimately renders a corresponding SPARQL query. 3. Regarding the problem of vocabulary mismatch, our contribution is divided into two parts, First, we introduce a number of new query expansion features based on semantic and linguistic inferencing over Linked Data. We evaluate the effectiveness of each feature individually as well as their combinations, employing Support Vector Machines and Decision Trees. Second, we propose a novel method for automatic query expansion, which employs a Hidden Markov Model to obtain the optimal tuples of derived words. 4. We provide two benchmarks for two different tasks to the community of question answering systems. The first one is used for the task of question answering on interlinked datasets (i.e. federated queries over Linked Data). The second one is used for the vocabulary mismatch task. We evaluate the accuracy of our approach using measures like mean reciprocal rank, precision, recall, and F-measure on three interlinked life-science datasets as well as DBpedia. The results of our accuracy evaluation demonstrate the effectiveness of our approach. Moreover, we study the runtime of our approach in its sequential as well as parallel implementations and draw conclusions on the scalability of our approach on Linked Data

    Query understanding: applying machine learning algorithms for named entity recognition

    Get PDF
    The term-frequency inverse-document(tf-idf) paradigm which is often used in general search engines for ranking the relevance of documents in a corpus to a given user query, is based on the frequency of occurrence of the search key terms in the corpus. These search terms are mostly expressed in natural language thus requiring natural language processing methods. But for domain-speciffic search engines like a software download portal, search terms are usually expressed in forms that does not conform to grammatical rules present in natural language and as such, they cannot be tackled using natural language processing techniques. This thesis proposes named entity recognition using supervised machine learning methods as a means to understanding queries for such domain-speciffic search engines. Particularly, our main objective is to apply machine learning techniques to automatically learn to recognize and classify search terms according to named entity class of predefined categories they belong. By so doing, we are able to understand user intents and rank result sets according to their relevance to detected named entities present in search query. Our approach involved three machine learning algorithms; Hidden Markov Models (HMM), Conditional Random Field(CRF) and Neural Network(NN). We followed the supervised learning approach in training these algorithms using labeled training data from sample queries, we then evaluated their performance on new unseen queries. Our empirical results showed precisions of 93% for NN which was based on distributed representations proposed by Yoshua Bengio, 85.60% for CRF and 82.84% for HMM. CRF 's precision improved to about 2% , achieving 87.40% after we generated gazetteer-based and morphological features. From our results, we were able to prove that machine learning methods for named entity recognition is useful for understanding query intents
    corecore