2,534 research outputs found

    CONTEXT-BASED AUTOSUGGEST ON GRAPH DATA

    Get PDF
    Autosuggest is an important feature in any search applications. Currently, most applications only suggest a single term based on how frequent that term appears in the indexed documents or how often it is searched upon. These approaches might not provide the most relevant suggestions because users often enter a series of related query terms to answer a question they have in mind. In this project, we implemented the Smart Solr Suggester plugin using a context-based approach that takes into account the relationships among search keywords. In particular, we used the keywords that the user has chosen so far in the search text box as the context to autosuggest their next incomplete keyword. This context-based approach uses the relationships between entities in the graph data that the user is searching on and therefore would provide more meaningful suggestions

    Toward Entity-Aware Search

    Get PDF
    As the Web has evolved into a data-rich repository, with the standard "page view," current search engines are becoming increasingly inadequate for a wide range of query tasks. While we often search for various data "entities" (e.g., phone number, paper PDF, date), today's engines only take us indirectly to pages. In my Ph.D. study, we focus on a novel type of Web search that is aware of data entities inside pages, a significant departure from traditional document retrieval. We study the various essential aspects of supporting entity-aware Web search. To begin with, we tackle the core challenge of ranking entities, by distilling its underlying conceptual model Impression Model and developing a probabilistic ranking framework, EntityRank, that is able to seamlessly integrate both local and global information in ranking. We also report a prototype system built to show the initial promise of the proposal. Then, we aim at distilling and abstracting the essential computation requirements of entity search. From the dual views of reasoning--entity as input and entity as output, we propose a dual-inversion framework, with two indexing and partition schemes, towards efficient and scalable query processing. Further, to recognize more entity instances, we study the problem of entity synonym discovery through mining query log data. The results we obtained so far have shown clear promise of entity-aware search, in its usefulness, effectiveness, efficiency and scalability

    Predicate-based indexing for desktop search

    Get PDF
    Google and other products have revolutionized the way we search for information. There are, however, still a number of research challenges. One challenge that arises specifically in desktop search is to exploit the structure and semantics of documents, as defined by the application program that generated the data (e.g., Word, Excel, or Outlook). The current generation of search products does not understand these structures and therefore often returns wrong results. This paper shows how today's search technology can be extended in order to take the specific semantics of certain structures into account. The key idea is to extend inverted file index structures with predicates which encode the circumstances under which certain keywords of a document become visible to a user. This paper provides a framework that allows to express the semantics of structures in documents and algorithms to construct enhanced, predicate-based indexes. Furthermore, this paper shows how keyword and phrase queries can be processed efficiently on such enhanced indexes. It is shown that the proposed approach has superior retrieval performance with regard to both recall and precision and has tolerable space and query running time overhead

    Content-Aware DataGuides for Indexing Large Collections of XML Documents

    Get PDF
    XML is well-suited for modelling structured data with textual content. However, most indexing approaches perform structure and content matching independently, combining the retrieved path and keyword occurrences in a third step. This paper shows that retrieval in XML documents can be accelerated significantly by processing text and structure simultaneously during all retrieval phases. To this end, the Content-Aware DataGuide (CADG) enhances the wellknown DataGuide with (1) simultaneous keyword and path matching and (2) a precomputed content/structure join. Extensive experiments prove the CADG to be 50-90% faster than the DataGuide for various sorts of query and document, including difficult cases such as poorly structured queries and recursive document paths. A new query classification scheme identifies precise query characteristics with a predominant influence on the performance of the individual indices. The experiments show that the CADG is applicable to many real-world applications, in particular large collections of heterogeneously structured XML documents

    What Users Ask a Search Engine: Analyzing One Billion Russian Question Queries

    Full text link
    We analyze the question queries submitted to a large commercial web search engine to get insights about what people ask, and to better tailor the search results to the users’ needs. Based on a dataset of about one billion question queries submitted during the year 2012, we investigate askers’ querying behavior with the support of automatic query categorization. While the importance of question queries is likely to increase, at present they only make up 3–4% of the total search traffic. Since questions are such a small part of the query stream and are more likely to be unique than shorter queries, clickthrough information is typically rather sparse. Thus, query categorization methods based on the categories of clicked web documents do not work well for questions. As an alternative, we propose a robust question query classification method that uses the labeled questions from a large community question answering platform (CQA) as a training set. The resulting classifier is then transferred to the web search questions. Even though questions on CQA platforms tend to be different to web search questions, our categorization method proves competitive with strong baselines with respect to classification accuracy. To show the scalability of our proposed method we apply the classifiers to about one billion question queries and discuss the trade-offs between performance and accuracy that different classification models offer. Our findings reveal what people ask a search engine and also how this contrasts behavior on a CQA platform

    CAMEL: Concept Annotated iMagE Libraries

    Get PDF
    Copyright 2001 Society of Photo-Optical Instrumentation Engineers. One print or electronic copy may be made for personal use only. Systematic electronic or print reproduction and distribution, duplication of any material in this paper for a fee or for commercial purposes, or modification of the content of the paper are prohibited. http://dx.doi.org/10.1117/12.410975The problem of content-based image searching has received considerable attention in the last few years. Thousands of images are now available on the internet, andmany important applications require searching of images in domains such as E-commerce, medical imaging, weather prediction, satellite imagery, and so on. Yet, content-based image querying is still largely unestablished as a mainstream field, nor is it widely used by search engines. We believe that two of the major hurdles for this poor acceptance are poor retrieval quality and usability. In this paper, we introduce the CAMEL system—an acronym for Concept Annotated iMagE Libraries—as an effort to address both of the above problems. The CAMEL system provides and easy-to-use, and yet powerful, text-only query interface, which allows users to search for images based on visual concepts, identified by specifying relevant keywords. Conceptually, CAMEL annotates images with the visual concepts that are relevant to them. In practice, CAMEL defines visual concepts by looking at sample images off-line and extracting their relevant visual features. Once defined, such visual concepts can be used to search for relevant images on the fly, using content-based search methods. The visual concepts are stored in a Concept Library and are represented by an associated set of wavelet features, which in our implementation were extracted by the WALRUS image querying system. Even though the CAMEL framework applies independently of the underlying query engine, for our prototype we have chosenWALRUS as a back-end, due to its ability to extract and query with image region features. CAMEL improves retrieval quality because it allows experts to build very accurate representations of visual concepts that can be used even by novice users. At the same time, CAMEL improves usability by supporting the familiar text-only interface currently used by most search engines on the web. Both improvements represent a departure from traditional approaches to improving image query systems—instead of focusing on query execution, we emphasize query specification by allowing simpler and yet more precise query specification

    Indexing and Searching Document Collections using Lucene

    Get PDF
    The amount of information available to a person is growing day by day; hence retrieving the correct information in a timely manner plays a very important role. This thesis talks about indexing document collections and fetching the right information with the help of a database. The primary role of a database is to store the additional information which may be or may not be available in the document collection by itself. The indexing of document collection is performed by Lucene, while the search application is strongly integrated with a database. In this thesis a highly efficient, scalable, customized search tool is built using Lucene. The search tool is capable of indexing and searching databases, PDF documents, word documents and text files

    XMach-1: A Benchmark for XML Data Management

    Get PDF
    We propose a scaleable multi-user benchmark called XMach-1 (XML Data Management benchmark) for evaluating the performance of XML data management systems. It is based on a web application and considers different types of XML data, in particular text documents, schema-less data and structured data. We specify the structure of the benchmark database and the generation of its contents. Furthermore, we define a mix of XML queries and update operations for which system performance is determined. The primary performance metric, Xqps, measures the query throughput of a system under response time constraints. We will use XMach-1 to evaluate both native XML data management systems and XML-enabled relational DBMS

    Closest Keyword Search in Dynamic Multidimensional Data Sets

    Get PDF
    Adding text to databases opens up many different innovations and functionalities that can be made feasible for keyword-based quests. The application in question focuses on search results that are keyword-marked and that are located in a geographical area. For these datasets, our main goal is to locate groups of points that satisfy search queries. Our team's recommendation is a process we call Projection and Multi Scale Hashing that combines random projection and hashing to provide great scalability and efficiency. This example illustrates how to present algorithms in both an exact and approximate manner. Analyses that take into account experimental and analytical studies show that, with regard to overall efficiency, multi-dimensional hashing offers up to 65 times better results. A point in a dynamic connection multi-dimensional feature space is a typical way to classify an object, and we often describe various objects as a point in a multi-dimensional feature space. In other words, for example, images are described using feature vectors that are comprised of colour components, and a textual description of the image is typically correlated with it (such as tags or keywords)
    corecore