1,202 research outputs found

    Semantic Interpretation of User Queries for Question Answering on Interlinked Data

    Get PDF
    The Web of Data contains a wealth of knowledge belonging to a large number of domains. Retrieving data from such precious interlinked knowledge bases is an issue. By taking the structure of data into account, it is expected that upcoming generation of search engines is approaching to question answering systems, which directly answer user questions. But developing a question answering over these interlinked data sources is still challenging because of two inherent characteristics: First, different datasets employ heterogeneous schemas and each one may only contain a part of the answer for a certain question. Second, constructing a federated formal query across different datasets requires exploiting links between these datasets on both the schema and instance levels. In this respect, several challenges such as resource disambiguation, vocabulary mismatch, inference, link traversal are raised. In this dissertation, we address these challenges in order to build a question answering system for Linked Data. We present our question answering system Sina, which transforms user-supplied queries (i.e. either natural language queries or keyword queries) into conjunctive SPARQL queries over a set of interlinked data sources. The contributions of this work are as follows: 1. A novel approach for determining the most suitable resources for a user-supplied query from different datasets (disambiguation approach). We employed a Hidden Markov Model, whose parameters were bootstrapped with different distribution functions. 2. A novel method for constructing federated formal queries using the disambiguated resources and leveraging the linking structure of the underlying datasets. This approach essentially relies on a combination of domain and range inference as well as a link traversal method for constructing a connected graph, which ultimately renders a corresponding SPARQL query. 3. Regarding the problem of vocabulary mismatch, our contribution is divided into two parts, First, we introduce a number of new query expansion features based on semantic and linguistic inferencing over Linked Data. We evaluate the effectiveness of each feature individually as well as their combinations, employing Support Vector Machines and Decision Trees. Second, we propose a novel method for automatic query expansion, which employs a Hidden Markov Model to obtain the optimal tuples of derived words. 4. We provide two benchmarks for two different tasks to the community of question answering systems. The first one is used for the task of question answering on interlinked datasets (i.e. federated queries over Linked Data). The second one is used for the vocabulary mismatch task. We evaluate the accuracy of our approach using measures like mean reciprocal rank, precision, recall, and F-measure on three interlinked life-science datasets as well as DBpedia. The results of our accuracy evaluation demonstrate the effectiveness of our approach. Moreover, we study the runtime of our approach in its sequential as well as parallel implementations and draw conclusions on the scalability of our approach on Linked Data

    A probabilistic approach for mapping free-text queries to complex web forms

    Get PDF
    Web applications with complex interfaces consisting of multiple input fields should understand free-text queries. We propose a probabilistic approach to map parts of a free-text query to the fields of a complex web form. Our method uses token models rather than only static dictionaries to create this mapping, offering greater flexibility and requiring less domain knowledge than existing systems. We evaluate different implementations of our mapping model and show that our system effectively maps free-text queries without using a dictionary. If a dictionary is available, the performance increases and is significantly better than a rule-based baseline

    Content-based Video Retrieval by Integrating Spatio-Temporal and Stochastic Recognition of Events

    Get PDF
    As amounts of publicly available video data grow the need to query this data efficiently becomes significant. Consequently content-based retrieval of video data turns out to be a challenging and important problem. We address the specific aspect of inferring semantics automatically from raw video data. In particular, we introduce a new video data model that supports the integrated use of two different approaches for mapping low-level features to high-level concepts. Firstly, the model is extended with a rule-based approach that supports spatio-temporal formalization of high-level concepts, and then with a stochastic approach. Furthermore, results on real tennis video data are presented, demonstrating the validity of both approaches, as well us advantages of their integrated us

    Document Indexing Strategies in Big Data A Survey

    Get PDF
    From past few years, the operations of the Internet have a significant growth and individuals, organizations were unaware of this data explosion. Because of the increasing quantity and diversity of digital documents available to end users, mechanism for their effective and efficient retrieval is given highest importance. One crucial aspect of this mechanism is indexing, which serves to allow documents to be located quickly. The problem is that users want to retrieve on the basis of context, and individual words provide unreliable evidence about the contextual topic or meaning of a document. Hence, the available solutions cannot meet the needs of the growing heterogeneous data in terms of processing. This results in inefficient information retrieval or search query results. The design of indexing strategies that can support this need is required. There are various indexing strategies which are utilized for solving Big Data management issues, and can also serve as a base for the design of more efficient indexing strategies. The aim is to explore document indexing strategy for Big Data manageability. The existing systems like, Latent Semantic Indexing , Inverted Indexing, Semantic indexing and Vector Space Model has their own challenges such as, Demands high computational performance, Consumes more memory Space, Longer data processing time, Limits the search space, will not produce the exact answer, Can present wrong answers due to synonyms and polysemy, approach makes use of formal ontology. This paper will describe and compare the various Indexing techniques and presents the characteristics and challenges involved

    Interpreting XML keyword query using hidden Markov model

    Get PDF
    Pretraživanje ključne riječi na XML bazi podataka privuklo je prilično zanimanja. Kako se XML dokumenti vrlo razlikuju od plošnih (flat) dokumenata, učinkovita pretraga XML dokumenata zahtijeva posebno razmatranje. Tradicionalni model vreće riječi (bag-of-words) ne uzima u obzir uloge ključnih riječi i odnos između ključnih riječi pa prema tome nije pogodan za XML pretragu ključne riječi. U ovom radu predstavljamo novi model, nazvan polu-strukturno pretraživanje ključne riječi (SSQ), koji podrazumijeva pretraživanje ključne riječi na različit način; to se pretraživanje sastoji od nekoliko cjelina pretrage i svaka cjelina predstavlja stanje pretrage (query condition). Za interpretaciju pretrage po tom modelu, potrebna su dva koraka. Prvo, predlažemo probabilistički pristup zasnovan na skrivenom Markovljevom modelu za izračunavanje najboljeg uklapanja traženih ključnih riječi u termine baze podataka, tj. elemenata, atributa i vrijednosti. Drugo, generiramo konstrukcije ključnih riječi (SSQs) na osnovu uklapanja. Eksperimentalni rezultati potvrđuju učinkovitost naših metoda.Keyword search on XML database has attracted a lot of research interests. As XML documents are very different from flat documents, effective search of XML documents needs special considerations. Traditional bag-of-words model does not take the roles of keywords and the relationship between keywords into consideration, and thus is not suited for XML keyword search. In this paper, we present a novel model, called semi-structured keyword query (SSQ), which understands a keyword query in a different way: a keyword query is composed of several query units, where each unit represents query condition. To interpret a keyword query under this model, we take two steps. First, we propose a probabilistic approach based on a Hidden Markov Model for computing the best mapping of the query keywords into the database terms, i.e., elements, attributes and values. Second, we generate SSQs based on the mapping. Experimental results verify the effectiveness of our methods

    Textpresso: An Ontology-Based Information Retrieval and Extraction System for Biological Literature

    Get PDF
    We have developed Textpresso, a new text-mining system for scientific literature whose capabilities go far beyond those of a simple keyword search engine. Textpresso's two major elements are a collection of the full text of scientific articles split into individual sentences, and the implementation of categories of terms for which a database of articles and individual sentences can be searched. The categories are classes of biological concepts (e.g., gene, allele, cell or cell group, phenotype, etc.) and classes that relate two objects (e.g., association, regulation, etc.) or describe one (e.g., biological process, etc.). Together they form a catalog of types of objects and concepts called an ontology. After this ontology is populated with terms, the whole corpus of articles and abstracts is marked up to identify terms of these categories. The current ontology comprises 33 categories of terms. A search engine enables the user to search for one or a combination of these tags and/or keywords within a sentence or document, and as the ontology allows word meaning to be queried, it is possible to formulate semantic queries. Full text access increases recall of biological data types from 45% to 95%. Extraction of particular biological facts, such as gene-gene interactions, can be accelerated significantly by ontologies, with Textpresso automatically performing nearly as well as expert curators to identify sentences; in searches for two uniquely named genes and an interaction term, the ontology confers a 3-fold increase of search efficiency. Textpresso currently focuses on Caenorhabditis elegans literature, with 3,800 full text articles and 16,000 abstracts. The lexicon of the ontology contains 14,500 entries, each of which includes all versions of a specific word or phrase, and it includes all categories of the Gene Ontology database. Textpresso is a useful curation tool, as well as search engine for researchers, and can readily be extended to other organism-specific corpora of text. Textpresso can be accessed at http://www.textpresso.org or via WormBase at http://www.wormbase.org

    Radio Oranje: Enhanced Access to a Historical Spoken Word Collection

    Get PDF
    Access to historical audio collections is typically very restricted:\ud content is often only available on physical (analog) media and the\ud metadata is usually limited to keywords, giving access at the level\ud of relatively large fragments, e.g., an entire tape. Many spoken\ud word heritage collections are now being digitized, which allows the\ud introduction of more advanced search technology. This paper presents\ud an approach that supports online access and search for recordings of\ud historical speeches. A demonstrator has been built, based on the\ud so-called Radio Oranje collection, which contains radio speeches by\ud the Dutch Queen Wilhelmina that were broadcast during World War II.\ud The audio has been aligned with its original 1940s manual\ud transcriptions to create a time-stamped index that enables the speeches to be\ud searched at the word level. Results are presented together with\ud related photos from an external database

    Keyword Search in Relational Databases: Architecture, Approaches and Considerations

    Get PDF
    Questo lavoro di tesi presenta le diverse soluzioni proposte in letteratura per applicare il paradigma keyword search alle basi di dati relazionali, e vuole delineare una architettura generale per definire e sviluppare questi sistemi. A tal proposito, le soluzioni presentate dalla comunitĂ  scientifica sono state analizzate focalizzandosi sui singoli componenti della pipeline di ricerca. Infine, si sono analizzati i processi di valutazione sperimentale di questi sistem
    • …
    corecore