759 research outputs found

    Light syntactically-based index pruning for information retrieval

    Get PDF
    Most index pruning techniques eliminate terms from an index on the basis of the contribution of those terms to the content of the documents. We present a novel syntactically-based index pruning technique, which uses exclusively shallow syntactic evidence to decide upon which terms to prune. This type of evidence is document-independent, and is based on the assumption that, in a general collection of documents, there exists an approximately proportional relation between the frequency and content of ‘blocks of parts of speech’ (<i>P</i><i>O</i><i>S</i> <i>blocks</i>) [5]. POS blocks are fixed-length sequences of nouns, verbs, and other parts of speech, extracted from a corpus. We remove from the index, terms that correspond to low-frequency POS blocks, using two different strategies: (i) considering that low-frequency POS blocks correspond to sequences of content-poor words, and (ii) considering that low-frequency POS blocks, which also contain ‘non content-bearing parts of speech’, such as prepositions for example, correspond to sequences of content-poor words. We experiment with two TREC test collections and two statistically different weighting models. Using full indices as our baseline, we show that syntactically-based index pruning overall enhances retrieval performance, in terms of both average and early precision, for light pruning levels, while also reducing the size of the index. Our novel low-cost technique performs at least similarly to other related work, even though it does not consider document-specific information, and as such it is more general

    Extending weighting models with a term quality measure

    Get PDF
    Weighting models use lexical statistics, such as term frequencies, to derive term weights, which are used to estimate the relevance of a document to a query. Apart from the removal of stopwords, there is no other consideration of the quality of words that are being ‘weighted’. It is often assumed that term frequency is a good indicator for a decision to be made as to how relevant a document is to a query. Our intuition is that raw term frequency could be enhanced to better discriminate between terms. To do so, we propose using non-lexical features to predict the ‘quality’ of words, before they are weighted for retrieval. Specifically, we show how parts of speech (e.g. nouns, verbs) can help estimate how informative a word generally is, regardless of its relevance to a query/document. Experimental results with two standard TREC collections show that integrating the proposed term quality to two established weighting models enhances retrieval performance, over a baseline that uses the original weighting models, at all times

    Part of Speech Based Term Weighting for Information Retrieval

    Full text link
    Automatic language processing tools typically assign to terms so-called weights corresponding to the contribution of terms to information content. Traditionally, term weights are computed from lexical statistics, e.g., term frequencies. We propose a new type of term weight that is computed from part of speech (POS) n-gram statistics. The proposed POS-based term weight represents how informative a term is in general, based on the POS contexts in which it generally occurs in language. We suggest five different computations of POS-based term weights by extending existing statistical approximations of term information measures. We apply these POS-based term weights to information retrieval, by integrating them into the model that matches documents to queries. Experiments with two TREC collections and 300 queries, using TF-IDF & BM25 as baselines, show that integrating our POS-based term weights to retrieval always leads to gains (up to +33.7% from the baseline). Additional experiments with a different retrieval model as baseline (Language Model with Dirichlet priors smoothing) and our best performing POS-based term weight, show retrieval gains always and consistently across the whole smoothing range of the baseline

    Towards a Linked Semantic Web: Precisely, Comprehensively and Scalably Linking Heterogeneous Data in the Semantic Web

    Get PDF
    The amount of Semantic Web data is growing rapidly today. Individual users, academic institutions and businesses have already published and are continuing to publish their data in Semantic Web standards, such as RDF and OWL. Due to the decentralized nature of the Semantic Web, the same real world entity may be described in various data sources with different ontologies and assigned syntactically distinct identifiers. Furthermore, data published by each individual publisher may not be complete. This situation makes it difficult for end users to consume the available Semantic Web data effectively. In order to facilitate data utilization and consumption in the Semantic Web, without compromising the freedom of people to publish their data, one critical problem is to appropriately interlink such heterogeneous data. This interlinking process is sometimes referred to as Entity Coreference, i.e., finding which identifiers refer to the same real world entity. In the Semantic Web, the owl:sameAs predicate is used to link two equivalent (coreferent) ontology instances. An important question is where these owl:sameAs links come from. Although manual interlinking is possible on small scales, when dealing with large-scale datasets (e.g., millions of ontology instances), automated linking becomes necessary. This dissertation summarizes contributions to several aspects of entity coreference research in the Semantic Web. First of all, by developing the EPWNG algorithm, we advance the performance of the state-of-the-art by 1% to 4%. EPWNG finds coreferent ontology instances from different data sources by comparing every pair of instances and focuses on achieving high precision and recall by appropriately collecting and utilizing instance context information domain-independently. We further propose a sampling and utility function based context pruning technique, which provides a runtime speedup factor of 30 to 75. Furthermore, we develop an on-the-fly candidate selection algorithm, P-EPWNG, that enables the coreference process to run 2 to 18 times faster than the state-of-the-art on up to 1 million instances while only making a small sacrifice in the coreference F1-scores. This is achieved by utilizing the matching histories of the instances to prune instance pairs that are not likely to be coreferent. We also propose Offline, another candidate selection algorithm, that not only provides similar runtime speedup to P-EPWNG but also helps to achieve higher candidate selection and coreference F1-scores due to its more accurate filtering of true negatives. Different from P-EPWNG, Offline pre-selects candidate pairs by only comparing their partial context information that is selected in an unsupervised, automatic and domain-independent manner.In order to be able to handle really heterogeneous datasets, a mechanism for automatically determining predicate comparability is proposed. Combing this property matching approach with EPWNG and Offline, our system outperforms state-of-the-art algorithms on the 2012 Billion Triples Challenge dataset on up to 2 million instances for both coreference F1-score and runtime. An interesting project, where we apply the EPWNG algorithm for assisting cervical cancer screening, is discussed in detail. By applying our algorithm to a combination of different patient clinical test results and biographic information, we achieve higher accuracy compared to its ablations. We end this dissertation with the discussion of promising and challenging future work

    Unsupervised Terminological Ontology Learning based on Hierarchical Topic Modeling

    Full text link
    In this paper, we present hierarchical relationbased latent Dirichlet allocation (hrLDA), a data-driven hierarchical topic model for extracting terminological ontologies from a large number of heterogeneous documents. In contrast to traditional topic models, hrLDA relies on noun phrases instead of unigrams, considers syntax and document structures, and enriches topic hierarchies with topic relations. Through a series of experiments, we demonstrate the superiority of hrLDA over existing topic models, especially for building hierarchies. Furthermore, we illustrate the robustness of hrLDA in the settings of noisy data sets, which are likely to occur in many practical scenarios. Our ontology evaluation results show that ontologies extracted from hrLDA are very competitive with the ontologies created by domain experts

    Netlang: A software for the linguistic analysis of corpora by means of complex networks

    Get PDF
    To date there is no software that directly connects the linguistic analysis of a conversation to a network program. Networks programs are able to extract statistical information from data basis with information about systems of interacting elements. Language has also been conceived and studied as a complex system. However, most proposals do not analyze language according to linguistic theory, but use instead computational systems that should save time at the price of leaving aside many crucial aspects for linguistic theory. Some approaches to network studies on language do apply precise linguistic analyses, made by a linguist. The problem until now has been the lack of interface between the analysis of a sentence and its integration into the network that could be managed by a linguist and that could save the analysis of any language. Previous works have used old software that was not created for these purposes and that often produced problems with some idiosyncrasies of the target language. The desired interface should be able to deal with the syntactic peculiarities of a particular language, the options of linguistic theory preferred by the user and the preservation of morpho-syntactic information (lexical categories and syntactic relations between items). Netlang is the first program able to do that. Recently, a new kind of linguistic analysis has been developed, which is able to extract a complexity pattern from the speaker’s linguistic production which is depicted as a network where words are inside nodes, and these nodes connect each other by means of edges or links (the information inside the edge can be syntactic, semantic, etc.). The Netlang software has become the bridge between rough linguistic data and the network program. Netlang has integrated and improved the functions of programs used in the past, namely the DGA annotator and two scripts (ToXML.pl and Xml2Pairs.py) used for transforming and pruning data. Netlang allows the researcher to make accurate linguistic analysis by means of syntactic dependency relations between words, while tracking record of the nature of such syntactic relationships (subject, object, etc). The Netlang software is presented as a new tool that solve many problems detected in the past. The most important improvement is that Netlang integrates three past applications into one program, and is able to produce a series of file formats that can be read by a network program. Through the Netlang software, the linguistic network analysis based on syntactic analyses, characterized for its low cost and the completely non-invasive procedure aims to evolve into a sufficiently fine grained tool for clinical diagnosis in potential cases of language disorders
    • …
    corecore