616 research outputs found

    On Term Selection Techniques for Patent Prior Art Search

    No full text
    A patent is a set of exclusive rights granted to an inventor to protect his invention for a limited period of time. Patent prior art search involves finding previously granted patents, scientific articles, product descriptions, or any other published work that may be relevant to a new patent application. Many well-known information retrieval (IR) techniques (e.g., typical query expansion methods), which are proven effective for ad hoc search, are unsuccessful for patent prior art search. In this thesis, we mainly investigate the reasons that generic IR techniques are not effective for prior art search on the CLEF-IP test collection. First, we analyse the errors caused due to data curation and experimental settings like applying International Patent Classification codes assigned to the patent topics to filter the search results. Then, we investigate the influence of term selection on retrieval performance on the CLEF-IP prior art test collection, starting with the description section of the reference patent and using language models (LM) and BM25 scoring functions. We find that an oracular relevance feedback system, which extracts terms from the judged relevant documents far outperforms the baseline (i.e., 0.11 vs. 0.48) and performs twice as well on mean average precision (MAP) as the best participant in CLEF-IP 2010 (i.e., 0.22 vs. 0.48). We find a very clear term selection value threshold for use when choosing terms. We also notice that most of the useful feedback terms are actually present in the original query and hypothesise that the baseline system can be substantially improved by removing negative query terms. We try four simple automated approaches to identify negative terms for query reduction but we are unable to improve on the baseline performance with any of them. However, we show that a simple, minimal feedback interactive approach, where terms are selected from only the first retrieved relevant document outperforms the best result from CLEF-IP 2010, suggesting the promise of interactive methods for term selection in patent prior art search

    CHORUS Deliverable 2.1: State of the Art on Multimedia Search Engines

    Get PDF
    Based on the information provided by European projects and national initiatives related to multimedia search as well as domains experts that participated in the CHORUS Think-thanks and workshops, this document reports on the state of the art related to multimedia content search from, a technical, and socio-economic perspective. The technical perspective includes an up to date view on content based indexing and retrieval technologies, multimedia search in the context of mobile devices and peer-to-peer networks, and an overview of current evaluation and benchmark inititiatives to measure the performance of multimedia search engines. From a socio-economic perspective we inventorize the impact and legal consequences of these technical advances and point out future directions of research

    Term selection in information retrieval

    Get PDF
    Systems trained on linguistically annotated data achieve strong performance for many language processing tasks. This encourages the idea that annotations can improve any language processing task if applied in the right way. However, despite widespread acceptance and availability of highly accurate parsing software, it is not clear that ad hoc information retrieval (IR) techniques using annotated documents and requests consistently improve search performance compared to techniques that use no linguistic knowledge. In many cases, retrieval gains made using language processing components, such as part-of-speech tagging and head-dependent relations, are offset by significant negative effects. This results in a minimal positive, or even negative, overall impact for linguistically motivated approaches compared to approaches that do not use any syntactic or domain knowledge. In some cases, it may be that syntax does not reveal anything of practical importance about document relevance. Yet without a convincing explanation for why linguistic annotations fail in IR, the intuitive appeal of search systems that ‘understand’ text can result in the repeated application, and mis-application, of language processing to enhance search performance. This dissertation investigates whether linguistics can improve the selection of query terms by better modelling the alignment process between natural language requests and search queries. It is the most comprehensive work on the utility of linguistic methods in IR to date. Term selection in this work focuses on identification of informative query terms of 1-3 words that both represent the semantics of a request and discriminate between relevant and non-relevant documents. Approaches to word association are discussed with respect to linguistic principles, and evaluated with respect to semantic characterization and discriminative ability. Analysis is organised around three theories of language that emphasize different structures for the identification of terms: phrase structure theory, dependency theory and lexicalism. The structures identified by these theories play distinctive roles in the organisation of language. Evidence is presented regarding the value of different methods of word association based on these structures, and the effect of method and term combinations. Two highly effective, novel methods for the selection of terms from verbose queries are also proposed and evaluated. The first method focuses on the semantic phenomenon of ellipsis with a discriminative filter that leverages diverse text features. The second method exploits a term ranking algorithm, PhRank, that uses no linguistic information and relies on a network model of query context. The latter focuses queries so that 1-5 terms in an unweighted model achieve better retrieval effectiveness than weighted IR models that use up to 30 terms. In addition, unlike models that use a weighted distribution of terms or subqueries, the concise terms identified by PhRank are interpretable by users. Evaluation with newswire and web collections demonstrates that PhRank-based query reformulation significantly improves performance of verbose queries up to 14% compared to highly competitive IR models, and is at least as good for short, keyword queries with the same models. Results illustrate that linguistic processing may help with the selection of word associations but does not necessarily translate into improved IR performance. Statistical methods are necessary to overcome the limits of syntactic parsing and word adjacency measures for ad hoc IR. As a result, probabilistic frameworks that discover, and make use of, many forms of linguistic evidence may deliver small improvements in IR effectiveness, but methods that use simple features can be substantially more efficient and equally, or more, effective. Various explanations for this finding are suggested, including the probabilistic nature of grammatical categories, a lack of homomorphism between syntax and semantics, the impact of lexical relations, variability in collection data, and systemic effects in language systems

    The anatomy of a search and mining system for digital humanities : Search And Mining Tools for Language Archives (SAMTLA)

    Get PDF
    Humanities researchers are faced with an overwhelming volume of digitised primary source material, and "born digital" information, of relevance to their research as a result of large-scale digitisation projects. The current digital tools do not provide consistent support for analysing the content of digital archives that are potentially large in scale, multilingual, and come in a range of data formats. The current language-dependent, or project specific, approach to tool development often puts the tools out of reach for many research disciplines in the humanities. In addition, the tools can be incompatible with the way researchers locate and compare the relevant sources. For instance, researchers are interested in shared structural text patterns, known as \parallel passages" that describe a specific cultural, social, or historical context relevant to their research topic. Identifying these shared structural text patterns is challenging due to their repeated yet highly variable nature, as a result of differences in the domain, author, language, time period, and orthography. The contribution of the thesis is a novel infrastructure that directly addresses the need for generic, flexible, extendable, and sustainable digital tools that are applicable to a wide range of digital archives and research in the humanities. The infrastructure adopts a character-level n-gram Statistical Language Model (SLM), stored in a space-optimised k-truncated suffix tree data structure as its underlying data model. A character-level n-gram model is a relatively new approach that is competitive with word-level n-gram models, but has the added advantage that it is domain and language-independent, requiring little or no preprocessing of the document text unlike word-level models that require some form of language-dependent tokenisation and stemming. Character-level n-grams capture word internal features that are ignored by word-level n-gram models, which provides greater exibility in addressing the information need of the user through tolerant search, and compensation for erroneous query specification or spelling errors in the document text. Furthermore, the SLM provides a unified approach to information retrieval and text mining, where traditional approaches have tended to adopt separate data models that are often ad-hoc or based on heuristic assumptions. In addition, the performance of the character-level n-gram SLM was formally evaluated through crowdsourcing, which demonstrates that the retrieval performance of the SLM is close to that of the human level performance. The proposed infrastructure, supports the development of the Samtla (Search And Mining Tools for Language Archives), which provides humanities researchers digital tools for search, browsing, and text mining of digital archives in any domain or language, within a single system. Samtla supersedes many of the existing tools for humanities researchers, by supporting the same or similar functionality of the systems, but with a domain-independent and languageindependent approach. The functionality includes a browsing tool constructed from the metadata and named entities extracted from the document text, a hybrid-recommendation system for recommending related queries and documents. However, some tools are novel tools and developed in response to the specific needs of the researchers, such as the document comparison tool for visualising shared sequences between groups of related documents. Furthermore, Samtla is the first practical example of a system with a SLM as its primary data model that supports the real research needs of several case studies covering different areas of research in the humanities

    Survey of the State of the Art in Natural Language Generation: Core tasks, applications and evaluation

    Get PDF
    This paper surveys the current state of the art in Natural Language Generation (NLG), defined as the task of generating text or speech from non-linguistic input. A survey of NLG is timely in view of the changes that the field has undergone over the past decade or so, especially in relation to new (usually data-driven) methods, as well as new applications of NLG technology. This survey therefore aims to (a) give an up-to-date synthesis of research on the core tasks in NLG and the architectures adopted in which such tasks are organised; (b) highlight a number of relatively recent research topics that have arisen partly as a result of growing synergies between NLG and other areas of artificial intelligence; (c) draw attention to the challenges in NLG evaluation, relating them to similar challenges faced in other areas of Natural Language Processing, with an emphasis on different evaluation methods and the relationships between them.Comment: Published in Journal of AI Research (JAIR), volume 61, pp 75-170. 118 pages, 8 figures, 1 tabl
    corecore