8 research outputs found
On Term Selection Techniques for Patent Prior Art Search
A patent is a set of exclusive rights granted to an inventor to
protect his invention for
a limited period of time. Patent prior art search involves
finding previously granted
patents, scientific articles, product descriptions, or any other
published work that
may be relevant to a new patent application. Many well-known
information retrieval
(IR) techniques (e.g., typical query expansion methods), which
are proven effective
for ad hoc search, are unsuccessful for patent prior art search.
In this thesis, we
mainly investigate the reasons that generic IR techniques are not
effective for prior
art search on the CLEF-IP test collection. First, we analyse the
errors caused due to
data curation and experimental settings like applying
International Patent Classification
codes assigned to the patent topics to filter the search results.
Then, we investigate
the influence of term selection on retrieval performance on the
CLEF-IP prior art
test collection, starting with the description section of the
reference patent and using
language models (LM) and BM25 scoring functions. We find that an
oracular relevance
feedback system, which extracts terms from the judged relevant
documents
far outperforms the baseline (i.e., 0.11 vs. 0.48) and performs
twice as well on mean
average precision (MAP) as the best participant in CLEF-IP 2010
(i.e., 0.22 vs. 0.48).
We find a very clear term selection value threshold for use when
choosing terms. We
also notice that most of the useful feedback terms are actually
present in the original
query and hypothesise that the baseline system can be
substantially improved by removing
negative query terms. We try four simple automated approaches to
identify
negative terms for query reduction but we are unable to improve
on the baseline
performance with any of them. However, we show that a simple,
minimal feedback
interactive approach, where terms are selected from only the
first retrieved relevant
document outperforms the best result from CLEF-IP 2010,
suggesting the promise of
interactive methods for term selection in patent prior art
search
Information retrieval and text mining technologies for chemistry
Efficient access to chemical information contained in scientific literature, patents, technical reports, or the web is a pressing need shared by researchers and patent attorneys from different chemical disciplines. Retrieval of important chemical information in most cases starts with finding relevant documents for a particular chemical compound or family. Targeted retrieval of chemical documents is closely connected to the automatic recognition of chemical entities in the text, which commonly involves the extraction of the entire list of chemicals mentioned in a document, including any associated information. In this Review, we provide a comprehensive and in-depth description of fundamental concepts, technical implementations, and current technologies for meeting these information demands. A strong focus is placed on community challenges addressing systems performance, more particularly CHEMDNER and CHEMDNER patents tasks of BioCreative IV and V, respectively. Considering the growing interest in the construction of automatically annotated chemical knowledge bases that integrate chemical information and biological data, cheminformatics approaches for mapping the extracted chemical names into chemical structures and their subsequent annotation together with text mining applications for linking chemistry with biological information are also presented. Finally, future trends and current challenges are highlighted as a roadmap proposal for research in this emerging field.A.V. and M.K. acknowledge funding from the European
Community’s Horizon 2020 Program (project reference:
654021 - OpenMinted). M.K. additionally acknowledges the
Encomienda MINETAD-CNIO as part of the Plan for the
Advancement of Language Technology. O.R. and J.O. thank
the Foundation for Applied Medical Research (FIMA),
University of Navarra (Pamplona, Spain). This work was
partially funded by Consellería
de Cultura, Educación e Ordenación Universitaria (Xunta de Galicia), and FEDER (European Union), and the Portuguese Foundation for Science and Technology (FCT) under the scope of the strategic
funding of UID/BIO/04469/2013 unit and COMPETE 2020
(POCI-01-0145-FEDER-006684). We thank Iñigo Garciá -Yoldi
for useful feedback and discussions during the preparation of
the manuscript.info:eu-repo/semantics/publishedVersio
Topical relevance models
An inherent characteristic of information retrieval (IR) is that the query expressing a user's information need is often multi-faceted, that is, it encapsulates more than one
specific potential sub-information need. This multifacetedness of queries manifests itself as a topic distribution in the retrieved set of documents, where each document can be considered as a mixture of topics, one or more of which may correspond to the sub-information needs expressed in the query. In some specific domains of IR,
such as patent prior art search, where the queries are full patent articles and the objective is to (in)validate the claims contained therein, the queries themselves are
multi-topical in addition to the retrieved set of documents. The overall objective of the research described in this thesis involves investigating techniques to recognize and exploit these multi-topical characteristics of the retrieved documents and the queries in IR and relevance feedback in IR.
First, we hypothesize that segments of documents in close proximity to the query terms are indicative of these segments being topically related to the query terms.
An intuitive choice for the unit of such segments, in close proximity to query terms within documents, is the sentences, which characteristically represent a collection
of semantically related terms. This way of utilizing term proximity through the use of sentences is empirically shown to select potentially relevant topics from among those present in a retrieved document set and thus improve relevance feedback in IR.
Secondly, to handle the very long queries of patent prior art search which are essentially multi-topical in nature, we hypothesize that segmenting these queries into topically focused segments and then using these topically focused segments as separate queries for retrieval can retrieve potentially relevant documents for each of these segments. The results for each of these segments then need to be merged to obtain a final retrieval result set for the whole query.
These two conceptual approaches for utilizing the topical relatedness of terms in both the retrieved documents and the queries are then integrated more formally within a single statistical generative model, called the topical relevance model (TRLM). This model utilizes the underlying multi-topical nature of both retrieved documents and the query. Moreover, the model is used as the basis for construction of a novel search interface, called TopicVis, which lets the user visualize the topic distributions in the retrieved set of documents and the query. This visualization of the topics is beneficial to the user in the following ways. Firstly, through visualization
of the ranked retrieval list, TopicVis facilitates the user to choose one or more facets of interest from the query in a feedback step, after which it retrieves documents primarily composed of the selected facets at top ranks. Secondly, the system provides an access link to the first segment within a document focusing on the selected topic and also supports navigation links to subsequent segments on the same topic in other documents.
The methods proposed in this thesis are evaluated on datasets from the TREC IR benchmarking workshop series, and the CLEF-IP 2010 data, a patent prior art search data set. Experimental results show that relevance feedback using sentences and segmented retrieval for patent prior art search queries significantly improve IR effectiveness for the standard ad-hoc IR and patent prior art search tasks. Moreover, the topical relevance model (TRLM), designed to encapsulate these two complementary approaches within a single framework, significantly improves IR effectiveness
for both standard ad-hoc IR and patent prior art search. Furthermore, a task based user study experiment shows that novel features of topic visualization, topic-based feedback and topic-based navigation, implemented in the TopicVis interface, lead to effective and efficient task completion achieving good user satisfaction
What Presentation of Search Engine Results Do Health Information Searchers Prefer?
A study of a sample of online health information searchers was conducted to see what their preferences are with respect to four different display styles for search engine results on health topics. Screen shots of search result display screens were presented to the participants via a Qualtrics (www.qualtrics.com) online survey. The other display types were Display 1: Google standard display, Display 2: Google enhanced with faceted browsable categories, Display 3: Google enhanced with a word cloud for each search result, and Display 4: Google enhanced with an overview word cloud for collection of search results. For each search task, participants were asked to rate the search engine results displays for quality indicators, using Likert-type item rating scales. At the end, in three concluding questions, the participants were asked to choose the display(s) that were best at meeting three specific criteria, based on overall impressions. The evaluations by the participants suggest that the standard Google search results display and the Google screen enhanced with faceted browsable categories were favored over the other two display types.Master of Science in Information Scienc