616 research outputs found
On Term Selection Techniques for Patent Prior Art Search
A patent is a set of exclusive rights granted to an inventor to
protect his invention for
a limited period of time. Patent prior art search involves
finding previously granted
patents, scientific articles, product descriptions, or any other
published work that
may be relevant to a new patent application. Many well-known
information retrieval
(IR) techniques (e.g., typical query expansion methods), which
are proven effective
for ad hoc search, are unsuccessful for patent prior art search.
In this thesis, we
mainly investigate the reasons that generic IR techniques are not
effective for prior
art search on the CLEF-IP test collection. First, we analyse the
errors caused due to
data curation and experimental settings like applying
International Patent Classification
codes assigned to the patent topics to filter the search results.
Then, we investigate
the influence of term selection on retrieval performance on the
CLEF-IP prior art
test collection, starting with the description section of the
reference patent and using
language models (LM) and BM25 scoring functions. We find that an
oracular relevance
feedback system, which extracts terms from the judged relevant
documents
far outperforms the baseline (i.e., 0.11 vs. 0.48) and performs
twice as well on mean
average precision (MAP) as the best participant in CLEF-IP 2010
(i.e., 0.22 vs. 0.48).
We find a very clear term selection value threshold for use when
choosing terms. We
also notice that most of the useful feedback terms are actually
present in the original
query and hypothesise that the baseline system can be
substantially improved by removing
negative query terms. We try four simple automated approaches to
identify
negative terms for query reduction but we are unable to improve
on the baseline
performance with any of them. However, we show that a simple,
minimal feedback
interactive approach, where terms are selected from only the
first retrieved relevant
document outperforms the best result from CLEF-IP 2010,
suggesting the promise of
interactive methods for term selection in patent prior art
search
CHORUS Deliverable 2.1: State of the Art on Multimedia Search Engines
Based on the information provided by European projects and national initiatives related to multimedia search as well as domains experts that participated in the CHORUS Think-thanks and workshops, this document reports on the state of the art related to multimedia content search from, a technical, and socio-economic perspective.
The technical perspective includes an up to date view on content based indexing and retrieval technologies, multimedia search in the context of mobile devices and peer-to-peer networks, and an overview of current evaluation and benchmark inititiatives to measure the performance of multimedia search engines.
From a socio-economic perspective we inventorize the impact and legal consequences of these technical advances and point out future directions of research
Term selection in information retrieval
Systems trained on linguistically annotated data achieve strong performance for many
language processing tasks. This encourages the idea that annotations can improve any
language processing task if applied in the right way. However, despite widespread
acceptance and availability of highly accurate parsing software, it is not clear that ad
hoc information retrieval (IR) techniques using annotated documents and requests consistently
improve search performance compared to techniques that use no linguistic
knowledge. In many cases, retrieval gains made using language processing components,
such as part-of-speech tagging and head-dependent relations, are offset by significant
negative effects. This results in a minimal positive, or even negative, overall
impact for linguistically motivated approaches compared to approaches that do not use
any syntactic or domain knowledge.
In some cases, it may be that syntax does not reveal anything of practical importance
about document relevance. Yet without a convincing explanation for why linguistic
annotations fail in IR, the intuitive appeal of search systems that ‘understand’ text
can result in the repeated application, and mis-application, of language processing to
enhance search performance. This dissertation investigates whether linguistics can improve
the selection of query terms by better modelling the alignment process between
natural language requests and search queries. It is the most comprehensive work on
the utility of linguistic methods in IR to date.
Term selection in this work focuses on identification of informative query terms of
1-3 words that both represent the semantics of a request and discriminate between relevant
and non-relevant documents. Approaches to word association are discussed with
respect to linguistic principles, and evaluated with respect to semantic characterization
and discriminative ability. Analysis is organised around three theories of language that
emphasize different structures for the identification of terms: phrase structure theory,
dependency theory and lexicalism. The structures identified by these theories play
distinctive roles in the organisation of language. Evidence is presented regarding the
value of different methods of word association based on these structures, and the effect
of method and term combinations.
Two highly effective, novel methods for the selection of terms from verbose queries
are also proposed and evaluated. The first method focuses on the semantic phenomenon
of ellipsis with a discriminative filter that leverages diverse text features. The second
method exploits a term ranking algorithm, PhRank, that uses no linguistic information
and relies on a network model of query context. The latter focuses queries so that 1-5
terms in an unweighted model achieve better retrieval effectiveness than weighted IR
models that use up to 30 terms. In addition, unlike models that use a weighted distribution
of terms or subqueries, the concise terms identified by PhRank are interpretable by
users. Evaluation with newswire and web collections demonstrates that PhRank-based
query reformulation significantly improves performance of verbose queries up to 14%
compared to highly competitive IR models, and is at least as good for short, keyword
queries with the same models.
Results illustrate that linguistic processing may help with the selection of word associations
but does not necessarily translate into improved IR performance. Statistical
methods are necessary to overcome the limits of syntactic parsing and word adjacency
measures for ad hoc IR. As a result, probabilistic frameworks that discover, and make
use of, many forms of linguistic evidence may deliver small improvements in IR effectiveness,
but methods that use simple features can be substantially more efficient
and equally, or more, effective. Various explanations for this finding are suggested,
including the probabilistic nature of grammatical categories, a lack of homomorphism
between syntax and semantics, the impact of lexical relations, variability in collection
data, and systemic effects in language systems
The anatomy of a search and mining system for digital humanities : Search And Mining Tools for Language Archives (SAMTLA)
Humanities researchers are faced with an overwhelming volume of digitised
primary source material, and "born digital" information, of relevance to their
research as a result of large-scale digitisation projects. The current digital tools
do not provide consistent support for analysing the content of digital archives
that are potentially large in scale, multilingual, and come in a range of data
formats. The current language-dependent, or project specific, approach to tool
development often puts the tools out of reach for many research disciplines in
the humanities. In addition, the tools can be incompatible with the way
researchers locate and compare the relevant sources. For instance, researchers
are interested in shared structural text patterns, known as \parallel passages"
that describe a specific cultural, social, or historical context relevant to their
research topic. Identifying these shared structural text patterns is challenging
due to their repeated yet highly variable nature, as a result of differences in
the domain, author, language, time period, and orthography.
The contribution of the thesis is a novel infrastructure that directly addresses
the need for generic,
flexible, extendable, and sustainable digital tools
that are applicable to a wide range of digital archives and research in the
humanities. The infrastructure adopts a character-level n-gram Statistical
Language Model (SLM), stored in a space-optimised k-truncated suffix tree
data structure as its underlying data model. A character-level n-gram model
is a relatively new approach that is competitive with word-level n-gram models,
but has the added advantage that it is domain and language-independent,
requiring little or no preprocessing of the document text unlike word-level
models that require some form of language-dependent tokenisation and stemming.
Character-level n-grams capture word internal features that are ignored
by word-level n-gram models, which provides greater
exibility in addressing
the information need of the user through tolerant search, and compensation
for erroneous query specification or spelling errors in the document text. Furthermore,
the SLM provides a unified approach to information retrieval and
text mining, where traditional approaches have tended to adopt separate data
models that are often ad-hoc or based on heuristic assumptions. In addition,
the performance of the character-level n-gram SLM was formally evaluated
through crowdsourcing, which demonstrates that the retrieval performance of
the SLM is close to that of the human level performance.
The proposed infrastructure, supports the development of the Samtla (Search
And Mining Tools for Language Archives), which provides humanities researchers
digital tools for search, browsing, and text mining of digital archives
in any domain or language, within a single system. Samtla supersedes many of
the existing tools for humanities researchers, by supporting the same or similar
functionality of the systems, but with a domain-independent and languageindependent
approach. The functionality includes a browsing tool constructed
from the metadata and named entities extracted from the document text, a
hybrid-recommendation system for recommending related queries and documents.
However, some tools are novel tools and developed in response to
the specific needs of the researchers, such as the document comparison tool
for visualising shared sequences between groups of related documents. Furthermore,
Samtla is the first practical example of a system with a SLM as
its primary data model that supports the real research needs of several case
studies covering different areas of research in the humanities
Recommended from our members
Remedying Security Concerns at an Internet Scale
The state of security across the Internet is poor, and it has been so since the advent of the modern Internet. While the research community has made tremendous progress over the years in learning how to design and build secure computer systems, network protocols, and algorithms, we are far from a world where we can truly trust the security of deployed Internet systems. In reality, we may never reach such a world. Security concerns continue to be identified at scale through-out the software ecosystem, with thousands of vulnerabilities discovered each year. Meanwhile, attacks have become ever more frequent and consequential.As Internet systems will continue to be inevitably affected by newly found security concerns, the research community must develop more effective ways to remedy these issues. To that end, in this dissertation, we conduct extensive empirical measurements to understand how remediation occurs in practice for Internet systems, and explore methods for spurring improved remediation behavior. This dissertation provides a treatment of the complete remediation life cycle, investigating the creation, dissemination, and deployment of remedies. We start by focusing on security patches that address vulnerabilities, and analyze at scale their creation process, characteristics of the resulting fixes, and how these impact vulnerability remediation. We then investigate and systematize how administrators of Internet systems deploy software updates which patch vulnerabilities across the many machines they manage on behalf of organizations. Finally, we conduct the first systematic exploration of Internet-scale outreach efforts to disseminate information about security concerns and their remedies to system administrators, with an aim of driving their remediation decisions. Our results show that such outreach campaigns can effectively galvanize positive reactions.Improving remediation, particularly at scale, is challenging, as the problem space exhibits many dimensions beyond traditional computer technical considerations, including human, social, organizational, economic, and policy facets. To make meaningful progress, this work uses a diversity of empirical methods, from software data mining to user studies to Internet-wide network measurements, to systematically collect and evaluate large-scale datasets. Ultimately, this dissertation establishes broad empirical grounding on security remediation in practice today, as well as new approaches for improved remediation at an Internet scale
Survey of the State of the Art in Natural Language Generation: Core tasks, applications and evaluation
This paper surveys the current state of the art in Natural Language
Generation (NLG), defined as the task of generating text or speech from
non-linguistic input. A survey of NLG is timely in view of the changes that the
field has undergone over the past decade or so, especially in relation to new
(usually data-driven) methods, as well as new applications of NLG technology.
This survey therefore aims to (a) give an up-to-date synthesis of research on
the core tasks in NLG and the architectures adopted in which such tasks are
organised; (b) highlight a number of relatively recent research topics that
have arisen partly as a result of growing synergies between NLG and other areas
of artificial intelligence; (c) draw attention to the challenges in NLG
evaluation, relating them to similar challenges faced in other areas of Natural
Language Processing, with an emphasis on different evaluation methods and the
relationships between them.Comment: Published in Journal of AI Research (JAIR), volume 61, pp 75-170. 118
pages, 8 figures, 1 tabl
- …