3 research outputs found
Final FLaReNet deliverable: Language Resources for the Future - The Future of Language Resources
Language Technologies (LT), together with their backbone, Language Resources (LR), provide an essential support to the challenge of Multilingualism and ICT of the future. The main task of language technologies is to bridge language barriers and to help creating a new environment where information flows smoothly across frontiers and languages, no matter the country, and the language, of origin. To achieve this goal, all players involved need to act as a community able to join forces on a set of shared priorities. However, until now the field of Language Resources and Technology has long suffered from an excess of individuality and fragmentation, with a lack of coherence concerning the priorities for the field, the direction to move, not to mention a common timeframe. The context encountered by the FLaReNet project was thus represented by an active field needing a coherence that can only be given by sharing common priorities and endeavours. FLaReNet has contributed to the creation of this coherence by gathering a wide community of experts and making them participate in the definition of an exhaustive set of recommendations
Exploiting Social Semantics for Multilingual Information Retrieval
In this thesis we consider how user-generated content that is assembled by different popular Web portals can be exploited for Multilingual Information Retrieval. We define the knowledge that can be derived from such portals as Social Semantics. We present to approaches, Cross-lingual Explicit Semantic Analysis and Discriminative Retrieval Models, that are able to support multilingual retrieval models by integrating Social Semantics derived from Wikipedia and Yahoo! Answers
Term selection in information retrieval
Systems trained on linguistically annotated data achieve strong performance for many
language processing tasks. This encourages the idea that annotations can improve any
language processing task if applied in the right way. However, despite widespread
acceptance and availability of highly accurate parsing software, it is not clear that ad
hoc information retrieval (IR) techniques using annotated documents and requests consistently
improve search performance compared to techniques that use no linguistic
knowledge. In many cases, retrieval gains made using language processing components,
such as part-of-speech tagging and head-dependent relations, are offset by significant
negative effects. This results in a minimal positive, or even negative, overall
impact for linguistically motivated approaches compared to approaches that do not use
any syntactic or domain knowledge.
In some cases, it may be that syntax does not reveal anything of practical importance
about document relevance. Yet without a convincing explanation for why linguistic
annotations fail in IR, the intuitive appeal of search systems that ‘understand’ text
can result in the repeated application, and mis-application, of language processing to
enhance search performance. This dissertation investigates whether linguistics can improve
the selection of query terms by better modelling the alignment process between
natural language requests and search queries. It is the most comprehensive work on
the utility of linguistic methods in IR to date.
Term selection in this work focuses on identification of informative query terms of
1-3 words that both represent the semantics of a request and discriminate between relevant
and non-relevant documents. Approaches to word association are discussed with
respect to linguistic principles, and evaluated with respect to semantic characterization
and discriminative ability. Analysis is organised around three theories of language that
emphasize different structures for the identification of terms: phrase structure theory,
dependency theory and lexicalism. The structures identified by these theories play
distinctive roles in the organisation of language. Evidence is presented regarding the
value of different methods of word association based on these structures, and the effect
of method and term combinations.
Two highly effective, novel methods for the selection of terms from verbose queries
are also proposed and evaluated. The first method focuses on the semantic phenomenon
of ellipsis with a discriminative filter that leverages diverse text features. The second
method exploits a term ranking algorithm, PhRank, that uses no linguistic information
and relies on a network model of query context. The latter focuses queries so that 1-5
terms in an unweighted model achieve better retrieval effectiveness than weighted IR
models that use up to 30 terms. In addition, unlike models that use a weighted distribution
of terms or subqueries, the concise terms identified by PhRank are interpretable by
users. Evaluation with newswire and web collections demonstrates that PhRank-based
query reformulation significantly improves performance of verbose queries up to 14%
compared to highly competitive IR models, and is at least as good for short, keyword
queries with the same models.
Results illustrate that linguistic processing may help with the selection of word associations
but does not necessarily translate into improved IR performance. Statistical
methods are necessary to overcome the limits of syntactic parsing and word adjacency
measures for ad hoc IR. As a result, probabilistic frameworks that discover, and make
use of, many forms of linguistic evidence may deliver small improvements in IR effectiveness,
but methods that use simple features can be substantially more efficient
and equally, or more, effective. Various explanations for this finding are suggested,
including the probabilistic nature of grammatical categories, a lack of homomorphism
between syntax and semantics, the impact of lexical relations, variability in collection
data, and systemic effects in language systems