6,285 research outputs found
How Part-of-Speech Tags Affect Text Retrieval and Filtering Performance
Natural language processing (NLP) applied to information retrieval (IR) and
filtering problems may assign part-of-speech tags to terms and, more generally,
modify queries and documents. Analytic models can predict the performance of a
text filtering system as it incorporates changes suggested by NLP, allowing us
to make precise statements about the average effect of NLP operations on IR.
Here we provide a model of retrieval and tagging that allows us to both compute
the performance change due to syntactic parsing and to allow us to understand
what factors affect performance and how. In addition to a prediction of
performance with tags, upper and lower bounds for retrieval performance are
derived, giving the best and worst effects of including part-of-speech tags.
Empirical grounds for selecting sets of tags are considered.Comment: uuencoded and compressed postscrip
Embedding Web-based Statistical Translation Models in Cross-Language Information Retrieval
Although more and more language pairs are covered by machine translation
services, there are still many pairs that lack translation resources.
Cross-language information retrieval (CLIR) is an application which needs
translation functionality of a relatively low level of sophistication since
current models for information retrieval (IR) are still based on a
bag-of-words. The Web provides a vast resource for the automatic construction
of parallel corpora which can be used to train statistical translation models
automatically. The resulting translation models can be embedded in several ways
in a retrieval model. In this paper, we will investigate the problem of
automatically mining parallel texts from the Web and different ways of
integrating the translation models within the retrieval process. Our
experiments on standard test collections for CLIR show that the Web-based
translation models can surpass commercial MT systems in CLIR tasks. These
results open the perspective of constructing a fully automatic query
translation device for CLIR at a very low cost.Comment: 37 page
Exploratory Analysis of Highly Heterogeneous Document Collections
We present an effective multifaceted system for exploratory analysis of
highly heterogeneous document collections. Our system is based on intelligently
tagging individual documents in a purely automated fashion and exploiting these
tags in a powerful faceted browsing framework. Tagging strategies employed
include both unsupervised and supervised approaches based on machine learning
and natural language processing. As one of our key tagging strategies, we
introduce the KERA algorithm (Keyword Extraction for Reports and Articles).
KERA extracts topic-representative terms from individual documents in a purely
unsupervised fashion and is revealed to be significantly more effective than
state-of-the-art methods. Finally, we evaluate our system in its ability to
help users locate documents pertaining to military critical technologies buried
deep in a large heterogeneous sea of information.Comment: 9 pages; KDD 2013: 19th ACM SIGKDD Conference on Knowledge Discovery
and Data Minin
Visual Affect Around the World: A Large-scale Multilingual Visual Sentiment Ontology
Every culture and language is unique. Our work expressly focuses on the
uniqueness of culture and language in relation to human affect, specifically
sentiment and emotion semantics, and how they manifest in social multimedia. We
develop sets of sentiment- and emotion-polarized visual concepts by adapting
semantic structures called adjective-noun pairs, originally introduced by Borth
et al. (2013), but in a multilingual context. We propose a new
language-dependent method for automatic discovery of these adjective-noun
constructs. We show how this pipeline can be applied on a social multimedia
platform for the creation of a large-scale multilingual visual sentiment
concept ontology (MVSO). Unlike the flat structure in Borth et al. (2013), our
unified ontology is organized hierarchically by multilingual clusters of
visually detectable nouns and subclusters of emotionally biased versions of
these nouns. In addition, we present an image-based prediction task to show how
generalizable language-specific models are in a multilingual context. A new,
publicly available dataset of >15.6K sentiment-biased visual concepts across 12
languages with language-specific detector banks, >7.36M images and their
metadata is also released.Comment: 11 pages, to appear at ACM MM'1
Natural language processing
Beginning with the basic issues of NLP, this chapter aims to chart the major research activities in this area since the last ARIST Chapter in 1996 (Haas, 1996), including: (i) natural language text processing systems - text summarization, information extraction, information retrieval, etc., including domain-specific applications; (ii) natural language interfaces; (iii) NLP in the context of www and digital libraries ; and (iv) evaluation of NLP systems
CHORUS Deliverable 2.1: State of the Art on Multimedia Search Engines
Based on the information provided by European projects and national initiatives related to multimedia search as well as domains experts that participated in the CHORUS Think-thanks and workshops, this document reports on the state of the art related to multimedia content search from, a technical, and socio-economic perspective.
The technical perspective includes an up to date view on content based indexing and retrieval technologies, multimedia search in the context of mobile devices and peer-to-peer networks, and an overview of current evaluation and benchmark inititiatives to measure the performance of multimedia search engines.
From a socio-economic perspective we inventorize the impact and legal consequences of these technical advances and point out future directions of research
- …