5,939 research outputs found
Embedding Web-based Statistical Translation Models in Cross-Language Information Retrieval
Although more and more language pairs are covered by machine translation
services, there are still many pairs that lack translation resources.
Cross-language information retrieval (CLIR) is an application which needs
translation functionality of a relatively low level of sophistication since
current models for information retrieval (IR) are still based on a
bag-of-words. The Web provides a vast resource for the automatic construction
of parallel corpora which can be used to train statistical translation models
automatically. The resulting translation models can be embedded in several ways
in a retrieval model. In this paper, we will investigate the problem of
automatically mining parallel texts from the Web and different ways of
integrating the translation models within the retrieval process. Our
experiments on standard test collections for CLIR show that the Web-based
translation models can surpass commercial MT systems in CLIR tasks. These
results open the perspective of constructing a fully automatic query
translation device for CLIR at a very low cost.Comment: 37 page
Probabilistic Bag-Of-Hyperlinks Model for Entity Linking
Many fundamental problems in natural language processing rely on determining
what entities appear in a given text. Commonly referenced as entity linking,
this step is a fundamental component of many NLP tasks such as text
understanding, automatic summarization, semantic search or machine translation.
Name ambiguity, word polysemy, context dependencies and a heavy-tailed
distribution of entities contribute to the complexity of this problem.
We here propose a probabilistic approach that makes use of an effective
graphical model to perform collective entity disambiguation. Input mentions
(i.e.,~linkable token spans) are disambiguated jointly across an entire
document by combining a document-level prior of entity co-occurrences with
local information captured from mentions and their surrounding context. The
model is based on simple sufficient statistics extracted from data, thus
relying on few parameters to be learned.
Our method does not require extensive feature engineering, nor an expensive
training procedure. We use loopy belief propagation to perform approximate
inference. The low complexity of our model makes this step sufficiently fast
for real-time usage. We demonstrate the accuracy of our approach on a wide
range of benchmark datasets, showing that it matches, and in many cases
outperforms, existing state-of-the-art methods
Investigation on Applying Modular Ontology to Statistical Language Model for Information Retrieval
The objective of this research is to provide a novel approach to improving retrieval performance by exploiting Ontology with the statistical language model (SLM). The proposed methods consist of two major processes, namely ontology-based query expansion (OQE) and ontology-based document classification (ODC). Research experiments have required development of an independent search tool that can combine the OQE and ODC in a traditional SLM-based information retrieval (IR) process using a Web document collection.
This research considers the ongoing challenges of modular ontology enhanced SLM-based search and addresses three contribution aspects. The first concerns how to apply modular ontology to query expansion, in a bespoke language model search tool (LMST). The second considers how to incorporate OQE with the language model to improve the search performance. The third examines how to manipulate such semantic-based document classification to improve the smoothing accuracy. The role of ontology in the research is to provide formally described domains of interest that serve as context, to enhance system query effectiveness
Neural Speed Reading with Structural-Jump-LSTM
Recurrent neural networks (RNNs) can model natural language by sequentially
'reading' input tokens and outputting a distributed representation of each
token. Due to the sequential nature of RNNs, inference time is linearly
dependent on the input length, and all inputs are read regardless of their
importance. Efforts to speed up this inference, known as 'neural speed
reading', either ignore or skim over part of the input. We present
Structural-Jump-LSTM: the first neural speed reading model to both skip and
jump text during inference. The model consists of a standard LSTM and two
agents: one capable of skipping single words when reading, and one capable of
exploiting punctuation structure (sub-sentence separators (,:), sentence end
symbols (.!?), or end of text markers) to jump ahead after reading a word. A
comprehensive experimental evaluation of our model against all five
state-of-the-art neural reading models shows that Structural-Jump-LSTM achieves
the best overall floating point operations (FLOP) reduction (hence is faster),
while keeping the same accuracy or even improving it compared to a vanilla LSTM
that reads the whole text.Comment: 10 page
Automatic tagging and geotagging in video collections and communities
Automatically generated tags and geotags hold great promise
to improve access to video collections and online communi-
ties. We overview three tasks offered in the MediaEval 2010
benchmarking initiative, for each, describing its use scenario, definition and the data set released. For each task, a reference algorithm is presented that was used within MediaEval 2010 and comments are included on lessons learned. The Tagging Task, Professional involves automatically matching episodes in a collection of Dutch television with subject labels drawn from the keyword thesaurus used by the archive staff. The Tagging Task, Wild Wild Web involves automatically predicting the tags that are assigned by users to their online videos. Finally, the Placing Task requires automatically assigning geo-coordinates to videos. The specification of each task admits the use of the full range of available information including user-generated metadata, speech recognition transcripts, audio, and visual features
A Progressive Clustering Algorithm to Group the XML Data by Structural and Semantic Similarity
Since the emergence in the popularity of XML for data representation and exchange over the Web, the distribution of XML documents has rapidly increased. It has become a challenge for researchers to turn these documents into a more useful information utility. In this paper, we introduce a novel clustering algorithm PCXSS that keeps the heterogeneous XML documents into various groups according to their similar structural and semantic representations. We develop a global criterion function CPSim that progressively measures the similarity between a XML document and existing clusters, ignoring the need to compute the similarity between two individual documents. The experimental analysis shows the method to be fast and accurate
- …