Search CORE

5,939 research outputs found

Embedding Web-based Statistical Translation Models in Cross-Language Information Retrieval

Author: Kraaij Wessel
Nie Jian-Yun
Simard Michel
Publication venue
Publication date: 01/01/2003
Field of study

Although more and more language pairs are covered by machine translation services, there are still many pairs that lack translation resources. Cross-language information retrieval (CLIR) is an application which needs translation functionality of a relatively low level of sophistication since current models for information retrieval (IR) are still based on a bag-of-words. The Web provides a vast resource for the automatic construction of parallel corpora which can be used to train statistical translation models automatically. The resulting translation models can be embedded in several ways in a retrieval model. In this paper, we will investigate the problem of automatically mining parallel texts from the Web and different ways of integrating the translation models within the retrieval process. Our experiments on standard test collections for CLIR show that the Web-based translation models can surpass commercial MT systems in CLIR tasks. These results open the perspective of constructing a fully automatic query translation device for CLIR at a very low cost.Comment: 37 page

arXiv.org e-Print Archive

CiteSeerX

Leiden University Scholary Publications

Probabilistic Bag-Of-Hyperlinks Model for Entity Linking

Author: Bunescu R. C.
Cheng X.
Cucerzan S.
He Z.
Recht B.
Rizzo G.
Spitkovsky V. I.
Yedidia J.
Publication venue
Publication date: 29/01/2016
Field of study

Many fundamental problems in natural language processing rely on determining what entities appear in a given text. Commonly referenced as entity linking, this step is a fundamental component of many NLP tasks such as text understanding, automatic summarization, semantic search or machine translation. Name ambiguity, word polysemy, context dependencies and a heavy-tailed distribution of entities contribute to the complexity of this problem. We here propose a probabilistic approach that makes use of an effective graphical model to perform collective entity disambiguation. Input mentions (i.e.,~linkable token spans) are disambiguated jointly across an entire document by combining a document-level prior of entity co-occurrences with local information captured from mentions and their surrounding context. The model is based on simple sufficient statistics extracted from data, thus relying on few parameters to be learned. Our method does not require extensive feature engineering, nor an expensive training procedure. We use loopy belief propagation to perform approximate inference. The low complexity of our model makes this step sufficiently fast for real-time usage. We demonstrate the accuracy of our approach on a wide range of benchmark datasets, showing that it matches, and in many cases outperforms, existing state-of-the-art methods

arXiv.org e-Print Archive

Crossref

Investigation on Applying Modular Ontology to Statistical Language Model for Information Retrieval

Author: Chang Jia Kang
Publication venue
Publication date
Field of study

The objective of this research is to provide a novel approach to improving retrieval performance by exploiting Ontology with the statistical language model (SLM). The proposed methods consist of two major processes, namely ontology-based query expansion (OQE) and ontology-based document classification (ODC). Research experiments have required development of an independent search tool that can combine the OQE and ODC in a traditional SLM-based information retrieval (IR) process using a Web document collection. This research considers the ongoing challenges of modular ontology enhanced SLM-based search and addresses three contribution aspects. The first concerns how to apply modular ontology to query expansion, in a bespoke language model search tool (LMST). The second considers how to incorporate OQE with the language model to improve the search performance. The third examines how to manipulate such semantic-based document classification to improve the smoothing accuracy. The role of ontology in the research is to provide formally described domains of interest that serve as context, to enhance system query effectiveness

CLoK

Neural Speed Reading with Structural-Jump-LSTM

Author: Alstrup Stephen
Hansen Casper
Hansen Christian
Lioma Christina
Simonsen Jakob Grue
Publication venue
Publication date: 01/01/2019
Field of study

Recurrent neural networks (RNNs) can model natural language by sequentially 'reading' input tokens and outputting a distributed representation of each token. Due to the sequential nature of RNNs, inference time is linearly dependent on the input length, and all inputs are read regardless of their importance. Efforts to speed up this inference, known as 'neural speed reading', either ignore or skim over part of the input. We present Structural-Jump-LSTM: the first neural speed reading model to both skip and jump text during inference. The model consists of a standard LSTM and two agents: one capable of skipping single words when reading, and one capable of exploiting punctuation structure (sub-sentence separators (,:), sentence end symbols (.!?), or end of text markers) to jump ahead after reading a word. A comprehensive experimental evaluation of our model against all five state-of-the-art neural reading models shows that Structural-Jump-LSTM achieves the best overall floating point operations (FLOP) reduction (hence is faster), while keeping the same accuracy or even improving it compared to a vanilla LSTM that reads the whole text.Comment: 10 page

arXiv.org e-Print Archive

Copenhagen University Research Information System

Automatic tagging and geotagging in video collections and communities

Author: Jones Gareth J.F.
Larson Martha
Serdyukov Pavel
Soleymani Mohammad
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/04/2011
Field of study

Automatically generated tags and geotags hold great promise to improve access to video collections and online communi- ties. We overview three tasks offered in the MediaEval 2010 benchmarking initiative, for each, describing its use scenario, definition and the data set released. For each task, a reference algorithm is presented that was used within MediaEval 2010 and comments are included on lessons learned. The Tagging Task, Professional involves automatically matching episodes in a collection of Dutch television with subject labels drawn from the keyword thesaurus used by the archive staff. The Tagging Task, Wild Wild Web involves automatically predicting the tags that are assigned by users to their online videos. Finally, the Placing Task requires automatically assigning geo-coordinates to videos. The specification of each task admits the use of the full range of available information including user-generated metadata, speech recognition transcripts, audio, and visual features

Irish Universities

DCU Online Research Access Service

A Progressive Clustering Algorithm to Group the XML Data by Structural and Semantic Similarity

Author: Nayak Richi
Tran Tien
Publication venue: 'World Scientific Pub Co Pte Lt'
Publication date: 01/01/2007
Field of study

Since the emergence in the popularity of XML for data representation and exchange over the Web, the distribution of XML documents has rapidly increased. It has become a challenge for researchers to turn these documents into a more useful information utility. In this paper, we introduce a novel clustering algorithm PCXSS that keeps the heterogeneous XML documents into various groups according to their similar structural and semantic representations. We develop a global criterion function CPSim that progressively measures the similarity between a XML document and existing clusters, ignoring the need to compute the similarity between two individual documents. The experimental analysis shows the method to be fast and accurate

CiteSeerX

Queensland University of Technology ePrints Archive