Search CORE

57 research outputs found

Scalable Cross-lingual Document Similarity through Language-specific Concept Hierarchies

Author: Badenes-Olmedo Carlos
Blei David M
Boyd-Graber Jordan
Hakkani-Tur D
Hearst Marti
Kenter Tom
Luo Wenhan
Pritchard Jonathan K.
Rao C Radhakrishna
Towne W Ben
Wang Chong
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 15/12/2020
Field of study

With the ongoing growth in number of digital articles in a wider set of languages and the expanding use of different languages, we need annotation methods that enable browsing multi-lingual corpora. Multilingual probabilistic topic models have recently emerged as a group of semi-supervised machine learning models that can be used to perform thematic explorations on collections of texts in multiple languages. However, these approaches require theme-aligned training data to create a language-independent space. This constraint limits the amount of scenarios that this technique can offer solutions to train and makes it difficult to scale up to situations where a huge collection of multi-lingual documents are required during the training phase. This paper presents an unsupervised document similarity algorithm that does not require parallel or comparable corpora, or any other type of translation resource. The algorithm annotates topics automatically created from documents in a single language with cross-lingual labels and describes documents by hierarchies of multi-lingual concepts from independently-trained models. Experiments performed on the English, Spanish and French editions of JCR-Acquis corpora reveal promising results on classifying and sorting documents by similar content.Comment: Accepted at the 10th International Conference on Knowledge Capture (K-CAP 2019

arXiv.org e-Print Archive

Crossref

Cross-language high similarity search using a conceptual thesaurus

Author: A. Chowdhury
A.Z. Broder
D. Pinto
J. Dean
M. Anderka
M. Potthast
M.S. Charikar
P. Mcnamee
P.F. Brown
R. Steinberger
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2012
Field of study

This work addresses the issue of cross-language high similarity and near-duplicates search, where, for the given document, a highly similar one is to be identified from a large cross-language collection of documents. We propose a concept-based similarity model for the problem which is very light in computation and memory. We evaluate the model on three corpora of different nature and two language pairs English-German and English-Spanish using the Eurovoc conceptual thesaurus. Our model is compared with two state-of-the-art models and we find, though the proposed model is very generic, it produces competitive results and is significantly stable and consistent across the corpora.This work was done in the framework of the VLC/CAMPUS Microcluster on Multimodal Interaction in Intelligent Systems and it has been partially funded by the European Commission as part of the WIQ-EI IRSES project (grant no. 269180) within the FP 7 Marie Curie People Framework, and by the Text-Enterprise 2.0 research project (TIN2009-13391-C04-03). The research work of the second author is supported by the CONACyT 192021/302009 grantGupta, P.; Barrón Cedeño, LA.; Rosso, P. (2012). Cross-language high similarity search using a conceptual thesaurus. En Information Access Evaluation. Multilinguality, Multimodality, and Visual Analytics. Springer Verlag (Germany). 7488:67-75. https://doi.org/10.1007/978-3-642-33247-0_8S6775748

Crossref

RiuNet

Archivio istituzionale della ricerca - Alma Mater Studiorum Università di Bologna

Computer-aided Document Indexing System

Author: Bojana Dalbelo Bašić
Igor Vukmirović
Jan Šnajder
Mladen Kolar
Publication venue: 'University of Zagreb - University Computing Centre'
Publication date: 01/01/2005
Field of study

An enormous number of documents is being produced that have to be stored, searched and accessed. Document indexing represents an efficient way to tackle this problem. Contributing to the document indexing process, we developed the Computer-Aided Document Indexing System (CADIS) that applies controlled vocabulary keywords from the EUROVOC thesaurus. The main contribution of this paper is the introduction of the special CADIS internal data structure that copes with the morphological complexity of the Croatian language. CADIS internal data structure ensures efficient statistical analysis of input documents and quick visual feedback generation that helps indexing documents more quickly, accurately and uniformly than by manual indexing

Crossref

HRČAK - Portal of Croatian Scientific and Professional Journals

Hrčak - Portal of scientific journals of Croatia

Featurebased method for document alignment in comparable news corpora

Author: Ai Ti Aw
Min Zhang
Thuy Vu
Publication venue
Publication date: 01/01/2009
Field of study

In this paper, we present a feature-based method to align documents with similar content across two sets of bilingual comparable corpora from daily news texts. We evaluate the contribution of each individual feature and investigate the incorporation of these diverse statistical and heuristic features for the task of bilingual document alignment. Experimental results on the English-Chinese and English-Malay comparable news corpora show that our proposed Discrete Fourier Transformbased term frequency distribution feature is very effective. It contributes 4.1 % and 8 % to performance improvement over Pearson’s correlation method on the two comparable corpora. In addition, when more heuristic and statistical features as well as a bilingual dictionary are utilized, our method shows an absolute performance improvement of 23.2% and 15.3 % on the two sets of bilingual corpora when comparing with a prior information retrieval-based method.

CiteSeerX

Crossref

A tool set for the quick and efficient exploration of large document collections

Author: Erjavec Tomaz
Ignat Camelia
Pouliquen Bruno
Steinberger Ralf
Publication venue
Publication date: 01/01/2005
Field of study

We are presenting a set of multilingual text analysis tools that can help analysts in any field to explore large document collections quickly in order to determine whether the documents contain information of interest, and to find the relevant text passages. The automatic tool, which currently exists as a fully functional prototype, is expected to be particularly useful when users repeatedly have to sieve through large collections of documents such as those downloaded automatically from the internet. The proposed system takes a whole document collection as input. It first carries out some automatic analysis tasks (named entity recognition, geo-coding, clustering, term extraction), annotates the texts with the generated meta-information and stores the meta-information in a database. The system then generates a zoomable and hyperlinked geographic map enhanced with information on entities and terms found. When the system is used on a regular basis, it builds up a historical database that contains information on which names have been mentioned together with which other names or places, and users can query this database to retrieve information extracted in the past.Comment: 10 page

arXiv.org e-Print Archive

CiteSeerX