57 research outputs found
Scalable Cross-lingual Document Similarity through Language-specific Concept Hierarchies
With the ongoing growth in number of digital articles in a wider set of
languages and the expanding use of different languages, we need annotation
methods that enable browsing multi-lingual corpora. Multilingual probabilistic
topic models have recently emerged as a group of semi-supervised machine
learning models that can be used to perform thematic explorations on
collections of texts in multiple languages. However, these approaches require
theme-aligned training data to create a language-independent space. This
constraint limits the amount of scenarios that this technique can offer
solutions to train and makes it difficult to scale up to situations where a
huge collection of multi-lingual documents are required during the training
phase. This paper presents an unsupervised document similarity algorithm that
does not require parallel or comparable corpora, or any other type of
translation resource. The algorithm annotates topics automatically created from
documents in a single language with cross-lingual labels and describes
documents by hierarchies of multi-lingual concepts from independently-trained
models. Experiments performed on the English, Spanish and French editions of
JCR-Acquis corpora reveal promising results on classifying and sorting
documents by similar content.Comment: Accepted at the 10th International Conference on Knowledge Capture
(K-CAP 2019
Cross-language high similarity search using a conceptual thesaurus
This work addresses the issue of cross-language high similarity and
near-duplicates search, where, for the given document, a highly similar one is to
be identified from a large cross-language collection of documents. We propose
a concept-based similarity model for the problem which is very light in computation
and memory. We evaluate the model on three corpora of different nature
and two language pairs English-German and English-Spanish using the Eurovoc
conceptual thesaurus. Our model is compared with two state-of-the-art models
and we find, though the proposed model is very generic, it produces competitive
results and is significantly stable and consistent across the corpora.This work was done in the framework of the VLC/CAMPUS Microcluster on Multimodal Interaction in Intelligent Systems and it has been partially
funded by the European Commission as part of the WIQ-EI IRSES project (grant no.
269180) within the FP 7 Marie Curie People Framework, and by the Text-Enterprise
2.0 research project (TIN2009-13391-C04-03). The research work of the second author
is supported by the CONACyT 192021/302009 grantGupta, P.; Barrón Cedeño, LA.; Rosso, P. (2012). Cross-language high similarity search using a conceptual thesaurus. En Information Access Evaluation. Multilinguality, Multimodality, and Visual Analytics. Springer Verlag (Germany). 7488:67-75. https://doi.org/10.1007/978-3-642-33247-0_8S6775748
Computer-aided Document Indexing System
An enormous number of documents is being produced that have to be stored, searched and accessed. Document indexing represents an efficient way to tackle this problem. Contributing to the document indexing process, we developed the Computer-Aided Document Indexing System (CADIS) that applies controlled vocabulary keywords from the EUROVOC thesaurus. The main contribution of this paper is the introduction of the special CADIS internal data structure that copes with the morphological complexity of the Croatian language. CADIS internal data structure ensures efficient statistical analysis of input documents and quick visual feedback generation that helps indexing documents more quickly, accurately and uniformly than by manual indexing
Featurebased method for document alignment in comparable news corpora
In this paper, we present a feature-based method to align documents with similar content across two sets of bilingual comparable corpora from daily news texts. We evaluate the contribution of each individual feature and investigate the incorporation of these diverse statistical and heuristic features for the task of bilingual document alignment. Experimental results on the English-Chinese and English-Malay comparable news corpora show that our proposed Discrete Fourier Transformbased term frequency distribution feature is very effective. It contributes 4.1 % and 8 % to performance improvement over Pearson’s correlation method on the two comparable corpora. In addition, when more heuristic and statistical features as well as a bilingual dictionary are utilized, our method shows an absolute performance improvement of 23.2% and 15.3 % on the two sets of bilingual corpora when comparing with a prior information retrieval-based method.
A tool set for the quick and efficient exploration of large document collections
We are presenting a set of multilingual text analysis tools that can help
analysts in any field to explore large document collections quickly in order to
determine whether the documents contain information of interest, and to find
the relevant text passages. The automatic tool, which currently exists as a
fully functional prototype, is expected to be particularly useful when users
repeatedly have to sieve through large collections of documents such as those
downloaded automatically from the internet. The proposed system takes a whole
document collection as input. It first carries out some automatic analysis
tasks (named entity recognition, geo-coding, clustering, term extraction),
annotates the texts with the generated meta-information and stores the
meta-information in a database. The system then generates a zoomable and
hyperlinked geographic map enhanced with information on entities and terms
found. When the system is used on a regular basis, it builds up a historical
database that contains information on which names have been mentioned together
with which other names or places, and users can query this database to retrieve
information extracted in the past.Comment: 10 page
- …