Search CORE

7 research outputs found

Scalable Cross-lingual Document Similarity through Language-specific Concept Hierarchies

Author: Badenes-Olmedo Carlos
Blei David M
Boyd-Graber Jordan
Hakkani-Tur D
Hearst Marti
Kenter Tom
Luo Wenhan
Pritchard Jonathan K.
Rao C Radhakrishna
Towne W Ben
Wang Chong
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 15/12/2020
Field of study

With the ongoing growth in number of digital articles in a wider set of languages and the expanding use of different languages, we need annotation methods that enable browsing multi-lingual corpora. Multilingual probabilistic topic models have recently emerged as a group of semi-supervised machine learning models that can be used to perform thematic explorations on collections of texts in multiple languages. However, these approaches require theme-aligned training data to create a language-independent space. This constraint limits the amount of scenarios that this technique can offer solutions to train and makes it difficult to scale up to situations where a huge collection of multi-lingual documents are required during the training phase. This paper presents an unsupervised document similarity algorithm that does not require parallel or comparable corpora, or any other type of translation resource. The algorithm annotates topics automatically created from documents in a single language with cross-lingual labels and describes documents by hierarchies of multi-lingual concepts from independently-trained models. Experiments performed on the English, Spanish and French editions of JCR-Acquis corpora reveal promising results on classifying and sorting documents by similar content.Comment: Accepted at the 10th International Conference on Knowledge Capture (K-CAP 2019

arXiv.org e-Print Archive

Crossref

Caracterización del sector de Tecnologías del Lenguaje mediante modelado de tópicos y análisis de grafos: Visión general de la participación española

Author: Arenas-García Jerónimo
Pérez-Fernández David
Samy Doaa
Publication venue: Sociedad Española para el Procesamiento del Lenguaje Natural
Publication date: 01/01/2019
Field of study

This paper aims at landscaping the Human Language Technologies (HLT) sector by applying topic modeling and graph analysis to study the scientific literature in ACL Anthology with special emphasis on the Spanish participation. The analysis takes into account the structured and unstructured data to offer an overview of the HLT landscape in Spain identifying main underlying themes and its evolution in the last years compared to the international HLT community. Results obtained are represented through an interactive visualization to allow the exploration of the HLT landscape in the time frame 1983-2018.El presente trabajo aplica herramientas de modelado de tópicos y análisis de grafos para caracterizar el sector de Tecnologías del Lenguaje (TL) en España. Para ello, se estudian el repositorio de ACL Anthology. Este análisis tiene en cuenta los datos estructurados y no-estructurados en dichas fuentes con el fin de retratar el panorama actual en términos de temáticas subyacentes y su evolución en los últimos años en comparación con la comunidad internacional. Los resultados se presentan mediante una visualización interactiva que permite navegar en el espacio de TL en el intervalo temporal 1983-2018.This work has been carried out in the framework of the Spanish State Plan for Natural Language Technologies. The work of J. Arenas-García has also been partly funded by MINECO projects TEC2014-52289-R and TEC2017-83838-R

Repositorio Institucional de la Universidad de Alicante

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Validation of scientific topic models using graph analysis and corpus metadata

Author: Arenas García Jerónimo
Cid Sueiro Jesús
Pereira Delgado Jorge
Vázquez López Manuel Alberto
Publication venue: Springer Nature
Publication date: 30/03/2022
Field of study

Probabilistic topic modeling algorithms like Latent Dirichlet Allocation (LDA) have become powerful tools for the analysis of large collections of documents (such as papers, projects, or funding applications) in science, technology an innovation (STI) policy design and monitoring. However, selecting an appropriate and stable topic model for a specific application (by adjusting the hyperparameters of the algorithm) is not a trivial problem. Common validation metrics like coherence or perplexity, which are focused on the quality of topics, are not a good fit in applications where the quality of the document similarity relations inferred from the topic model is especially relevant. Relying on graph analysis techniques, the aim of our work is to state a new methodology for the selection of hyperparameters which is specifically oriented to optimize the similarity metrics emanating from the topic model. In order to do this, we propose two graph metrics: the first measures the variability of the similarity graphs that result from different runs of the algorithm for a fixed value of the hyperparameters, while the second metric measures the alignment between the graph derived from the LDA model and another obtained using metadata available for the corresponding corpus. Through experiments on various corpora related to STI, it is shown that the proposed metrics provide relevant indicators to select the number of topics and build persistent topic models that are consistent with the metadata. Their use, which can be extended to other topic models beyond LDA, could facilitate the systematic adoption of this kind of techniques in STI policy analysis and design.Open Access funding provided thanks to the CRUE-CSIC agreement with Springer Nature. This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 101004870 (IntelComp project), and has also been partially supported by FEDER/ Spanish Ministry of Science, Innovation and Universities, State Agency of Research, project TEC2017-83838-R

Universidad Carlos III de Madrid e-Archivo

Optimizing shared service center performance by assessing simulated task assignment methods

Author: Vermeeren W.
Publication venue
Publication date: 20/04/2021
Field of study

Pure OAI Repository

Minnesota and the competition for immigrants /

Author: Ristuben Peter John.
Publication venue: The University of Oklahoma.
Publication date: 01/01/1964
Field of study

SHAREOK repository

Distributing Text Mining tasks with librAIry

Author: Badenes-Olmedo Carlos
Corcho Oscar
Redondo-Garcia José Luis
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 31/08/2017
Field of study

We present librAIry, a novel architecture to store, process and an- alyze large collections of textual resources, integrating existing algorithms and tools into a common, distributed, high-performance work ow. Available text mining techniques can be incorporated as independent plug&play modules working in a collaborative manner into the framework. In the absence of a pre-de ned ow, librAIry leverages on the aggregation of operations executed by di erent components in response to an emergent chain of events. Extensive use of Linked Data (LD) and Representational State Transfer (REST) principles are made to provide individually addressable resources from textual documents. We have described the architecture design and its implementation and tested its e ectiveness in real-world scenarios such as collections of research papers, patents or ICT aids, with the objective of providing solutions for decision makers and experts in those domains. Major advantages of the framework and lessons-learned from these experiments are reported

Crossref

Archivo Digital UPM

Annual Report of the Board of Regents of the Smithsonian Institution showing the operations, expenditures, and condition of the Institution to July, 1896, Pt 1.

Author
Publication venue: University of Oklahoma College of Law Digital Commons
Publication date: 01/07/1896
Field of study

Annual Report of the Smithsonian Institution. 1 July. HD 352 (pts. 1 and 2), 54-2, v72-73, 1909p. (3548-3549] Research related to the American Indian

University of Oklahoma College of Law