5 research outputs found

    Research on Multilingual News Clustering Based on Cross-Language Word Embeddings

    Full text link
    Classifying the same event reported by different countries is of significant importance for public opinion control and intelligence gathering. Due to the diverse types of news, relying solely on transla-tors would be costly and inefficient, while depending solely on translation systems would incur considerable performance overheads in invoking translation interfaces and storing translated texts. To address this issue, we mainly focus on the clustering problem of cross-lingual news. To be specific, we use a combination of sentence vector representations of news headlines in a mixed semantic space and the topic probability distributions of news content to represent a news article. In the training of cross-lingual models, we employ knowledge distillation techniques to fit two semantic spaces into a mixed semantic space. We abandon traditional static clustering methods like K-Means and AGNES in favor of the incremental clustering algorithm Single-Pass, which we further modify to better suit cross-lingual news clustering scenarios. Our main contributions are as follows: (1) We adopt the English standard BERT as the teacher model and XLM-Roberta as the student model, training a cross-lingual model through knowledge distillation that can represent sentence-level bilingual texts in both Chinese and English. (2) We use the LDA topic model to represent news as a combina-tion of cross-lingual vectors for headlines and topic probability distributions for con-tent, introducing concepts such as topic similarity to address the cross-lingual issue in news content representation. (3) We adapt the Single-Pass clustering algorithm for the news context to make it more applicable. Our optimizations of Single-Pass include ad-justing the distance algorithm between samples and clusters, adding cluster merging operations, and incorporating a news time parameter

    A Grammar-Based Semantic Similarity Algorithm for Natural Language Sentences

    Get PDF
    This paper presents a grammar and semantic corpus based similarity algorithm for natural language sentences. Natural language, in opposition to "artificial language", such as computer programming languages, is the language used by the general public for daily communication. Traditional information retrieval approaches, such as vector models, LSA, HAL, or even the ontologybased approaches that extend to include concept similarity comparison instead of cooccurrence terms/words, may not always determine the perfect matching while there is no obvious relation or concept overlap between two natural language sentences. This paper proposes a sentence similarity algorithm that takes advantage of corpus-based ontology and grammatical rules to overcome the addressed problems. Experiments on two famous benchmarks demonstrate that the proposed algorithm has a significant performance improvement in sentences/short-texts with arbitrary syntax and structure

    Characterisation and adaptive learning in interactive video retrieval

    Get PDF
    El objetivo principal de esta tesis consiste en utilizar eficazmente los modelos de tópicos latentes para afrontar el problema de la recuperación automática de vídeo. Concretamente, se pretende mejorar tanto a nivel de eficiencia como a nivel de precisión el actual estado del arte en materia de los sitemas de recuperación automática de vídeo. En general, los modelos de tópicos latentes son un conjunto de herramientas estadísticas que permiten extraer los patrones generadores de una colección de datos. Tradicionalmente, este tipo de técnicas no han sido consideradas de gran utilidad para los sistemas de recuperación automática de vídeo debido a su alto coste computacional y a la propia complejidad del espacio de tópicos en el ámbito de la información visual.In this work, we are interested in the use of latent topics to overcome the current limitations in CBVR. Despite the potential of topic models to uncover the hidden structure of a collection, they have traditionally been unable to provide a competitive advantage in CBVR because of the high computational cost of their algorithms and the complexity of the latent space in the visual domain. Throughout this thesis we focus on designing new models and tools based on topic models to take advantage of the latent space in CBVR. Specifically, we have worked in four different areas within the retrieval process: vocabulary reduction, encoding, modelling and ranking, being our most important contributions related to both modelling and ranking

    Exploiting sparsity for machine learning in big data

    Get PDF
    The rapid development of modern information technology has significantly facilitated the generation, collection, transmission and storage of all kinds of data. With the so-called “big data” generated in an unprecedented rate, we are facing significant challenges in learning knowledge from it. Traditional machine learning algorithms often suffer from the unmatched volume and complexity of such big data, however, sparsity has been recently studied to tackle this challenge. With reasonable assumptions and effective utilization of sparsity, we can learn models that are simpler, more efficient and robust to noise. The goal of this dissertation is studying and exploiting sparsity to design learning algorithms to effectively and efficiently solve various challenging and significant real-world machine learning tasks. I will integrate and introduce my work from three different perspectives: sample complexity, computational complexity, and noise reduction. Intuitively, these three aspects correspond to models that require less data to learn, are more computationally efficient, and still perform well when the data is noisy. Specifically, this thesis is integrated from the three aspects as follows: First, I focus on the sample complexity of machine learning algorithms for an important machine learning task, compressed sensing. I propose a novel algorithm based on nonconvex sparsity-inducing penalty, which is the first work that utilizes such penalty. I also prove that our algorithm improves the best previous sample complexity significantly by extensive theoretical derivation and numerical experiments. Second, from the perspective of computational complexity, I study the expectation-maximization (EM) algorithms in high dimensional scenarios. In contrast to the conventional regime, the maximization step (M-step) in high dimensional scenario can be very computationally expensive or even not well defined. To address this challenge, I propose an efficient algorithm based on novel semi-stochastic gradient descent with variance reduction, which naturally incorporates the sparsity in model parameters, greatly economizes the computational cost at each iteration and enjoys faster convergence rates simultaneously. We believe the proposed unique semi-stochastic variance-reduced gradient is of general interest of nonconvex optimization of bivariate structure. Third, I look into the noise reduction problem and target on an important text mining task, event detection. To overcome the noise in the text data which hampers the detection of real events, I design an efficient algorithm based on sparsity-inducing fused lasso framework. Experiment results on various datasets show that our algorithm effectively smooths out noises and captures the real event, outperforming several state- of-the-art methods consistently in noisy setting. To sum up, this thesis focuses on the critical issues of machine learning in big data from the perspective of sparsity in the data and model. Our proposed methods clearly show that utilizing sparsity is of great importance for various significant machine learning tasks
    corecore