Search CORE

70 research outputs found

Fisher Kernels and Probabilistic Latent Semantic Models

Author: Eckard Emmanuel
Publication venue: Lausanne, EPFL
Publication date: 11/02/2010
Field of study

Tasks that rely on semantic content of documents, notably Information Retrieval and Document Classification, can benefit from a good account of document context, i.e. the semantic association between documents. To this effect, the scheme of latent semantics blends individual words appearing throughout a document collection into latent topics, thus providing a way to handle documents that is less constrained than the conventional approach by the mere appearance of such or such word. Probabilistic latent semantic models take the matter further by providing assumptions on how the documents observed in the collection would have been generated. This allows derivation of inference algorithms that can fit the model parameters to the observed document collection; with their values set, these parameters can then be used to compute the similarities between documents. The Fisher kernels, similarity functions rooted in information geometry, constitute good candidates to measure the similarity between documents in the framework of probabilistic latent semantic models. In this context, we study the use of Fisher kernels for the Probabilistic Latent Semantic Indexing (PLSI) model. By thoroughly analysing the generative process of PLSI, we derive the proper Fisher kernel for PLSI and expose the hypotheses that relate former work to this kernel. In particular, we confirm that the Fisher information matrix (FIM) should not be approximated by the identity in the case of PLSI. We also study the impact on the performances of the Fisher kernel of the contribution of the latent topics and the one of the distribution of words among the topics; eventually, we provide empirical evidence, and theoretical arguments, showing that the Fisher kernel originally published by Hofmann, corrected to account for FIM, is the best of the PLSI Fisher kernels. It can compete with the strong BM25 baseline, and even significantly outperforms it when documents sharing few words must be matched. We further study of PLSI document similarities by applying the Language model approach. This approach shuns the usual IR paradigm that considers documents and queries to be of a similar nature. Instead, they consider documents as being representative of language models, and use probabilistic tools to determine which of these models would have generated the query with highest probability. Using this scheme in the framework of PLSI provides a way to bypass the issue of query representation, which constitutes one of the specific challenges of PLSI. We find the Language model approach to perform as well as the best of the Fisher kernels when enough latent categories are provided. Eventually, we propose a new probabilistic latent semantic model consisting in a mixture of Smoothed Dirichlet distributions which, by better modeling word burstiness, provides a more realistic model of empirical observations on real document collections than the usually used Multinomials

Infoscience - École polytechnique fédérale de Lausanne

Metric for seleting the number of topics in the LDA Model

Author: Lima Junior Afonso Valau de
Publication venue
Publication date: 01/01/2020
Field of study

The latest technological trends are driving a vast and growing amount of textual data. Topic modeling is a useful tool for extracting information from large corpora of text. A topic template is based on a corpus of documents, discovers the topics that permeate the corpus and assigns documents to those topics. The Latent Dirichlet Allocation (LDA) model is the main, or most popular, of the probabilistic topic models. The LDA model is conditioned by three parameters: two Dirichlet hyperparameters (α and β ) and the number of topics (K). Determining the parameter K is extremely important and not extensively explored in the literature, mainly due to the intensive computation and long processing time. Most topic modeling methods implicitly assume that the number of topics is known in advance, thus considering it demands an exogenous parameter. That is annoying, leaving the technique prone to subjectivities. The quality of insights offered by LDA is quite sensitive to the value of the parameter K, and perhaps an excess of subjectivity in its choice might influence the confidence managers put on the techniques results, thus undermining its usage by firms. This dissertation’s main objective is to develop a metric to identify the ideal value for the parameter K of the LDA model that allows an adequate representation of the corpus and within a tolerable elapsed time of the process. We apply the proposed metric alongside existing metrics to two datasets. Experiments show that the proposed method selects a number of topics similar to that of other metrics, but with better performance in terms of processing time. Although each metric has its own method for determining the number of topics, some results are similar for the same database, as evidenced in the study. Our metric is superior when considering the processing time. Experiments show this method is effective.As tendências tecnológicas mais recentes impulsionam uma vasta e crescente quantidade de dados textuais. Modelagem de tópicos é uma ferramenta útil para extrair informações relevantes de grandes corpora de texto. Um modelo de tópico é baseado em um corpus de documentos, descobre os tópicos que permeiam o corpus e atribui documentos a esses tópicos. O modelo de Alocação de Dirichlet Latente (LDA) é o principal, ou mais popular, dos modelos de tópicos probabilísticos. O modelo LDA é condicionado por três parâmetros: os hiperparâmetros de Dirichlet (α and β ) e o número de tópicos (K). A determinação do parâmetro K é extremamente importante e pouco explorada na literatura, principalmente devido à computação intensiva e ao longo tempo de processamento. A maioria dos métodos de modelagem de tópicos assume implicitamente que o número de tópicos é conhecido com antecedência, portanto, considerando que exige um parâmetro exógeno. Isso é um tanto complicado para o pesquisador pois acaba acrescentando à técnica uma subjetividade. A qualidade dos insights oferecidos pelo LDA é bastante sensível ao valor do parâmetro K, e pode-se argumentar que um excesso de subjetividade em sua escolha possa influenciar a confiança que os gerentes depositam nos resultados da técnica, prejudicando assim seu uso pelas empresas. O principal objetivo desta dissertação é desenvolver uma métrica para identificar o valor ideal para o parâmetro K do modelo LDA que permita uma representação adequada do corpus e dentro de um tempo de processamento tolerável. Embora cada métrica possua método próprio para determinação do número de tópicos, alguns resultados são semelhantes para a mesma base de dados, conforme evidenciado no estudo. Nossa métrica é superior ao considerar o tempo de processamento. Experimentos mostram que esse método é eficaz

Lume 5.8

Recommended from our members

Neural Generative Models and Representation Learning for Information Retrieval

Author: Ai Qingyao
Publication venue: ScholarWorks@UMass Amherst
Publication date: 30/10/2019
Field of study

Information Retrieval (IR) concerns about the structure, analysis, organization, storage, and retrieval of information. Among different retrieval models proposed in the past decades, generative retrieval models, especially those under the statistical probabilistic framework, are one of the most popular techniques that have been widely applied to Information Retrieval problems. While they are famous for their well-grounded theory and good empirical performance in text retrieval, their applications in IR are often limited by their complexity and low extendability in the modeling of high-dimensional information. Recently, advances in deep learning techniques provide new opportunities for representation learning and generative models for information retrieval. In contrast to statistical models, neural models have much more flexibility because they model information and data correlation in latent spaces without explicitly relying on any prior knowledge. Previous studies on pattern recognition and natural language processing have shown that semantically meaningful representations of text, images, and many types of information can be acquired with neural models through supervised or unsupervised training. Nonetheless, the effectiveness of neural models for information retrieval is mostly unexplored. In this thesis, we study how to develop new generative models and representation learning frameworks with neural models for information retrieval. Specifically, our contributions include three main components: (1) Theoretical Analysis: We present the first theoretical analysis and adaptation of existing neural embedding models for ad-hoc retrieval tasks; (2) Design Practice: Based on our experience and knowledge, we show how to design an embedding-based neural generative model for practical information retrieval tasks such as personalized product search; And (3) Generic Framework: We further generalize our proposed neural generative framework for complicated heterogeneous information retrieval scenarios that concern text, images, knowledge entities, and their relationships. Empirical results show that the proposed neural generative framework can effectively learn information representations and construct retrieval models that outperform the state-of-the-art systems in a variety of IR tasks

ScholarWorks@UMass Amherst

Recovering from a Decade: A Systematic Mapping of Information Retrieval Approaches to Software Traceability

Author: Ardö Anders
Borg Markus
Runeson Per
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2014
Field of study

Engineers in large-scale software development have to manage large amounts of information, spread across many artifacts. Several researchers have proposed expressing retrieval of trace links among artifacts, i.e. trace recovery, as an Information Retrieval (IR) problem. The objective of this study is to produce a map of work on IR-based trace recovery, with a particular focus on previous evaluations and strength of evidence. We conducted a systematic mapping of IR-based trace recovery. Of the 79 publications classified, a majority applied algebraic IR models. While a set of studies on students indicate that IR-based trace recovery tools support certain work tasks, most previous studies do not go beyond reporting precision and recall of candidate trace links from evaluations using datasets containing less than 500 artifacts. Our review identified a need of industrial case studies. Furthermore, we conclude that the overall quality of reporting should be improved regarding both context and tool details, measures reported, and use of IR terminology. Finally, based on our empirical findings, we present suggestions on how to advance research on IR-based trace recovery

Lund University Publications

Automatic aspect extraction in information retrieval diversity

Author: Alfaya Sánchez David
Publication venue
Publication date: 01/01/2015
Field of study

In this master thesis we describe a new automatic aspect extraction algorithm by incorporating relevance information to the dynamics of the Probabilistic Latent Semantic Analysis. An utility-biased likelihood statistical framework is described to formalize the incorporation of prior relevance information to the dynamics of the algorithm intrinsically. Moreover, a general abstract algorithm is presented to incorporate any arbitrary new feature variables to the analysis. A tempering procedure is inferred for this general algorithm as an entropic regularization of the utility-biased likelihood functional and a geometric interpretation of the algorithm is described, showing the intrinsic changes in the information space of the problem produced when di erent sources of prior utility estimations are provided over the same data. The general algorithm is applied to several information retrieval, recommendation and personalization tasks. Moreover, a set of post-processing aspect lters is presented. Some characteristics of the aspect distributions such as sparsity or low entropy are identi ed to enhance the overall diversity attained by the diversi cation algorithm. Proposed lters assure that the nal aspect space has those properties, thus leading to better diversity levels. An experimental setup over TREC web track 09-12 data shows that the algorithm surpasses classic pLSA as an aspect extraction tool for the search diversi cation. Additional theoretical applications of the general procedure to information retrieval, recommendation and personalization tasks are given, leading to new relevanceaware models incorporating several variables to the latent semantic analysis. Finally the problem of optimizing the aspect space size for diversi cation is addressed. Analytical formulas for the dependency of diversity metrics on the choice of an automatically extracted aspect space are given under a simpli ed generative model for the relation between system aspects and evaluation true aspects. An experimental analysis of this dependence is performed over TREC web track data using pLSA as aspect extraction algorithm

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Biblos-e Archivo

Clustering and Topic Modelling: A New Approach for Analysis of National Cyber security Strategies

Author: Janczewski Lech
Kolini Farzan
Publication venue: AIS Electronic Library (AISeL)
Publication date: 01/07/2017
Field of study

The consequences of cybersecurity attacks can be severe for nation states and their people. Recently many nations have revisited their national cybersecurity strategies (NCSs) to ensure that their cybersecurity capabilities is sufficient to protect their citizens and cyberspace. This study is an initial attempt to compare NCSs by using clustering and topic modelling methods to investigate the similarity and differences between them. We also aimed to identify underlying topics that are appeared in NCSs. We have collected and examined 60 NCSs that have been developed during 2003-2016. By relying on institutional theories, we found that memberships in the international intuitions could be a determinant factor for harmonization and integration between NCSs. By applying hierarchical clustering method, we noticed a stronger similarities between NCSs that are developed by the EU or NATO members. We also found that public-private partnerships, protection of critical infrastructure, and defending citizen and public IT systems are among those topics that have been received considerable attention in the majority of NCSs. We also argue that topic modeling method, LDA, can be used as an automated technique for analysis and understanding of textual documents by policy makers and governments during the development and reviewing of national strategies and policies

AIS Electronic Library (AISeL)

Ontology-based semantic reminiscence support system

Author: Shi Lei
Publication venue
Publication date
Field of study

This thesis addresses the needs of people who find reminiscence helpful in focusing on the development of a computerised reminiscence support system, which facilitates the access to and retrieval of stored memories used as the basis for positive interactions between elderly and young, and also between people with cognitive impairment and members of their family or caregivers. To model users’ background knowledge, this research defines a light weight useroriented ontology and its building principles. The ontology is flexible, and has simplified knowledge structure populated with semantically homogeneous ontology concepts. The user-oriented ontology is different from generic ontology models, as it does not rely on knowledge experts. Its structure enables users to browse, edit and create new entries on their own. To solve the semantic gap problem in personal information retrieval, this thesis proposes a semantic ontology-based feature matching method. It involves natural language processing and semantic feature extraction/selection using the user-oriented ontology. It comprises four stages: (i) user-oriented ontology building, (ii) semantic feature extraction for building vectors representing information objects, (iii) semantic feature selection using the user-oriented ontology, and (iv) measuring the similarity between the information objects. To facilitate personal information management and dynamic generation of content, the system uses ontologies and advanced algorithms for semantic feature matching. An algorithm named Onto-SVD is also proposed, which uses the user-oriented ontology to automatically detect the semantic relations within the stored memories. It combines semantic feature selection with matrix factorisation and k-means clustering to achieve topic identification based on semantic relations. The thesis further proposes an ontology-based personalised retrieval mechanism for the system. It aims to assist people to recall, browse and re-discover events from their lives by considering their profiles and background knowledge, and providing them v with customised retrieval results. Furthermore, a user profile space model is defined, and its construction method is also described. The model combines multiple useroriented ontologies and has a self-organised structure based on relevance feedback. The identification of person’s search intentions in this mechanism is on the conceptual level and involves the person’s background knowledge. Based on the identified search intentions, knowledge spanning trees are automatically generated from the ontologies or user profile spaces. The knowledge spanning trees are used to expand and reform queries, which enhance the queries’ semantic representations by applying domain knowledge. The crowdsourcing-based system evaluation measures users’ satisfaction on the generated content of Sem-LSB. It compares the advantage and disadvantage of three types of content presentations (i.e. unstructured, LSB-based and semantic/knowledgebased). Based on users’ feedback, the semantic/knowledge-based presentation is considered to have higher overall satisfaction and stronger reminiscing support effects than the others

Online Research @ Cardiff

Multi Domain Semantic Information Retrieval Based on Topic Model

Author: Lee Sanghoon
Publication venue: ScholarWorks @ Georgia State University
Publication date: 07/05/2016
Field of study

Over the last decades, there have been remarkable shifts in the area of Information Retrieval (IR) as huge amount of information is increasingly accumulated on the Web. The gigantic information explosion increases the need for discovering new tools that retrieve meaningful knowledge from various complex information sources. Thus, techniques primarily used to search and extract important information from numerous database sources have been a key challenge in current IR systems. Topic modeling is one of the most recent techniquesthat discover hidden thematic structures from large data collections without human supervision. Several topic models have been proposed in various fields of study and have been utilized extensively for many applications. Latent Dirichlet Allocation (LDA) is the most well-known topic model that generates topics from large corpus of resources, such as text, images, and audio.It has been widely used in many areas in information retrieval and data mining, providing efficient way of identifying latent topics among document collections. However, LDA has a drawback that topic cohesion within a concept is attenuated when estimating infrequently occurring words. Moreover, LDAseems not to consider the meaning of words, but rather to infer hidden topics based on a statisticalapproach. However, LDA can cause either reduction in the quality of topic words or increase in loose relations between topics. In order to solve the previous problems, we propose a domain specific topic model that combines domain concepts with LDA. Two domain specific algorithms are suggested for solving the difficulties associated with LDA. The main strength of our proposed model comes from the fact that it narrows semantic concepts from broad domain knowledge to a specific one which solves the unknown domain problem. Our proposed model is extensively tested on various applications, query expansion, classification, and summarization, to demonstrate the effectiveness of the model. Experimental results show that the proposed model significantly increasesthe performance of applications

ScholarWorks @ Georgia State University

CHORUS Deliverable 2.1: State of the Art on Multimedia Search Engines

Author: Boujemaa Nozha
Compañó Ramón
Dosch Christoph
Geurts Joost
Karlgren Jussi
King Paul
Kompatsiaris Yiannis
Köhler Joachim
Le Moine Jean-Yves
Ortgies Robert
Point Jean-Charles
Rotenberg Boris
Rudström Åsa
Sebe Nicu
Publication venue: Chorus Project Consortium
Publication date: 01/01/2007
Field of study

Based on the information provided by European projects and national initiatives related to multimedia search as well as domains experts that participated in the CHORUS Think-thanks and workshops, this document reports on the state of the art related to multimedia content search from, a technical, and socio-economic perspective. The technical perspective includes an up to date view on content based indexing and retrieval technologies, multimedia search in the context of mobile devices and peer-to-peer networks, and an overview of current evaluation and benchmark inititiatives to measure the performance of multimedia search engines. From a socio-economic perspective we inventorize the impact and legal consequences of these technical advances and point out future directions of research

RISE – Research Institutes of Sweden

Digitala Vetenskapliga Arkivet - Academic Archive On-line

Swedish Institute of Computer Science Publications Database

Software institutes' Online Digital Archive