70 research outputs found
Fisher Kernels and Probabilistic Latent Semantic Models
Tasks that rely on semantic content of documents, notably Information Retrieval and Document Classification, can benefit from a good account of document context, i.e. the semantic association between documents. To this effect, the scheme of latent semantics blends individual words appearing throughout a document collection into latent topics, thus providing a way to handle documents that is less constrained than the conventional approach by the mere appearance of such or such word. Probabilistic latent semantic models take the matter further by providing assumptions on how the documents observed in the collection would have been generated. This allows derivation of inference algorithms that can fit the model parameters to the observed document collection; with their values set, these parameters can then be used to compute the similarities between documents. The Fisher kernels, similarity functions rooted in information geometry, constitute good candidates to measure the similarity between documents in the framework of probabilistic latent semantic models. In this context, we study the use of Fisher kernels for the Probabilistic Latent Semantic Indexing (PLSI) model. By thoroughly analysing the generative process of PLSI, we derive the proper Fisher kernel for PLSI and expose the hypotheses that relate former work to this kernel. In particular, we confirm that the Fisher information matrix (FIM) should not be approximated by the identity in the case of PLSI. We also study the impact on the performances of the Fisher kernel of the contribution of the latent topics and the one of the distribution of words among the topics; eventually, we provide empirical evidence, and theoretical arguments, showing that the Fisher kernel originally published by Hofmann, corrected to account for FIM, is the best of the PLSI Fisher kernels. It can compete with the strong BM25 baseline, and even significantly outperforms it when documents sharing few words must be matched. We further study of PLSI document similarities by applying the Language model approach. This approach shuns the usual IR paradigm that considers documents and queries to be of a similar nature. Instead, they consider documents as being representative of language models, and use probabilistic tools to determine which of these models would have generated the query with highest probability. Using this scheme in the framework of PLSI provides a way to bypass the issue of query representation, which constitutes one of the specific challenges of PLSI. We find the Language model approach to perform as well as the best of the Fisher kernels when enough latent categories are provided. Eventually, we propose a new probabilistic latent semantic model consisting in a mixture of Smoothed Dirichlet distributions which, by better modeling word burstiness, provides a more realistic model of empirical observations on real document collections than the usually used Multinomials
Metric for seleting the number of topics in the LDA Model
The latest technological trends are driving a vast and growing amount of textual data. Topic modeling is a useful tool for extracting information from large corpora of text. A topic template is based on a corpus of documents, discovers the topics that permeate the corpus and assigns documents to those topics. The Latent Dirichlet Allocation (LDA) model is the main, or most popular, of the probabilistic topic models. The LDA model is conditioned by three parameters: two Dirichlet hyperparameters (α and β ) and the number of topics (K). Determining the parameter K is extremely important and not extensively explored in the literature, mainly due to the intensive computation and long processing time. Most topic modeling methods implicitly assume that the number of topics is known in advance, thus considering it demands an exogenous parameter. That is annoying, leaving the technique prone to subjectivities. The quality of insights offered by LDA is quite sensitive to the value of the parameter K, and perhaps an excess of subjectivity in its choice might influence the confidence managers put on the techniques results, thus undermining its usage by firms. This dissertation’s main objective is to develop a metric to identify the ideal value for the parameter K of the LDA model that allows an adequate representation of the corpus and within a tolerable elapsed time of the process. We apply the proposed metric alongside existing metrics to two datasets. Experiments show that the proposed method selects a number of topics similar to that of other metrics, but with better performance in terms of processing time. Although each metric has its own method for determining the number of topics, some results are similar for the same database, as evidenced in the study. Our metric is superior when considering the processing time. Experiments show this method is effective.As tendĂŞncias tecnolĂłgicas mais recentes impulsionam uma vasta e crescente quantidade de dados textuais. Modelagem de tĂłpicos Ă© uma ferramenta Ăştil para extrair informações relevantes de grandes corpora de texto. Um modelo de tĂłpico Ă© baseado em um corpus de documentos, descobre os tĂłpicos que permeiam o corpus e atribui documentos a esses tĂłpicos. O modelo de Alocação de Dirichlet Latente (LDA) Ă© o principal, ou mais popular, dos modelos de tĂłpicos probabilĂsticos. O modelo LDA Ă© condicionado por trĂŞs parâmetros: os hiperparâmetros de Dirichlet (α and β ) e o nĂşmero de tĂłpicos (K). A determinação do parâmetro K Ă© extremamente importante e pouco explorada na literatura, principalmente devido Ă computação intensiva e ao longo tempo de processamento. A maioria dos mĂ©todos de modelagem de tĂłpicos assume implicitamente que o nĂşmero de tĂłpicos Ă© conhecido com antecedĂŞncia, portanto, considerando que exige um parâmetro exĂłgeno. Isso Ă© um tanto complicado para o pesquisador pois acaba acrescentando Ă tĂ©cnica uma subjetividade. A qualidade dos insights oferecidos pelo LDA Ă© bastante sensĂvel ao valor do parâmetro K, e pode-se argumentar que um excesso de subjetividade em sua escolha possa influenciar a confiança que os gerentes depositam nos resultados da tĂ©cnica, prejudicando assim seu uso pelas empresas. O principal objetivo desta dissertação Ă© desenvolver uma mĂ©trica para identificar o valor ideal para o parâmetro K do modelo LDA que permita uma representação adequada do corpus e dentro de um tempo de processamento tolerável. Embora cada mĂ©trica possua mĂ©todo prĂłprio para determinação do nĂşmero de tĂłpicos, alguns resultados sĂŁo semelhantes para a mesma base de dados, conforme evidenciado no estudo. Nossa mĂ©trica Ă© superior ao considerar o tempo de processamento. Experimentos mostram que esse mĂ©todo Ă© eficaz
Recommended from our members
Neural Generative Models and Representation Learning for Information Retrieval
Information Retrieval (IR) concerns about the structure, analysis, organization, storage, and retrieval of information. Among different retrieval models proposed in the past decades, generative retrieval models, especially those under the statistical probabilistic framework, are one of the most popular techniques that have been widely applied to Information Retrieval problems. While they are famous for their well-grounded theory and good empirical performance in text retrieval, their applications in IR are often limited by their complexity and low extendability in the modeling of high-dimensional information. Recently, advances in deep learning techniques provide new opportunities for representation learning and generative models for information retrieval. In contrast to statistical models, neural models have much more flexibility because they model information and data correlation in latent spaces without explicitly relying on any prior knowledge. Previous studies on pattern recognition and natural language processing have shown that semantically meaningful representations of text, images, and many types of information can be acquired with neural models through supervised or unsupervised training. Nonetheless, the effectiveness of neural models for information retrieval is mostly unexplored. In this thesis, we study how to develop new generative models and representation learning frameworks with neural models for information retrieval. Specifically, our contributions include three main components: (1) Theoretical Analysis: We present the first theoretical analysis and adaptation of existing neural embedding models for ad-hoc retrieval tasks; (2) Design Practice: Based on our experience and knowledge, we show how to design an embedding-based neural generative model for practical information retrieval tasks such as personalized product search; And (3) Generic Framework: We further generalize our proposed neural generative framework for complicated heterogeneous information retrieval scenarios that concern text, images, knowledge entities, and their relationships. Empirical results show that the proposed neural generative framework can effectively learn information representations and construct retrieval models that outperform the state-of-the-art systems in a variety of IR tasks
Recovering from a Decade: A Systematic Mapping of Information Retrieval Approaches to Software Traceability
Engineers in large-scale software development have to manage large amounts of information, spread across many artifacts. Several researchers have proposed expressing retrieval of trace links among artifacts, i.e. trace recovery, as an Information Retrieval (IR) problem. The objective of this study is to produce a map of work on IR-based trace recovery, with a particular focus on previous evaluations and strength of evidence. We conducted a systematic mapping of IR-based trace recovery. Of the 79 publications classified, a majority applied algebraic IR models. While a set of studies on students indicate that IR-based trace recovery tools support certain work tasks, most previous studies do not go beyond reporting precision and recall of candidate trace links from evaluations using datasets containing less than 500 artifacts. Our review identified a need of industrial case studies. Furthermore, we conclude that the overall quality of reporting should be improved regarding both context and tool details, measures reported, and use of IR terminology. Finally, based on our empirical findings, we present suggestions on how to advance research on IR-based trace recovery
Automatic aspect extraction in information retrieval diversity
In this master thesis we describe a new automatic aspect extraction algorithm by
incorporating relevance information to the dynamics of the Probabilistic Latent
Semantic Analysis. An utility-biased likelihood statistical framework is described
to formalize the incorporation of prior relevance information to the dynamics of
the algorithm intrinsically. Moreover, a general abstract algorithm is presented to
incorporate any arbitrary new feature variables to the analysis.
A tempering procedure is inferred for this general algorithm as an entropic regularization
of the utility-biased likelihood functional and a geometric interpretation of
the algorithm is described, showing the intrinsic changes in the information space of
the problem produced when di erent sources of prior utility estimations are provided
over the same data.
The general algorithm is applied to several information retrieval, recommendation
and personalization tasks. Moreover, a set of post-processing aspect lters is
presented. Some characteristics of the aspect distributions such as sparsity or low
entropy are identi ed to enhance the overall diversity attained by the diversi cation
algorithm. Proposed lters assure that the nal aspect space has those properties,
thus leading to better diversity levels.
An experimental setup over TREC web track 09-12 data shows that the algorithm
surpasses classic pLSA as an aspect extraction tool for the search diversi cation.
Additional theoretical applications of the general procedure to information retrieval,
recommendation and personalization tasks are given, leading to new relevanceaware
models incorporating several variables to the latent semantic analysis.
Finally the problem of optimizing the aspect space size for diversi cation is
addressed. Analytical formulas for the dependency of diversity metrics on the choice
of an automatically extracted aspect space are given under a simpli ed generative
model for the relation between system aspects and evaluation true aspects.
An experimental analysis of this dependence is performed over TREC web track
data using pLSA as aspect extraction algorithm
Clustering and Topic Modelling: A New Approach for Analysis of National Cyber security Strategies
The consequences of cybersecurity attacks can be severe for nation states and their people. Recently many nations have revisited their national cybersecurity strategies (NCSs) to ensure that their cybersecurity capabilities is sufficient to protect their citizens and cyberspace. This study is an initial attempt to compare NCSs by using clustering and topic modelling methods to investigate the similarity and differences between them. We also aimed to identify underlying topics that are appeared in NCSs. We have collected and examined 60 NCSs that have been developed during 2003-2016. By relying on institutional theories, we found that memberships in the international intuitions could be a determinant factor for harmonization and integration between NCSs. By applying hierarchical clustering method, we noticed a stronger similarities between NCSs that are developed by the EU or NATO members. We also found that public-private partnerships, protection of critical infrastructure, and defending citizen and public IT systems are among those topics that have been received considerable attention in the majority of NCSs. We also argue that topic modeling method, LDA, can be used as an automated technique for analysis and understanding of textual documents by policy makers and governments during the development and reviewing of national strategies and policies
Ontology-based semantic reminiscence support system
This thesis addresses the needs of people who find reminiscence helpful in focusing on
the development of a computerised reminiscence support system, which facilitates the
access to and retrieval of stored memories used as the basis for positive interactions
between elderly and young, and also between people with cognitive impairment and
members of their family or caregivers.
To model users’ background knowledge, this research defines a light weight useroriented
ontology and its building principles. The ontology is flexible, and has
simplified knowledge structure populated with semantically homogeneous ontology
concepts. The user-oriented ontology is different from generic ontology models, as it
does not rely on knowledge experts. Its structure enables users to browse, edit and
create new entries on their own.
To solve the semantic gap problem in personal information retrieval, this thesis
proposes a semantic ontology-based feature matching method. It involves natural
language processing and semantic feature extraction/selection using the user-oriented
ontology. It comprises four stages: (i) user-oriented ontology building, (ii) semantic
feature extraction for building vectors representing information objects, (iii) semantic
feature selection using the user-oriented ontology, and (iv) measuring the similarity
between the information objects.
To facilitate personal information management and dynamic generation of content,
the system uses ontologies and advanced algorithms for semantic feature matching.
An algorithm named Onto-SVD is also proposed, which uses the user-oriented
ontology to automatically detect the semantic relations within the stored memories. It
combines semantic feature selection with matrix factorisation and k-means clustering
to achieve topic identification based on semantic relations.
The thesis further proposes an ontology-based personalised retrieval mechanism for
the system. It aims to assist people to recall, browse and re-discover events from their
lives by considering their profiles and background knowledge, and providing them
v
with customised retrieval results. Furthermore, a user profile space model is defined,
and its construction method is also described. The model combines multiple useroriented
ontologies and has a self-organised structure based on relevance feedback.
The identification of person’s search intentions in this mechanism is on the conceptual
level and involves the person’s background knowledge. Based on the identified search
intentions, knowledge spanning trees are automatically generated from the ontologies
or user profile spaces. The knowledge spanning trees are used to expand and reform
queries, which enhance the queries’ semantic representations by applying domain
knowledge.
The crowdsourcing-based system evaluation measures users’ satisfaction on the
generated content of Sem-LSB. It compares the advantage and disadvantage of three
types of content presentations (i.e. unstructured, LSB-based and semantic/knowledgebased).
Based on users’ feedback, the semantic/knowledge-based presentation is
considered to have higher overall satisfaction and stronger reminiscing support effects
than the others
Multi Domain Semantic Information Retrieval Based on Topic Model
Over the last decades, there have been remarkable shifts in the area of Information Retrieval (IR) as huge amount of information is increasingly accumulated on the Web. The gigantic information explosion increases the need for discovering new tools that retrieve meaningful knowledge from various complex information sources. Thus, techniques primarily used to search and extract important information from numerous database sources have been a key challenge in current IR systems.
Topic modeling is one of the most recent techniquesthat discover hidden thematic structures from large data collections without human supervision. Several topic models have been proposed in various fields of study and have been utilized extensively for many applications. Latent Dirichlet Allocation (LDA) is the most well-known topic model that generates topics from large corpus of resources, such as text, images, and audio.It has been widely used in many areas in information retrieval and data mining, providing efficient way of identifying latent topics among document collections. However, LDA has a drawback that topic cohesion within a concept is attenuated when estimating infrequently occurring words. Moreover, LDAseems not to consider the meaning of words, but rather to infer hidden topics based on a statisticalapproach. However, LDA can cause either reduction in the quality of topic words or increase in loose relations between topics.
In order to solve the previous problems, we propose a domain specific topic model that combines domain concepts with LDA. Two domain specific algorithms are suggested for solving the difficulties associated with LDA. The main strength of our proposed model comes from the fact that it narrows semantic concepts from broad domain knowledge to a specific one which solves the unknown domain problem. Our proposed model is extensively tested on various applications, query expansion, classification, and summarization, to demonstrate the effectiveness of the model. Experimental results show that the proposed model significantly increasesthe performance of applications
CHORUS Deliverable 2.1: State of the Art on Multimedia Search Engines
Based on the information provided by European projects and national initiatives related to multimedia search as well as domains experts that participated in the CHORUS Think-thanks and workshops, this document reports on the state of the art related to multimedia content search from, a technical, and socio-economic perspective.
The technical perspective includes an up to date view on content based indexing and retrieval technologies, multimedia search in the context of mobile devices and peer-to-peer networks, and an overview of current evaluation and benchmark inititiatives to measure the performance of multimedia search engines.
From a socio-economic perspective we inventorize the impact and legal consequences of these technical advances and point out future directions of research
- …