7 research outputs found

    Utilisation of metadata fields and query expansion in cross-lingual search of user-generated Internet video

    Get PDF
    Recent years have seen signicant eorts in the area of Cross Language Information Retrieval (CLIR) for text retrieval. This work initially focused on formally published content, but more recently research has begun to concentrate on CLIR for informal social media content. However, despite the current expansion in online multimedia archives, there has been little work on CLIR for this content. While there has been some limited work on Cross-Language Video Retrieval (CLVR) for professional videos, such as documentaries or TV news broadcasts, there has to date, been no signicant investigation of CLVR for the rapidly growing archives of informal user generated (UGC) content. Key differences between such UGC and professionally produced content are the nature and structure of the textual UGC metadata associated with it, as well as the form and quality of the content itself. In this setting, retrieval eectiveness may not only suer from translation errors common to all CLIR tasks, but also recognition errors associated with the automatic speech recognition (ASR) systems used to transcribe the spoken content of the video and with the informality and inconsistency of the associated user-created metadata for each video. This work proposes and evaluates techniques to improve CLIR effectiveness of such noisy UGC content. Our experimental investigation shows that dierent sources of evidence, e.g. the content from dierent elds of the structured metadata, significantly affect CLIR effectiveness. Results from our experiments also show that each metadata eld has a varying robustness to query expansion (QE) and hence can have a negative impact on the CLIR eectiveness. Our work proposes a novel adaptive QE technique that predicts the most reliable source for expansion and shows how this technique can be effective for improving CLIR effectiveness for UGC content

    On Clustering and Evaluation of Narrow Domain Short-Test Corpora

    Full text link
    En este trabajo de tesis doctoral se investiga el problema del agrupamiento de conjuntos especiales de documentos llamados textos cortos de dominios restringidos. Para llevar a cabo esta tarea, se han analizados diversos corpora y métodos de agrupamiento. Mas aún, se han introducido algunas medidas de evaluación de corpus, técnicas de selección de términos y medidas para la validez de agrupamiento con la finalidad de estudiar los siguientes problemas: -Determinar la relativa dificultad de un corpus para ser agrupado y estudiar algunas de sus características como longitud de los textos, amplitud del dominio, estilometría, desequilibrio de clases y estructura. -Contribuir en el estado del arte sobre el agrupamiento de corpora compuesto de textos cortos de dominios restringidos El trabajo de investigación que se ha llevado a cabo se encuentra parcialmente enfocado en el "agrupamiento de textos cortos". Este tema se considera relevante dado el modo actual y futuro en que las personas tienden a usar un "lenguaje reducido" constituidos por textos cortos (por ejemplo, blogs, snippets, noticias y generación de mensajes de textos como el correo electrónico y el chat). Adicionalmente, se estudia la amplitud del dominio de corpora. En este sentido, un corpus puede ser considerado como restringido o amplio si el grado de traslape de vocabulario es alto o bajo, respectivamente. En la tarea de categorización, es bastante complejo lidiar con corpora de dominio restringido tales como artículos científicos, reportes técnicos, patentes, etc. El objetivo principal de este trabajo consiste en estudiar las posibles estrategias para tratar con los siguientes dos problemas: a) las bajas frecuencias de los términos del vocabulario en textos cortos, y b) el alto traslape de vocabulario asociado a dominios restringidos. Si bien, cada uno de los problemas anteriores es un reto suficientemente alto, cuando se trata con textos cortos de dominios restringidos, la complejidad del problema se incrPinto Avendaño, DE. (2008). On Clustering and Evaluation of Narrow Domain Short-Test Corpora [Tesis doctoral no publicada]. Universitat Politècnica de València. https://doi.org/10.4995/Thesis/10251/2641Palanci

    Learning to select for information retrieval

    Get PDF
    The effective ranking of documents in search engines is based on various document features, such as the frequency of the query terms in each document, the length, or the authoritativeness of each document. In order to obtain a better retrieval performance, instead of using a single or a few features, there is a growing trend to create a ranking function by applying a learning to rank technique on a large set of features. Learning to rank techniques aim to generate an effective document ranking function by combining a large number of document features. Different ranking functions can be generated by using different learning to rank techniques or on different document feature sets. While the generated ranking function may be uniformly applied to all queries, several studies have shown that different ranking functions favour different queries, and that the retrieval performance can be significantly enhanced if an appropriate ranking function is selected for each individual query. This thesis proposes Learning to Select (LTS), a novel framework that selectively applies an appropriate ranking function on a per-query basis, regardless of the given query's type and the number of candidate ranking functions. In the learning to select framework, the effectiveness of a ranking function for an unseen query is estimated from the available neighbouring training queries. The proposed framework employs a classification technique (e.g. k-nearest neighbour) to identify neighbouring training queries for an unseen query by using a query feature. In particular, a divergence measure (e.g. Jensen-Shannon), which determines the extent to which a document ranking function alters the scores of an initial ranking of documents for a given query, is proposed for use as a query feature. The ranking function which performs the best on the identified training query set is then chosen for the unseen query. The proposed framework is thoroughly evaluated on two different TREC retrieval tasks (namely, Web search and adhoc search tasks) and on two large standard LETOR feature sets, which contain as many as 64 document features, deriving conclusions concerning the key components of LTS, namely the query feature and the identification of neighbouring queries components. Two different types of experiments are conducted. The first one is to select an appropriate ranking function from a number of candidate ranking functions. The second one is to select multiple appropriate document features from a number of candidate document features, for building a ranking function. Experimental results show that our proposed LTS framework is effective in both selecting an appropriate ranking function and selecting multiple appropriate document features, on a per-query basis. In addition, the retrieval performance is further enhanced when increasing the number of candidates, suggesting the robustness of the learning to select framework. This thesis also demonstrates how the LTS framework can be deployed to other search applications. These applications include the selective integration of a query independent feature into a document weighting scheme (e.g. BM25), the selective estimation of the relative importance of different query aspects in a search diversification task (the goal of the task is to retrieve a ranked list of documents that provides a maximum coverage for a given query, while avoiding excessive redundancy), and the selective application of an appropriate resource for expanding and enriching a given query for document search within an enterprise. The effectiveness of the LTS framework is observed across these search applications, and on different collections, including a large scale Web collection that contains over 50 million documents. This suggests the generality of the proposed learning to select framework. The main contributions of this thesis are the introduction of the LTS framework and the proposed use of divergence measures as query features for identifying similar queries. In addition, this thesis draws insights from a large set of experiments, involving four different standard collections, four different search tasks and large document feature sets. This illustrates the effectiveness, robustness and generality of the LTS framework in tackling various retrieval applications

    A semi-automated FAQ retrieval system for HIV/AIDS

    Get PDF
    This thesis describes a semi-automated FAQ retrieval system that can be queried by users through short text messages on low-end mobile phones to provide answers on HIV/AIDS related queries. First we address the issue of result presentation on low-end mobile phones by proposing an iterative interaction retrieval strategy where the user engages with the FAQ retrieval system in the question answering process. At each iteration, the system returns only one question-answer pair to the user and the iterative process terminates after the user's information need has been satisfied. Since the proposed system is iterative, this thesis attempts to reduce the number of iterations (search length) between the users and the system so that users do not abandon the search process before their information need has been satisfied. Moreover, we conducted a user study to determine the number of iterations that users are willing to tolerate before abandoning the iterative search process. We subsequently used the bad abandonment statistics from this study to develop an evaluation measure for estimating the probability that any random user will be satisfied when using our FAQ retrieval system. In addition, we used a query log and its click-through data to address three main FAQ document collection deficiency problems in order to improve the retrieval performance and the probability that any random user will be satisfied when using our FAQ retrieval system. Conclusions are derived concerning whether we can reduce the rate at which users abandon their search before their information need has been satisfied by using information from previous searches to: Address the term mismatch problem between the users' SMS queries and the relevant FAQ documents in the collection; to selectively rank the FAQ document according to how often they have been previously identified as relevant by users for a particular query term; and to identify those queries that do not have a relevant FAQ document in the collection. In particular, we proposed a novel template-based approach that uses queries from a query log for which the true relevant FAQ documents are known to enrich the FAQ documents with additional terms in order to alleviate the term mismatch problem. These terms are added as a separate field in a field-based model using two different proposed enrichment strategies, namely the Term Frequency and the Term Occurrence strategies. This thesis thoroughly investigates the effectiveness of the aforementioned FAQ document enrichment strategies using three different field-based models. Our findings suggest that we can improve the overall recall and the probability that any random user will be satisfied by enriching the FAQ documents with additional terms from queries in our query log. Moreover, our investigation suggests that it is important to use an FAQ document enrichment strategy that takes into consideration the number of times a term occurs in the query when enriching the FAQ documents. We subsequently show that our proposed enrichment approach for alleviating the term mismatch problem generalise well on other datasets. Through the evaluation of our proposed approach for selectively ranking the FAQ documents, we show that we can improve the retrieval performance and the probability that any random user will be satisfied when using our FAQ retrieval system by incorporating the click popularity score of a query term t on an FAQ document d into the scoring and ranking process. Our results generalised well on a new dataset. However, when we deploy the click popularity score of a query term t on an FAQ document d on an enriched FAQ document collection, we saw a decrease in the retrieval performance and the probability that any random user will be satisfied when using our FAQ retrieval system. Furthermore, we used our query log to build a binary classifier for detecting those queries that do not have a relevant FAQ document in the collection (Missing Content Queries (MCQs))). Before building such a classifier, we empirically evaluated several feature sets in order to determine the best combination of features for building a model that yields the best classification accuracy in identifying the MCQs and the non-MCQs. Using a different dataset, we show that we can improve the overall retrieval performance and the probability that any random user will be satisfied when using our FAQ retrieval system by deploying a MCQs detection subsystem in our FAQ retrieval system to filter out the MCQs. Finally, this thesis demonstrates that correcting spelling errors can help improve the retrieval performance and the probability that any random user will be satisfied when using our FAQ retrieval system. We tested our FAQ retrieval system with two different testing sets, one containing the original SMS queries and the other containing the SMS queries which were manually corrected for spelling errors. Our results show a significant improvement in the retrieval performance and the probability that any random user will be satisfied when using our FAQ retrieval system

    Towards effective cross-lingual search of user-generated internet speech

    Get PDF
    The very rapid growth in user-generated social spoken content on online platforms is creating new challenges for Spoken Content Retrieval (SCR) technologies. There are many potential choices for how to design a robust SCR framework for UGS content, but the current lack of detailed investigation means that there is a lack of understanding of the specifc challenges, and little or no guidance available to inform these choices. This thesis investigates the challenges of effective SCR for UGS content, and proposes novel SCR methods that are designed to cope with the challenges of UGS content. The work presented in this thesis can be divided into three areas of contribution as follows. The first contribution of this work is critiquing the issues and challenges that in influence the effectiveness of searching UGS content in both mono-lingual and cross-lingual settings. The second contribution is to develop an effective Query Expansion (QE) method for UGS. This research reports that, encountered in UGS content, the variation in the length, quality and structure of the relevant documents can harm the effectiveness of QE techniques across different queries. Seeking to address this issue, this work examines the utilisation of Query Performance Prediction (QPP) techniques for improving QE in UGS, and presents a novel framework specifically designed for predicting of the effectiveness of QE. Thirdly, this work extends the utilisation of QPP in UGS search to improve cross-lingual search for UGS by predicting the translation effectiveness. The thesis proposes novel methods to estimate the quality of translation for cross-lingual UGS search. An empirical evaluation that demonstrates the quality of the proposed method on alternative translation outputs extracted from several Machine Translation (MT) systems developed for this task. The research then shows how this framework can be integrated in cross-lingual UGS search to find relevant translations for improved retrieval performance

    The voting model for people search

    Get PDF
    The thesis investigates how persons in an enterprise organisation can be ranked in response to a query, so that those persons with relevant expertise to the query topic are ranked first. The expertise areas of the persons are represented by documentary evidence of expertise, known as candidate profiles. The statement of this research work is that the expert search task in an enterprise setting can be successfully and effectively modelled using a voting paradigm. In the so-called Voting Model, when a document is retrieved for a query, this document represents a vote for every expert associated with the document to have relevant expertise to the query topic. This voting paradigm is manifested by the proposition of various voting techniques that aggregate the votes from documents to candidate experts. Moreover, the research work demonstrates that these voting techniques can be modelled in terms of a Bayesian belief network, providing probabilistic semantics for the proposed voting paradigm. The proposed voting techniques are thoroughly evaluated on three standard expert search test collections, deriving conclusions concerning each component of the Voting Model, namely the method used to identify the documents that represent each candidate's expertise areas, the weighting models that are used to rank the documents, and the voting techniques which are used to convert the ranking of documents into the ranking of experts. Effective settings are identified and insights about the behaviour of each voting technique are derived. Moreover, the practical aspects of deploying an expert search engine such as its efficiency and how it should be trained are also discussed. This thesis includes an investigation of the relationship between the quality of the underlying ranking of documents and the resulting effectiveness of the voting techniques. The thesis shows that various effective document retrieval approaches have a positive impact on the performance of the voting techniques. Interestingly, it also shows that a `perfect' ranking of documents does not necessarily translate into an equally perfect ranking of candidates. Insights are provided into the reasons for this, which relate to the complexity of evaluating tasks based on ranking aggregates of documents. Furthermore, it is shown how query expansion can be adapted and integrated into the expert search process, such that the query expansion successfully acts on a pseudo-relevant set containing only a list of names of persons. Five ways of performing query expansion in the expert search task are proposed, which vary in the extent to which they tackle expert search-specific problems, in particular, the occurrence of topic drift within the expertise evidence for each candidate. Not all documentary evidence of expertise for a given person are equally useful, nor may there be sufficient expertise evidence for a relevant person within an enterprise. This thesis investigates various approaches to identify the high quality evidence for each person, and shows how the World Wide Web can be mined as a resource to find additional expertise evidence. This thesis also demonstrates how the proposed model can be applied to other people search tasks such as ranking blog(ger)s in the blogosphere setting, and suggesting reviewers for the submitted papers to an academic conference. The central contributions of this thesis are the introduction of the Voting Model, and the definition of a number of voting techniques within the model. The thesis draws insights from an extremely large and exhaustive set of experiments, involving many experimental parameters, and using different test collections for several people search tasks. This illustrates the effectiveness and the generality of the Voting Model at tackling various people search tasks and, indeed, the retrieval of aggregates of documents in general

    Impact de l'hyperparamètre alpha sur l'algorithme d'analyse de textes Latent Dirichlet Allocation

    Get PDF
    Résumé L'algorithme de classification non supervisée de documents Latent Dirichlet Allocation (LDA) est devenu en l'espace d'une dizaine d'années l'un des plus cités dans la littérature du domaine de la classification. Cet algorithme a la particularité de permettre à un document d'appartenir à plusieurs thématiques dans des proportions variables. Celui-ci se base sur un hyper-paramètre encore peu étudié dans la communauté scientifique, le paramètre α qui contrôle la variabilité des thématiques pour chaque document. Ce paramètre correspond à l'unique paramètre de la distribution de Dirichlet. Il définit la probabilité initiale des documents dans le contexte du LDA. À chaque extrême du spectre des valeurs que l'on peut assigner à ce paramètre, il devient possible de limiter chaque document à une seule thématique, jusqu'à forcer tous les documents de partager toutes les thématiques uniformément. Le présent mémoire tente d'illustrer le rôle du paramètre α et de démontrer l'effet qu'il peut avoir sur la performance de l'algorithme. Le paramètre α est un vecteur dont la longueur correspond au nombre de thématiques et qui est généralement fixé à une valeur constante. Cette valeur peut être soit déterminée arbitrairement, soit estimée durant la phase d'apprentissage. Une valeur faible amène la classification vers un petit nombre de thématiques par document, et à l'inverse une valeur élevée amène à assigner plusieurs thématiques par documents. Certains travaux de Wallach et coll. ont démontré que des distributions non uniformes à ce paramètre pouvaient améliorer la mesure de classification de l'algorithme LDA. Ces travaux ont été effectués avec des données réelles pour lesquelles nous ne connaissons pas la distribution des thématiques sous-jacentes. Ces données ne permettent donc pas de valider si l'amélioration obtenue provient du fait que la distribution des thématiques correspond effectivement à une distribution non uniforme dans la réalité, ou si au contraire d'autres facteurs liés à des minimums locaux du LDA ou d'autres facteurs circonstanciels expliquent l'amélioration. Pour étudier cette question, notre étude porte sur des données synthétiques. Le LDA est un modèle génératif qui se prête naturellement à la création de documents synthétiques. Les documents sont générés à partir de paramètres latents connus. L'hypothèse naturelle qui est faite est évidemment de présumer qu'en arrimant le paramètre α utilisé avec l'algorithme LDA à la fois pour la génération des données et pour l'apprentissage, la performance sera la meilleure. Les résultats démontrent que, contrairement aux attentes, la performance du LDA n'est pas nécessairement optimale lorsque les α de la génération et de l'apprentissage sont identiques. Les performances optimales varient selon les valeurs α du corpus. Les différences les plus marquées se trouvent lorsque le corpus tend à être composé de documents mono-thématiques, auquel cas les α d'apprentissage uniformes fournissent les meilleures performances. Les différences de performance s'amenuisent à mesure que les valeurs de α deviennent grandes et que les corpus sont composés de thématiques multiples. On observe alors moins de différences de performance et aucune tendance claire ne surgit quant à la performance optimale. Wallach et coll. ont démontré qu'une distribution non uniforme pour α pouvait donner de meilleurs résultats, ce qui ne corrobore pas les conclusions de cette étude. Cependant, les raisons de l'amélioration obtenues demeurent encore hypothétiques. D'une part, les résultats proviennent de corpus réels, qui peuvent s'avérer plus complexes ou relativement différents du modèle du LDA. D'autre part, la différence peut aussi provenir de l'approche utilisée pour l'entraînement des variables latentes, ou encore parce que l'asymétrie du paramètre α était plus faible que pour notre étude. L'amélioration de leur performance pourrait provenir d'un maximum local. Car, contrairement à notre étude, il est difficile avec des données réelles de tenter d'explorer l'espace des paramètres latents d'un corpus puisqu'ils sont inconnus. Une autre contribution de cette étude est d'améliorer la performance du LDA par l'initialisation d'un de ses paramètres latents, la distribution des mots par thématique (la matrice β). Nous utilisons une méthode de classification non supervisée basée sur l'algorithme bayésien naïf. Il en est ressorti un gain de performance substantiel dans le cas de corpus mono-thématiques en plus d'une meilleure fiabilité par des résultats plus stables. Une dernière contribution aborde la problématique de la comparaison de classifications selon leur représentation des thématiques. Cela a amené à définir une mesure de similarité de matrices qui est robuste à la permutation et à la rotation. Ce travail est toujours en cours, mais nous rapportons les résultats partiels, car ils fournissent une contribution non négligeable. En plus de notre contexte, cette mesure peut avoir des applications dans plusieurs autres domaines où il faut évaluer et comparer des résultats d'algorithmes non supervisés, notamment comme la factorisation de matrices par valeurs non négatives (NMF), ou tout autre contexte où les résultats d'un algorithme s'expriment sous forme matricielle, mais où le résultat escompté peut être transformé par rotation et par permutation ce qui complexifie la comparaison.----------Abstract Latent Dirichlet Allocation (LDA) is an unsupervised text classification algorithm that has become one of the most famous and quoted algorithm within the last ten years. This algorithm allows documents to belongs to several topics. LDA relies on an hyperparameter that is generally fixed and received little attention in the scientific community. This variable, α, is a vector that controls the proportions of topics in documents. It is the sole parameter of the Dirichlet probability distribution and it defines the initial probability of documents in the LDA model. Through α, one can force every documents to be composed of a single topic, or conversely make every document share the same mixture of topics. This thesis investigates the role of the α hyperparameter on the document classification performance of LDA. The α vector's length corresponds to the number of topics, which is initially defined to a constant value. This value can either be defined arbitrarily, or estimated during the learning phase. A small value leads to a small number of topics per document and vice-versa. Work by Wallach and al. has demonstrated that non-uniform distributions of this vector parameter could enhance the classification performance of the LDA algorithm. This work has been conducted with real data, for which the underlying distribution of topics is unknown. Therefore, it does not allow to verify if the the improvement effectively comes from a better fit of the α parameter to real data, or if it comes from some other reasons such as better avoidance of local minima. To investigate this question, our study is conducted with synthetic data. The LDA is a generative model and the generation of documents from an underlying LDA latent parameter configuration is straightforward. The documents are generated from known distributions of topics. The obvious hypothesis is to expect that the best performance of the classification will be obtained when the vector α for the corpus generation is identical to the one of the LDA training. Contrary to expectations, results show that the performance is not better when α of the corpus is identical to the training one. The performances vary across the range of corpora α parameter. The strongest differences are observed when the corpus tends to be composed of mono-topics documents, in which case a uniform α tends to give better performance. The differences become smaller as α values get larger, until the corpus is composed of multiple well-distributed topics. In that case, we find smaller performance differences, and no clear performance trend emerges. These results run against Wallach and al. results who have demonstrated that a non-uniform distribution for α can lead to better results. However, the reasons for their improvements remain unclear. On one hand, they were relying on real corpus, that can be more complex or be relatively different from the LDA model. On the other hand, the differences could be related to the LDA latent variable training algorithm, and their improvements could be due to a local maximum, or because the α parameter distribution was flatter than in our study. Unlike our study, it is hard to explore the space of latent variable of a corpus with real data and therefore to rule out the possibility that the real data is subject to local tendencies. Another contribution of this study is the improvement of the LDA through the initialization of one of its latent parameter, namely the distribution of words per topic (the β matrix). We use an unsupervised classification method based on the naive Bayes algorithm. It yields a substantial improvement of performance in the case of uni-topic corpus, in addition to a greater reliability as the results are more stable across simulation runs. A last contribution of our work addresses the problem of comparing classifications along their topic representation. This lead us to define a new similarity measure, which is resilient to permutation and rotation. This is still ongoing work, but we present partial results as an appendix of this document, since we believe it is a significant contribution. In addition to its use in our own context, this measure can have applications in several other fields where we require to evaluate and compare results coming from unsupervised algorithm results, such as the non-negative matrix factorization (NMF), or any other applications where the results can be expressed as a matrix that can be subject to permutations and rotations of its dimensions, which makes the comparison complex
    corecore