318 research outputs found

    Feature extraction and classification of spam emails

    Get PDF

    Approaches to better context modeling and categorization

    Get PDF

    New sampling and optimization methods for topic inference and text classification

    Get PDF
    Topic modelling (TM) methods, such as latent Dirichlet allocation (LDA), are advanced statistical models which are used to uncover hidden thematic structures or topics in the unstructured text. In this context, a topic is a distribution over words, and a document is a distribution over topics. Topic models are usually unsupervised; however, supervised variants have been proposed, such as supervised LDA (SLDA) which can be used for text classification. To evaluate a supervised topic model, one could measure its classification accuracy. However, unsupervised topic model’s evaluation is not straightforward, and it is usually done by calculating metrics known as held-out perplexity and coherence. Held-out perplexity evaluates the model’s ability to generalize to unseen documents; coherence calculates a semantic distance between the words within each topic. This thesis explores ideas for enhancing the performance of TM, both supervised and unsupervised. Firstly, multi-objective topic modelling (MOEA-TM) is proposed, which uses a multi-objective evolutionary algorithm (MOEA) to optimize two objectives: coverage and coherence. MOEA-TM has two settings: ’start from scratch’ and ’start from an estimated topic model’. In the later, the held-out perplexity is added as another objective. In both settings, MOEA-TM achieves highly coherent topics. Further, a genetic algorithm is developed with LDA log-likelihood as a fitness function. This algorithm can improve log-likelihood by up to 10%; however, perplexity scores slightly deteriorate due to over-fitting. Hyperparameters play a significant role in TM; thus, Gibbs-Newton (GN), which is an efficient approach to learn a multivariate Pólya distribution parameter, is proposed. A closer look at the LDA model reveals that it comprises two multivariate Pólya distributions: one is used to model topics, whereas the other is used to model topics proportions in documents. Consequently, a better approach to learn multivariate Pólya distribution parameter may enhance TM. GN is benchmarked against Minka’s fixed-point iteration approach, a slice sampling technique and the moments’ method. We find that GN provides the same level of accuracy as Minka’s fixed-point iteration method but in less time, and with better accuracy than the other approaches. Also, LDA-GN is proposed, which makes use of the GN method in topic modelling. This algorithm can achieve better perplexity scores than the original LDA on three corpora tested. Moreover, LDA-GN is tested on a supervised task using SLDA-GN, which is the SLDA model equipped with the GN method to learn its hyperparameters. SLDA-GN outperforms the original SLDA, which optimizes its hyperparameters using Minka’s fixed point iteration method. Furthermore, LDA-GN is evaluated on a spam filtering task using the Multi-corpus LDA (MC-LDA) model; where LDA-GN shows a more stable performance compared with the standard LDA. Finally, most topic models are based on the “Bag of Words” assumption, where a document word order is lost, and only frequency is preserved. We propose LDA-crr model, which represents word order as an observed variable. LDA-crr introduces only minor additional complexity to TM; thus, it can be applied readily to large corpora. LDA-crr is benchmarked against the original LDA using fixed hyperparameters to isolate their influence. LDA-crr outperforms LDA in terms of perplexity and shows slightly more coherent topics when the number of topics increases. Also, LDA-crr is equipped with both the GN approach and the slice sampling technique in LDA-crrGN and LDA-crrGSS models respectively. LDA-crrGN shows a slightly better ability to generalize to unseen documents compared with LDA-GN on one corpus when the number of topics is high. However, in general, LDA-crrGSS shows better coherence scores compared with the LDA-GN and the original LDA. Furthermore, experiments to investigate LDA-crr performance in a classification task were run; thus, SLDA is extended to incorporate word orders in the SLDA-crr model. The GN and the GSS techniques are used in the SLDA-crrGN and the SLDA-crrGSS models respectively to learn its parameters. Compared with the SLDA-GN and the original SLDA, the SLDA-crrGN shows better accuracy results in classifying unseen documents. This reveals that SLDA-crrGN can pick up more useful information from the training corpus which consequently helps the model to perform better

    Recommender Systems

    Get PDF
    The ongoing rapid expansion of the Internet greatly increases the necessity of effective recommender systems for filtering the abundant information. Extensive research for recommender systems is conducted by a broad range of communities including social and computer scientists, physicists, and interdisciplinary researchers. Despite substantial theoretical and practical achievements, unification and comparison of different approaches are lacking, which impedes further advances. In this article, we review recent developments in recommender systems and discuss the major challenges. We compare and evaluate available algorithms and examine their roles in the future developments. In addition to algorithms, physical aspects are described to illustrate macroscopic behavior of recommender systems. Potential impacts and future directions are discussed. We emphasize that recommendation has a great scientific depth and combines diverse research fields which makes it of interests for physicists as well as interdisciplinary researchers.Comment: 97 pages, 20 figures (To appear in Physics Reports

    Discovering and Mitigating Social Data Bias

    Get PDF
    abstract: Exabytes of data are created online every day. This deluge of data is no more apparent than it is on social media. Naturally, finding ways to leverage this unprecedented source of human information is an active area of research. Social media platforms have become laboratories for conducting experiments about people at scales thought unimaginable only a few years ago. Researchers and practitioners use social media to extract actionable patterns such as where aid should be distributed in a crisis. However, the validity of these patterns relies on having a representative dataset. As this dissertation shows, the data collected from social media is seldom representative of the activity of the site itself, and less so of human activity. This means that the results of many studies are limited by the quality of data they collect. The finding that social media data is biased inspires the main challenge addressed by this thesis. I introduce three sets of methodologies to correct for bias. First, I design methods to deal with data collection bias. I offer a methodology which can find bias within a social media dataset. This methodology works by comparing the collected data with other sources to find bias in a stream. The dissertation also outlines a data collection strategy which minimizes the amount of bias that will appear in a given dataset. It introduces a crawling strategy which mitigates the amount of bias in the resulting dataset. Second, I introduce a methodology to identify bots and shills within a social media dataset. This directly addresses the concern that the users of a social media site are not representative. Applying these methodologies allows the population under study on a social media site to better match that of the real world. Finally, the dissertation discusses perceptual biases, explains how they affect analysis, and introduces computational approaches to mitigate them. The results of the dissertation allow for the discovery and removal of different levels of bias within a social media dataset. This has important implications for social media mining, namely that the behavioral patterns and insights extracted from social media will be more representative of the populations under study.Dissertation/ThesisDoctoral Dissertation Computer Science 201

    Statistical Learning Approaches to Information Filtering

    Get PDF
    Enabling computer systems to understand human thinking or behaviors has ever been an exciting challenge to computer scientists. In recent years one such a topic, information filtering, emerges to help users find desired information items (e.g.~movies, books, news) from large amount of available data, and has become crucial in many applications, like product recommendation, image retrieval, spam email filtering, news filtering, and web navigation etc.. An information filtering system must be able to understand users' information needs. Existing approaches either infer a user's profile by exploring his/her connections to other users, i.e.~collaborative filtering (CF), or analyzing the content descriptions of liked or disliked examples annotated by the user, ~i.e.~content-based filtering (CBF). Those methods work well to some extent, but are facing difficulties due to lack of insights into the problem. This thesis intensively studies a wide scope of information filtering technologies. Novel and principled machine learning methods are proposed to model users' information needs. The work demonstrates that the uncertainty of user profiles and the connections between them can be effectively modelled by using probability theory and Bayes rule. As one major contribution of this thesis, the work clarifies the ``structure'' of information filtering and gives rise to principled solutions. In summary, the work of this thesis mainly covers the following three aspects: Collaborative filtering: We develop a probabilistic model for memory-based collaborative filtering (PMCF), which has clear links with classical memory-based CF. Various heuristics to improve memory-based CF have been proposed in the literature. In contrast, extensions based on PMCF can be made in a principled probabilistic way. With PMCF, we describe a CF paradigm that involves interactions with users, instead of passively receiving data from users in conventional CF, and actively chooses the most informative patterns to learn, thereby greatly reduce user efforts and computational costs. Content-based filtering: One major problem for CBF is the deficiency and high dimensionality of content-descriptive features. Information items (e.g.~images or articles) are typically described by high-dimensional features with mixed types of attributes, that seem to be developed independently but intrinsically related. We derive a generalized principle component analysis to merge high-dimensional and heterogenous content features into a low-dimensional continuous latent space. The derived features brings great conveniences to CBF, because most existing algorithms easily cope with low-dimensional and continuous data, and more importantly, the extracted data highlight the intrinsic semantics of original content features. Hybrid filtering: How to combine CF and CBF in an ``smart'' way remains one of the most challenging problems in information filtering. Little principled work exists so far. This thesis reveals that people's information needs can be naturally modelled with a hierarchical Bayesian thinking, where each individual's data are generated based on his/her own profile model, which itself is a sample from a common distribution of the population of user profiles. Users are thus connected to each other via this common distribution. Due to the complexity of such a distribution in real-world applications, usually applied parametric models are too restrictive, and we thus introduce a nonparametric hierarchical Bayesian model using Dirichlet process. We derive effective and efficient algorithms to learn the described model. In particular, the finally achieved hybrid filtering methods are surprisingly simple and intuitively understandable, offering clear insights to previous work on pure CF, pure CBF, and hybrid filtering

    Automatic aspect extraction in information retrieval diversity

    Full text link
    In this master thesis we describe a new automatic aspect extraction algorithm by incorporating relevance information to the dynamics of the Probabilistic Latent Semantic Analysis. An utility-biased likelihood statistical framework is described to formalize the incorporation of prior relevance information to the dynamics of the algorithm intrinsically. Moreover, a general abstract algorithm is presented to incorporate any arbitrary new feature variables to the analysis. A tempering procedure is inferred for this general algorithm as an entropic regularization of the utility-biased likelihood functional and a geometric interpretation of the algorithm is described, showing the intrinsic changes in the information space of the problem produced when di erent sources of prior utility estimations are provided over the same data. The general algorithm is applied to several information retrieval, recommendation and personalization tasks. Moreover, a set of post-processing aspect lters is presented. Some characteristics of the aspect distributions such as sparsity or low entropy are identi ed to enhance the overall diversity attained by the diversi cation algorithm. Proposed lters assure that the nal aspect space has those properties, thus leading to better diversity levels. An experimental setup over TREC web track 09-12 data shows that the algorithm surpasses classic pLSA as an aspect extraction tool for the search diversi cation. Additional theoretical applications of the general procedure to information retrieval, recommendation and personalization tasks are given, leading to new relevanceaware models incorporating several variables to the latent semantic analysis. Finally the problem of optimizing the aspect space size for diversi cation is addressed. Analytical formulas for the dependency of diversity metrics on the choice of an automatically extracted aspect space are given under a simpli ed generative model for the relation between system aspects and evaluation true aspects. An experimental analysis of this dependence is performed over TREC web track data using pLSA as aspect extraction algorithm
    corecore