299 research outputs found

    Evolutionary Computation Applied to Adaptive Information Filtering

    Get PDF
    Information Filtering is concerned with filtering data streams in such a way as to leave only pertinent data (information) to be perused. When the data streams are produced in a changing environment the filtering has to adapt too in order to remain effective. Adaptive Information Filtering is concerned with filtering in changing environments. The changes may occur both on the transmission side (the nature of the streams can change), and on the reception side (the interest of a user can change). Weighted trigram analysis is a quick and flexible technique for describing the contents of a document. A novel application of evolutionary computation is its use in adaptive information filtering for optimizing various parameters, notably the weights associated with trigrams. The research described in this paper combines weighted trigram analysis, clustering, and a special two-pool evolutionary algorithm, to create an Adaptive Information Filtering system with such use ful properties as domain independence, spelling error insensitivity, adaptability, and optimal use of user feedback while minimizing the amount of user feedback required to function properly. We designed a special evolutionary algorithm with a two-pool strategy for this changing environment

    Text Mining Infrastructure in R

    Get PDF
    During the last decade text mining has become a widely used discipline utilizing statistical and machine learning methods. We present the tm package which provides a framework for text mining applications within R. We give a survey on text mining facilities in R and explain how typical application tasks can be carried out using our framework. We present techniques for count-based analysis methods, text clustering, text classification and string kernels.

    Querying Temporal Drifts at Multiple Granularities

    Get PDF
    There exists a large body of work on online drift detection with the goal of dynamically finding and maintaining changes in data streams. In this paper, we adopt a query-based approach to drift detection. Our approach relies on a drift index, a structure that captures drift at different time granularities and enables flexible drift queries. We formalize different drift queries that represent real-world scenarios and develop query evaluation algorithms that use different mate-rializations of the drift index as well as strategies for online index maintenance. We describe a thorough study of the performance of our algorithms on real-world and synthetic datasets with varying change rates

    Finding structure in language

    Get PDF
    Since the Chomskian revolution, it has become apparent that natural language is richly structured, being naturally represented hierarchically, and requiring complex context sensitive rules to define regularities over these representations. It is widely assumed that the richness of the posited structure has strong nativist implications for mechanisms which might learn natural language, since it seemed unlikely that such structures could be derived directly from the observation of linguistic data (Chomsky 1965).This thesis investigates the hypothesis that simple statistics of a large, noisy, unlabelled corpus of natural language can be exploited to discover some of the structure which exists in natural language automatically. The strategy is to initially assume no knowledge of the structures present in natural language, save that they might be found by analysing statistical regularities which pertain between a word and the words which typically surround it in the corpus.To achieve this, various statistical methods are applied to define similarity between statistical distributions, and to infer a structure for a domain given knowledge of the similarities which pertain within it. Using these tools, it is shown that it is possible to form a hierarchical classification of many domains, including words in natural language. When this is done, it is shown that all the major syntactic categories can be obtained, and the classification is both relatively complete, and very much in accord with a standard linguistic conception of how words are classified in natural language.Once this has been done, the categorisation derived is used as the basis of a similar classification of short sequences of words. If these are analysed in a similar way, then several syntactic categories can be derived. These include simple noun phrases, various tensed forms of verbs, and simple prepositional phrases. Once this has been done, the same technique can be applied one level higher, and at this level simple sentences and verb phrases, as well as more complicated noun phrases and prepositional phrases, are shown to be derivable

    Text Document Categorization using Enhanced Sentence Vector Space Model and Bi-Gram Text Representation Model Based on Novel Fusion Techniques

    Get PDF
    The text document classification tasks passes under the Automatic Classification (also known as pattern Recognition) problem in Machine Learning and Text Mining. It is necessary to classify large text documents into specific classes, to make clear and search simply. Classified data are easy for users to browse. The important issue in usual text document classification is representing the features for classification of an unknown document into predefined categories. The Combination of classifiers is fused together to increase the accuracy classification result in a single text document. This paper states a novel fusion approach to classify text documents by considering ES-VSM and Bigram representation models for text documents. ES-VSM: Enhanced Sentence –Vector Space Model is an advanced feature of the sentence based vector space model and extension to simple VSM will be considered for the constructive representation of text documents. The main objective of the study is to boost the accuracy of text classification by accounting for the features extracted from the text document. The proposed system concatenates two different representation models of the text documents for designing two different classifiers and feeds them as one input to the classifier. An enhanced S-VSM and interval-valued representation model are considered for the effective representation of text documents. A word level neural network Bigram representation of text documents is proposed for effective capturing of semantic information present in the text data. A Proposed approach improves the overall accuracy of text document classification to a significant extent. Keywords: ES-VSM; Fusion, Text Document Classification, Neural Network, Text Representation, Machine learning. DOI: 10.7176/NMMC/93-03 Publication date:September 30th 2020

    Advanced Probabilistic Models for Clustering and Projection

    Get PDF
    Probabilistic modeling for data mining and machine learning problems is a fundamental research area. The general approach is to assume a generative model underlying the observed data, and estimate model parameters via likelihood maximization. It has the deep probability theory as the mathematical background, and enjoys a large amount of methods from statistical learning, sampling theory and Bayesian statistics. In this thesis we study several advanced probabilistic models for data clustering and feature projection, which are the two important unsupervised learning problems. The goal of clustering is to group similar data points together to uncover the data clusters. While numerous methods exist for various clustering tasks, one important question still remains, i.e., how to automatically determine the number of clusters. The first part of the thesis answers this question from a mixture modeling perspective. A finite mixture model is first introduced for clustering, in which each mixture component is assumed to be an exponential family distribution for generality. The model is then extended to an infinite mixture model, and its strong connection to Dirichlet process (DP) is uncovered which is a non-parametric Bayesian framework. A variational Bayesian algorithm called VBDMA is derived from this new insight to learn the number of clusters automatically, and empirical studies on some 2D data sets and an image data set verify the effectiveness of this algorithm. In feature projection, we are interested in dimensionality reduction and aim to find a low-dimensional feature representation for the data. We first review the well-known principal component analysis (PCA) and its probabilistic interpretation (PPCA), and then generalize PPCA to a novel probabilistic model which is able to handle non-linear projection known as kernel PCA. An expectation-maximization (EM) algorithm is derived for kernel PCA such that it is fast and applicable to large data sets. Then we propose a novel supervised projection method called MORP, which can take the output information into account in a supervised learning context. Empirical studies on various data sets show much better results compared to unsupervised projection and other supervised projection methods. At the end we generalize MORP probabilistically to propose SPPCA for supervised projection, and we can also naturally extend the model to S2PPCA which is a semi-supervised projection method. This allows us to incorporate both the label information and the unlabeled data into the projection process. In the third part of the thesis, we introduce a unified probabilistic model which can handle data clustering and feature projection jointly. The model can be viewed as a clustering model with projected features, and a projection model with structured documents. A variational Bayesian learning algorithm can be derived, and it turns out to iterate the clustering operations and projection operations until convergence. Superior performance can be obtained for both clustering and projection
    • …
    corecore