51,479 research outputs found

    A Statistical Approach for Multilingual Document Clustering and Topic Extraction from Clusters

    Get PDF
    2000 Mathematics Subject Classification: 62H30This paper describes a statistics-based methodology for document unsupervised clustering and cluster topics extraction. For this purpose, multiword lexical units (MWUs) of any length are automatically extracted from corpora using the LiPXtractor extractor - a language independent statistics-based tool. The MWUs are taken as base-features to describe documents. These features are transformed and a document similarity matrix is constructed. From this matrix, a reduced set of features is selected using an approach based on Principal Component Analysis. Then, using the Model Based Clustering Analysis software, it is possible to obtain the best number of clusters. Precision and Recall for document-cluster assignment range above 90%. Most important MWUs are extracted from each cluster and taken as document cluster topics. Results on new document classification will just be mentioned

    Distributed Representations of Sentences and Documents

    Full text link
    Many machine learning algorithms require the input to be represented as a fixed-length feature vector. When it comes to texts, one of the most common fixed-length features is bag-of-words. Despite their popularity, bag-of-words features have two major weaknesses: they lose the ordering of the words and they also ignore semantics of the words. For example, "powerful," "strong" and "Paris" are equally distant. In this paper, we propose Paragraph Vector, an unsupervised algorithm that learns fixed-length feature representations from variable-length pieces of texts, such as sentences, paragraphs, and documents. Our algorithm represents each document by a dense vector which is trained to predict words in the document. Its construction gives our algorithm the potential to overcome the weaknesses of bag-of-words models. Empirical results show that Paragraph Vectors outperform bag-of-words models as well as other techniques for text representations. Finally, we achieve new state-of-the-art results on several text classification and sentiment analysis tasks

    A Multi-label Text Classification Framework: Using Supervised and Unsupervised Feature Selection Strategy

    Get PDF
    Text classification, the task of metadata to documents, needs a person to take significant time and effort. Since online-generated contents are explosively growing, it becomes a challenge for manually annotating with large scale and unstructured data. Recently, various state-or-art text mining methods have been applied to classification process based on the keywords extraction. However, when using these keywords as features in the classification task, it is common that the number of feature dimensions is large. In addition, how to select keywords from documents as features in the classification task is a big challenge. Especially, when using traditional machine learning algorithms in big data, the computation time is very long. On the other hand, about 80% of real data is unstructured and non-labeled in the real world. The conventional supervised feature selection methods cannot be directly used in selecting entities from massive data. Usually, statistical strategies are utilized to extract features from unlabeled data for classification tasks according to their importance scores. We propose a novel method to extract key features effectively before feeding them into the classification assignment. Another challenge in the text classification is the multi-label problem, the assignment of multiple non-exclusive labels to documents. This problem makes text classification more complicated compared with a single label classification. For the above issues, we develop a framework for extracting data and reducing data dimension to solve the multi-label problem on labeled and unlabeled datasets. In order to reduce data dimension, we develop a hybrid feature selection method that extracts meaningful features according to the importance of each feature. The Word2Vec is applied to represent each document by a feature vector for the document categorization for the big dataset. The unsupervised approach is used to extract features from real online-generated data for text classification. Our unsupervised feature selection method is applied to extract depression symptoms from social media such as Twitter. In the future, these depression symptoms will be used for depression self-screening and diagnosis

    Unsupervised learning of document image types

    Full text link
    In a system where medical paper document images have been converted to a digital format by a scanning operation, understanding the document types that exists in this system could provide for vital data indexing and retrieval. In a system where millions of document images have been scanned, it is infeasible to expect a supervised based algorithm or a tedious (human based) effort to discover the document types. The most sensible and practical way to do that is an unsupervised algorithm. Many clustering techniques have been developed for unsupervised classification. Many rely on all data being presented at once, the number of clusters to be known, or both. Presented in this thesis is a clustering scheme that is a two-threshold based technique relying on a hierarchical decomposition of the features. On a subset of document images, it discovers document types at an acceptable level and confidently classifies unknown document images

    An Empirical Evaluation of Zero Resource Acoustic Unit Discovery

    Full text link
    Acoustic unit discovery (AUD) is a process of automatically identifying a categorical acoustic unit inventory from speech and producing corresponding acoustic unit tokenizations. AUD provides an important avenue for unsupervised acoustic model training in a zero resource setting where expert-provided linguistic knowledge and transcribed speech are unavailable. Therefore, to further facilitate zero-resource AUD process, in this paper, we demonstrate acoustic feature representations can be significantly improved by (i) performing linear discriminant analysis (LDA) in an unsupervised self-trained fashion, and (ii) leveraging resources of other languages through building a multilingual bottleneck (BN) feature extractor to give effective cross-lingual generalization. Moreover, we perform comprehensive evaluations of AUD efficacy on multiple downstream speech applications, and their correlated performance suggests that AUD evaluations are feasible using different alternative language resources when only a subset of these evaluation resources can be available in typical zero resource applications.Comment: 5 pages, 1 figure; Accepted for publication at ICASSP 201
    • …
    corecore