118,536 research outputs found

    Categorization of unorganized text corpora for better domain-specific language modeling

    Get PDF
    This paper describes the process of categorization of unorganized text data gathered from the Internet to the in-domain and out-of-domain data for better domain-specific language modeling and speech recognition. An algorithm for text categorization and topic detection based on the most frequent key phrases is presented. In this scheme, each document entered into the process of text categorization is represented by a vector space model with term weighting based on computing the term frequency and inverse document frequency. Text documents are then classified to the in-domain and out-of-domain data automatically with predefined threshold using one of the selected distance/similarity measures comparing to the list of key phrases. The experimental results of the language modeling and adaptation to the judicial domain show significant improvement in the model perplexity about 19 % and decreasing of the word error rate of the Slovak transcription and dictation system about 5,54 %, relatively

    Latent Topic Text Representation Learning on Statistical Manifolds

    Get PDF
    The explosive growth of text data requires effective methods to represent and classify these texts. Many text learning methods have been proposed, like statistics-based methods, semantic similarity methods, and deep learning methods. The statistics-based methods focus on comparing the substructure of text, which ignores the semantic similarity between different words. Semantic similarity methods learn a text representation by training word embedding and representing text as the average vector of all words. However, these methods cannot capture the topic diversity of words and texts clearly. Recently, deep learning methods such as CNNs and RNNs have been studied. However, the vanishing gradient problem and time complexity for parameter selection limit their applications. In this paper, we propose a novel and efficient text learning framework, named Latent Topic Text Representation Learning. Our method aims to provide an effective text representation and text measurement with latent topics. With the assumption that words on the same topic follow a Gaussian distribution, texts are represented as a mixture of topics, i.e., a Gaussian mixture model. Our framework is able to effectively measure text distance to perform text categorization tasks by leveraging statistical manifolds. Experimental results on text representation and classification, and topic coherence demonstrate the effectiveness of the proposed method

    Feature Augmentation for Improved Topic Modeling of Youtube Lecture Videos using Latent Dirichlet Allocation

    Get PDF
    Application of Topic Models in text mining of educational data and more specifically, the text data obtained from lecture videos, is an area of research which is largely unexplored yet holds great potential. This work seeks to find empirical evidence for an improvement in Topic Modeling by pre- extracting bigram tokens and adding them as additional features in the Latent Dirichlet Allocation (LDA) algorithm, a widely-recognized topic modeling technique. The dataset considered for analysis is a collection of transcripts of video lectures on Machine Learning scraped from YouTube. Using the cosine similarity distance measure as a metric, the experiment showed a statistically significant improvement in topic model performance against the baseline topic model which did not use extra features, thus confirming the hypothesis. By introducing explainable features before modeling and using deep learning based text representation only at the post-modeling evaluation stage, the overall model interpretability is retained. This empowers educators and researchers alike to not only benefit from the LDA model in their own fields but also to play a substantial role in eorts to improve model performance. It also sets the direction for future work which could use the feature augmented topic model as the input to other more common text mining tasks like document categorization and information retrieval

    Image categorization by a classifier based on probabilistic topic model

    Get PDF
    With rapid increase of number of accessible images and videos, ability to recognize visual information is getting more and more important for content-based information retrieval. Recently, probabilistic topic models, which were originally developed for text analysis, have been used for image categorization successfully. Usually, topics which represent contents of an image is detected based on the underlying probabilistic model, then image categorization is carried out using topic distribution as the input feature. Typical method is to use k-nearest neighbor classifier based on L2-distance after topic discovery. In the method, topic distribution is just treated as a feature point. In this paper, we propose a categorization method based on more natural use of the topic distribution, which is derived by using pLSA model. Categorization is carried out by estimating conditional probability p(categoryjdata). We present two types of image categorization tasks, scene classification and document image segmentation, and show the proposed method performs very well. In addition, we also examine the performance of the proposed method under the situation where only the limited number of labeled examples are available. We show our method can perform quite well even in the circumstances

    Minimally Supervised Categorization of Text with Metadata

    Full text link
    Document categorization, which aims to assign a topic label to each document, plays a fundamental role in a wide variety of applications. Despite the success of existing studies in conventional supervised document classification, they are less concerned with two real problems: (1) the presence of metadata: in many domains, text is accompanied by various additional information such as authors and tags. Such metadata serve as compelling topic indicators and should be leveraged into the categorization framework; (2) label scarcity: labeled training samples are expensive to obtain in some cases, where categorization needs to be performed using only a small set of annotated data. In recognition of these two challenges, we propose MetaCat, a minimally supervised framework to categorize text with metadata. Specifically, we develop a generative process describing the relationships between words, documents, labels, and metadata. Guided by the generative model, we embed text and metadata into the same semantic space to encode heterogeneous signals. Then, based on the same generative process, we synthesize training samples to address the bottleneck of label scarcity. We conduct a thorough evaluation on a wide range of datasets. Experimental results prove the effectiveness of MetaCat over many competitive baselines.Comment: 10 pages; Accepted to SIGIR 2020; Some typos fixe

    Total Variability Space for LDA-based multi-viewtext categorization

    Get PDF
    Paru sous le titre Compact Multiview Representation of Documents Based on the Total Variability SpaceInternational audienceMapping text document into LDA-based topic-space is a classical way to extract high level representation of text documents. Unfortunatly , LDA is higly sensitive to hyper-parameters related to class number or word and topic distribution , and there is not any systematic way to prior estimate optimal configurations. Morover , various hyperparameter configurations offer complementary views on the document. In this paper , we propose a method based on a two-step process that , first , expands representation space by using a set of topic spaces and , second , compacts representation space by removing poorly relevant dimensions. These two steps are based respectivelly on multi-view LDA-based representation spaces and factor-analysis models. This model provides a view-independant representation of documents while extracting complementary information from a massive multi-view representation. Experiments are conducted on the DECODA conversation corpus and Reuters-21578 textual dataset. Results show the effectiveness of the proposed multi-view compact representation paradigm. The proposed categorization system reaches an accuracy of 86. 9% and 86. 5% respectively with manual and automatic transcriptions of conversations , and a macro-F1 of 80% during a classification task of the well-known studied Reuters-21578 corpus , with a significant gain compared to the baseline (best single topic space configuration) , as well as methods and document representations previously studied
    corecore