Search CORE

118,536 research outputs found

Categorization of unorganized text corpora for better domain-specific language modeling

Author: Hládek Daniel
Juhár Jozef
Staš Ján
Zlacký Daniel
Publication venue: Vysoká škola báňská - Technická univerzita Ostrava
Publication date: 01/01/2013
Field of study

This paper describes the process of categorization of unorganized text data gathered from the Internet to the in-domain and out-of-domain data for better domain-specific language modeling and speech recognition. An algorithm for text categorization and topic detection based on the most frequent key phrases is presented. In this scheme, each document entered into the process of text categorization is represented by a vector space model with term weighting based on computing the term frequency and inverse document frequency. Text documents are then classified to the in-domain and out-of-domain data automatically with predefined threshold using one of the selected distance/similarity measures comparing to the list of key phrases. The experimental results of the language modeling and adaptation to the judicial domain show significant improvement in the model perplexity about 19 % and decreasing of the word error rate of the Slovak transcription and dictation system about 5,54 %, relatively

Directory of Open Access Journals

DSpace at VSB Technical University of Ostrava

Recommended from our members

Semantic Topic Models: Combining Word Distributional Statistics and Dictionary Deﬁnitions

Author: Diab Mona
Guo Weiwei
Publication venue: 'Columbia University Libraries/Information Services'
Publication date: 01/01/2011
Field of study

In this paper, we propose a novel topic model based on incorporating dictionary deﬁnitions. Traditional topic models treat words as surface strings without assuming predeﬁned knowledge about word meaning. They infer topics only by observing surface word co-occurrence. However, the co-occurred words may not be semantically related in a manner that is relevant for topic coherence. Exploiting dictionary deﬁnitions explicitly in our model yields a better understanding of word semantics leading to better text modeling. We exploit WordNet as a lexical resource for sense deﬁnitions. We show that explicitly modeling word deﬁnitions helps improve performance signiﬁcantly over the baseline for a text categorization task

Columbia University Academic Commons

Latent Topic Text Representation Learning on Statistical Manifolds

Author: Chen H
Cohn AG
Jiang B
Li Z
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/11/2018
Field of study

The explosive growth of text data requires effective methods to represent and classify these texts. Many text learning methods have been proposed, like statistics-based methods, semantic similarity methods, and deep learning methods. The statistics-based methods focus on comparing the substructure of text, which ignores the semantic similarity between different words. Semantic similarity methods learn a text representation by training word embedding and representing text as the average vector of all words. However, these methods cannot capture the topic diversity of words and texts clearly. Recently, deep learning methods such as CNNs and RNNs have been studied. However, the vanishing gradient problem and time complexity for parameter selection limit their applications. In this paper, we propose a novel and efficient text learning framework, named Latent Topic Text Representation Learning. Our method aims to provide an effective text representation and text measurement with latent topics. With the assumption that words on the same topic follow a Gaussian distribution, texts are represented as a mixture of topics, i.e., a Gaussian mixture model. Our framework is able to effectively measure text distance to perform text categorization tasks by leveraging statistical manifolds. Experimental results on text representation and classification, and topic coherence demonstrate the effectiveness of the proposed method

Crossref

White Rose Research Online

Feature Augmentation for Improved Topic Modeling of Youtube Lecture Videos using Latent Dirichlet Allocation

Author: Srikumar Nakul
Publication venue: Dublin Institute of Technology
Publication date: 01/01/2021
Field of study

Application of Topic Models in text mining of educational data and more specifically, the text data obtained from lecture videos, is an area of research which is largely unexplored yet holds great potential. This work seeks to find empirical evidence for an improvement in Topic Modeling by pre- extracting bigram tokens and adding them as additional features in the Latent Dirichlet Allocation (LDA) algorithm, a widely-recognized topic modeling technique. The dataset considered for analysis is a collection of transcripts of video lectures on Machine Learning scraped from YouTube. Using the cosine similarity distance measure as a metric, the experiment showed a statistically significant improvement in topic model performance against the baseline topic model which did not use extra features, thus confirming the hypothesis. By introducing explainable features before modeling and using deep learning based text representation only at the post-modeling evaluation stage, the overall model interpretability is retained. This empowers educators and researchers alike to not only benefit from the LDA model in their own fields but also to play a substantial role in eorts to improve model performance. It also sets the direction for future work which could use the feature augmented topic model as the input to other more common text mining tasks like document categorization and information retrieval

Arrow@TUDublin

Image categorization by a classifier based on probabilistic topic model

Author: YAMAGUCHI Takuma
MARUYAMA Minoru
Publication venue: International Conference on Pattern Recognition
Publication date: 01/01/1960
Field of study

With rapid increase of number of accessible images and videos, ability to recognize visual information is getting more and more important for content-based information retrieval. Recently, probabilistic topic models, which were originally developed for text analysis, have been used for image categorization successfully. Usually, topics which represent contents of an image is detected based on the underlying probabilistic model, then image categorization is carried out using topic distribution as the input feature. Typical method is to use k-nearest neighbor classifier based on L2-distance after topic discovery. In the method, topic distribution is just treated as a feature point. In this paper, we propose a categorization method based on more natural use of the topic distribution, which is derived by using pLSA model. Categorization is carried out by estimating conditional probability p(categoryjdata). We present two types of image categorization tasks, scene classification and document image segmentation, and show the proposed method performs very well. In addition, we also examine the performance of the proposed method under the situation where only the limited number of labeled examples are available. We show our method can perform quite well even in the circumstances

Crossref

University of Wisconsin, Milwaukee: UWM Libraries Digital Collections

Minimally Supervised Categorization of Text with Metadata

Author: Chang Ming-Wei
Chen Xingyuan
Devlin Jacob
Gopal Siddharth
Kim Yoon
Lee Wang-Chien
Mekala Dheeraj
Mikolov Tomas
Rosen-Zvi Michal
Shi Chuan
Xiao Huiru
Zhang Yu
Publication venue
Publication date: 13/11/2021
Field of study

Document categorization, which aims to assign a topic label to each document, plays a fundamental role in a wide variety of applications. Despite the success of existing studies in conventional supervised document classification, they are less concerned with two real problems: (1) the presence of metadata: in many domains, text is accompanied by various additional information such as authors and tags. Such metadata serve as compelling topic indicators and should be leveraged into the categorization framework; (2) label scarcity: labeled training samples are expensive to obtain in some cases, where categorization needs to be performed using only a small set of annotated data. In recognition of these two challenges, we propose MetaCat, a minimally supervised framework to categorize text with metadata. Specifically, we develop a generative process describing the relationships between words, documents, labels, and metadata. Guided by the generative model, we embed text and metadata into the same semantic space to encode heterogeneous signals. Then, based on the same generative process, we synthesize training samples to address the bottleneck of label scarcity. We conduct a thorough evaluation on a wide range of datasets. Experimental results prove the effectiveness of MetaCat over many competitive baselines.Comment: 10 pages; Accepted to SIGIR 2020; Some typos fixe

arXiv.org e-Print Archive

Crossref

Total Variability Space for LDA-based multi-viewtext categorization

Author: Bouallegue Mohamed
De Mori Renato
Dufour Richard
Linarès Georges
Morchid Mohamed
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2015
Field of study

Paru sous le titre Compact Multiview Representation of Documents Based on the Total Variability SpaceInternational audienceMapping text document into LDA-based topic-space is a classical way to extract high level representation of text documents. Unfortunatly , LDA is higly sensitive to hyper-parameters related to class number or word and topic distribution , and there is not any systematic way to prior estimate optimal configurations. Morover , various hyperparameter configurations offer complementary views on the document. In this paper , we propose a method based on a two-step process that , first , expands representation space by using a set of topic spaces and , second , compacts representation space by removing poorly relevant dimensions. These two steps are based respectivelly on multi-view LDA-based representation spaces and factor-analysis models. This model provides a view-independant representation of documents while extracting complementary information from a massive multi-view representation. Experiments are conducted on the DECODA conversation corpus and Reuters-21578 textual dataset. Results show the effectiveness of the proposed multi-view compact representation paradigm. The proposed categorization system reaches an accuracy of 86. 9% and 86. 5% respectively with manual and automatic transcriptions of conversations , and a macro-F1 of 80% during a classification task of the well-known studied Reuters-21578 corpus , with a significant gain compared to the baseline (best single topic space configuration) , as well as methods and document representations previously studied