187,443 research outputs found

    Large-scale Multi-Label Text Classification for an Online News Monitoring System

    Get PDF
    This thesis provides a detailed exploration of numerous methods — some established and some novel — considered in the construction of a text-categorization system, for use in a large-scale, online news-monitoring system known as PULS. PULS is an information extraction (IE) system, consisting of a number of tools for automatically collecting named-entities from text. The system also has access to large training corpora in the business domain, where documents are annotated with associated industry-sectors. These assets are leveraged in the construction of a multi-label industry-sector classifier, the output of which is displayed on the web-based front-end of PULS, for new articles. Through review of background literature and direct experimentation with each stage of development, we illuminate many major challenges of multi-label classification. These challenges include: working effectively in a real-world scenario that poses time and memory restrictions; organizing and processing semi-structured, pre-annotated text corpora; handling large-scale data sets and label sets with significant class imbalances; weighing the trade-offs of different learning algorithms and feature-selection methods with respect to end-user performance; and finding meaningful evaluations for each system component. In addition to presenting the challenges associated with large-scale multi-label learning, this thesis presents a number of experiments and evaluations to determine methods which enhance overall performance. The major outcome of these experiments is a multi-stage, multi-label classifier that combines IE-based rote classification — with features extracted by the PULS system — with an array of balanced, statistical classifiers. Evaluation of this multi-stage system shows improvement over a baseline classifier and, for certain evaluations, over state-of-the-art performance from literature, when tested on a commonly-used corpus. Aspects of the classification method and their associated experimental results have also been published for international conference proceedings

    Unsupervised feature analysis for high dimensional big data

    Get PDF
    In practice we often encounter the scenario that label information is unavailable due to either high cost of manual labeling or unwillingness of users to label. When label information is not available, traditional supervised learning can not be directly applied so we need to study unsupervised methods which could work well even without supervision. Feature analysis has been proven effective and important for many applications. Feature analysis is a broad research field, whose research topics includes but are not limited to feature selection, feature extraction, feature construction, and feature composition e.g., in topic discovery the learned topics can be viewed as compound features. In many real systems, it is often necessary and important to do feature analysis to determine which individual or compound features should be used for posterior learning tasks. The effectiveness of traditional feature analysis often relies on labels of the training data examples. However, in the era of big data, label information is often unavailable. In the unsupervised scenario, it is more challenging to do feature analysis. Two important research topics in unsupervised feature analysis are unsupervised feature selection and unsupervised feature composition, e.g., to discover topics as compound features. This would naturally create two lines for unsupervised feature analysis. Also, combined with single-view or multiple-view for the data, we would generate a table with four cells. Except for the single-view feature composition (or topic discovery) where there're already many work done e.g., PLSA, LDA, and NMF, the other three cells correspond to new research topics, and there is few work done yet. For single view unsupervised feature analysis, we propose two unsupervised feature selection methods. For multi-view unsupervised feature analysis, we focus on text-image web news data and propose a multi-view unsupervised feature selection method and a text-image topic model. Specifically, for single-view unsupervised feature selection, we propose a new method that is called Robust Unsupervised Feature Selection (RUFS), where pseudo cluster labels are learned via local learning regularized robust NMF and feature selection is performed simultaneously by robust joint l2,1l_{2, 1}-norm minimization. Outliers could be effectively handled and redundant or noisy features could be effectively reduced. We also design a (projected) limited-memory BFGS based linear time iterative algorithm to efficiently solve the optimization problem. We also study how the choice of norms for data fitting and feature selection terms affect the ultimate unsupervised feature selection performance. Specifically, we propose to use joint adaptive loss and l2/l0l_2/l_0 minimization for data fitting and feature selection. We mathematically explain desirable properties of joint adaptive loss and l2/l0l_2/l_0 minimization over recent unsupervised feature selection models. We solve the optimization problem with an efficient iterative algorithm whose computational complexity and memory cost are linear to both sample size and feature size. For multiple-view unsupervised feature selection, we propose a more effective approach for high dimensional text-image web news data. We propose to use raw text features in label learning to avoid information loss. We propose a new multi-view unsupervised feature selection method in which image local learning regularized orthogonal nonnegative matrix factorization is used to learn pseudo labels and simultaneously robust joint l2,1l_{2,1}-norm minimization is performed to select discriminative features. Cross-view consensus on pseudo labels can be obtained as much as possible. For multi-view topic discovery, we study how to systematically mine topics from high dimensional text-image web news data. The application problem is important because almost all news articles have one picture associated. Unlike traditional topic modeling which considers text alone, the new task aims to discover heterogeneous topics from web news of multiple data types. We propose to tackle the problem by a regularized nonnegative constrained l2,1l_{2,1}-norm minimization framework. We also present a new iterative algorithm to solve the optimization problem. The proposed single-view feature selection methods can be applied on almost all single-view data. The proposed multi-view methods are designed to process text-image web news data, but the idea can be naturally generalized to analyze any multi-view data. Practitioners could run the proposed methods to select features that will be used in posterior learning tasks. One can also run our multi-view topic model to analyze and visualize topics in text-image web news corpora to help interpret the data

    Multi-Label Classifier Chains for Bird Sound

    Full text link
    Bird sound data collected with unattended microphones for automatic surveys, or mobile devices for citizen science, typically contain multiple simultaneously vocalizing birds of different species. However, few works have considered the multi-label structure in birdsong. We propose to use an ensemble of classifier chains combined with a histogram-of-segments representation for multi-label classification of birdsong. The proposed method is compared with binary relevance and three multi-instance multi-label learning (MIML) algorithms from prior work (which focus more on structure in the sound, and less on structure in the label sets). Experiments are conducted on two real-world birdsong datasets, and show that the proposed method usually outperforms binary relevance (using the same features and base-classifier), and is better in some cases and worse in others compared to the MIML algorithms.Comment: 6 pages, 1 figure, submission to ICML 2013 workshop on bioacoustics. Note: this is a minor revision- the blind submission format has been replaced with one that shows author names, and a few corrections have been mad

    Automatically extracting polarity-bearing topics for cross-domain sentiment classification

    Get PDF
    Joint sentiment-topic (JST) model was previously proposed to detect sentiment and topic simultaneously from text. The only supervision required by JST model learning is domain-independent polarity word priors. In this paper, we modify the JST model by incorporating word polarity priors through modifying the topic-word Dirichlet priors. We study the polarity-bearing topics extracted by JST and show that by augmenting the original feature space with polarity-bearing topics, the in-domain supervised classifiers learned from augmented feature representation achieve the state-of-the-art performance of 95% on the movie review data and an average of 90% on the multi-domain sentiment dataset. Furthermore, using feature augmentation and selection according to the information gain criteria for cross-domain sentiment classification, our proposed approach performs either better or comparably compared to previous approaches. Nevertheless, our approach is much simpler and does not require difficult parameter tuning
    • …
    corecore