23,027 research outputs found
Kernel Multivariate Analysis Framework for Supervised Subspace Learning: A Tutorial on Linear and Kernel Multivariate Methods
Feature extraction and dimensionality reduction are important tasks in many
fields of science dealing with signal processing and analysis. The relevance of
these techniques is increasing as current sensory devices are developed with
ever higher resolution, and problems involving multimodal data sources become
more common. A plethora of feature extraction methods are available in the
literature collectively grouped under the field of Multivariate Analysis (MVA).
This paper provides a uniform treatment of several methods: Principal Component
Analysis (PCA), Partial Least Squares (PLS), Canonical Correlation Analysis
(CCA) and Orthonormalized PLS (OPLS), as well as their non-linear extensions
derived by means of the theory of reproducing kernel Hilbert spaces. We also
review their connections to other methods for classification and statistical
dependence estimation, and introduce some recent developments to deal with the
extreme cases of large-scale and low-sized problems. To illustrate the wide
applicability of these methods in both classification and regression problems,
we analyze their performance in a benchmark of publicly available data sets,
and pay special attention to specific real applications involving audio
processing for music genre prediction and hyperspectral satellite images for
Earth and climate monitoring
Knowledge Base Population using Semantic Label Propagation
A crucial aspect of a knowledge base population system that extracts new
facts from text corpora, is the generation of training data for its relation
extractors. In this paper, we present a method that maximizes the effectiveness
of newly trained relation extractors at a minimal annotation cost. Manual
labeling can be significantly reduced by Distant Supervision, which is a method
to construct training data automatically by aligning a large text corpus with
an existing knowledge base of known facts. For example, all sentences
mentioning both 'Barack Obama' and 'US' may serve as positive training
instances for the relation born_in(subject,object). However, distant
supervision typically results in a highly noisy training set: many training
sentences do not really express the intended relation. We propose to combine
distant supervision with minimal manual supervision in a technique called
feature labeling, to eliminate noise from the large and noisy initial training
set, resulting in a significant increase of precision. We further improve on
this approach by introducing the Semantic Label Propagation method, which uses
the similarity between low-dimensional representations of candidate training
instances, to extend the training set in order to increase recall while
maintaining high precision. Our proposed strategy for generating training data
is studied and evaluated on an established test collection designed for
knowledge base population tasks. The experimental results show that the
Semantic Label Propagation strategy leads to substantial performance gains when
compared to existing approaches, while requiring an almost negligible manual
annotation effort.Comment: Submitted to Knowledge Based Systems, special issue on Knowledge
Bases for Natural Language Processin
Optimized complex power quality classifier using one vs. rest support vector machine
Nowadays, power quality issues are becoming a significant research topic because of the increasing inclusion of very sensitive devices and considerable renewable energy sources. In general, most of the previous power quality classification techniques focused on single power quality events and did not include an optimal feature selection process. This paper presents a classification system that employs Wavelet Transform and the RMS profile to extract the main features of the measured waveforms containing either single or complex disturbances. A data mining process is designed to select the optimal set of features that better describes each disturbance present in the waveform. Support Vector Machine binary classifiers organized in a ?One Vs Rest? architecture are individually optimized to classify single and complex disturbances. The parameters that rule the performance of each binary classifier are also individually adjusted using a grid search algorithm that helps them achieve optimal performance. This specialized process significantly improves the total classification accuracy. Several single and complex disturbances were simulated in order to train and test the algorithm. The results show that the classifier is capable of identifying >99% of single disturbances and >97% of complex disturbances.Fil: de Yong, David Marcelo. Universidad Nacional de Río Cuarto. Facultad de Ingeniería. Departamento de Electricidad y Electrónica; Argentina. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Conicet - Córdoba; ArgentinaFil: Bhowmik, Sudipto. Nexant Inc; Estados UnidosFil: Magnago, Fernando. Universidad Nacional de Río Cuarto. Facultad de Ingeniería. Departamento de Electricidad y Electrónica; Argentina. Consejo Nacional de Investigaciones Científicas y Técnicas. Centro Científico Tecnológico Conicet - Córdoba; Argentin
The Role of Text Pre-processing in Sentiment Analysis
It is challenging to understand the latest trends and summarise the state or general opinions about products due to the big diversity and size of social media data, and this creates the need of automated and real time opinion extraction and mining. Mining online opinion is a form of sentiment analysis that is treated as a difficult text classification task. In this paper, we explore the role of text pre-processing in sentiment analysis, and report on experimental results that demonstrate that with appropriate feature selection and representation, sentiment analysis accuracies using support vector machines (SVM) in this area may be significantly improved. The level of accuracy achieved is shown to be comparable to the ones achieved in topic categorisation although sentiment analysis is considered to be a much harder problem in the literature
- …