29 research outputs found
Nuevos Modelos de Aprendizaje Híbrido para Clasificación y Ordenamiento Multi-Etiqueta
En la última década, el aprendizaje multi-etiqueta se ha convertido en una importante tarea de investigación, debido en gran parte al creciente número de problemas reales que contienen datos multi-etiqueta. En esta tesis se estudiaron dos problemas sobre datos multi-etiqueta, la mejora del rendimiento de los algoritmos en datos multi-etiqueta complejos y la mejora del rendimiento de los algoritmos a partir de datos no etiquetados. El primer problema fue tratado mediante métodos de estimación de atributos. Se evaluó la efectividad de los métodos de estimación de atributos propuestos en la mejora del rendimiento de los algoritmos de vecindad, mediante la parametrización de las funciones de distancias empleadas para recuperar los ejemplos más cercanos. Además, se demostró la efectividad de los métodos de estimación en la tarea de selección de atributos. Por otra parte, se desarrolló un algoritmo de vecindad inspirado en el enfoque de clasifcación basada en gravitación de datos. Este algoritmo garantiza un balance adecuado entre eficiencia y efectividad en su solución ante datos multi-etiqueta complejos. El segundo problema fue resuelto mediante técnicas de aprendizaje activo, lo cual permite reducir los costos del etiquetado de datos y del entrenamiento de un mejor modelo. Se propusieron dos estrategias de aprendizaje activo. La primer estrategia resuelve el problema de aprendizaje activo multi-etiqueta de una manera efectiva y eficiente, para ello se combinaron dos medidas que representan la utilidad de un ejemplo no etiquetado. La segunda estrategia propuesta se enfocó en la resolución del problema de aprendizaje activo multi-etiqueta en modo de lotes, para ello se formuló un problema multi-objetivo donde se optimizan tres medidas, y el problema de optimización planteado se resolvió mediante un algoritmo evolutivo. Como resultados complementarios derivados de esta tesis, se desarrolló una herramienta computacional que favorece la implementación de métodos de aprendizaje activo y la experimentación en esta tarea de estudio. Además, se propusieron dos aproximaciones que permiten evaluar el rendimiento de las técnicas de aprendizaje activo de una manera más adecuada y robusta que la empleada comunmente en la literatura. Todos los métodos propuestos en esta tesis han sido evaluados en un marco experimental
adecuado, se utilizaron numerosos conjuntos de datos y se compararon
los rendimientos de los algoritmos frente a otros métodos del estado del arte. Los
resultados obtenidos, los cuales fueron verificados mediante la aplicación de test
estadísticos no paramétricos, demuestran la efectividad de los métodos propuestos
y de esta manera comprueban las hipótesis planteadas en esta tesis.In the last decade, multi-label learning has become an important area of research
due to the large number of real-world problems that contain multi-label data. This
doctoral thesis is focused on the multi-label learning paradigm. Two problems were
studied, rstly, improving the performance of the algorithms on complex multi-label
data, and secondly, improving the performance through unlabeled data.
The rst problem was solved by means of feature estimation methods. The e ectiveness
of the feature estimation methods proposed was evaluated by improving
the performance of multi-label lazy algorithms. The parametrization of the distance
functions with a weight vector allowed to recover examples with relevant
label sets for classi cation. It was also demonstrated the e ectiveness of the feature
estimation methods in the feature selection task. On the other hand, a lazy
algorithm based on a data gravitation model was proposed. This lazy algorithm
has a good trade-o between e ectiveness and e ciency in the resolution of the
multi-label lazy learning.
The second problem was solved by means of active learning techniques. The active
learning methods allowed to reduce the costs of the data labeling process and
training an accurate model. Two active learning strategies were proposed. The
rst strategy e ectively solves the multi-label active learning problem. In this
strategy, two measures that represent the utility of an unlabeled example were
de ned and combined. On the other hand, the second active learning strategy proposed
resolves the batch-mode active learning problem, where the aim is to select a
batch of unlabeled examples that are informative and the information redundancy
is minimal. The batch-mode active learning was formulated as a multi-objective
problem, where three measures were optimized. The multi-objective problem was
solved through an evolutionary algorithm.
This thesis also derived in the creation of a computational framework to develop
any active learning method and to favor the experimentation process in the active
learning area. On the other hand, a methodology based on non-parametric
tests that allows a more adequate evaluation of active learning performance was
proposed. All methods proposed were evaluated by means of extensive and adequate experimental
studies. Several multi-label datasets from di erent domains were used, and
the methods were compared to the most signi cant state-of-the-art algorithms. The
results were validated using non-parametric statistical tests. The evidence showed
the e ectiveness of the methods proposed, proving the hypotheses formulated at
the beginning of this thesis
Conditional Graphical Lasso for Multi-label Image Classification
© 2016 IEEE. Multi-label image classification aims to predict multiple labels for a single image which contains diverse content. By utilizing label correlations, various techniques have been developed to improve classification performance. However, current existing methods either neglect image features when exploiting label correlations or lack the ability to learn image-dependent conditional label structures. In this paper, we develop conditional graphical Lasso (CGL) to handle these challenges. CGL provides a unified Bayesian framework for structure and parameter learning conditioned on image features. We formulate the multi-label prediction as CGL inference problem, which is solved by a mean field variational approach. Meanwhile, CGL learning is efficient due to a tailored proximal gradient procedure by applying the maximum a posterior (MAP) methodology. CGL performs competitively for multi-label image classification on benchmark datasets MULAN scene, PASCAL VOC 2007 and PASCAL VOC 2012, compared with the state-of-the-art multi-label classification algorithms
Local selection of features and its applications to image search and annotation
In multimedia applications, direct representations of data objects typically involve hundreds or thousands of features. Given a query object, the similarity between the query object and a database object can be computed as the distance between their feature vectors. The neighborhood of the query object consists of those database objects that are close to the query object. The semantic quality of the neighborhood, which can be measured as the proportion of neighboring objects that share the same class label as the query object, is crucial for many applications, such as content-based image retrieval and automated image annotation. However, due to the existence of noisy or irrelevant features, errors introduced into similarity measurements are detrimental to the neighborhood quality of data objects.
One way to alleviate the negative impact of noisy features is to use feature selection techniques in data preprocessing. From the original vector space, feature selection techniques select a subset of features, which can be used subsequently in supervised or unsupervised learning algorithms for better performance. However, their performance on improving the quality of data neighborhoods is rarely evaluated in the literature. In addition, most traditional feature selection techniques are global, in the sense that they compute a single set of features across the entire database. As a consequence, the possibility that the feature importance may vary across different data objects or classes of objects is neglected.
To compute a better neighborhood structure for objects in high-dimensional feature spaces, this dissertation proposes several techniques for selecting features that are important to the local neighborhood of individual objects. These techniques are then applied to image applications such as content-based image retrieval and image label propagation. Firstly, an iterative K-NN graph construction method for image databases is proposed. A local variant of the Laplacian Score is designed for the selection of features for individual images. Noisy features are detected and sparsified iteratively from the original standardized feature vectors. This technique is incorporated into an approximate K-NN graph construction method so as to improve the semantic quality of the graph. Secondly, in a content-based image retrieval system, a generalized version of the Laplacian Score is used to compute different feature subspaces for images in the database. For online search, a query image is ranked in the feature spaces of database images. Those database images for which the query image is ranked highly are selected as the query results. Finally, a supervised method for the local selection of image features is proposed, for refining the similarity graph used in an image label propagation framework. By using only the selected features to compute the edges leading from labeled image nodes to unlabeled image nodes, better annotation accuracy can be achieved.
Experimental results on several datasets are provided in this dissertation, to demonstrate the effectiveness of the proposed techniques for the local selection of features, and for the image applications under consideration
Streaming Feature Grouping and Selection (Sfgs) For Big Data Classification
Real-time data has always been an essential element for organizations when the quickness of data delivery is critical to their businesses. Today, organizations understand the importance of real-time data analysis to maintain benefits from their generated data. Real-time data analysis is also known as real-time analytics, streaming analytics, real-time streaming analytics, and event processing. Stream processing is the key to getting results in real-time. It allows us to process the data stream in real-time as it arrives. The concept of streaming data means the data are generated dynamically, and the full stream is unknown or even infinite. This data becomes massive and diverse and forms what is known as a big data challenge. In machine learning, streaming feature selection has always been a preferred method in the preprocessing of streaming data. Recently, feature grouping, which can measure the hidden information between selected features, has begun gaining attention. This dissertation’s main contribution is in solving the issue of the extremely high dimensionality of streaming big data by delivering a streaming feature grouping and selection algorithm. Also, the literature review presents a comprehensive review of the current streaming feature selection approaches and highlights the state-of-the-art algorithms trending in this area. The proposed algorithm is designed with the idea of grouping together similar features to reduce redundancy and handle the stream of features in an online fashion. This algorithm has been implemented and evaluated using benchmark datasets against state-of-the-art streaming feature selection algorithms and feature grouping techniques. The results showed better performance regarding prediction accuracy than with state-of-the-art algorithms
Using data mining to repurpose German language corpora. An evaluation of data-driven analysis methods for corpus linguistics
A growing number of studies report interesting insights gained from existing data resources. Among those, there are analyses on textual data, giving reason to consider such methods for linguistics as well. However, the field of corpus linguistics usually works with purposefully collected, representative language samples that aim to answer only a limited set of research questions.
This thesis aims to shed some light on the potentials of data-driven analysis based on machine learning and predictive modelling for corpus linguistic studies, investigating the possibility to repurpose existing German language corpora for linguistic inquiry by using methodologies developed for data science and computational linguistics. The study focuses on predictive modelling and machine-learning-based data mining and gives a detailed overview and evaluation of currently popular strategies and methods for analysing corpora with computational methods.
After the thesis introduces strategies and methods that have already been used on language data, discusses how they can assist corpus linguistic analysis and refers to available toolkits and software as well as to state-of-the-art research and further references, the introduced methodological toolset is applied in two differently shaped corpus studies that utilize readily available corpora for German. The first study explores linguistic correlates of holistic text quality ratings on student essays, while the second deals with age-related language features in computer-mediated communication and interprets age prediction models to answer a set of research questions that are based on previous research in the field. While both studies give linguistic insights that integrate into the current understanding of the investigated phenomena in German language, they systematically test the methodological toolset introduced beforehand, allowing a detailed discussion of added values and remaining challenges of machine-learning-based data mining methods in corpus at the end of the thesis
Ranking to Learn and Learning to Rank: On the Role of Ranking in Pattern Recognition Applications
The last decade has seen a revolution in the theory and application of
machine learning and pattern recognition. Through these advancements, variable
ranking has emerged as an active and growing research area and it is now
beginning to be applied to many new problems. The rationale behind this fact is
that many pattern recognition problems are by nature ranking problems. The main
objective of a ranking algorithm is to sort objects according to some criteria,
so that, the most relevant items will appear early in the produced result list.
Ranking methods can be analyzed from two different methodological perspectives:
ranking to learn and learning to rank. The former aims at studying methods and
techniques to sort objects for improving the accuracy of a machine learning
model. Enhancing a model performance can be challenging at times. For example,
in pattern classification tasks, different data representations can complicate
and hide the different explanatory factors of variation behind the data. In
particular, hand-crafted features contain many cues that are either redundant
or irrelevant, which turn out to reduce the overall accuracy of the classifier.
In such a case feature selection is used, that, by producing ranked lists of
features, helps to filter out the unwanted information. Moreover, in real-time
systems (e.g., visual trackers) ranking approaches are used as optimization
procedures which improve the robustness of the system that deals with the high
variability of the image streams that change over time. The other way around,
learning to rank is necessary in the construction of ranking models for
information retrieval, biometric authentication, re-identification, and
recommender systems. In this context, the ranking model's purpose is to sort
objects according to their degrees of relevance, importance, or preference as
defined in the specific application.Comment: European PhD Thesis. arXiv admin note: text overlap with
arXiv:1601.06615, arXiv:1505.06821, arXiv:1704.02665 by other author