1,374 research outputs found
Embedding Feature Selection for Large-scale Hierarchical Classification
Large-scale Hierarchical Classification (HC) involves datasets consisting of
thousands of classes and millions of training instances with high-dimensional
features posing several big data challenges. Feature selection that aims to
select the subset of discriminant features is an effective strategy to deal
with large-scale HC problem. It speeds up the training process, reduces the
prediction time and minimizes the memory requirements by compressing the total
size of learned model weight vectors. Majority of the studies have also shown
feature selection to be competent and successful in improving the
classification accuracy by removing irrelevant features. In this work, we
investigate various filter-based feature selection methods for dimensionality
reduction to solve the large-scale HC problem. Our experimental evaluation on
text and image datasets with varying distribution of features, classes and
instances shows upto 3x order of speed-up on massive datasets and upto 45% less
memory requirements for storing the weight vectors of learned model without any
significant loss (improvement for some datasets) in the classification
accuracy. Source Code: https://cs.gmu.edu/~mlbio/featureselection.Comment: IEEE International Conference on Big Data (IEEE BigData 2016
Machine Learning Based Autism Detection Using Brain Imaging
Autism Spectrum Disorder (ASD) is a group of heterogeneous developmental disabilities that manifest in early childhood. Currently, ASD is primarily diagnosed by assessing the behavioral and intellectual abilities of a child. This behavioral diagnosis can be subjective, time consuming, inconclusive, does not provide insight on the underlying etiology, and is not suitable for early detection. Diagnosis based on brain magnetic resonance imaging (MRI)—a widely used non- invasive tool—can be objective, can help understand the brain alterations in ASD, and can be suitable for early diagnosis. However, the brain morphological findings in ASD from MRI studies have been inconsistent. Moreover, there has been limited success in machine learning based ASD detection using MRI derived brain features. In this thesis, we begin by demonstrating that the low success in ASD detection and the inconsistent findings are likely attributable to the heterogeneity of brain alterations in ASD. We then show that ASD detection can be significantly improved by mitigating the heterogeneity with the help of behavioral and demographics information. Here we demonstrate that finding brain markers in well-defined sub-groups of ASD is easier and more insightful than identifying markers across the whole spectrum. Finally, our study focused on brain MRI of a pediatric cohort (3 to 4 years) and achieved a high classification success (AUC of 95%). Results of this study indicate three main alterations in early ASD brains: 1) abnormally large ventricles, 2) highly folded cortices, and 3) low image intensity in white matter regions suggesting myelination deficits indicative of decreased structural connectivity. Results of this thesis demonstrate that the meaningful brain markers of ASD can be extracted by applying machine learning techniques on brain MRI data. This data-driven technique can be a powerful tool for early detection and understanding brain anatomical underpinnings of ASD
An Overview of Deep Semi-Supervised Learning
Deep neural networks demonstrated their ability to provide remarkable
performances on a wide range of supervised learning tasks (e.g., image
classification) when trained on extensive collections of labeled data (e.g.,
ImageNet). However, creating such large datasets requires a considerable amount
of resources, time, and effort. Such resources may not be available in many
practical cases, limiting the adoption and the application of many deep
learning methods. In a search for more data-efficient deep learning methods to
overcome the need for large annotated datasets, there is a rising research
interest in semi-supervised learning and its applications to deep neural
networks to reduce the amount of labeled data required, by either developing
novel methods or adopting existing semi-supervised learning frameworks for a
deep learning setting. In this paper, we provide a comprehensive overview of
deep semi-supervised learning, starting with an introduction to the field,
followed by a summarization of the dominant semi-supervised approaches in deep
learning.Comment: Preprin
Data-Driven Representation Learning in Multimodal Feature Fusion
abstract: Modern machine learning systems leverage data and features from multiple modalities to gain more predictive power. In most scenarios, the modalities are vastly different and the acquired data are heterogeneous in nature. Consequently, building highly effective fusion algorithms is at the core to achieve improved model robustness and inferencing performance. This dissertation focuses on the representation learning approaches as the fusion strategy. Specifically, the objective is to learn the shared latent representation which jointly exploit the structural information encoded in all modalities, such that a straightforward learning model can be adopted to obtain the prediction.
We first consider sensor fusion, a typical multimodal fusion problem critical to building a pervasive computing platform. A systematic fusion technique is described to support both multiple sensors and descriptors for activity recognition. Targeted to learn the optimal combination of kernels, Multiple Kernel Learning (MKL) algorithms have been successfully applied to numerous fusion problems in computer vision etc. Utilizing the MKL formulation, next we describe an auto-context algorithm for learning image context via the fusion with low-level descriptors. Furthermore, a principled fusion algorithm using deep learning to optimize kernel machines is developed. By bridging deep architectures with kernel optimization, this approach leverages the benefits of both paradigms and is applied to a wide variety of fusion problems.
In many real-world applications, the modalities exhibit highly specific data structures, such as time sequences and graphs, and consequently, special design of the learning architecture is needed. In order to improve the temporal modeling for multivariate sequences, we developed two architectures centered around attention models. A novel clinical time series analysis model is proposed for several critical problems in healthcare. Another model coupled with triplet ranking loss as metric learning framework is described to better solve speaker diarization. Compared to state-of-the-art recurrent networks, these attention-based multivariate analysis tools achieve improved performance while having a lower computational complexity. Finally, in order to perform community detection on multilayer graphs, a fusion algorithm is described to derive node embedding from word embedding techniques and also exploit the complementary relational information contained in each layer of the graph.Dissertation/ThesisDoctoral Dissertation Electrical Engineering 201
On the enhancement of Big Data Pipelines through Data Preparation, Data Quality, and the distribution of Optimisation Problems
Nowadays, data are fundamental for companies, providing operational support by facilitating daily
transactions. Data has also become the cornerstone of strategic decision-making processes in
businesses. For this purpose, there are numerous techniques that allow to extract knowledge and
value from data. For example, optimisation algorithms excel at supporting decision-making
processes to improve the use of resources, time and costs in the organisation. In the current
industrial context, organisations usually rely on business processes to orchestrate their daily
activities while collecting large amounts of information from heterogeneous sources. Therefore,
the support of Big Data technologies (which are based on distributed environments) is required
given the volume, variety and speed of data. Then, in order to extract value from the data, a set
of techniques or activities is applied in an orderly way and at different stages. This set of
techniques or activities, which facilitate the acquisition, preparation, and analysis of data, is known
in the literature as Big Data pipelines.
In this thesis, the improvement of three stages of the Big Data pipelines is tackled: Data
Preparation, Data Quality assessment, and Data Analysis. These improvements can be
addressed from an individual perspective, by focussing on each stage, or from a more complex
and global perspective, implying the coordination of these stages to create data workflows.
The first stage to improve is the Data Preparation by supporting the preparation of data with
complex structures (i.e., data with various levels of nested structures, such as arrays).
Shortcomings have been found in the literature and current technologies for transforming complex
data in a simple way. Therefore, this thesis aims to improve the Data Preparation stage through
Domain-Specific Languages (DSLs). Specifically, two DSLs are proposed for different use cases.
While one of them is a general-purpose Data Transformation language, the other is a DSL aimed
at extracting event logs in a standard format for process mining algorithms.
The second area for improvement is related to the assessment of Data Quality. Depending on the
type of Data Analysis algorithm, poor-quality data can seriously skew the results. A clear example
are optimisation algorithms. If the data are not sufficiently accurate and complete, the search
space can be severely affected. Therefore, this thesis formulates a methodology for modelling
Data Quality rules adjusted to the context of use, as well as a tool that facilitates the automation
of their assessment. This allows to discard the data that do not meet the quality criteria defined
by the organisation. In addition, the proposal includes a framework that helps to select actions to
improve the usability of the data.
The third and last proposal involves the Data Analysis stage. In this case, this thesis faces the
challenge of supporting the use of optimisation problems in Big Data pipelines. There is a lack of
methodological solutions that allow computing exhaustive optimisation problems in distributed
environments (i.e., those optimisation problems that guarantee the finding of an optimal solution
by exploring the whole search space). The resolution of this type of problem in the Big Data
context is computationally complex, and can be NP-complete. This is caused by two different
factors. On the one hand, the search space can increase significantly as the amount of data to
be processed by the optimisation algorithms increases. This challenge is addressed through a
technique to generate and group problems with distributed data. On the other hand, processing
optimisation problems with complex models and large search spaces in distributed environments
is not trivial. Therefore, a proposal is presented for a particular case in this type of scenario.
As a result, this thesis develops methodologies that have been published in scientific journals and
conferences.The methodologies have been implemented in software tools that are integrated with
the Apache Spark data processing engine. The solutions have been validated through tests and use cases with real datasets
- …