1,374 research outputs found

    Embedding Feature Selection for Large-scale Hierarchical Classification

    Full text link
    Large-scale Hierarchical Classification (HC) involves datasets consisting of thousands of classes and millions of training instances with high-dimensional features posing several big data challenges. Feature selection that aims to select the subset of discriminant features is an effective strategy to deal with large-scale HC problem. It speeds up the training process, reduces the prediction time and minimizes the memory requirements by compressing the total size of learned model weight vectors. Majority of the studies have also shown feature selection to be competent and successful in improving the classification accuracy by removing irrelevant features. In this work, we investigate various filter-based feature selection methods for dimensionality reduction to solve the large-scale HC problem. Our experimental evaluation on text and image datasets with varying distribution of features, classes and instances shows upto 3x order of speed-up on massive datasets and upto 45% less memory requirements for storing the weight vectors of learned model without any significant loss (improvement for some datasets) in the classification accuracy. Source Code: https://cs.gmu.edu/~mlbio/featureselection.Comment: IEEE International Conference on Big Data (IEEE BigData 2016

    How Biased is our Validation (Data) for {AS} Relationships?

    Get PDF

    Machine Learning Based Autism Detection Using Brain Imaging

    Get PDF
    Autism Spectrum Disorder (ASD) is a group of heterogeneous developmental disabilities that manifest in early childhood. Currently, ASD is primarily diagnosed by assessing the behavioral and intellectual abilities of a child. This behavioral diagnosis can be subjective, time consuming, inconclusive, does not provide insight on the underlying etiology, and is not suitable for early detection. Diagnosis based on brain magnetic resonance imaging (MRI)—a widely used non- invasive tool—can be objective, can help understand the brain alterations in ASD, and can be suitable for early diagnosis. However, the brain morphological findings in ASD from MRI studies have been inconsistent. Moreover, there has been limited success in machine learning based ASD detection using MRI derived brain features. In this thesis, we begin by demonstrating that the low success in ASD detection and the inconsistent findings are likely attributable to the heterogeneity of brain alterations in ASD. We then show that ASD detection can be significantly improved by mitigating the heterogeneity with the help of behavioral and demographics information. Here we demonstrate that finding brain markers in well-defined sub-groups of ASD is easier and more insightful than identifying markers across the whole spectrum. Finally, our study focused on brain MRI of a pediatric cohort (3 to 4 years) and achieved a high classification success (AUC of 95%). Results of this study indicate three main alterations in early ASD brains: 1) abnormally large ventricles, 2) highly folded cortices, and 3) low image intensity in white matter regions suggesting myelination deficits indicative of decreased structural connectivity. Results of this thesis demonstrate that the meaningful brain markers of ASD can be extracted by applying machine learning techniques on brain MRI data. This data-driven technique can be a powerful tool for early detection and understanding brain anatomical underpinnings of ASD

    An Overview of Deep Semi-Supervised Learning

    Full text link
    Deep neural networks demonstrated their ability to provide remarkable performances on a wide range of supervised learning tasks (e.g., image classification) when trained on extensive collections of labeled data (e.g., ImageNet). However, creating such large datasets requires a considerable amount of resources, time, and effort. Such resources may not be available in many practical cases, limiting the adoption and the application of many deep learning methods. In a search for more data-efficient deep learning methods to overcome the need for large annotated datasets, there is a rising research interest in semi-supervised learning and its applications to deep neural networks to reduce the amount of labeled data required, by either developing novel methods or adopting existing semi-supervised learning frameworks for a deep learning setting. In this paper, we provide a comprehensive overview of deep semi-supervised learning, starting with an introduction to the field, followed by a summarization of the dominant semi-supervised approaches in deep learning.Comment: Preprin

    Data-Driven Representation Learning in Multimodal Feature Fusion

    Get PDF
    abstract: Modern machine learning systems leverage data and features from multiple modalities to gain more predictive power. In most scenarios, the modalities are vastly different and the acquired data are heterogeneous in nature. Consequently, building highly effective fusion algorithms is at the core to achieve improved model robustness and inferencing performance. This dissertation focuses on the representation learning approaches as the fusion strategy. Specifically, the objective is to learn the shared latent representation which jointly exploit the structural information encoded in all modalities, such that a straightforward learning model can be adopted to obtain the prediction. We first consider sensor fusion, a typical multimodal fusion problem critical to building a pervasive computing platform. A systematic fusion technique is described to support both multiple sensors and descriptors for activity recognition. Targeted to learn the optimal combination of kernels, Multiple Kernel Learning (MKL) algorithms have been successfully applied to numerous fusion problems in computer vision etc. Utilizing the MKL formulation, next we describe an auto-context algorithm for learning image context via the fusion with low-level descriptors. Furthermore, a principled fusion algorithm using deep learning to optimize kernel machines is developed. By bridging deep architectures with kernel optimization, this approach leverages the benefits of both paradigms and is applied to a wide variety of fusion problems. In many real-world applications, the modalities exhibit highly specific data structures, such as time sequences and graphs, and consequently, special design of the learning architecture is needed. In order to improve the temporal modeling for multivariate sequences, we developed two architectures centered around attention models. A novel clinical time series analysis model is proposed for several critical problems in healthcare. Another model coupled with triplet ranking loss as metric learning framework is described to better solve speaker diarization. Compared to state-of-the-art recurrent networks, these attention-based multivariate analysis tools achieve improved performance while having a lower computational complexity. Finally, in order to perform community detection on multilayer graphs, a fusion algorithm is described to derive node embedding from word embedding techniques and also exploit the complementary relational information contained in each layer of the graph.Dissertation/ThesisDoctoral Dissertation Electrical Engineering 201

    On the enhancement of Big Data Pipelines through Data Preparation, Data Quality, and the distribution of Optimisation Problems

    Get PDF
    Nowadays, data are fundamental for companies, providing operational support by facilitating daily transactions. Data has also become the cornerstone of strategic decision-making processes in businesses. For this purpose, there are numerous techniques that allow to extract knowledge and value from data. For example, optimisation algorithms excel at supporting decision-making processes to improve the use of resources, time and costs in the organisation. In the current industrial context, organisations usually rely on business processes to orchestrate their daily activities while collecting large amounts of information from heterogeneous sources. Therefore, the support of Big Data technologies (which are based on distributed environments) is required given the volume, variety and speed of data. Then, in order to extract value from the data, a set of techniques or activities is applied in an orderly way and at different stages. This set of techniques or activities, which facilitate the acquisition, preparation, and analysis of data, is known in the literature as Big Data pipelines. In this thesis, the improvement of three stages of the Big Data pipelines is tackled: Data Preparation, Data Quality assessment, and Data Analysis. These improvements can be addressed from an individual perspective, by focussing on each stage, or from a more complex and global perspective, implying the coordination of these stages to create data workflows. The first stage to improve is the Data Preparation by supporting the preparation of data with complex structures (i.e., data with various levels of nested structures, such as arrays). Shortcomings have been found in the literature and current technologies for transforming complex data in a simple way. Therefore, this thesis aims to improve the Data Preparation stage through Domain-Specific Languages (DSLs). Specifically, two DSLs are proposed for different use cases. While one of them is a general-purpose Data Transformation language, the other is a DSL aimed at extracting event logs in a standard format for process mining algorithms. The second area for improvement is related to the assessment of Data Quality. Depending on the type of Data Analysis algorithm, poor-quality data can seriously skew the results. A clear example are optimisation algorithms. If the data are not sufficiently accurate and complete, the search space can be severely affected. Therefore, this thesis formulates a methodology for modelling Data Quality rules adjusted to the context of use, as well as a tool that facilitates the automation of their assessment. This allows to discard the data that do not meet the quality criteria defined by the organisation. In addition, the proposal includes a framework that helps to select actions to improve the usability of the data. The third and last proposal involves the Data Analysis stage. In this case, this thesis faces the challenge of supporting the use of optimisation problems in Big Data pipelines. There is a lack of methodological solutions that allow computing exhaustive optimisation problems in distributed environments (i.e., those optimisation problems that guarantee the finding of an optimal solution by exploring the whole search space). The resolution of this type of problem in the Big Data context is computationally complex, and can be NP-complete. This is caused by two different factors. On the one hand, the search space can increase significantly as the amount of data to be processed by the optimisation algorithms increases. This challenge is addressed through a technique to generate and group problems with distributed data. On the other hand, processing optimisation problems with complex models and large search spaces in distributed environments is not trivial. Therefore, a proposal is presented for a particular case in this type of scenario. As a result, this thesis develops methodologies that have been published in scientific journals and conferences.The methodologies have been implemented in software tools that are integrated with the Apache Spark data processing engine. The solutions have been validated through tests and use cases with real datasets

    Web Page Classification and Hierarchy Adaptation

    Get PDF
    corecore