14 research outputs found

    XLearn : learning activity labels across heterogeneous datasets

    Get PDF
    Sensor-driven systems often need to map sensed data into meaningfully labelled activities to classify the phenomena being observed. A motivating and challenging example comes from human activity recognition in which smart home and other datasets are used to classify human activities to support applications such as ambient assisted living, health monitoring, and behavioural intervention. Building a robust and meaningful classifier needs annotated ground truth, labelled with what activities are actually being observed—and acquiring high-quality, detailed, continuous annotations remains a challenging, time-consuming, and error-prone task, despite considerable attention in the literature. In this article, we use knowledge-driven ensemble learning to develop a technique that can combine classifiers built from individually labelled datasets, even when the labels are sparse and heterogeneous. The technique both relieves individual users of the burden of annotation and allows activities to be learned individually and then transferred to a general classifier. We evaluate our approach using four third-party, real-world smart home datasets and show that it enhances activity recognition accuracies even when given only a very small amount of training data.PostprintPeer reviewe

    Doctor of Philosophy

    Get PDF
    dissertationMachine learning is the science of building predictive models from data that automatically improve based on past experience. To learn these models, traditional learning algorithms require labeled data. They also require that the entire dataset fits in the memory of a single machine. Labeled data are available or can be acquired for small and moderately sized datasets but curating large datasets can be prohibitively expensive. Similarly, massive datasets are usually too huge to fit into the memory of a single machine. An alternative is to distribute the dataset over multiple machines. Distributed learning, however, poses new challenges as most existing machine learning techniques are inherently sequential. Additionally, these distributed approaches have to be designed keeping in mind various resource limitations of real-world settings, prime among them being intermachine communication. With the advent of big datasets machine learning algorithms are facing new challenges. Their design is no longer limited to minimizing some loss function but, additionally, needs to consider other resources that are critical when learning at scale. In this thesis, we explore different models and measures for learning with limited resources that have a budget. What budgetary constraints are posed by modern datasets? Can we reuse or combine existing machine learning paradigms to address these challenges at scale? How does the cost metrics change when we shift to distributed models for learning? These are some of the questions that have been investigated in this thesis. The answers to these questions hold the key to addressing some of the challenges faced when learning on massive datasets. In the first part of this thesis, we present three different budgeted scenarios that deal with scarcity of labeled data and limited computational resources. The goal is to leverage transfer information from related domains to learn under budgetary constraints. Our proposed techniques comprise semisupervised transfer, online transfer and active transfer. In the second part of this thesis, we study distributed learning with limited communication. We present initial sampling based results, as well as, propose communication protocols for learning distributed linear classifiers

    A novel spectral-spatial co-training algorithm for the transductive classification of hyperspectral imagery data

    Get PDF
    The automatic classification of hyperspectral data is made complex by several factors, such as the high cost of true sample labeling coupled with the high number of spectral bands, as well as the spatial correlation of the spectral signature. In this paper, a transductive collective classifier is proposed for dealing with all these factors in hyperspectral image classification. The transductive inference paradigm allows us to reduce the inference error for the given set of unlabeled data, as sparsely labeled pixels are learned by accounting for both labeled and unlabeled information. The collective inference paradigm allows us to manage the spatial correlation between spectral responses of neighboring pixels, as interacting pixels are labeled simultaneously. In particular, the innovative contribution of this study includes: (1) the design of an application-specific co-training schema to use both spectral information and spatial information, iteratively extracted at the object (set of pixels) level via collective inference; (2) the formulation of a spatial-aware example selection schema that accounts for the spatial correlation of predicted labels to augment training sets during iterative learning and (3) the investigation of a diversity class criterion that allows us to speed-up co-training classification. Experimental results validate the accuracy and efficiency of the proposed spectral-spatial, collective, co-training strategy

    Learning Visual Classifiers From Limited Labeled Images

    Get PDF
    Recognizing humans and their activities from images and video is one of the key goals of computer vision. While supervised learning algorithms like Support Vector Machines and Boosting have offered robust solutions, they require large amount of labeled data for good performance. It is often difficult to acquire large labeled datasets due to the significant human effort involved in data annotation. However, it is considerably easier to collect unlabeled data due to the availability of inexpensive cameras and large public databases like Flickr and YouTube. In this dissertation, we develop efficient machine learning techniques for visual classification from small amount of labeled training data by utilizing the structure in the testing data, labeled data in a different domain and unlabeled data. This dissertation has three main parts. In the first part of the dissertation, we consider how multiple noisy samples available during testing can be utilized to perform accurate visual classification. Such multiple samples are easily available in video-based recognition problem, which is commonly encountered in visual surveillance. Specifically, we study the problem of unconstrained human recognition from iris images. We develop a Sparse Representation-based selection and recognition scheme, which learns the underlying structure of clean images. This learned structure is utilized to develop a quality measure, and a quality-based fusion scheme is proposed to combine the varying evidence. Furthermore, we extend the method to incorporate privacy, an important requirement inpractical biometric applications, without significantly affecting the recognition performance. In the second part, we analyze the problem of utilizing labeled data in a different domain to aid visual classification. We consider the problem of shifts in acquisition conditions during training and testing, which is very common in iris biometrics. In particular, we study the sensor mismatch problem, where the training samples are acquired using a sensor much older than the one used for testing. We provide one of the first solutions to this problem, a kernel learning framework to adapt iris data collected from one sensor to another. Extensive evaluations on iris data from multiple sensors demonstrate that the proposed method leads to considerable improvement in cross sensor recognition accuracy. Furthermore, since the proposed technique requires minimal changes to the iris recognition pipeline, it can easily be incorporated into existing iris recognition systems. In the last part of the dissertation, we analyze how unlabeled data available during training can assist visual classification applications. Here, we consider still image-based vision applications involving humans, where explicit motion cues are not available. A human pose often conveys not only the configuration of the body parts, but also implicit predictive information about the ensuing motion. We propose a probabilistic framework to infer this dynamic information associated with a human pose, using unlabeled and unsegmented videos available during training. The inference problem is posed as a non-parametric density estimation problem on non-Euclidean manifolds. Since direct modeling is intractable, we develop a data driven approach, estimating the density for the test sample under consideration. Statistical inference on the estimated density provides us with quantities of interest like the most probable future motion of the human and the amount of motion informatio

    Semi-supervised learning and fairness-aware learning under class imbalance

    Get PDF
    With the advent of Web 2.0 and the rapid technological advances, there is a plethora of data in every field; however, more data does not necessarily imply more information, rather the quality of data (veracity aspect) plays a key role. Data quality is a major issue, since machine learning algorithms are solely based on historical data to derive novel hypotheses. Data may contain noise, outliers, missing values and/or class labels, and skewed data distributions. The latter case, the so-called class-imbalance problem, is quite old and still affects dramatically machine learning algorithms. Class-imbalance causes classification models to learn effectively one particular class (majority) while ignoring other classes (minority). In extend to this issue, machine learning models that are applied in domains of high societal impact have become biased towards groups of people or individuals who are not well represented within the data. Direct and indirect discriminatory behavior is prohibited by international laws; thus, there is an urgency of mitigating discriminatory outcomes from machine learning algorithms. In this thesis, we address the aforementioned issues and propose methods that tackle class imbalance, and mitigate discriminatory outcomes in machine learning algorithms. As part of this thesis, we make the following contributions: • Tackling class-imbalance in semi-supervised learning – The class-imbalance problem is very often encountered in classification. There is a variety of methods that tackle this problem; however, there is a lack of methods that deal with class-imbalance in the semi-supervised learning. We address this problem by employing data augmentation in semi-supervised learning process in order to equalize class distributions. We show that semi-supervised learning coupled with data augmentation methods can overcome class-imbalance propagation and significantly outperform the standard semi-supervised annotation process. • Mitigating unfairness in supervised models – Fairness in supervised learning has received a lot of attention over the last years. A growing body of pre-, in- and postprocessing approaches has been proposed to mitigate algorithmic bias; however, these methods consider error rate as the performance measure of the machine learning algorithm, which causes high error rates on the under-represented class. To deal with this problem, we propose approaches that operate in pre-, in- and post-processing layers while accounting for all classes. Our proposed methods outperform state-of-the-art methods in terms of performance while being able to mitigate unfair outcomes

    Unsupervised Graph-Based Similarity Learning Using Heterogeneous Features.

    Full text link
    Relational data refers to data that contains explicit relations among objects. Nowadays, relational data are universal and have a broad appeal in many different application domains. The problem of estimating similarity between objects is a core requirement for many standard Machine Learning (ML), Natural Language Processing (NLP) and Information Retrieval (IR) problems such as clustering, classiffication, word sense disambiguation, etc. Traditional machine learning approaches represent the data using simple, concise representations such as feature vectors. While this works very well for homogeneous data, i.e, data with a single feature type such as text, it does not exploit the availability of dfferent feature types fully. For example, scientic publications have text, citations, authorship information, venue information. Each of the features can be used for estimating similarity. Representing such objects has been a key issue in efficient mining (Getoor and Taskar, 2007). In this thesis, we propose natural representations for relational data using multiple, connected layers of graphs; one for each feature type. Also, we propose novel algorithms for estimating similarity using multiple heterogeneous features. Also, we present novel algorithms for tasks like topic detection and music recommendation using the estimated similarity measure. We demonstrate superior performance of the proposed algorithms (root mean squared error of 24.81 on the Yahoo! KDD Music recommendation data set and classiffication accuracy of 88% on the ACL Anthology Network data set) over many of the state of the art algorithms, such as Latent Semantic Analysis (LSA), Multiple Kernel Learning (MKL) and spectral clustering and baselines on large, standard data sets.Ph.D.Computer Science & EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/89824/1/mpradeep_1.pd

    P300 classification using deep belief nets

    Get PDF
    2014 Summer.Includes bibliographical references.Electroencephalogram (EEG) is measure of the electrical activity of the brain. One of the most important EEG paradigm that has been explored in BCI systems is the P300 signal. The P300 wave is an endogenous event-related-potential which can be captured during the process of decision making as a subject reacts to a stimulus. One way to detect the P300 signal is to show a subject two types of visual stimuli occurring at different rates. The event occurring less frequently than the other elicits a positive signal component with a latency of roughly 250-500 ms. P300 detection has many applications in the BCI field. One of the most common applications of P300 detection is the P300 speller which enables users to type letters on the screen. Machine Learning algorithms play a crucial role in designing a BCI system. One important purpose of using the machine learning algorithms in BCI systems is the classification of EEG signals. In order to translate EEG signals to a control signal, BCI systems should first capture the pattern of EEG signals and discriminate them into different command categories. This is usually done using different machine learning-based classifiers. In the past, different linear and nonlinear methods have been used to discriminate the P300 signals from nonP300 signals. This thesis provides the first attempt to implement and examine the performance of the Deep Belief Networks (DBN) to model the P300 data for classification. The highest classification accuracy we achieved with DBN is 97 percent for testing trials. In our experiments, we used EEG data collected by the BCI lab at Colorado State University on both healthy and disabled subjects

    Searching for Needles in the Cosmic Haystack

    Get PDF
    Searching for pulsar signals in radio astronomy data sets is a difficult task. The data sets are extremely large, approaching the petabyte scale, and are growing larger as instruments become more advanced. Big Data brings with it big challenges. Processing the data to identify candidate pulsar signals is computationally expensive and must utilize parallelism to be scalable. Labeling benchmarks for supervised classification is costly. To compound the problem, pulsar signals are very rare, e.g., only 0.05% of the instances in one data set represent pulsars. Furthermore, there are many different approaches to candidate classification with no consensus on a best practice. This dissertation is focused on identifying and classifying radio pulsar candidates from single pulse searches. First, to identify and classify Dispersed Pulse Groups (DPGs), we developed a supervised machine learning approach that consists of RAPID (a novel peak identification algorithm), feature extraction, and supervised machine learning classification. We tested six algorithms for classification with four imbalance treatments. Results showed that classifiers with imbalance treatments had higher recall values. Overall, classifiers using multiclass RandomForests combined with Synthetic Majority Oversampling TEchnique (SMOTE) were the most efficient; they identified additional known pulsars not in the benchmark, with less false positives than other classifiers. Second, we developed a parallel single pulse identification method, D-RAPID, and introduced a novel automated multiclass labeling (ALM) technique that we combined with feature selection to improve execution performance. D-RAPID improved execution performance over RAPID by a factor of 5. We also showed that the combination of ALM and feature selection sped up the execution performance of RandomForest by 54% on average with less than a 2% average reduction in classification performance. Finally, we proposed CoDRIFt, a novel classification algorithm that is distributed for scalability and employs semi-supervised learning to leverage unlabeled data to inform classification. We evaluated and compared CoDRIFt to eleven other classifiers. The results showed that CoDRIFt excelled at classifying candidates in imbalanced benchmarks with a majority of non-pulsar signals (\u3e95%). Furthermore, CoDRIFt models created with very limited sets of labeled data (as few as 22 labeled minority class instances) were able to achieve high recall (mean = 0.98). In comparison to the other algorithms trained on similar sets, CoDRIFt outperformed them all, with recall 2.9% higher than the next best classifier and a 35% average improvement over all eleven classifiers. CoDRIFt is customizable for other problem domains with very large, imbalanced data sets, such as fraud detection and cyber attack detection

    Machine Learning Approaches for Semantic Segmentation on Partly-Annotated Medical Images

    Get PDF
    Semantic segmentation of medical images plays a crucial role in assisting medical practitioners in providing accurate and swift diagnoses; nevertheless, deep neural networks require extensive labelled data to learn and generalise appropriately. This is a major issue in medical imagery because most of the datasets are not fully annotated. Training models with partly-annotated datasets generate plenty of predictions that belong to correct unannotated areas that are categorised as false positives; as a result, standard segmentation metrics and objective functions do not work correctly, affecting the overall performance of the models. In this thesis, the semantic segmentation of partly-annotated medical datasets is extensively and thoroughly studied. The general objective is to improve the segmentation results of medical images via innovative supervised and semi-supervised approaches. The main contributions of this work are the following. Firstly, a new metric, specifically designed for this kind of dataset, can provide a reliable score to partly-annotated datasets with positive expert feedback in their generated predictions by exploiting all the confusion matrix values except the false positives. Secondly, an innovative approach to generating better pseudo-labels when applying co-training with the disagreement selection strategy. This method expands the pixels in disagreement utilising the combined predictions as a guide. Thirdly, original attention mechanisms based on disagreement are designed for two cases: intra-model and inter-model. These attention modules leverage the disagreement between layers (from the same or different model instances) to enhance the overall learning process and generalisation of the models. Lastly, innovative deep supervision methods improve the segmentation results by training neural networks one subnetwork at a time following the order of the supervision branches. The methods are thoroughly evaluated on several histopathological datasets showing significant improvements
    corecore