1,561 research outputs found

    Feature Selection: A Data Perspective

    Full text link
    Feature selection, as a data preprocessing strategy, has been proven to be effective and efficient in preparing data (especially high-dimensional data) for various data mining and machine learning problems. The objectives of feature selection include: building simpler and more comprehensible models, improving data mining performance, and preparing clean, understandable data. The recent proliferation of big data has presented some substantial challenges and opportunities to feature selection. In this survey, we provide a comprehensive and structured overview of recent advances in feature selection research. Motivated by current challenges and opportunities in the era of big data, we revisit feature selection research from a data perspective and review representative feature selection algorithms for conventional data, structured data, heterogeneous data and streaming data. Methodologically, to emphasize the differences and similarities of most existing feature selection algorithms for conventional data, we categorize them into four main groups: similarity based, information theoretical based, sparse learning based and statistical based methods. To facilitate and promote the research in this community, we also present an open-source feature selection repository that consists of most of the popular feature selection algorithms (\url{http://featureselection.asu.edu/}). Also, we use it as an example to show how to evaluate feature selection algorithms. At the end of the survey, we present a discussion about some open problems and challenges that require more attention in future research

    Discovery of Shifting Patterns in Sequence Classification

    Full text link
    In this paper, we investigate the multi-variate sequence classification problem from a multi-instance learning perspective. Real-world sequential data commonly show discriminative patterns only at specific time periods. For instance, we can identify a cropland during its growing season, but it looks similar to a barren land after harvest or before planting. Besides, even within the same class, the discriminative patterns can appear in different periods of sequential data. Due to such property, these discriminative patterns are also referred to as shifting patterns. The shifting patterns in sequential data severely degrade the performance of traditional classification methods without sufficient training data. We propose a novel sequence classification method by automatically mining shifting patterns from multi-variate sequence. The method employs a multi-instance learning approach to detect shifting patterns while also modeling temporal relationships within each multi-instance bag by an LSTM model to further improve the classification performance. We extensively evaluate our method on two real-world applications - cropland mapping and affective state recognition. The experiments demonstrate the superiority of our proposed method in sequence classification performance and in detecting discriminative shifting patterns

    Online Semi-Supervised Learning with Deep Hybrid Boltzmann Machines and Denoising Autoencoders

    Full text link
    Two novel deep hybrid architectures, the Deep Hybrid Boltzmann Machine and the Deep Hybrid Denoising Auto-encoder, are proposed for handling semi-supervised learning problems. The models combine experts that model relevant distributions at different levels of abstraction to improve overall predictive performance on discriminative tasks. Theoretical motivations and algorithms for joint learning for each are presented. We apply the new models to the domain of data-streams in work towards life-long learning. The proposed architectures show improved performance compared to a pseudo-labeled, drop-out rectifier network

    Unveiling Contextual Similarity of Things via Mining Human-Thing Interactions in the Internet of Things

    Full text link
    With recent advances in radio-frequency identification (RFID), wireless sensor networks, and Web services, physical things are becoming an integral part of the emerging ubiquitous Web. Finding correlations of ubiquitous things is a crucial prerequisite for many important applications such as things search, discovery, classification, recommendation, and composition. This article presents DisCor-T, a novel graph-based method for discovering underlying connections of things via mining the rich content embodied in human-thing interactions in terms of user, temporal and spatial information. We model these various information using two graphs, namely spatio-temporal graph and social graph. Then, random walk with restart (RWR) is applied to find proximities among things, and a relational graph of things (RGT) indicating implicit correlations of things is learned. The correlation analysis lays a solid foundation contributing to improved effectiveness in things management. To demonstrate the utility, we develop a flexible feature-based classification framework on top of RGT and perform a systematic case study. Our evaluation exhibits the strength and feasibility of the proposed approach

    Weakly Supervised Deep Learning Approach in Streaming Environments

    Full text link
    The feasibility of existing data stream algorithms is often hindered by the weakly supervised condition of data streams. A self-evolving deep neural network, namely Parsimonious Network (ParsNet), is proposed as a solution to various weakly-supervised data stream problems. A self-labelling strategy with hedge (SLASH) is proposed in which its auto-correction mechanism copes with \textit{the accumulation of mistakes} significantly affecting the model's generalization. ParsNet is developed from a closed-loop configuration of the self-evolving generative and discriminative training processes exploiting shared parameters in which its structure flexibly grows and shrinks to overcome the issue of concept drift with/without labels. The numerical evaluation has been performed under two challenging problems, namely sporadic access to ground truth and infinitely delayed access to the ground truth. Our numerical study shows the advantage of ParsNet with a substantial margin from its counterparts in the high-dimensional data streams and infinite delay simulation protocol. To support the reproducible research initiative, the source code of ParsNet along with supplementary materials are made available at https://bit.ly/2qNW7p4.Comment: This paper has been accepted for publication in The 2019 IEEE International Conference on Big Data (IEEE BigData 2019), Los Angeles, CA, US

    A Survey on Multi-output Learning

    Full text link
    Multi-output learning aims to simultaneously predict multiple outputs given an input. It is an important learning problem due to the pressing need for sophisticated decision making in real-world applications. Inspired by big data, the 4Vs characteristics of multi-output imposes a set of challenges to multi-output learning, in terms of the volume, velocity, variety and veracity of the outputs. Increasing number of works in the literature have been devoted to the study of multi-output learning and the development of novel approaches for addressing the challenges encountered. However, it lacks a comprehensive overview on different types of challenges of multi-output learning brought by the characteristics of the multiple outputs and the techniques proposed to overcome the challenges. This paper thus attempts to fill in this gap to provide a comprehensive review on this area. We first introduce different stages of the life cycle of the output labels. Then we present the paradigm on multi-output learning, including its myriads of output structures, definitions of its different sub-problems, model evaluation metrics and popular data repositories used in the study. Subsequently, we review a number of state-of-the-art multi-output learning methods, which are categorized based on the challenges.Comment: Paper accepted by IEEE Transactions on Neural Networks and Learning System

    Short Text Topic Modeling Techniques, Applications, and Performance: A Survey

    Full text link
    Analyzing short texts infers discriminative and coherent latent topics that is a critical and fundamental task since many real-world applications require semantic understanding of short texts. Traditional long text topic modeling algorithms (e.g., PLSA and LDA) based on word co-occurrences cannot solve this problem very well since only very limited word co-occurrence information is available in short texts. Therefore, short text topic modeling has already attracted much attention from the machine learning research community in recent years, which aims at overcoming the problem of sparseness in short texts. In this survey, we conduct a comprehensive review of various short text topic modeling techniques proposed in the literature. We present three categories of methods based on Dirichlet multinomial mixture, global word co-occurrences, and self-aggregation, with example of representative approaches in each category and analysis of their performance on various tasks. We develop the first comprehensive open-source library, called STTM, for use in Java that integrates all surveyed algorithms within a unified interface, benchmark datasets, to facilitate the expansion of new methods in this research field. Finally, we evaluate these state-of-the-art methods on many real-world datasets and compare their performance against one another and versus long text topic modeling algorithm.Comment: arXiv admin note: text overlap with arXiv:1808.02215 by other author

    A Survey of Location Prediction on Twitter

    Full text link
    Locations, e.g., countries, states, cities, and point-of-interests, are central to news, emergency events, and people's daily lives. Automatic identification of locations associated with or mentioned in documents has been explored for decades. As one of the most popular online social network platforms, Twitter has attracted a large number of users who send millions of tweets on daily basis. Due to the world-wide coverage of its users and real-time freshness of tweets, location prediction on Twitter has gained significant attention in recent years. Research efforts are spent on dealing with new challenges and opportunities brought by the noisy, short, and context-rich nature of tweets. In this survey, we aim at offering an overall picture of location prediction on Twitter. Specifically, we concentrate on the prediction of user home locations, tweet locations, and mentioned locations. We first define the three tasks and review the evaluation metrics. By summarizing Twitter network, tweet content, and tweet context as potential inputs, we then structurally highlight how the problems depend on these inputs. Each dependency is illustrated by a comprehensive review of the corresponding strategies adopted in state-of-the-art approaches. In addition, we also briefly review two related problems, i.e., semantic location prediction and point-of-interest recommendation. Finally, we list future research directions.Comment: Accepted to TKDE. 30 pages, 1 figur

    Neural Predictive Coding using Convolutional Neural Networks towards Unsupervised Learning of Speaker Characteristics

    Full text link
    Learning speaker-specific features is vital in many applications like speaker recognition, diarization and speech recognition. This paper provides a novel approach, we term Neural Predictive Coding (NPC), to learn speaker-specific characteristics in a completely unsupervised manner from large amounts of unlabeled training data that even contain many non-speech events and multi-speaker audio streams. The NPC framework exploits the proposed short-term active-speaker stationarity hypothesis which assumes two temporally-close short speech segments belong to the same speaker, and thus a common representation that can encode the commonalities of both the segments, should capture the vocal characteristics of that speaker. We train a convolutional deep siamese network to produce "speaker embeddings" by learning to separate `same' vs `different' speaker pairs which are generated from an unlabeled data of audio streams. Two sets of experiments are done in different scenarios to evaluate the strength of NPC embeddings and compare with state-of-the-art in-domain supervised methods. First, two speaker identification experiments with different context lengths are performed in a scenario with comparatively limited within-speaker channel variability. NPC embeddings are found to perform the best at short duration experiment, and they provide complementary information to i-vectors for full utterance experiments. Second, a large scale speaker verification task having a wide range of within-speaker channel variability is adopted as an upper-bound experiment where comparisons are drawn with in-domain supervised methods
    • …
    corecore