1,561 research outputs found
Feature Selection: A Data Perspective
Feature selection, as a data preprocessing strategy, has been proven to be
effective and efficient in preparing data (especially high-dimensional data)
for various data mining and machine learning problems. The objectives of
feature selection include: building simpler and more comprehensible models,
improving data mining performance, and preparing clean, understandable data.
The recent proliferation of big data has presented some substantial challenges
and opportunities to feature selection. In this survey, we provide a
comprehensive and structured overview of recent advances in feature selection
research. Motivated by current challenges and opportunities in the era of big
data, we revisit feature selection research from a data perspective and review
representative feature selection algorithms for conventional data, structured
data, heterogeneous data and streaming data. Methodologically, to emphasize the
differences and similarities of most existing feature selection algorithms for
conventional data, we categorize them into four main groups: similarity based,
information theoretical based, sparse learning based and statistical based
methods. To facilitate and promote the research in this community, we also
present an open-source feature selection repository that consists of most of
the popular feature selection algorithms
(\url{http://featureselection.asu.edu/}). Also, we use it as an example to show
how to evaluate feature selection algorithms. At the end of the survey, we
present a discussion about some open problems and challenges that require more
attention in future research
Discovery of Shifting Patterns in Sequence Classification
In this paper, we investigate the multi-variate sequence classification
problem from a multi-instance learning perspective. Real-world sequential data
commonly show discriminative patterns only at specific time periods. For
instance, we can identify a cropland during its growing season, but it looks
similar to a barren land after harvest or before planting. Besides, even within
the same class, the discriminative patterns can appear in different periods of
sequential data. Due to such property, these discriminative patterns are also
referred to as shifting patterns. The shifting patterns in sequential data
severely degrade the performance of traditional classification methods without
sufficient training data.
We propose a novel sequence classification method by automatically mining
shifting patterns from multi-variate sequence. The method employs a
multi-instance learning approach to detect shifting patterns while also
modeling temporal relationships within each multi-instance bag by an LSTM model
to further improve the classification performance. We extensively evaluate our
method on two real-world applications - cropland mapping and affective state
recognition. The experiments demonstrate the superiority of our proposed method
in sequence classification performance and in detecting discriminative shifting
patterns
Online Semi-Supervised Learning with Deep Hybrid Boltzmann Machines and Denoising Autoencoders
Two novel deep hybrid architectures, the Deep Hybrid Boltzmann Machine and
the Deep Hybrid Denoising Auto-encoder, are proposed for handling
semi-supervised learning problems. The models combine experts that model
relevant distributions at different levels of abstraction to improve overall
predictive performance on discriminative tasks. Theoretical motivations and
algorithms for joint learning for each are presented. We apply the new models
to the domain of data-streams in work towards life-long learning. The proposed
architectures show improved performance compared to a pseudo-labeled, drop-out
rectifier network
Unveiling Contextual Similarity of Things via Mining Human-Thing Interactions in the Internet of Things
With recent advances in radio-frequency identification (RFID), wireless
sensor networks, and Web services, physical things are becoming an integral
part of the emerging ubiquitous Web. Finding correlations of ubiquitous things
is a crucial prerequisite for many important applications such as things
search, discovery, classification, recommendation, and composition. This
article presents DisCor-T, a novel graph-based method for discovering
underlying connections of things via mining the rich content embodied in
human-thing interactions in terms of user, temporal and spatial information. We
model these various information using two graphs, namely spatio-temporal graph
and social graph. Then, random walk with restart (RWR) is applied to find
proximities among things, and a relational graph of things (RGT) indicating
implicit correlations of things is learned. The correlation analysis lays a
solid foundation contributing to improved effectiveness in things management.
To demonstrate the utility, we develop a flexible feature-based classification
framework on top of RGT and perform a systematic case study. Our evaluation
exhibits the strength and feasibility of the proposed approach
Weakly Supervised Deep Learning Approach in Streaming Environments
The feasibility of existing data stream algorithms is often hindered by the
weakly supervised condition of data streams. A self-evolving deep neural
network, namely Parsimonious Network (ParsNet), is proposed as a solution to
various weakly-supervised data stream problems. A self-labelling strategy with
hedge (SLASH) is proposed in which its auto-correction mechanism copes with
\textit{the accumulation of mistakes} significantly affecting the model's
generalization. ParsNet is developed from a closed-loop configuration of the
self-evolving generative and discriminative training processes exploiting
shared parameters in which its structure flexibly grows and shrinks to overcome
the issue of concept drift with/without labels. The numerical evaluation has
been performed under two challenging problems, namely sporadic access to ground
truth and infinitely delayed access to the ground truth. Our numerical study
shows the advantage of ParsNet with a substantial margin from its counterparts
in the high-dimensional data streams and infinite delay simulation protocol. To
support the reproducible research initiative, the source code of ParsNet along
with supplementary materials are made available at https://bit.ly/2qNW7p4.Comment: This paper has been accepted for publication in The 2019 IEEE
International Conference on Big Data (IEEE BigData 2019), Los Angeles, CA,
US
A Survey on Multi-output Learning
Multi-output learning aims to simultaneously predict multiple outputs given
an input. It is an important learning problem due to the pressing need for
sophisticated decision making in real-world applications. Inspired by big data,
the 4Vs characteristics of multi-output imposes a set of challenges to
multi-output learning, in terms of the volume, velocity, variety and veracity
of the outputs. Increasing number of works in the literature have been devoted
to the study of multi-output learning and the development of novel approaches
for addressing the challenges encountered. However, it lacks a comprehensive
overview on different types of challenges of multi-output learning brought by
the characteristics of the multiple outputs and the techniques proposed to
overcome the challenges. This paper thus attempts to fill in this gap to
provide a comprehensive review on this area. We first introduce different
stages of the life cycle of the output labels. Then we present the paradigm on
multi-output learning, including its myriads of output structures, definitions
of its different sub-problems, model evaluation metrics and popular data
repositories used in the study. Subsequently, we review a number of
state-of-the-art multi-output learning methods, which are categorized based on
the challenges.Comment: Paper accepted by IEEE Transactions on Neural Networks and Learning
System
Short Text Topic Modeling Techniques, Applications, and Performance: A Survey
Analyzing short texts infers discriminative and coherent latent topics that
is a critical and fundamental task since many real-world applications require
semantic understanding of short texts. Traditional long text topic modeling
algorithms (e.g., PLSA and LDA) based on word co-occurrences cannot solve this
problem very well since only very limited word co-occurrence information is
available in short texts. Therefore, short text topic modeling has already
attracted much attention from the machine learning research community in recent
years, which aims at overcoming the problem of sparseness in short texts. In
this survey, we conduct a comprehensive review of various short text topic
modeling techniques proposed in the literature. We present three categories of
methods based on Dirichlet multinomial mixture, global word co-occurrences, and
self-aggregation, with example of representative approaches in each category
and analysis of their performance on various tasks. We develop the first
comprehensive open-source library, called STTM, for use in Java that integrates
all surveyed algorithms within a unified interface, benchmark datasets, to
facilitate the expansion of new methods in this research field. Finally, we
evaluate these state-of-the-art methods on many real-world datasets and compare
their performance against one another and versus long text topic modeling
algorithm.Comment: arXiv admin note: text overlap with arXiv:1808.02215 by other author
A Survey of Location Prediction on Twitter
Locations, e.g., countries, states, cities, and point-of-interests, are
central to news, emergency events, and people's daily lives. Automatic
identification of locations associated with or mentioned in documents has been
explored for decades. As one of the most popular online social network
platforms, Twitter has attracted a large number of users who send millions of
tweets on daily basis. Due to the world-wide coverage of its users and
real-time freshness of tweets, location prediction on Twitter has gained
significant attention in recent years. Research efforts are spent on dealing
with new challenges and opportunities brought by the noisy, short, and
context-rich nature of tweets. In this survey, we aim at offering an overall
picture of location prediction on Twitter. Specifically, we concentrate on the
prediction of user home locations, tweet locations, and mentioned locations. We
first define the three tasks and review the evaluation metrics. By summarizing
Twitter network, tweet content, and tweet context as potential inputs, we then
structurally highlight how the problems depend on these inputs. Each dependency
is illustrated by a comprehensive review of the corresponding strategies
adopted in state-of-the-art approaches. In addition, we also briefly review two
related problems, i.e., semantic location prediction and point-of-interest
recommendation. Finally, we list future research directions.Comment: Accepted to TKDE. 30 pages, 1 figur
Neural Predictive Coding using Convolutional Neural Networks towards Unsupervised Learning of Speaker Characteristics
Learning speaker-specific features is vital in many applications like speaker
recognition, diarization and speech recognition. This paper provides a novel
approach, we term Neural Predictive Coding (NPC), to learn speaker-specific
characteristics in a completely unsupervised manner from large amounts of
unlabeled training data that even contain many non-speech events and
multi-speaker audio streams. The NPC framework exploits the proposed short-term
active-speaker stationarity hypothesis which assumes two temporally-close short
speech segments belong to the same speaker, and thus a common representation
that can encode the commonalities of both the segments, should capture the
vocal characteristics of that speaker. We train a convolutional deep siamese
network to produce "speaker embeddings" by learning to separate `same' vs
`different' speaker pairs which are generated from an unlabeled data of audio
streams. Two sets of experiments are done in different scenarios to evaluate
the strength of NPC embeddings and compare with state-of-the-art in-domain
supervised methods. First, two speaker identification experiments with
different context lengths are performed in a scenario with comparatively
limited within-speaker channel variability. NPC embeddings are found to perform
the best at short duration experiment, and they provide complementary
information to i-vectors for full utterance experiments. Second, a large scale
speaker verification task having a wide range of within-speaker channel
variability is adopted as an upper-bound experiment where comparisons are drawn
with in-domain supervised methods
- …