22,884 research outputs found
Big Data Analytics in Bioinformatics: A Machine Learning Perspective
Bioinformatics research is characterized by voluminous and incremental
datasets and complex data analytics methods. The machine learning methods used
in bioinformatics are iterative and parallel. These methods can be scaled to
handle big data using the distributed and parallel computing technologies.
Usually big data tools perform computation in batch-mode and are not
optimized for iterative processing and high data dependency among operations.
In the recent years, parallel, incremental, and multi-view machine learning
algorithms have been proposed. Similarly, graph-based architectures and
in-memory big data tools have been developed to minimize I/O cost and optimize
iterative processing.
However, there lack standard big data architectures and tools for many
important bioinformatics problems, such as fast construction of co-expression
and regulatory networks and salient module identification, detection of
complexes over growing protein-protein interaction data, fast analysis of
massive DNA, RNA, and protein sequence data, and fast querying on incremental
and heterogeneous disease networks. This paper addresses the issues and
challenges posed by several big data problems in bioinformatics, and gives an
overview of the state of the art and the future research opportunities.Comment: 20 pages survey paper on Big data analytics in Bioinformatic
Scalable Prototype Selection by Genetic Algorithms and Hashing
Classification in the dissimilarity space has become a very active research
area since it provides a possibility to learn from data given in the form of
pairwise non-metric dissimilarities, which otherwise would be difficult to cope
with. The selection of prototypes is a key step for the further creation of the
space. However, despite previous efforts to find good prototypes, how to select
the best representation set remains an open issue. In this paper we proposed
scalable methods to select the set of prototypes out of very large datasets.
The methods are based on genetic algorithms, dissimilarity-based hashing, and
two different unsupervised and supervised scalable criteria. The unsupervised
criterion is based on the Minimum Spanning Tree of the graph created by the
prototypes as nodes and the dissimilarities as edges. The supervised criterion
is based on counting matching labels of objects and their closest prototypes.
The suitability of these type of algorithms is analyzed for the specific case
of dissimilarity representations. The experimental results showed that the
methods select good prototypes taking advantage of the large datasets, and they
do so at low runtimes.Comment: 26 pages, 8 figure
Clustering Time Series Data Stream - A Literature Survey
Mining Time Series data has a tremendous growth of interest in today's world.
To provide an indication various implementations are studied and summarized to
identify the different problems in existing applications. Clustering time
series is a trouble that has applications in an extensive assortment of fields
and has recently attracted a large amount of research. Time series data are
frequently large and may contain outliers. In addition, time series are a
special type of data set where elements have a temporal ordering. Therefore
clustering of such data stream is an important issue in the data mining
process. Numerous techniques and clustering algorithms have been proposed
earlier to assist clustering of time series data streams. The clustering
algorithms and its effectiveness on various applications are compared to
develop a new method to solve the existing problem. This paper presents a
survey on various clustering algorithms available for time series datasets.
Moreover, the distinctiveness and restriction of previous research are
discussed and several achievable topics for future study are recognized.
Furthermore the areas that utilize time series clustering are also summarized.Comment: IEEE Publication format, International Journal of Computer Science
and Information Security, IJCSIS, Vol. 8 No. 1, April 2010, USA. ISSN 1947
5500, http://sites.google.com/site/ijcsis
Unsupervised Assignment Flow: Label Learning on Feature Manifolds by Spatially Regularized Geometric Assignment
This paper introduces the unsupervised assignment flow that couples the
assignment flow for supervised image labeling with Riemannian gradient flows
for label evolution on feature manifolds. The latter component of the approach
encompasses extensions of state-of-the-art clustering approaches to
manifold-valued data. Coupling label evolution with the spatially regularized
assignment flow induces a sparsifying effect that enables to learn compact
label dictionaries in an unsupervised manner. Our approach alleviates the
requirement for supervised labeling to have proper labels at hand, because an
initial set of labels can evolve and adapt to better values while being
assigned to given data. The separation between feature and assignment manifolds
enables the flexible application which is demonstrated for three scenarios with
manifold-valued features. Experiments demonstrate a beneficial effect in both
directions: adaptivity of labels improves image labeling, and steering label
evolution by spatially regularized assignments leads to proper labels, because
the assignment flow for supervised labeling is exactly used without any
approximation for label learning.Comment: 34 pages, 13 figures, published in Journal of Mathematical Imaging
and Vision (JMIV
Using real-time cluster configurations of streaming asynchronous features as online state descriptors in financial markets
We present a scheme for online, unsupervised state discovery and detection
from streaming, multi-featured, asynchronous data in high-frequency financial
markets. Online feature correlations are computed using an unbiased, lossless
Fourier estimator. A high-speed maximum likelihood clustering algorithm is then
used to find the feature cluster configuration which best explains the
structure in the correlation matrix. We conjecture that this feature
configuration is a candidate descriptor for the temporal state of the system.
Using a simple cluster configuration similarity metric, we are able to
enumerate the state space based on prevailing feature configurations. The
proposed state representation removes the need for human-driven data
pre-processing for state attribute specification, allowing a learning agent to
find structure in streaming data, discern changes in the system, enumerate its
perceived state space and learn suitable action-selection policies.Comment: 19 pages, 6 figures, 3 tables, under review at Pattern Recognition
Letter
Local Aggregation for Unsupervised Learning of Visual Embeddings
Unsupervised approaches to learning in neural networks are of substantial
interest for furthering artificial intelligence, both because they would enable
the training of networks without the need for large numbers of expensive
annotations, and because they would be better models of the kind of
general-purpose learning deployed by humans. However, unsupervised networks
have long lagged behind the performance of their supervised counterparts,
especially in the domain of large-scale visual recognition. Recent developments
in training deep convolutional embeddings to maximize non-parametric instance
separation and clustering objectives have shown promise in closing this gap.
Here, we describe a method that trains an embedding function to maximize a
metric of local aggregation, causing similar data instances to move together in
the embedding space, while allowing dissimilar instances to separate. This
aggregation metric is dynamic, allowing soft clusters of different scales to
emerge. We evaluate our procedure on several large-scale visual recognition
datasets, achieving state-of-the-art unsupervised transfer learning performance
on object recognition in ImageNet, scene recognition in Places 205, and object
detection in PASCAL VOC
Julia Language in Machine Learning: Algorithms, Applications, and Open Issues
Machine learning is driving development across many fields in science and
engineering. A simple and efficient programming language could accelerate
applications of machine learning in various fields. Currently, the programming
languages most commonly used to develop machine learning algorithms include
Python, MATLAB, and C/C ++. However, none of these languages well balance both
efficiency and simplicity. The Julia language is a fast, easy-to-use, and
open-source programming language that was originally designed for
high-performance computing, which can well balance the efficiency and
simplicity. This paper summarizes the related research work and developments in
the application of the Julia language in machine learning. It first surveys the
popular machine learning algorithms that are developed in the Julia language.
Then, it investigates applications of the machine learning algorithms
implemented with the Julia language. Finally, it discusses the open issues and
the potential future directions that arise in the use of the Julia language in
machine learning.Comment: Published in Computer Science Revie
A survey on trajectory clustering analysis
This paper comprehensively surveys the development of trajectory clustering.
Considering the critical role of trajectory data mining in modern intelligent
systems for surveillance security, abnormal behavior detection, crowd behavior
analysis, and traffic control, trajectory clustering has attracted growing
attention. Existing trajectory clustering methods can be grouped into three
categories: unsupervised, supervised and semi-supervised algorithms. In spite
of achieving a certain level of development, trajectory clustering is limited
in its success by complex conditions such as application scenarios and data
dimensions. This paper provides a holistic understanding and deep insight into
trajectory clustering, and presents a comprehensive analysis of representative
methods and promising future directions
Application of Machine Learning in Wireless Networks: Key Techniques and Open Issues
As a key technique for enabling artificial intelligence, machine learning
(ML) is capable of solving complex problems without explicit programming.
Motivated by its successful applications to many practical tasks like image
recognition, both industry and the research community have advocated the
applications of ML in wireless communication. This paper comprehensively
surveys the recent advances of the applications of ML in wireless
communication, which are classified as: resource management in the MAC layer,
networking and mobility management in the network layer, and localization in
the application layer. The applications in resource management further include
power control, spectrum management, backhaul management, cache management,
beamformer design and computation resource management, while ML based
networking focuses on the applications in clustering, base station switching
control, user association and routing. Moreover, literatures in each aspect is
organized according to the adopted ML techniques. In addition, several
conditions for applying ML to wireless communication are identified to help
readers decide whether to use ML and which kind of ML techniques to use, and
traditional approaches are also summarized together with their performance
comparison with ML based approaches, based on which the motivations of surveyed
literatures to adopt ML are clarified. Given the extensiveness of the research
area, challenges and unresolved issues are presented to facilitate future
studies, where ML based network slicing, infrastructure update to support ML
based paradigms, open data sets and platforms for researchers, theoretical
guidance for ML implementation and so on are discussed.Comment: 34 pages,8 figure
Training a Restricted Boltzmann Machine for Classification by Labeling Model Samples
We propose an alternative method for training a classification model. Using
the MNIST set of handwritten digits and Restricted Boltzmann Machines, it is
possible to reach a classification performance competitive to semi-supervised
learning if we first train a model in an unsupervised fashion on unlabeled data
only, and then manually add labels to model samples instead of training data
samples with the help of a GUI. This approach can benefit from the fact that
model samples can be presented to the human labeler in a video-like fashion,
resulting in a higher number of labeled examples. Also, after some initial
training, hard-to-classify examples can be distinguished from easy ones
automatically, saving manual work
- …