8,857 research outputs found
Big Data Analytics in Bioinformatics: A Machine Learning Perspective
Bioinformatics research is characterized by voluminous and incremental
datasets and complex data analytics methods. The machine learning methods used
in bioinformatics are iterative and parallel. These methods can be scaled to
handle big data using the distributed and parallel computing technologies.
Usually big data tools perform computation in batch-mode and are not
optimized for iterative processing and high data dependency among operations.
In the recent years, parallel, incremental, and multi-view machine learning
algorithms have been proposed. Similarly, graph-based architectures and
in-memory big data tools have been developed to minimize I/O cost and optimize
iterative processing.
However, there lack standard big data architectures and tools for many
important bioinformatics problems, such as fast construction of co-expression
and regulatory networks and salient module identification, detection of
complexes over growing protein-protein interaction data, fast analysis of
massive DNA, RNA, and protein sequence data, and fast querying on incremental
and heterogeneous disease networks. This paper addresses the issues and
challenges posed by several big data problems in bioinformatics, and gives an
overview of the state of the art and the future research opportunities.Comment: 20 pages survey paper on Big data analytics in Bioinformatic
Fast communication-efficient spectral clustering over distributed data
The last decades have seen a surge of interests in distributed computing
thanks to advances in clustered computing and big data technology. Existing
distributed algorithms typically assume {\it all the data are already in one
place}, and divide the data and conquer on multiple machines. However, it is
increasingly often that the data are located at a number of distributed sites,
and one wishes to compute over all the data with low communication overhead.
For spectral clustering, we propose a novel framework that enables its
computation over such distributed data, with "minimal" communications while a
major speedup in computation. The loss in accuracy is negligible compared to
the non-distributed setting. Our approach allows local parallel computing at
where the data are located, thus turns the distributed nature of the data into
a blessing; the speedup is most substantial when the data are evenly
distributed across sites. Experiments on synthetic and large UC Irvine datasets
show almost no loss in accuracy with our approach while about 2x speedup under
various settings with two distributed sites. As the transmitted data need not
be in their original form, our framework readily addresses the privacy concern
for data sharing in distributed computing.Comment: 27 pages, 7 figure
DAME: A Distributed Data Mining & Exploration Framework within the Virtual Observatory
Nowadays, many scientific areas share the same broad requirements of being
able to deal with massive and distributed datasets while, when possible, being
integrated with services and applications. In order to solve the growing gap
between the incremental generation of data and our understanding of it, it is
required to know how to access, retrieve, analyze, mine and integrate data from
disparate sources. One of the fundamental aspects of any new generation of data
mining software tool or package which really wants to become a service for the
community is the possibility to use it within complex workflows which each user
can fine tune in order to match the specific demands of his scientific goal.
These workflows need often to access different resources (data, providers,
computing facilities and packages) and require a strict interoperability on (at
least) the client side. The project DAME (DAta Mining & Exploration) arises
from these requirements by providing a distributed WEB-based data mining
infrastructure specialized on Massive Data Sets exploration with Soft Computing
methods. Originally designed to deal with astrophysical use cases, where first
scientific application examples have demonstrated its effectiveness, the DAME
Suite results as a multi-disciplinary platform-independent tool perfectly
compliant with modern KDD (Knowledge Discovery in Databases) requirements and
Information & Communication Technology trends.Comment: 20 pages, INGRID 2010 - 5th International Workshop on Distributed
Cooperative Laboratories: "Instrumenting" the Grid, May 12-14, 2010, Poznan,
Poland; Volume Remote Instrumentation for eScience and Related Aspects, 2011,
F. Davoli et al. (eds.), SPRINGER N
Computational Intelligence Challenges and Applications on Large-Scale Astronomical Time Series Databases
Time-domain astronomy (TDA) is facing a paradigm shift caused by the
exponential growth of the sample size, data complexity and data generation
rates of new astronomical sky surveys. For example, the Large Synoptic Survey
Telescope (LSST), which will begin operations in northern Chile in 2022, will
generate a nearly 150 Petabyte imaging dataset of the southern hemisphere sky.
The LSST will stream data at rates of 2 Terabytes per hour, effectively
capturing an unprecedented movie of the sky. The LSST is expected not only to
improve our understanding of time-varying astrophysical objects, but also to
reveal a plethora of yet unknown faint and fast-varying phenomena. To cope with
a change of paradigm to data-driven astronomy, the fields of astroinformatics
and astrostatistics have been created recently. The new data-oriented paradigms
for astronomy combine statistics, data mining, knowledge discovery, machine
learning and computational intelligence, in order to provide the automated and
robust methods needed for the rapid detection and classification of known
astrophysical objects as well as the unsupervised characterization of novel
phenomena. In this article we present an overview of machine learning and
computational intelligence applications to TDA. Future big data challenges and
new lines of research in TDA, focusing on the LSST, are identified and
discussed from the viewpoint of computational intelligence/machine learning.
Interdisciplinary collaboration will be required to cope with the challenges
posed by the deluge of astronomical data coming from the LSST
Edge Computing Aware NOMA for 5G Networks
With the fast development of Internet of things (IoT), the fifth generation
(5G) wireless networks need to provide massive connectivity of IoT devices and
meet the demand for low latency. To satisfy these requirements, Non-Orthogonal
Multiple Access (NOMA) has been recognized as a promising solution for 5G
networks to significantly improve the network capacity. In parallel with the
development of NOMA techniques, Mobile Edge Computing (MEC) is becoming one of
the key emerging technologies to reduce the latency and improve the Quality of
Service (QoS) for 5G networks. In order to capture the potential gains of NOMA
in the context of MEC, this paper proposes an edge computing aware NOMA
technique which can enjoy the benefits of uplink NOMA in reducing MEC users'
uplink energy consumption. To this end, we formulate a NOMA based optimization
framework which minimizes the energy consumption of MEC users via optimizing
the user clustering, computing and communication resource allocation, and
transmit powers. In particular, similar to frequency Resource Blocks (RBs), we
divide the computing capacity available at the cloudlet to computing RBs.
Accordingly, we explore the joint allocation of the frequency and computing RBs
to the users that are assigned to different order indices within the NOMA
clusters. We also design an efficient heuristic algorithm for user clustering
and RBs allocation, and formulate a convex optimization problem for the power
control to be solved independently per NOMA cluster. The performance of the
proposed NOMA scheme is evaluated via simulations
CADDeLaG: Framework for distributed anomaly detection in large dense graph sequences
Random walk based distance measures for graphs such as commute-time distance
are useful in a variety of graph algorithms, such as clustering, anomaly
detection, and creating low dimensional embeddings. Since such measures hinge
on the spectral decomposition of the graph, the computation becomes a
bottleneck for large graphs and do not scale easily to graphs that cannot be
loaded in memory. Most existing graph mining libraries for large graphs either
resort to sampling or exploit the sparsity structure of such graphs for
spectral analysis. However, such methods do not work for dense graphs
constructed for studying pairwise relationships among entities in a data set.
Examples of such studies include analyzing pairwise locations in gridded
climate data for discovering long distance climate phenomena. These graphs
representations are fully connected by construction and cannot be sparsified
without loss of meaningful information. In this paper we describe CADDeLaG, a
framework for scalable computation of commute-time distance based anomaly
detection in large dense graphs without the need to load the entire graph in
memory. The framework relies on Apache Spark's memory-centric cluster-computing
infrastructure and consists of two building blocks: a decomposable algorithm
for commute time distance computation and a distributed linear system solver.
We illustrate the scalability of CADDeLaG and its dependency on various factors
using both synthetic and real world data sets. We demonstrate the usefulness of
CADDeLaG in identifying anomalies in a climate graph sequence, that have been
historically missed due to ad hoc graph sparsification and on an election
donation data set
Distributed mining of large scale remote sensing image archives on public computing infrastructures
Earth Observation (EO) mining aims at supporting efficient access and
exploration of petabyte-scale space- and airborne remote sensing archives that
are currently expanding at rates of terabytes per day. A significant challenge
is performing the analysis required by envisaged applications --- like for
instance process mapping for environmental risk management --- in reasonable
time. In this work, we address the problem of content-based image retrieval via
example-based queries from EO data archives. In particular, we focus on the
analysis of polarimetric SAR data, for which target decomposition theorems have
proved fundamental in discovering patterns in data and characterize the ground
scattering properties. To this end, we propose an interactive region-oriented
content-based image mining system in which 1) unsupervised ingestion processes
are distributed onto virtual machines in elastic, on-demand computing
infrastructures 2) archive-scale content hierarchical indexing is implemented
in terms of a "big data" analytics cluster-computing framework 3) query
processing amounts to traversing the generated binary tree index, computing
distances that correspond to descriptor-based similarity measures between image
groups and a query image tile. We describe in depth both the strategies and the
actual implementations for the ingestion and indexing components, and verify
the approach by experiments carried out on the NASA/JPL UAVSAR full
polarimetric data archive. We report the results of the tests performed on
computer clusters by using a public Infrastructure-as-a-Service and evaluating
the impact of cluster configuration on system performance. Results are
promising for data mapping and information retrieval applications
Application of Machine Learning in Wireless Networks: Key Techniques and Open Issues
As a key technique for enabling artificial intelligence, machine learning
(ML) is capable of solving complex problems without explicit programming.
Motivated by its successful applications to many practical tasks like image
recognition, both industry and the research community have advocated the
applications of ML in wireless communication. This paper comprehensively
surveys the recent advances of the applications of ML in wireless
communication, which are classified as: resource management in the MAC layer,
networking and mobility management in the network layer, and localization in
the application layer. The applications in resource management further include
power control, spectrum management, backhaul management, cache management,
beamformer design and computation resource management, while ML based
networking focuses on the applications in clustering, base station switching
control, user association and routing. Moreover, literatures in each aspect is
organized according to the adopted ML techniques. In addition, several
conditions for applying ML to wireless communication are identified to help
readers decide whether to use ML and which kind of ML techniques to use, and
traditional approaches are also summarized together with their performance
comparison with ML based approaches, based on which the motivations of surveyed
literatures to adopt ML are clarified. Given the extensiveness of the research
area, challenges and unresolved issues are presented to facilitate future
studies, where ML based network slicing, infrastructure update to support ML
based paradigms, open data sets and platforms for researchers, theoretical
guidance for ML implementation and so on are discussed.Comment: 34 pages,8 figure
clusterNOR: A NUMA-Optimized Clustering Framework
Clustering algorithms are iterative and have complex data access patterns
that result in many small random memory accesses. The performance of parallel
implementations suffer from synchronous barriers for each iteration and skewed
workloads. We rethink the parallelization of clustering for modern non-uniform
memory architectures (NUMA) to maximizes independent, asynchronous computation.
We eliminate many barriers, reduce remote memory accesses, and maximize cache
reuse. We implement the 'Clustering NUMA Optimized Routines' (clusterNOR)
extensible parallel framework that provides algorithmic building blocks. The
system is generic, we demonstrate nine modern clustering algorithms that have
simple implementations. clusterNOR includes (i) in-memory, (ii) semi-external
memory, and (iii) distributed memory execution, enabling computation for
varying memory and hardware budgets. For algorithms that rely on Euclidean
distance, clusterNOR defines an updated Elkan's triangle inequality pruning
algorithm that uses asymptotically less memory so that it works on
billion-point data sets. clusterNOR extends and expands the scope of the 'knor'
library for k-means clustering by generalizing underlying principles, providing
a uniform programming interface and expanding the scope to hierarchical and
linear algebraic classes of algorithms. The compound effect of our
optimizations is an order of magnitude improvement in speed over other
state-of-the-art solutions, such as Spark's MLlib and Apple's Turi.Comment: arXiv admin note: Journal version of arXiv:1606.0890
Parallel Spectral Clustering Algorithm Based on Hadoop
Spectral clustering and cloud computing is emerging branch of computer
science or related discipline. It overcome the shortcomings of some traditional
clustering algorithm and guarantee the convergence to the optimal solution,
thus have to the widespread attention. This article first introduced the
parallel spectral clustering algorithm research background and significance,
and then to Hadoop the cloud computing Framework has carried on the detailed
introduction, then has carried on the related to spectral clustering is
introduced, then introduces the spectral clustering arithmetic Method of
parallel and relevant steps, finally made the related experiments, and the
experiment are summarized
- …