8,857 research outputs found

    Big Data Analytics in Bioinformatics: A Machine Learning Perspective

    Full text link
    Bioinformatics research is characterized by voluminous and incremental datasets and complex data analytics methods. The machine learning methods used in bioinformatics are iterative and parallel. These methods can be scaled to handle big data using the distributed and parallel computing technologies. Usually big data tools perform computation in batch-mode and are not optimized for iterative processing and high data dependency among operations. In the recent years, parallel, incremental, and multi-view machine learning algorithms have been proposed. Similarly, graph-based architectures and in-memory big data tools have been developed to minimize I/O cost and optimize iterative processing. However, there lack standard big data architectures and tools for many important bioinformatics problems, such as fast construction of co-expression and regulatory networks and salient module identification, detection of complexes over growing protein-protein interaction data, fast analysis of massive DNA, RNA, and protein sequence data, and fast querying on incremental and heterogeneous disease networks. This paper addresses the issues and challenges posed by several big data problems in bioinformatics, and gives an overview of the state of the art and the future research opportunities.Comment: 20 pages survey paper on Big data analytics in Bioinformatic

    Fast communication-efficient spectral clustering over distributed data

    Full text link
    The last decades have seen a surge of interests in distributed computing thanks to advances in clustered computing and big data technology. Existing distributed algorithms typically assume {\it all the data are already in one place}, and divide the data and conquer on multiple machines. However, it is increasingly often that the data are located at a number of distributed sites, and one wishes to compute over all the data with low communication overhead. For spectral clustering, we propose a novel framework that enables its computation over such distributed data, with "minimal" communications while a major speedup in computation. The loss in accuracy is negligible compared to the non-distributed setting. Our approach allows local parallel computing at where the data are located, thus turns the distributed nature of the data into a blessing; the speedup is most substantial when the data are evenly distributed across sites. Experiments on synthetic and large UC Irvine datasets show almost no loss in accuracy with our approach while about 2x speedup under various settings with two distributed sites. As the transmitted data need not be in their original form, our framework readily addresses the privacy concern for data sharing in distributed computing.Comment: 27 pages, 7 figure

    DAME: A Distributed Data Mining & Exploration Framework within the Virtual Observatory

    Full text link
    Nowadays, many scientific areas share the same broad requirements of being able to deal with massive and distributed datasets while, when possible, being integrated with services and applications. In order to solve the growing gap between the incremental generation of data and our understanding of it, it is required to know how to access, retrieve, analyze, mine and integrate data from disparate sources. One of the fundamental aspects of any new generation of data mining software tool or package which really wants to become a service for the community is the possibility to use it within complex workflows which each user can fine tune in order to match the specific demands of his scientific goal. These workflows need often to access different resources (data, providers, computing facilities and packages) and require a strict interoperability on (at least) the client side. The project DAME (DAta Mining & Exploration) arises from these requirements by providing a distributed WEB-based data mining infrastructure specialized on Massive Data Sets exploration with Soft Computing methods. Originally designed to deal with astrophysical use cases, where first scientific application examples have demonstrated its effectiveness, the DAME Suite results as a multi-disciplinary platform-independent tool perfectly compliant with modern KDD (Knowledge Discovery in Databases) requirements and Information & Communication Technology trends.Comment: 20 pages, INGRID 2010 - 5th International Workshop on Distributed Cooperative Laboratories: "Instrumenting" the Grid, May 12-14, 2010, Poznan, Poland; Volume Remote Instrumentation for eScience and Related Aspects, 2011, F. Davoli et al. (eds.), SPRINGER N

    Computational Intelligence Challenges and Applications on Large-Scale Astronomical Time Series Databases

    Full text link
    Time-domain astronomy (TDA) is facing a paradigm shift caused by the exponential growth of the sample size, data complexity and data generation rates of new astronomical sky surveys. For example, the Large Synoptic Survey Telescope (LSST), which will begin operations in northern Chile in 2022, will generate a nearly 150 Petabyte imaging dataset of the southern hemisphere sky. The LSST will stream data at rates of 2 Terabytes per hour, effectively capturing an unprecedented movie of the sky. The LSST is expected not only to improve our understanding of time-varying astrophysical objects, but also to reveal a plethora of yet unknown faint and fast-varying phenomena. To cope with a change of paradigm to data-driven astronomy, the fields of astroinformatics and astrostatistics have been created recently. The new data-oriented paradigms for astronomy combine statistics, data mining, knowledge discovery, machine learning and computational intelligence, in order to provide the automated and robust methods needed for the rapid detection and classification of known astrophysical objects as well as the unsupervised characterization of novel phenomena. In this article we present an overview of machine learning and computational intelligence applications to TDA. Future big data challenges and new lines of research in TDA, focusing on the LSST, are identified and discussed from the viewpoint of computational intelligence/machine learning. Interdisciplinary collaboration will be required to cope with the challenges posed by the deluge of astronomical data coming from the LSST

    Edge Computing Aware NOMA for 5G Networks

    Full text link
    With the fast development of Internet of things (IoT), the fifth generation (5G) wireless networks need to provide massive connectivity of IoT devices and meet the demand for low latency. To satisfy these requirements, Non-Orthogonal Multiple Access (NOMA) has been recognized as a promising solution for 5G networks to significantly improve the network capacity. In parallel with the development of NOMA techniques, Mobile Edge Computing (MEC) is becoming one of the key emerging technologies to reduce the latency and improve the Quality of Service (QoS) for 5G networks. In order to capture the potential gains of NOMA in the context of MEC, this paper proposes an edge computing aware NOMA technique which can enjoy the benefits of uplink NOMA in reducing MEC users' uplink energy consumption. To this end, we formulate a NOMA based optimization framework which minimizes the energy consumption of MEC users via optimizing the user clustering, computing and communication resource allocation, and transmit powers. In particular, similar to frequency Resource Blocks (RBs), we divide the computing capacity available at the cloudlet to computing RBs. Accordingly, we explore the joint allocation of the frequency and computing RBs to the users that are assigned to different order indices within the NOMA clusters. We also design an efficient heuristic algorithm for user clustering and RBs allocation, and formulate a convex optimization problem for the power control to be solved independently per NOMA cluster. The performance of the proposed NOMA scheme is evaluated via simulations

    CADDeLaG: Framework for distributed anomaly detection in large dense graph sequences

    Full text link
    Random walk based distance measures for graphs such as commute-time distance are useful in a variety of graph algorithms, such as clustering, anomaly detection, and creating low dimensional embeddings. Since such measures hinge on the spectral decomposition of the graph, the computation becomes a bottleneck for large graphs and do not scale easily to graphs that cannot be loaded in memory. Most existing graph mining libraries for large graphs either resort to sampling or exploit the sparsity structure of such graphs for spectral analysis. However, such methods do not work for dense graphs constructed for studying pairwise relationships among entities in a data set. Examples of such studies include analyzing pairwise locations in gridded climate data for discovering long distance climate phenomena. These graphs representations are fully connected by construction and cannot be sparsified without loss of meaningful information. In this paper we describe CADDeLaG, a framework for scalable computation of commute-time distance based anomaly detection in large dense graphs without the need to load the entire graph in memory. The framework relies on Apache Spark's memory-centric cluster-computing infrastructure and consists of two building blocks: a decomposable algorithm for commute time distance computation and a distributed linear system solver. We illustrate the scalability of CADDeLaG and its dependency on various factors using both synthetic and real world data sets. We demonstrate the usefulness of CADDeLaG in identifying anomalies in a climate graph sequence, that have been historically missed due to ad hoc graph sparsification and on an election donation data set

    Distributed mining of large scale remote sensing image archives on public computing infrastructures

    Full text link
    Earth Observation (EO) mining aims at supporting efficient access and exploration of petabyte-scale space- and airborne remote sensing archives that are currently expanding at rates of terabytes per day. A significant challenge is performing the analysis required by envisaged applications --- like for instance process mapping for environmental risk management --- in reasonable time. In this work, we address the problem of content-based image retrieval via example-based queries from EO data archives. In particular, we focus on the analysis of polarimetric SAR data, for which target decomposition theorems have proved fundamental in discovering patterns in data and characterize the ground scattering properties. To this end, we propose an interactive region-oriented content-based image mining system in which 1) unsupervised ingestion processes are distributed onto virtual machines in elastic, on-demand computing infrastructures 2) archive-scale content hierarchical indexing is implemented in terms of a "big data" analytics cluster-computing framework 3) query processing amounts to traversing the generated binary tree index, computing distances that correspond to descriptor-based similarity measures between image groups and a query image tile. We describe in depth both the strategies and the actual implementations for the ingestion and indexing components, and verify the approach by experiments carried out on the NASA/JPL UAVSAR full polarimetric data archive. We report the results of the tests performed on computer clusters by using a public Infrastructure-as-a-Service and evaluating the impact of cluster configuration on system performance. Results are promising for data mapping and information retrieval applications

    Application of Machine Learning in Wireless Networks: Key Techniques and Open Issues

    Full text link
    As a key technique for enabling artificial intelligence, machine learning (ML) is capable of solving complex problems without explicit programming. Motivated by its successful applications to many practical tasks like image recognition, both industry and the research community have advocated the applications of ML in wireless communication. This paper comprehensively surveys the recent advances of the applications of ML in wireless communication, which are classified as: resource management in the MAC layer, networking and mobility management in the network layer, and localization in the application layer. The applications in resource management further include power control, spectrum management, backhaul management, cache management, beamformer design and computation resource management, while ML based networking focuses on the applications in clustering, base station switching control, user association and routing. Moreover, literatures in each aspect is organized according to the adopted ML techniques. In addition, several conditions for applying ML to wireless communication are identified to help readers decide whether to use ML and which kind of ML techniques to use, and traditional approaches are also summarized together with their performance comparison with ML based approaches, based on which the motivations of surveyed literatures to adopt ML are clarified. Given the extensiveness of the research area, challenges and unresolved issues are presented to facilitate future studies, where ML based network slicing, infrastructure update to support ML based paradigms, open data sets and platforms for researchers, theoretical guidance for ML implementation and so on are discussed.Comment: 34 pages,8 figure

    clusterNOR: A NUMA-Optimized Clustering Framework

    Full text link
    Clustering algorithms are iterative and have complex data access patterns that result in many small random memory accesses. The performance of parallel implementations suffer from synchronous barriers for each iteration and skewed workloads. We rethink the parallelization of clustering for modern non-uniform memory architectures (NUMA) to maximizes independent, asynchronous computation. We eliminate many barriers, reduce remote memory accesses, and maximize cache reuse. We implement the 'Clustering NUMA Optimized Routines' (clusterNOR) extensible parallel framework that provides algorithmic building blocks. The system is generic, we demonstrate nine modern clustering algorithms that have simple implementations. clusterNOR includes (i) in-memory, (ii) semi-external memory, and (iii) distributed memory execution, enabling computation for varying memory and hardware budgets. For algorithms that rely on Euclidean distance, clusterNOR defines an updated Elkan's triangle inequality pruning algorithm that uses asymptotically less memory so that it works on billion-point data sets. clusterNOR extends and expands the scope of the 'knor' library for k-means clustering by generalizing underlying principles, providing a uniform programming interface and expanding the scope to hierarchical and linear algebraic classes of algorithms. The compound effect of our optimizations is an order of magnitude improvement in speed over other state-of-the-art solutions, such as Spark's MLlib and Apple's Turi.Comment: arXiv admin note: Journal version of arXiv:1606.0890

    Parallel Spectral Clustering Algorithm Based on Hadoop

    Full text link
    Spectral clustering and cloud computing is emerging branch of computer science or related discipline. It overcome the shortcomings of some traditional clustering algorithm and guarantee the convergence to the optimal solution, thus have to the widespread attention. This article first introduced the parallel spectral clustering algorithm research background and significance, and then to Hadoop the cloud computing Framework has carried on the detailed introduction, then has carried on the related to spectral clustering is introduced, then introduces the spectral clustering arithmetic Method of parallel and relevant steps, finally made the related experiments, and the experiment are summarized