Search CORE

42,230 research outputs found

The Parallelism Motifs of Genomic Data Analysis

Author: Awan Muaaz
Azad Ariful
Brock Benjamin
Buluc Aydin
Egan Rob
Ekanayake Saliya
Ellis Marquita
Georganas Evangelos
Guidi Giulia
Hofmeyr Steven
Oliker Leonid
Selvitopi Oguz
Teodoropol Cristina
Yelick Katherine
Publication venue: 'The Royal Society'
Publication date: 20/01/2020
Field of study

Genomic data sets are growing dramatically as the cost of sequencing continues to decline and small sequencing devices become available. Enormous community databases store and share this data with the research community, but some of these genomic data analysis problems require large scale computational platforms to meet both the memory and computational requirements. These applications differ from scientific simulations that dominate the workload on high end parallel systems today and place different requirements on programming support, software libraries, and parallel architectural design. For example, they involve irregular communication patterns such as asynchronous updates to shared data structures. We consider several problems in high performance genomics analysis, including alignment, profiling, clustering, and assembly for both single genomes and metagenomes. We identify some of the common computational patterns or motifs that help inform parallelization strategies and compare our motifs to some of the established lists, arguing that at least two key patterns, sorting and hashing, are missing

arXiv.org e-Print Archive

eScholarship - University of California

Spectral Sequence Motif Discovery

Author: Colombo Nicolò
Vlassis Nikos
Publication venue
Publication date: 01/01/2014
Field of study

Sequence discovery tools play a central role in several fields of computational biology. In the framework of Transcription Factor binding studies, motif finding algorithms of increasingly high performance are required to process the big datasets produced by new high-throughput sequencing technologies. Most existing algorithms are computationally demanding and often cannot support the large size of new experimental data. We present a new motif discovery algorithm that is built on a recent machine learning technique, referred to as Method of Moments. Based on spectral decompositions, this method is robust under model misspecification and is not prone to locally optimal solutions. We obtain an algorithm that is extremely fast and designed for the analysis of big sequencing data. In a few minutes, we can process datasets of hundreds of thousand sequences and extract motif profiles that match those computed by various state-of-the-art algorithms.Comment: 20 pages, 3 figures, 1 tabl

arXiv.org e-Print Archive

CiteSeerX

Open Repository and Bibliography - Luxembourg

Efficient Algorithms for the Closest Pair Problem and Applications

Author: Pathak Sudipta
Rajasekaran Sanguthevar
Publication venue
Publication date: 21/07/2014
Field of study

The closest pair problem (CPP) is one of the well studied and fundamental problems in computing. Given a set of points in a metric space, the problem is to identify the pair of closest points. Another closely related problem is the fixed radius nearest neighbors problem (FRNNP). Given a set of points and a radius

R

, the problem is, for every input point

p

, to identify all the other input points that are within a distance of

R

from

p

. A naive deterministic algorithm can solve these problems in quadratic time. CPP as well as FRNNP play a vital role in computational biology, computational finance, share market analysis, weather prediction, entomology, electro cardiograph, N-body simulations, molecular simulations, etc. As a result, any improvements made in solving CPP and FRNNP will have immediate implications for the solution of numerous problems in these domains. We live in an era of big data and processing these data take large amounts of time. Speeding up data processing algorithms is thus much more essential now than ever before. In this paper we present algorithms for CPP and FRNNP that improve (in theory and/or practice) the best-known algorithms reported in the literature for CPP and FRNNP. These algorithms also improve the best-known algorithms for related applications including time series motif mining and the two locus problem in Genome Wide Association Studies (GWAS)

arXiv.org e-Print Archive

CiteSeerX

Delay Parameter Selection in Permutation Entropy Using Topological Data Analysis

Author: Khasawneh Firas A.
Myers Audun D.
Publication venue
Publication date: 10/05/2019
Field of study

Permutation Entropy (PE) is a powerful tool for quantifying the predictability of a sequence which includes measuring the regularity of a time series. Despite its successful application in a variety of scientific domains, PE requires a judicious choice of the delay parameter

\tau

. While another parameter of interest in PE is the motif dimension

n

, Typically

n

is selected between

4

and

8

with

5

6

giving optimal results for the majority of systems. Therefore, in this work we focus solely on choosing the delay parameter. Selecting

\tau

is often accomplished using trial and error guided by the expertise of domain scientists. However, in this paper, we show that persistent homology, the flag ship tool from Topological Data Analysis (TDA) toolset, provides an approach for the automatic selection of

\tau

. We evaluate the successful identification of a suitable

\tau

from our TDA-based approach by comparing our results to a variety of examples in published literature

arXiv.org e-Print Archive

Probabilistic Approach to Structural Change Prediction in Evolving Social Networks

Author: Budka Marcin
Gonczarek A.
Juszczyszyn Krzysztof
Musial Katarzyna
Tomczak J.
Publication venue
Publication date: 01/01/2012
Field of study

We propose a predictive model of structural changes in elementary subgraphs of social network based on Mixture of Markov Chains. The model is trained and verified on a dataset from a large corporate social network analyzed in short, one day-long time windows, and reveals distinctive patterns of evolution of connections on the level of local network topology. We argue that the network investigated in such short timescales is highly dynamic and therefore immune to classic methods of link prediction and structural analysis, and show that in the case of complex networks, the dynamic subgraph mining may lead to better prediction accuracy. The experiments were carried out on the logs from the Wroclaw University of Technology mail server

Bournemouth University Research Online