Search CORE

66 research outputs found

Fractal MapReduce decomposition of sequence alignment

Author: Almeida Jonas S.
Grüneberg Alexander
Maass Wolfgang
Vinga Susana
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/05/2012
Field of study

This work was supported in part by the Center for Clinical and Translational Sciences of the University of Alabama at Birmingham under contract no. 5UL1 RR025777-03 from NIH National Center for Research Resources, by the National Cancer Institute grant 1U24CA143883-01, by the European Union FP7 PNEUMOPATH (HEALTH F3 2009 222983).Background: The dramatic fall in the cost of genomic sequencing, and the increasing convenience of distributed cloud computing resources, positions the MapReduce coding pattern as a cornerstone of scalable bioinformatics algorithm development. In some cases an algorithm will find a natural distribution via use of map functions to process vectorized components, followed by a reduce of aggregate intermediate results. However, for some data analysis procedures such as sequence analysis, a more fundamental reformulation may be required. Results: In this report we describe a solution to sequence comparison that can be thoroughly decomposed into multiple rounds of map and reduce operations. The route taken makes use of iterated maps, a fractal analysis technique, that has been found to provide a "alignment-free" solution to sequence analysis and comparison. That is, a solution that does not require dynamic programming, relying on a numeric Chaos Game Representation (CGR) data structure. This claim is demonstrated in this report by calculating the length of the longest similar segment by inspecting only the USM coordinates of two analogous units: with no resort to dynamic programming. Conclusions: The procedure described is an attempt at extreme decomposition and parallelization of sequence alignment in anticipation of a volume of genomic sequence data that cannot be met by current algorithmic frameworks. The solution found is delivered with a browser-based application (webApp), highlighting the browser's emergence as an environment for high performance distributed computing. Availability: Public distribution of accompanying software library with open source and version control at http://usm.github.com. Also available as a webApp through Google Chrome's WebStore http://chrome.google.com/webstore: search with "usm".publishersversionpublishe

Repositório da Universidade Nova de Lisboa

Fractal MapReduce decomposition of sequence alignment

Author: A McKenna
Alexander Grüneberg
B Haubold
D Crockford
EE Schadt
G Huang
G Reinert
HJ Jeffrey
J Backus
J Dean
J Jongboom
J Joseph
J Schwacke
Jonas S Almeida
JS Almeida
JS Almeida
JS Almeida
JS Almeida
KA Wetterstrand
L Wan
M Ruffalo
MC Schatz
P Deschavanne
RC Taylor
S Vinga
SB Needleman
Susana Vinga
TF Smith
Wolfgang Maass
Publication venue: 'Springer Science and Business Media LLC'
Publication date
Field of study

Crossref

QMachine: commodity supercomputing in web browsers

Author
Publication venue: BioMed Central
Publication date: 09/06/2014
Field of study

Springer - Publisher Connector

Applications of the MapReduce programming framework to clinical big data analysis: current landscape and future trends

Author: A Darling
A Matsunaga
A McKenna
A Rajaraman
A Thusoo
AB Friedman
AE Youssef
AV Nguyen
B He
B MacLean
B Meng
B Xu
B Zhang
Behrouz H Far
C Olston
C-F Juang
Christopher Naugler
CM Cusack
D Markonis
D Purves
DE Bell
DR Bean
E Kohlwey
Emad A Mohammed
F Omri
F Wang
G Kumar
GF Coulouris
GM Shepherd
GS Sadasivam
H Braak
H Horiguchi
H Huang
H Nordberg
I Foster
J Dean
J Gurtowski
J Xiaojing
JD Owens
JG Reid
JS Almeida
K Shvachko
K Zhang
L Dai
L Feldkamp
L Gao
L Wang
M de Oliveira Branco
M Gaggero
M Hämäläinen
M Isard
M Jonas
M Mazurek
M Olson
MA Musen
MC Schatz
ME Colosimo
MJ Brodie
N Raghava
N Satish
NV Chawla
PF Fabene
R Díaz-Uriarte
RC Taylor
RE Bryant
RS Kaplan
S Devaraj
S Herculano-Houzel
S Lewis
S Schönherr
S Shuman
S Yaramakala
S Zhao
SL Peyton Jones
T White
TA Tatusova
W Gropp
W Raghupathi
W Wang
W-P Chen
W-P Lee
X Qiu
Y Aphinyanaphongs
Y Wang
Y-J Chang
Y-L Lin
Z Wu
Publication venue: 'Springer Science and Business Media LLC'
Publication date
Field of study

Crossref

A system’s approach to cache hierarchy-aware decomposition of data-parallel computations

Author: Delgado Nuno Miguel de Brito
Publication venue: Faculdade de Ciências e Tecnologia
Publication date: 01/01/2014
Field of study

Dissertação para obtenção do Grau de Mestre em Engenharia InformáticaThe architecture of nowadays’ processors is very complex, comprising several computational cores and an intricate hierarchy of cache memories. The latter, in particular, differ considerably between the many processors currently available in the market, resulting in a wide variety of configurations. Application development is typically oblivious of this complexity and diversity, taking only into consideration the number of available execution cores. This oblivion prevents such applications from fully harnessing the computing power available in these architectures. This problem has been recognized by the community, which has proposed languages and models to express and tune applications according to the underlying machine’s hierarchy. These, however, lack the desired abstraction level, forcing the programmer to have deep knowledge of computer architecture and parallel programming, in order to ensure performance portability across a wide range of architectures. Realizing these limitations, the goal of this thesis is to delegate these hierarchy-aware optimizations to the runtime system. Accordingly, the programmer’s responsibilities are confined to the definition of procedures for decomposing an application’s domain, into an arbitrary number of partitions. With this, the programmer has only to reason about the application’s data representation and manipulation. We prototyped our proposal on top of a Java parallel programming framework, and evaluated it from a performance perspective, against cache neglectful domain decompositions. The results demonstrate that our optimizations deliver significant speedups against decomposition strategies based solely on the number of execution cores, without requiring the programmer to reason about the machine’s hardware. These facts allow us to conclude that it is possible to obtain performance gains by transferring hierarchyaware optimizations concerns to the runtime system

Repositório da Universidade Nova de Lisboa

The Role of Distributed Computing in Big Data Science: Case Studies in Forensics and Bioinformatics

Author: Roscigno Gianluca
Publication venue: Universita degli studi di Salerno
Publication date: 24/05/2016
Field of study

2014 - 2015The era of Big Data is leading the generation of large amounts of data, which require storage and analysis capabilities that can be only ad- dressed by distributed computing systems. To facilitate large-scale distributed computing, many programming paradigms and frame- works have been proposed, such as MapReduce and Apache Hadoop, which transparently address some issues of distributed systems and hide most of their technical details. Hadoop is currently the most popular and mature framework sup- porting the MapReduce paradigm, and it is widely used to store and process Big Data using a cluster of computers. The solutions such as Hadoop are attractive, since they simplify the transformation of an application from non-parallel to the distributed one by means of general utilities and without many skills. However, without any algorithm engineering activity, some target applications are not alto- gether fast and e cient, and they can su er from several problems and drawbacks when are executed on a distributed system. In fact, a distributed implementation is a necessary but not su cient condition to obtain remarkable performance with respect to a non-parallel coun- terpart. Therefore, it is required to assess how distributed solutions are run on a Hadoop cluster, and/or how their performance can be improved to reduce resources consumption and completion times. In this dissertation, we will show how Hadoop-based implementations can be enhanced by using carefully algorithm engineering activity, tuning, pro ling and code improvements. It is also analyzed how to achieve these goals by working on some critical points, such as: data local computation, input split size, number and granularity of tasks, cluster con guration, input/output representation, etc. i In particular, to address these issues, we choose some case studies coming from two research areas where the amount of data is rapidly increasing, namely, Digital Image Forensics and Bioinformatics. We mainly describe full- edged implementations to show how to design, engineer, improve and evaluate Hadoop-based solutions for Source Camera Identi cation problem, i.e., recognizing the camera used for taking a given digital image, adopting the algorithm by Fridrich et al., and for two of the main problems in Bioinformatics, i.e., alignment- free sequence comparison and extraction of k-mer cumulative or local statistics. The results achieved by our improved implementations show that they are substantially faster than the non-parallel counterparts, and re- markably faster than the corresponding Hadoop-based naive imple- mentations. In some cases, for example, our solution for k-mer statis- tics is approximately 30× faster than our Hadoop-based naive im- plementation, and about 40× faster than an analogous tool build on Hadoop. In addition, our applications are also scalable, i.e., execution times are (approximately) halved by doubling the computing units. Indeed, algorithm engineering activities based on the implementation of smart improvements and supported by careful pro ling and tun- ing may lead to a much better experimental performance avoiding potential problems. We also highlight how the proposed solutions, tips, tricks and insights can be used in other research areas and problems. Although Hadoop simpli es some tasks of the distributed environ- ments, we must thoroughly know it to achieve remarkable perfor- mance. It is not enough to be an expert of the application domain to build Hadop-based implementations, indeed, in order to achieve good performance, an expert of distributed systems, algorithm engi- neering, tuning, pro ling, etc. is also required. Therefore, the best performance depend heavily on the cooperation degree between the domain expert and the distributed algorithm engineer. [edited by Author]XIV n.s

EleA@UniSA - Università degli Studi di Salerno

L'intertextualité dans les publications scientifiques

Author: Labbé Cyril
Labbé Dominique
Publication venue: HAL CCSD
Publication date: 28/06/2013
Field of study

La base de données bibliographiques de l'IEEE contient un certain nombre de duplications avérées avec indication des originaux copiés. Ce corpus est utilisé pour tester une méthode d'attribution d'auteur. La combinaison de la distance intertextuelle avec la fenêtre glissante et diverses techniques de classification permet d'identifier ces duplications avec un risque d'erreur très faible. Cette expérience montre également que plusieurs facteurs brouillent l'identité de l'auteur scientifique, notamment des collectifs de chercheurs à géométrie variable et une forte dose d'intertextualité acceptée voire recherchée

Hal - Université Grenoble Alpes

Book of Abstracts of the Sixth SIAM Workshop on Combinatorial Scientific Computing

Author: Uçar Bora
Publication venue: 'Society for Industrial & Applied Mathematics (SIAM)'
Publication date: 01/08/2014
Field of study

Book of Abstracts of CSC14 edited by Bora UçarInternational audienceThe Sixth SIAM Workshop on Combinatorial Scientific Computing, CSC14, was organized at the Ecole Normale Supérieure de Lyon, France on 21st to 23rd July, 2014. This two and a half day event marked the sixth in a series that started ten years ago in San Francisco, USA. The CSC14 Workshop's focus was on combinatorial mathematics and algorithms in high performance computing, broadly interpreted. The workshop featured three invited talks, 27 contributed talks and eight poster presentations. All three invited talks were focused on two interesting fields of research specifically: randomized algorithms for numerical linear algebra and network analysis. The contributed talks and the posters targeted modeling, analysis, bisection, clustering, and partitioning of graphs, applied in the context of networks, sparse matrix factorizations, iterative solvers, fast multi-pole methods, automatic differentiation, high-performance computing, and linear programming. The workshop was held at the premises of the LIP laboratory of ENS Lyon and was generously supported by the LABEX MILYON (ANR-10-LABX-0070, Université de Lyon, within the program ''Investissements d'Avenir'' ANR-11-IDEX-0007 operated by the French National Research Agency), and by SIAM

HAL-ENS-LYON

INRIA a CCSD electronic archive server

Hal-Diderot

Big Data Computing for Geospatial Applications

Author
Publication venue: 'MDPI AG'
Publication date: 01/05/2021
Field of study

The convergence of big data and geospatial computing has brought forth challenges and opportunities to Geographic Information Science with regard to geospatial data management, processing, analysis, modeling, and visualization. This book highlights recent advancements in integrating new computing approaches, spatial methods, and data management strategies to tackle geospatial big data challenges and meanwhile demonstrates opportunities for using big data for geospatial applications. Crucial to the advancements highlighted in this book is the integration of computational thinking and spatial thinking and the transformation of abstract ideas and models to concrete data structures and algorithms

Directory of Open Access Books (DOAB)