Search CORE

4,099 research outputs found

The Parallelism Motifs of Genomic Data Analysis

Author: Awan Muaaz
Azad Ariful
Brock Benjamin
Buluc Aydin
Egan Rob
Ekanayake Saliya
Ellis Marquita
Georganas Evangelos
Guidi Giulia
Hofmeyr Steven
Oliker Leonid
Selvitopi Oguz
Teodoropol Cristina
Yelick Katherine
Publication venue: 'The Royal Society'
Publication date: 20/01/2020
Field of study

Genomic data sets are growing dramatically as the cost of sequencing continues to decline and small sequencing devices become available. Enormous community databases store and share this data with the research community, but some of these genomic data analysis problems require large scale computational platforms to meet both the memory and computational requirements. These applications differ from scientific simulations that dominate the workload on high end parallel systems today and place different requirements on programming support, software libraries, and parallel architectural design. For example, they involve irregular communication patterns such as asynchronous updates to shared data structures. We consider several problems in high performance genomics analysis, including alignment, profiling, clustering, and assembly for both single genomes and metagenomes. We identify some of the common computational patterns or motifs that help inform parallelization strategies and compare our motifs to some of the established lists, arguing that at least two key patterns, sorting and hashing, are missing

arXiv.org e-Print Archive

eScholarship - University of California

Recombination between heterologous human acrocentric chromosomes

Author: Buonaiuto Silvia
Gomes de Lima Leonardo
Guarracino Andrea
Marco Santiago
Potapova Tamara
Rhie Arang
Publication venue: Nature Research
Publication date: 01/01/2023
Field of study

The short arms of the human acrocentric chromosomes 13, 14, 15, 21 and 22 (SAACs) share large homologous regions, including ribosomal DNA repeats and extended segmental duplications1,2. Although the resolution of these regions in the first complete assembly of a human genome—the Telomere-to-Telomere Consortium’s CHM13 assembly (T2T-CHM13)—provided a model of their homology3, it remained unclear whether these patterns were ancestral or maintained by ongoing recombination exchange. Here we show that acrocentric chromosomes contain pseudo-homologous regions (PHRs) indicative of recombination between non-homologous sequences. Utilizing an all-to-all comparison of the human pangenome from the Human Pangenome Reference Consortium4 (HPRC), we find that contigs from all of the SAACs form a community. A variation graph5 constructed from centromere-spanning acrocentric contigs indicates the presence of regions in which most contigs appear nearly identical between heterologous acrocentric chromosomes in T2T-CHM13. Except on chromosome 15, we observe faster decay of linkage disequilibrium in the pseudo-homologous regions than in the corresponding short and long arms, indicating higher rates of recombination6,7. The pseudo-homologous regions include sequences that have previously been shown to lie at the breakpoint of Robertsonian translocations8, and their arrangement is compatible with crossover in inverted duplications on chromosomes 13, 14 and 21. The ubiquity of signals of recombination between heterologous acrocentric chromosomes seen in the HPRC draft pangenome suggests that these shared sequences form the basis for recurrent Robertsonian translocations, providing sequence and population-based confirmation of hypotheses first developed from cytogenetic studies 50 years ago9.Our work depends on the HPRC draft human pangenome resource established in the accompanying Article4, and we thank the production and assembly groups for their efforts in establishing this resource. This work used the computational resources of the UTHSC Octopus cluster and NIH HPC Biowulf cluster. We acknowledge support in maintaining these systems that was critical to our analyses. The authors thank M. Miller for the development of a graphical synopsis of our study (Fig. 5); and R. Williams and N. Soranzo for support and guidance in the design and discussion of our work. This work was supported, in part, by National Institutes of Health/NIDA U01DA047638 (E.G.), National Institutes of Health/NIGMS R01GM123489 (E.G.), NSF PPoSS Award no. 2118709 (E.G. and C.F.), the Tennessee Governor’s Chairs programme (C.F. and E.G.), National Institutes of Health/NCI R01CA266339 (T.P., L.G.d.L. and J.L.G.), and the Intramural Research Program of the National Human Genome Research Institute, National Institutes of Health (A.R., S.K. and A.M.P.). We acknowledge support from Human Technopole (A.G.), Consiglio Nazionale delle Ricerche, Italy (S.B. and V.C.), and Stowers Institute for Medical Research (T.P., L.G.d.L., B.R. and J.L.G.).Peer Reviewed"Article signat per 13 autors/es: Andrea Guarracino, Silvia Buonaiuto, Leonardo Gomes de Lima, Tamara Potapova, Arang Rhie, Sergey Koren, Boris Rubinstein, Christian Fischer, Human Pangenome Reference Consortium, Jennifer L. Gerton, Adam M. Phillippy, Vincenza Colonna & Erik Garrison " Human Pangenome Reference Consortium: "Haley J. Abel, Lucinda L. Antonacci-Fulton, Mobin Asri, Gunjan Baid, Carl A. Baker, Anastasiya Belyaeva, Konstantinos Billis, Guillaume Bourque, Silvia Buonaiuto, Andrew Carroll, Mark J. P. Chaisson, Pi-Chuan Chang, Xian H. Chang, Haoyu Cheng, Justin Chu, Sarah Cody, Vincenza Colonna, Daniel E. Cook, Robert M. Cook-Deegan, Omar E. Cornejo, Mark Diekhans, Daniel Doerr, Peter Ebert, Jana Ebler, Evan E. Eichler, Jordan M. Eizenga, Susan Fairley, Olivier Fedrigo, Adam L. Felsenfeld, Xiaowen Feng, Christian Fischer, Paul Flicek, Giulio Formenti, Adam Frankish, Robert S. Fulton, Yan Gao, Shilpa Garg, Erik Garrison, Nanibaa’ A. Garrison, Carlos Garcia Giron, Richard E. Green, Cristian Groza, Andrea Guarracino, Leanne Haggerty, Ira Hall, William T. Harvey, Marina Haukness, David Haussler, Simon Heumos, Glenn Hickey, Kendra Hoekzema, Thibaut Hourlier, Kerstin Howe, Miten Jain, Erich D. Jarvis, Hanlee P. Ji, Eimear E. Kenny, Barbara A. Koenig, Alexey Kolesnikov, Jan O. Korbel, Jennifer Kordosky, Sergey Koren, HoJoon Lee, Alexandra P. Lewis, Heng Li, Wen-Wei Liao, Shuangjia Lu, Tsung-Yu Lu, Julian K. Lucas, Hugo Magalhães, Santiago Marco-Sola, Pierre Marijon, Charles Markello, Tobias Marschall, Fergal J. Martin, Ann McCartney, Jennifer McDaniel, Karen H. Miga, Matthew W. Mitchell, Jean Monlong, Jacquelyn Mountcastle, Katherine M. Munson, Moses Njagi Mwaniki, Maria Nattestad, Adam M. Novak, Sergey Nurk, Hugh E. Olsen, Nathan D. Olson, Benedict Paten, Trevor Pesout, Adam M. Phillippy, Alice B. Popejoy, David Porubsky, Pjotr Prins, Daniela Puiu, Mikko Rautiainen, Allison A. Regier, Arang Rhie, Samuel Sacco, Ashley D. Sanders, Valerie A. Schneider, Baergen I. Schultz, Kishwar Shafin, Jonas A. Sibbesen, Jouni Sirén, Michael W. Smith, Heidi J. Sofia, Ahmad N. Abou Tayoun, Françoise Thibaud-Nissen, Chad Tomlinson, Francesca Floriana Tricomi, Flavia Villani, Mitchell R. Vollger, Justin Wagner, Brian Walenz, Ting Wang, Jonathan M. D. Wood, Aleksey V. Zimin & Justin M. Zook"Postprint (published version

UPCommons. Portal del coneixement obert de la UPC

Parallel Paths Analysis Using Function Call Graphs

Author: Naeimian Arman
Publication venue: 'University of Waterloo'
Publication date: 18/09/2019
Field of study

Call graphs have been used widely in different software engineering areas. Since call graphs provide us with detailed information about the structure of software elements and components and how they are connected with each other, they could be used in detecting specific structures and patterns in the code such as malware, code clones, unreachable code, and many other software symptoms that could be searched by their structural features. In this work, we have analyzed parallel paths in function call graphs in three Java open-source projects. Parallel paths emerge when there is more than one path between two nodes in the call graph. We investigated the reasons such paths are created and used for and also the problems that result in removing them. Moreover, we have used the results of our analyses to find instances of parallel paths in the projects that we analyzed and suggest some changes to developers based on that. Based on our results, we found three categories of problems associated with parallel paths and four categories of usages of them

University of Waterloo's Institutional Repository

Climate Dynamics: A Network-Based Approach for the Analysis of Global Precipitation

Precipitation is one of the most important meteorological variables for defining the climate dynamics, but the spatial patterns of precipitation have not been fully investigated yet. The complex network theory, which provides a robust tool to investigate the statistical interdependence of many interacting elements, is used here to analyze the spatial dynamics of annual precipitation over seventy years (1941-2010). The precipitation network is built associating a node to a geographical region, which has a temporal distribution of precipitation, and identifying possible links among nodes through the correlation function. The precipitation network reveals significant spatial variability with barely connected regions, as Eastern China and Japan, and highly connected regions, such as the African Sahel, Eastern Australia and, to a lesser extent, Northern Europe. Sahel and Eastern Australia are remarkably dry regions, where low amounts of rainfall are uniformly distributed on continental scales and small-scale extreme events are rare. As a consequence, the precipitation gradient is low, making these regions well connected on a large spatial scale. On the contrary, the Asiatic South-East is often reached by extreme events such as monsoons, tropical cyclones and heat waves, which can all contribute to reduce the correlation to the short-range scale only. Some patterns emerging between mid-latitude and tropical regions suggest a possible impact of the propagation of planetary waves on precipitation at a global scale. Other links can be qualitatively associated to the atmospheric and oceanic circulation. To analyze the sensitivity of the network to the physical closeness of the nodes, short-term connections are broken. The African Sahel, Eastern Australia and Northern Europe regions again appear as the supernodes of the network, confirming furthermore their long-range connection structure. Almost all North-American and Asian nodes vanish, revealing that extreme events can enhance high precipitation gradients, leading to a systematic absence of long-range patterns

arXiv.org e-Print Archive

Public Library of Science (PLOS)

Crossref

Directory of Open Access Journals

PubMed Central

PORTO@iris (Publications Open Repository TOrino - Politecnico di Torino)

PORTO Publications Open Repository TOrino

FigShare

REPdenovo: Inferring De Novo Repeat Motifs from Short Sequence Reads.

Author: Chu Chong
Nielsen Rasmus
Wu Yufeng
Publication venue: eScholarship, University of California
Publication date: 01/01/2016
Field of study

Repeat elements are important components of eukaryotic genomes. One limitation in our understanding of repeat elements is that most analyses rely on reference genomes that are incomplete and often contain missing data in highly repetitive regions that are difficult to assemble. To overcome this problem we develop a new method, REPdenovo, which assembles repeat sequences directly from raw shotgun sequencing data. REPdenovo can construct various types of repeats that are highly repetitive and have low sequence divergence within copies. We show that REPdenovo is substantially better than existing methods both in terms of the number and the completeness of the repeat sequences that it recovers. The key advantage of REPdenovo is that it can reconstruct long repeats from sequence reads. We apply the method to human data and discover a number of potentially new repeats sequences that have been missed by previous repeat annotations. Many of these sequences are incorporated into various parasite genomes, possibly because the filtering process for host DNA involved in the sequencing of the parasite genomes failed to exclude the host derived repeat sequences. REPdenovo is a new powerful computational tool for annotating genomes and for addressing questions regarding the evolution of repeat families. The software tool, REPdenovo, is available for download at https://github.com/Reedwarbler/REPdenovo

Directory of Open Access Journals

PubMed Central

eScholarship - University of California

Communities in Networks

Author: J. Mucha
Jukka-pekka Onnela
Jukka-pekka Onnela
Mason A. Porter
Mason A. Porter
Peter
Peter J. Mucha
Publication venue
Publication date: 01/01/2009
Field of study

We survey some of the concepts, methods, and applications of community detection, which has become an increasingly important area of network science. To help ease newcomers into the field, we provide a guide to available methodology and open problems, and discuss why scientists from diverse backgrounds are interested in these problems. As a running theme, we emphasize the connections of community detection to problems in statistical physics and computational optimization.Comment: survey/review article on community structure in networks; published version is available at http://people.maths.ox.ac.uk/~porterm/papers/comnotices.pd

arXiv.org e-Print Archive

CiteSeerX

Oxford University Research Archive

Distances and Isomorphism between Networks and the Stability of Network Invariants

Author: Chowdhury Samir
Mémoli Facundo
Publication venue
Publication date: 10/04/2018
Field of study

We develop the theoretical foundations of a network distance that has recently been applied to various subfields of topological data analysis, namely persistent homology and hierarchical clustering. While this network distance has previously appeared in the context of finite networks, we extend the setting to that of compact networks. The main challenge in this new setting is the lack of an easy notion of sampling from compact networks; we solve this problem in the process of obtaining our results. The generality of our setting means that we automatically establish results for exotic objects such as directed metric spaces and Finsler manifolds. We identify readily computable network invariants and establish their quantitative stability under this network distance. We also discuss the computational complexity involved in precisely computing this distance, and develop easily-computable lower bounds by using the identified invariants. By constructing a wide range of explicit examples, we show that these lower bounds are effective in distinguishing between networks. Finally, we provide a simple algorithm that computes a lower bound on the distance between two networks in polynomial time and illustrate our metric and invariant constructions on a database of random networks and a database of simulated hippocampal networks

arXiv.org e-Print Archive

Dagstuhl Reports : Volume 1, Issue 2, February 2011

Author: Schloss Dagstuhl Leibniz-Zentrum für Informatik
Publication venue
Publication date: 09/09/2011
Field of study

Online Privacy: Towards Informational Self-Determination on the Internet (Dagstuhl Perspectives Workshop 11061) : Simone Fischer-Hübner, Chris Hoofnagle, Kai Rannenberg, Michael Waidner, Ioannis Krontiris and Michael Marhöfer Self-Repairing Programs (Dagstuhl Seminar 11062) : Mauro Pezzé, Martin C. Rinard, Westley Weimer and Andreas Zeller Theory and Applications of Graph Searching Problems (Dagstuhl Seminar 11071) : Fedor V. Fomin, Pierre Fraigniaud, Stephan Kreutzer and Dimitrios M. Thilikos Combinatorial and Algorithmic Aspects of Sequence Processing (Dagstuhl Seminar 11081) : Maxime Crochemore, Lila Kari, Mehryar Mohri and Dirk Nowotka Packing and Scheduling Algorithms for Information and Communication Services (Dagstuhl Seminar 11091) Klaus Jansen, Claire Mathieu, Hadas Shachnai and Neal E. Youn

Hochschulschriftenserver - Universität Frankfurt am Main