Search CORE

28 research outputs found

A BLAS-based algorithm for finding position weight matrix occurrences in DNA sequences on CPUs and GPUs

Author: Fostier Jan
Publication venue
Publication date: 01/01/2018
Field of study

BLAMM : BLAS-based algorithm for finding position weight matrix occurrences in DNA sequences on CPUs and GPUs

Author: Fostier Jan
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2020
Field of study

Background The identification of all matches of a large set of position weight matrices (PWMs) in long DNA sequences requires significant computational resources for which a number of efficient yet complex algorithms have been proposed. Results We propose BLAMM, a simple and efficient tool inspired by high performance computing techniques. The workload is expressed in terms of matrix-matrix products that are evaluated with high efficiency using optimized BLAS library implementations. The algorithm is easy to parallelize and implement on CPUs and GPUs and has a runtime that is independent of the selected p-value. In terms of single-core performance, it is competitive with state-of-the-art software for PWM matching while being much more efficient when using multithreading. Additionally, BLAMM requires negligible memory. For example, both strands of the entire human genome can be scanned for 1404 PWMs in the JASPAR database in 13 min with a p-value of 10(-4) using a 36-core machine. On a dual GPU system, the same task can be performed in under 5 min. Conclusions BLAMM is an efficient tool for identifying PWM matches in large DNA sequences. Its C++ source code is available under the GNU General Public License Version 3 at https://github.com/biointec/blamm

Ghent University Academic Bibliography

Main findings and advances in bioinformatics and biomedical engineeringIWBBIO 2018

Author: Rojas Ruiz Fernando José
Rojas Ruiz Ignacio
Valenzuela Cansino Olga
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/05/2020
Field of study

We want to thank the great work done by the reviewers of each of the papers, together with the great interest shown by the editorial of BMC Bioinformatics in IWBBIO Conference. Special thanks to D. Omar El Bakry for his interest and great help to make this Special Issue. Thank the Ministry of Spain for the economic resources within the project with reference RTI2018-101674-B-I00.In the current supplement, we are proud to present seventeen relevant contributions from the 6th International Work-Conference on Bioinformatics and Biomedical Engineering (IWBBIO 2018), which was held during April 25-27, 2018 in Granada (Spain). These contributions have been chosen because of their quality and the importance of their findings.This research has been partially supported by the proyects with reference RTI2018-101674-B-I00 (Ministry of Spain) and B-TIC-414-UGR18 (FEDER, Junta Andalucia and UGR)

Repositorio Institucional Universidad de Granada

Benchmarking a New Paradigm: An Experimental Analysis of a Real Processing-in-Memory Architecture

Author: Fernandez Ivan
Giannoula Christina
Gómez-Luna Juan
Hajj Izzat El
Mutlu Onur
Oliveira Geraldo F.
Publication venue
Publication date: 06/07/2021
Field of study

Many modern workloads, such as neural networks, databases, and graph processing, are fundamentally memory-bound. For such workloads, the data movement between main memory and CPU cores imposes a significant overhead in terms of both latency and energy. A major reason is that this communication happens through a narrow bus with high latency and limited bandwidth, and the low data reuse in memory-bound workloads is insufficient to amortize the cost of main memory access. Fundamentally addressing this data movement bottleneck requires a paradigm where the memory system assumes an active role in computing by integrating processing capabilities. This paradigm is known as processing-in-memory (PIM). Recent research explores different forms of PIM architectures, motivated by the emergence of new 3D-stacked memory technologies that integrate memory with a logic layer where processing elements can be easily placed. Past works evaluate these architectures in simulation or, at best, with simplified hardware prototypes. In contrast, the UPMEM company has designed and manufactured the first publicly-available real-world PIM architecture. This paper provides the first comprehensive analysis of the first publicly-available real-world PIM architecture. We make two key contributions. First, we conduct an experimental characterization of the UPMEM-based PIM system using microbenchmarks to assess various architecture limits such as compute throughput and memory bandwidth, yielding new insights. Second, we present PrIM, a benchmark suite of 16 workloads from different application domains (e.g., linear algebra, databases, graph processing, neural networks, bioinformatics).Comment: Our open source software is available at https://github.com/CMU-SAFARI/prim-benchmark

arXiv.org e-Print Archive

CSP for Executable Scientific Workflows

Author: Friborg Rune Møllegaard
Publication venue: University of Copenhagen
Publication date: 29/11/2011
Field of study

Copenhagen University Research Information System

Predicting the past: Mathematical models and numerical methods in molecular phylogenetics

Author: Stoltz Stephanus Marnus
Publication venue: 'University of Otago Library'
Publication date: 11/10/2020
Field of study

Molecular phylogenetics is the study of phylogenies and processes of evolution by the analyses of DNA or amino acid sequence data. In this thesis we describe a computationally efficient Bayesian methodology for inferring species trees and demographics from unlinked binary markers. The new diffusion approach coupled with state-of-the-art numerical algorithms allow for analyses of datasets containing hundreds or even thousands of individuals. We demonstrate the scale of analyses possible using a SNP data sampled from 399 fresh water turtles in 41 populations. The method, which we call Snapper, is the successor of the coalescent based method Snapp. A reanalysis of soybean SNP data demonstrates that the two methods are hard to distinguish in practice. We also describe a Bayesian methodology for inferring niches of present and ancestral species of plants from environmental measurements and estimated phylogenies. Fitting the phylogenetic niche model to three conifer species endemic to New Zealand confirms that viable ancestral niches can be inferred. Lastly, in anticipation of even larger genomic datasets we look into graphical processing units as computational tools for efficient model fitting. We introduce a new graphical processing unit based algorithm designed to fit long chain Hidden Markov models, applying this approach to an Hidden Markov model for nonvolcanic tremor events. Our implementation resulted in a 1000-fold increase in speed over the standard single processing algorithm, allowing for a full Bayesian inference of model parameters

Te Tumu Eprints Repository

Variants in transcription factor binding sites altering gene expression in prostate cancer

Author: Salokorpi Noora
Publication venue
Publication date: 20/12/2022
Field of study

Prostate cancer is the 2nd most prevalent cancer and 5th most worldwide cause of death among men. There are several methods to treat prostate cancer, such as surgery, radiation therapy, hormone therapy, and chemotherapy. Non-lethal primary prostate cancer can develop into lethal castration-resistant prostate cancer. Prostate cancer development is caused by environmental and genetic factors. One promising explanation for prostate cancer development is transcription factor binding in cis-regulatory regions, which promotes or inhibits gene expression. Variants in these cis-regulatory elements can change the binding of transcription factors and, therefore, alter gene expression. In many cases, the effects of noncoding regions of the genome on gene expression are unclear. Noncoding regions include many essential parts of gene expression regulation, such as promoters, enhancers, and silencers. ATAC-sequencing is a sequencing method used to study chromatin accessibility genome-wide. Open chromatin peaks accessed by ATAC-sequencing contain active parts of the genome, which is why it is a suitable method to study active noncoding regions. The first aim of this Master’s thesis was to perform variant calling with suitable parameters to ATAC-seq. The second aim was to discover common variants within different TFBSs. The third aim was to find out how variants affect the ability of TF to bind to its binding site. This aim was accomplished by comparing PWM scores of wild types and mutated sequences. The main objective, to discover if and which variants in TFBS can change the gene expression close to these regulatory areas, was accomplished by the three aims. Variant calling was performed with sufficient quality, with the median percentage of ATAC-sequencing variants found from whole genome sequencing variants to be 91.4 %. The five most common transcription factor binding sites for all cell lines and prostate cell lines were CTCF, AR, ESR1, FOXA1, and MYC, and AR, FOXA1, ERG, CTCF, and E2F1, respec-tively. After running Wilcoxon rank-sum test and Benjamini-Hochberg multiple testing correction for each gene in samples with and without the variant, 443 genes had a p-value less than 0.05. Out of these, eight were considered significant in three transcription factors and 112 in two transcription factors. The eight genes present in three transcription factor binding sites were ZNF195, RFXANK, PTPN3, MAP4K5, KRIT1, ITGAL, DDX17, and AHCY. Previous studies of ITGAL, DDX17, and AHCY stated that these genes have a role in prostate cancer development. To understand whether the variants in transcription factor binding sites were actually the cause of changes in gene expression, more studies would be required. These methods could be, for example, using STARR-sequencing to directly and quantitatively estimate enhancer activity.Eturauhassyöpä on toiseksi yleisin tapausmäärältään ja viidenneksi yleisin kuolinsyy maailmanlaajuisesti miehillä. Eturauhassyövän hoitoon on monia menetelmiä, kuten leikkaus, sädehoito, hormonaaliset hoidot tai kemoterapia. Ei-tappava primaarinen eturauhassyöpä voi kehittyä tappavaksi kastraatioresistentiksi eturauhassyöväksi. Eturauhassyövän kehitys johtuu sekä geneettisistä että ympäristötekijöistä. Yksi lupaava selittävä tekijä eturauhassyövän kehityksessä on cis-säätelyalueen transkriptiofaktorit, jotka edistävät tai vähentävät geeniekspressiota. Näiden cis-säätelyalueiden variantit voivat muuttaa transkriptiofaktorien sitoutumista ja täten muuttaa geeniekspressiota. Genomin ei-koodaavien alueiden vaikutus geeniekspressioon on monissa tapauksissa epäselvä. Ei-koodaaviin alueisiin kuuluu monia geeniekspression säätelyn kannalta tärkeitä alueita, kuten promoottorit sekä tehostin- ja vaimenninalueet. ATAC-sekvensointi on sekvensointimenetelmä, jonka avulla voidaan tutkia kromatiinin avoimuutta genomin laajuisesti. Avoimet kromatiinikohdat, joita ATAC-sekvensoinnilla saavutetaan, sisältävät genomin aktiiviset alueet, minkä vuoksi se on hyvä menetelmä tutkia aktiivisia ei-koodaavia alueita. Tämän tutkielman ensimmäisenä tavoitteena oli suorittaa varianttien kutsuminen sopivilla parametreillä ATAC-sekvensoidusta datasta. Toinen tavoite oli selvittää eri transkriptiofaktorien sitoutumisalueiden yleiset variantit. Kolmas tavoite oli selvittää, kuinka variantit vaikuttavat transkriptiofaktoreiden kykyyn sitoutua sitoutumisalueelle. Tämä tavoite saavutettiin vertaamalla PWM-arvoja normaalin sekvenssin ja mutatoituneen sekvenssin välillä. Päätavoite, joka oli selvittää, jos ja mitkä variantit transkriptiofaktorien sitoutumiskohdissa muuttavat geeniekspressiota, saavutettiin näiden tavoitteiden avulla. Varianttien laatu oli riittävä. ATAC-sekvensoinnista saaduista varianteista mediaaniprosentiltaan 91,4 % löytyi myös koko genomin sekvensoinnin varianteista. Viisi yleisintä transkriptiofaktorin sitoutumiskohtaa kaikille solulinjoille oli CTCF, AR, ESR1, FOXA1 ja MYC ja eturauhasen solulinjoille AR, FOXA1, ERG, CTCF ja E2F1. Wilcoxonin järjestyssummatestin ja Benjamini-Hochbergin monen testin korjaamismenetelmän geenien näyteryhmille variantilla ja ilman jälkeen jäljelle jäi 443 geeniä, joiden p-arvo oli alle 0,05. Näistä geeneistä kahdeksaa pidettiin merkityksellisenä kolmessa transkriptiofaktorissa ja 112:ta kahdessa transkriptiofaktorissa. Kahdeksan geeniä, jotka löytyivät kolmesta transkriptiofaktorista, olivat ZNF195, RFXANK, PTPN3, MAP4K5, KRIT1, ITGAL, DDX17 ja AHCY. Aikaisempien tutkimusten mukaan ITGAL, DDX17 ja AHCY toimivat jonkinlaisessa roolissa eturauhassyövän kehityksessä. Näiden transkriptiofaktorien sitoutumiskohtien varianttien merkityksen ymmärtäminen geeniekspression säätelyssä vaatisi lisätutkimuksia. Tämä voisi tarkoittaa esimerkiksi STARR-sekvensoinnin käyttämistä tutkiakseen tehostinalueita suoraan ja määrällisest

Trepo - Institutional Repository of Tampere University

Science & Technology Trends Quarterly Review 2010 April

Author: Science & Technology Foresight Center
科学技術動向研究センター
Publication venue: Science & Technology Foresight Center（NISTEP)
Publication date: 01/01/2010
Field of study

National Institute of Science and Technology Policy Library (NISTEP) / 科学技術・学術政策研究所ライブラリ

High-Performance Modelling and Simulation for Big Data Applications

Author
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 10/02/2021
Field of study

This open access book was prepared as a Final Publication of the COST Action IC1406 “High-Performance Modelling and Simulation for Big Data Applications (cHiPSet)“ project. Long considered important pillars of the scientific method, Modelling and Simulation have evolved from traditional discrete numerical methods to complex data-intensive continuous analytical optimisations. Resolution, scale, and accuracy have become essential to predict and analyse natural and complex systems in science and engineering. When their level of abstraction raises to have a better discernment of the domain at hand, their representation gets increasingly demanding for computational and data resources. On the other hand, High Performance Computing typically entails the effective use of parallel and distributed processing units coupled with efficient storage, communication and visualisation systems to underpin complex data-intensive applications in distinct scientific and technical domains. It is then arguably required to have a seamless interaction of High Performance Computing with Modelling and Simulation in order to store, compute, analyse, and visualise large data sets in science and engineering. Funded by the European Commission, cHiPSet has provided a dynamic trans-European forum for their members and distinguished guests to openly discuss novel perspectives and topics of interests for these two communities. This cHiPSet compendium presents a set of selected case studies related to healthcare, biological data, computational advertising, multimedia, finance, bioinformatics, and telecommunications

Directory of Open Access Books (DOAB)

Design and Code Optimization for Systems with Next-generation Racetrack Memories

Author: Khan Asif Ali
Publication venue
Publication date: 16/06/2022
Field of study

With the rise of computationally expensive application domains such as machine learning, genomics, and fluids simulation, the quest for performance and energy-efficient computing has gained unprecedented momentum. The significant increase in computing and memory devices in modern systems has resulted in an unsustainable surge in energy consumption, a substantial portion of which is attributed to the memory system. The scaling of conventional memory technologies and their suitability for the next-generation system is also questionable. This has led to the emergence and rise of nonvolatile memory ( NVM ) technologies. Today, in different development stages, several NVM technologies are competing for their rapid access to the market. Racetrack memory ( RTM ) is one such nonvolatile memory technology that promises SRAM -comparable latency, reduced energy consumption, and unprecedented density compared to other technologies. However, racetrack memory ( RTM ) is sequential in nature, i.e., data in an RTM cell needs to be shifted to an access port before it can be accessed. These shift operations incur performance and energy penalties. An ideal RTM , requiring at most one shift per access, can easily outperform SRAM . However, in the worst-cast shifting scenario, RTM can be an order of magnitude slower than SRAM . This thesis presents an overview of the RTM device physics, its evolution, strengths and challenges, and its application in the memory subsystem. We develop tools that allow the programmability and modeling of RTM -based systems. For shifts minimization, we propose a set of techniques including optimal, near-optimal, and evolutionary algorithms for efficient scalar and instruction placement in RTMs . For array accesses, we explore schedule and layout transformations that eliminate the longer overhead shifts in RTMs . We present an automatic compilation framework that analyzes static control flow programs and transforms the loop traversal order and memory layout to maximize accesses to consecutive RTM locations and minimize shifts. We develop a simulation framework called RTSim that models various RTM parameters and enables accurate architectural level simulation. Finally, to demonstrate the RTM potential in non-Von-Neumann in-memory computing paradigms, we exploit its device attributes to implement logic and arithmetic operations. As a concrete use-case, we implement an entire hyperdimensional computing framework in RTM to accelerate the language recognition problem. Our evaluation shows considerable performance and energy improvements compared to conventional Von-Neumann models and state-of-the-art accelerators

Technische Universität Dresden: Qucosa