Search CORE

353 research outputs found

Faster algorithms for 1-mappability of a sequence

Author: A Amir
G Manzini
J Fischer
M Crochemore
MA Bender
ML Fredman
ML Metzker
NA Fonseca
SV Thankachan
T Derrien
U Manber
Publication venue
Publication date: 11/05/2017
Field of study

In the k-mappability problem, we are given a string x of length n and integers m and k, and we are asked to count, for each length-m factor y of x, the number of other factors of length m of x that are at Hamming distance at most k from y. We focus here on the version of the problem where k = 1. The fastest known algorithm for k = 1 requires time O(mn log n/ log log n) and space O(n). We present two algorithms that require worst-case time O(mn) and O(n log^2 n), respectively, and space O(n), thus greatly improving the state of the art. Moreover, we present an algorithm that requires average-case time and space O(n) for integer alphabets if m = {\Omega}(log n/ log {\sigma}), where {\sigma} is the alphabet size

arXiv.org e-Print Archive

Crossref

Efficient Computation of Sequence Mappability

Author: Alzamel Mai
Charalampopoulos Panagiotis
Iliopoulos Costas S.
Kociumaka Tomasz
Pissis Solon P.
Radoszewski Jakub
Straszyński Juliusz
Publication venue
Publication date: 31/07/2018
Field of study

Sequence mappability is an important task in genome re-sequencing. In the

(k,m)

-mappability problem, for a given sequence

T

of length

n

, our goal is to compute a table whose

i

th entry is the number of indices

j \ne i

such that length-

m

substrings of

T

starting at positions

i

and

j

have at most

k

mismatches. Previous works on this problem focused on heuristic approaches to compute a rough approximation of the result or on the case of

k=1

. We present several efficient algorithms for the general case of the problem. Our main result is an algorithm that works in

\mathcal{O}(n \min\{m^k,\log^{k+1} n\})

time and

\mathcal{O}(n)

space for

k=\mathcal{O}(1)

. It requires a carefu l adaptation of the technique of Cole et al.~[STOC 2004] to avoid multiple counting of pairs of substrings. We also show

\mathcal{O}(n^2)

-time algorithms to compute all results for a fixed

m

and all

k=0,\ldots,m

or a fixed

k

and all

m=k,\ldots,n-1

. Finally we show that the

(k,m)

-mappability problem cannot be solved in strongly subquadratic time for

k,m = \Theta(\log n)

unless the Strong Exponential Time Hypothesis fails.Comment: Accepted to SPIRE 201

arXiv.org e-Print Archive

VU Research Portal

CWI's Institutional Repository

INRIA a CCSD electronic archive server

Diminishing Return for Increased Mappability with Longer Sequencing Reads: Implications of the k-mer Distributions in the Human Genome

Author: Freudenberg Jan
Li Wentian
Miramontes Pedro
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 28/08/2013
Field of study

The amount of non-unique sequence (non-singletons) in a genome directly affects the difficulty of read alignment to a reference assembly for high throughput-sequencing data. Although a greater length increases the chance for reads being uniquely mapped to the reference genome, a quantitative analysis of the influence of read lengths on mappability has been lacking. To address this question, we evaluate the k-mer distribution of the human reference genome. The k-mer frequency is determined for k ranging from 20 to 1000 basepairs. We use the proportion of non-singleton k-mers to evaluate the mappability of reads for a corresponding read length. We observe that the proportion of non-singletons decreases slowly with increasing k, and can be fitted by piecewise power-law functions with different exponents at different k ranges. A faster decay at smaller values for k indicates more limited gains for read lengths > 200 basepairs. The frequency distributions of k-mers exhibit long tails in a power-law-like trend, and rank frequency plots exhibit a concave Zipf's curve. The location of the most frequent 1000-mers comprises 172 kilobase-ranged regions, including four large stretches on chromosomes 1 and X, containing genes with biomedical implications. Even the read length 1000 would be insufficient to reliably sequence these specific regions.Comment: 5 figure

arXiv.org e-Print Archive

Springer - Publisher Connector

The k-Mappability Problem Revisited

Author: Amir Amihood
Boneh Itai
Kondratovsky Eitan
Publication venue: LIPIcs - Leibniz International Proceedings in Informatics. 32nd Annual Symposium on Combinatorial Pattern Matching (CPM 2021)
Publication date: 01/01/2021
Field of study

The k-mappability problem has two integers parameters m and k. For every subword of size m in a text S, we wish to report the number of indices in S in which the word occurs with at most k mismatches. The problem was lately tackled by Alzamel et al. [Mai Alzamel et al., 2018]. For a text with constant alphabet ? and k ? O(1), they present an algorithm with linear space and O(nlog^{k+1}n) time. For the case in which k = 1 and a constant size alphabet, a faster algorithm with linear space and O(nlog(n)log log(n)) time was presented in [Mai Alzamel et al., 2020]. In this work, we enhance the techniques of [Mai Alzamel et al., 2020] to obtain an algorithm with linear space and O(n log(n)) time for k = 1. Our algorithm removes the constraint of the alphabet being of constant size. We also present linear algorithms for the case of k = 1, |?| ? O(1) and m = ?(?n)

arXiv.org e-Print Archive

Dagstuhl Research Online Publication Server

CNVScan: detecting border- line copy number variations in NGS data via scan statistics

Author: D\u27Aurizio Romina
Leoncini Mauro
Pellegrini Marco
Publication venue: ACM Press
Publication date: 01/01/2015
Field of study

Background. Next Generation Sequencing (NGS) data has been extensively exploited in the last decade to analyse genome variations and to understand the role of genome variations in complex diseases. Copy number variations (CNVs) are genomic structural variants estimated to account for about 1.2% of the total variation in humans. CNVs in coding or regulatory regions may have an impact on the gene expression, often also at a functional level, and contribute to cause different diseases like cancer, autism and cardiovascular diseases. Computational methods developed for detection of CNVs from NGS data and based on the depth of coverage are limited to the identification of medium/large events and heavily influenced by the level of coverage. Result. In this paper we propose, CNVScan a CNV detection method based on scan statistics that overcomes limitations of previous read count (RC) based approaches mainly by being a window-less approach. The scans statistics have been used before mainly in epidemiology and ecology studies, but never before was applied to the CNV detection problem to the best of our knowledge. Since we avoid window- ing we do not have to choose an optimal window-size which is a key step in many previous approaches. Extensive simulated experiments with single read data in extreme situations (low coverage, short reads, homo/heterozygoticity) show that this approach is very effective for a range of small CNV (200-500 bp) for which previous state-of-the-art methods are not suitable. Conclusion. The scan statistics technique is applied and adapted in this paper for the first time to the CNV detection problem. Comparison with state-of-the art methods shows the approach is quite effective in discovering shortCNVin rather extreme situations in which previous methods fail or have degraded performance. CNVScan thus extends the range of CNV sizes and types that can be detected via read count with single read data

PUblication MAnagement

Longest Common Prefixes with $k$ -Errors and Applications

Author: A Apostolico
AF Smit
B Bollobás
C Leimeister
C Pizzi
DE Willard
G Kucherov
G Manzini
G Navarro
H Alamro
I Ulitsky
J Fischer
KR Rasmussen
M Alzamel
MA Bender
MI Abouelhoda
N Välimäki
P Eades
R Kolpakov
S Faro
S Grabowski
S Karlin
SV Thankachan
SV Thankachan
SV Thankachan
T Derrien
T Flouri
TH Cormen
U Manber
Publication venue
Publication date: 01/01/2018
Field of study

Although real-world text datasets, such as DNA sequences, are far from being uniformly random, average-case string searching algorithms perform significantly better than worst-case ones in most applications of interest. In this paper, we study the problem of computing the longest prefix of each suffix of a given string of length

n

over a constant-sized alphabet that occurs elsewhere in the string with

k

-errors. This problem has already been studied under the Hamming distance model. Our first result is an improvement upon the state-of-the-art average-case time complexity for non-constant

k

and using only linear space under the Hamming distance model. Notably, we show that our technique can be extended to the edit distance model with the same time and space complexities. Specifically, our algorithms run in

\mathcal{O}(n \log^k n \log \log n)

time on average using

\mathcal{O}(n)

space. We show that our technique is applicable to several algorithmic problems in computational biology and elsewhere

arXiv.org e-Print Archive

Crossref

King's Research Portal

Zero-Inflated Models to Identify Transcription Factor Binding Sites in ChIP-seq Experiments

Author: Viswakula Sameera Dhananjaya
Publication venue: ODU Digital Commons
Publication date: 01/04/2015
Field of study

It is essential to determine the protein-DNA binding sites to understand many biological processes. A transcription factor is a particular type of protein that binds to DNA and controls gene regulation in living organisms. Chromatin immunoprecipitation followed by highthroughput sequencing (ChIP-seq) is considered the gold standard in locating these binding sites and programs use to identify DNA-transcription factor binding sites are known as peak-callers. ChIP-seq data are known to exhibit considerable background noise and other biases. In this study, we propose a negative binomial model (NB), a zero-inflated Poisson model (ZIP) and a zero-inflated negative binomial model (ZINB) for peak-calling. Using real ChIP-seq datasets, we show that ZINB model is the best model for ChIP-seq data. Then we incorporate control data, GC count information, and mappability information into the ZINB regression model as covariates using two link functions. We implemented this approach in C++, and our peak-caller chooses the optimal parameter combination for a given dataset. Performace of our approach is compared with two frequently used peak-callers: QuEST and MACS

Old Dominion University

Systematic Evaluation of Factors Influencing ChIP-Seq Fidelity

Author: A Barski
A Mortazavi
A Valouev
AP Boyle
AR Quinlan
Barbara J Wold
DA Nix
DS Johnson
DS Johnson
E Larschan
EG Wilbanks
G Benson
G Robertson
H Ji
Housheng Hansen He
I Kozarewa
J Rozowsky
Jason D Lieb
JC Dohm
Jennifer Zieba
Joanna O Mieczkowska
JW Ho
Kevin P White
L Teytelman
Matthew Slattery
N Negre
N Negre
Nicolas Negre
NU Rashid
P Kolasinska-Zwierz
Peter J Bickel
PV Kharchenko
PV Kharchenko
Q Li
Qunhua Li
R Jothi
Richard M Myers
RM Myers
S Pepke
S Roy
SE Celniker
Tae-Kyung Kim
Tao Liu
TD Laajala
TS Mikkelsen
WE Johnson
X Feng
X Shirley Liu
Y Zhang
Y Zhang
Yijun Ruan
Yiwen Chen
Yong Zhang
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2012
Field of study

We performed a systematic evaluation of how variations in sequencing depth and other parameters influence interpretation of Chromatin immunoprecipitation (ChIP) followed by sequencing (ChIP-seq) experiments. Using Drosophila S2 cells, we generated ChIP-seq datasets for a site-specific transcription factor (Suppressor of Hairy-wing) and a histone modification (H3K36me3). We detected a chromatin state bias, open chromatin regions yielded higher coverage, which led to false positives if not corrected and had a greater effect on detection specificity than any base-composition bias. Paired-end sequencing revealed that single-end data underestimated ChIP library complexity at high coverage. The removal of reads originating at the same base reduced false-positives while having little effect on detection sensitivity. Even at a depth of ~1 read/bp coverage of mappable genome, ~1% of the narrow peaks detected on a tiling array were missed by ChIP-seq. Evaluation of widely-used ChIP-seq analysis tools suggests that adjustments or algorithm improvements are required to handle datasets with deep coverage

Crossref

Harvard University - DASH

PubMed Central

Carolina Digital Repository

Caltech Authors

WisecondorX : improved copy number detection for routine shallow whole-genome sequencing

Author: De Smet Matthias
Dheedene Annelies
Menten Björn
Raman Lennart
Van Dorpe Jo
Publication venue: 'Oxford University Press (OUP)'
Publication date: 19/12/2018
Field of study

Shallow whole-genome sequencing to infer copy number alterations (CNAs) in the human genome is rapidly becoming the method par excellence for routine diagnostic use. Numerous tools exist to deduce aberrations from massive parallel sequencing data, yet most are optimized for research and often fail to redeem paramount needs in a clinical setting. Optimally, a read depth-based analytical software should be able to deal with single-end and low-coverage datathis to make sequencing costs feasible. Other important factors include runtime, applicability to a variety of analyses and overall performance. We compared the most important aspect, being normalization, across six different CNA tools, selected for their assumed ability to satisfy the latter needs. In conclusion, WISECONDOR, which uses a within-sample normalization technique, undoubtedly produced the best results concerning variance, distributional assumptions and basic ability to detect true variations. Nonetheless, as is the case with every tool, WISECONDOR has limitations, which arise through its exclusiveness for non-invasive prenatal testing. Therefore, this work presents WisecondorX in addition, an improved WISECONDOR that enables its use for varying types of applications. WisecondorX is freely available at https://github.com/CenterForMedicalGeneticsGhent/WisecondorX

ZENODO

Ghent University Academic Bibliography

Dryad Digital Repository (Duke University)

Electronic Archiving System

NEUROSURGERY ENTHUSIASTIC WOMEN SOCIETY