189 research outputs found
Efficient HTTP based I/O on very large datasets for high performance computing with the libdavix library
Remote data access for data analysis in high performance computing is
commonly done with specialized data access protocols and storage systems. These
protocols are highly optimized for high throughput on very large datasets,
multi-streams, high availability, low latency and efficient parallel I/O. The
purpose of this paper is to describe how we have adapted a generic protocol,
the Hyper Text Transport Protocol (HTTP) to make it a competitive alternative
for high performance I/O and data analysis applications in a global computing
grid: the Worldwide LHC Computing Grid. In this work, we first analyze the
design differences between the HTTP protocol and the most common high
performance I/O protocols, pointing out the main performance weaknesses of
HTTP. Then, we describe in detail how we solved these issues. Our solutions
have been implemented in a toolkit called davix, available through several
recent Linux distributions. Finally, we describe the results of our benchmarks
where we compare the performance of davix against a HPC specific protocol for a
data analysis use case.Comment: Presented at: Very large Data Bases (VLDB) 2014, Hangzho
Exponentially hard problems are sometimes polynomial, a large deviation analysis of search algorithms for the random Satisfiability problem, and its application to stop-and-restart resolutions
A large deviation analysis of the solving complexity of random
3-Satisfiability instances slightly below threshold is presented. While finding
a solution for such instances demands an exponential effort with high
probability, we show that an exponentially small fraction of resolutions
require a computation scaling linearly in the size of the instance only. This
exponentially small probability of easy resolutions is analytically calculated,
and the corresponding exponent shown to be smaller (in absolute value) than the
growth exponent of the typical resolution time. Our study therefore gives some
theoretical basis to heuristic stop-and-restart solving procedures, and
suggests a natural cut-off (the size of the instance) for the restart.Comment: Revtex file, 4 figure
Comparison of sequencing-based methods to profile DNA methylation and identification of monoallelic epigenetic modifications.
Analysis of DNA methylation patterns relies increasingly on sequencing-based profiling methods. The four most frequently used sequencing-based technologies are the bisulfite-based methods MethylC-seq and reduced representation bisulfite sequencing (RRBS), and the enrichment-based techniques methylated DNA immunoprecipitation sequencing (MeDIP-seq) and methylated DNA binding domain sequencing (MBD-seq). We applied all four methods to biological replicates of human embryonic stem cells to assess their genome-wide CpG coverage, resolution, cost, concordance and the influence of CpG density and genomic context. The methylation levels assessed by the two bisulfite methods were concordant (their difference did not exceed a given threshold) for 82% for CpGs and 99% of the non-CpG cytosines. Using binary methylation calls, the two enrichment methods were 99% concordant and regions assessed by all four methods were 97% concordant. We combined MeDIP-seq with methylation-sensitive restriction enzyme (MRE-seq) sequencing for comprehensive methylome coverage at lower cost. This, along with RNA-seq and ChIP-seq of the ES cells enabled us to detect regions with allele-specific epigenetic states, identifying most known imprinted regions and new loci with monoallelic epigenetic marks and monoallelic expression
Genomic, Pathway Network, and Immunologic Features Distinguishing Squamous Carcinomas
This integrated, multiplatform PanCancer Atlas study co-mapped and identified distinguishing
molecular features of squamous cell carcinomas (SCCs) from five sites associated with smokin
A machine learning case–control classifier for schizophrenia based on DNA methylation in blood
Epigenetic dysregulation is thought to contribute to the etiology of schizophrenia (SZ), but the cell type-specificity of DNA methylation makes population-based epigenetic studies of SZ challenging. To train an SZ case–control classifier based on DNA methylation in blood, therefore, we focused on human genomic regions of systemic interindividual epigenetic variation (CoRSIVs), a subset of which are represented on the Illumina Human Methylation 450K (HM450) array. HM450 DNA methylation data on whole blood of 414 SZ cases and 433 non-psychiatric controls were used as training data for a classification algorithm with built-in feature selection, sparse partial least squares discriminate analysis (SPLS-DA); application of SPLS-DA to HM450 data has not been previously reported. Using the first two SPLS-DA dimensions we calculated a “risk distance” to identify individuals with the highest probability of SZ. The model was then evaluated on an independent HM450 data set on 353 SZ cases and 322 non-psychiatric controls. Our CoRSIV-based model classified 303 individuals as cases with a positive predictive value (PPV) of 80%, far surpassing the performance of a model based on polygenic risk score (PRS). Importantly, risk distance (based on CoRSIV methylation) was not associated with medication use, arguing against reverse causality. Risk distance and PRS were positively correlated (Pearson r = 0.28, P = 1.28 × 10−12), and mediational analysis suggested that genetic effects on SZ are partially mediated by altered methylation at CoRSIVs. Our results indicate two innate dimensions of SZ risk: one based on genetic, and the other on systemic epigenetic variants
Ronin Governs Early Heart Development by Controlling Core Gene Expression Programs.
Ronin (THAP11), a DNA-binding protein that evolved from a primordial DNA transposon by molecular domestication, recognizes a hyperconserved promoter sequence to control developmentally and metabolically essential genes in pluripotent stem cells. However, it remains unclear whether Ronin or related THAP proteins perform similar functions in development. Here, we present evidence that Ronin functions within the nascent heart as it arises from the mesoderm and forms a four-chambered organ. We show that Ronin is vital for cardiogenesis during midgestation by controlling a set of critical genes. The activity of Ronin coincided with the recruitment of its cofactor, Hcf-1, and the elevation of H3K4me3 levels at specific target genes, suggesting the involvement of an epigenetic mechanism. On the strength of these findings, we propose that Ronin activity during cardiogenesis offers a template to understand how important gene programs are sustained across different cell types within a developing organ such as the heart
ReadDepth: A Parallel R Package for Detecting Copy Number Alterations from Short Sequencing Reads
Copy number alterations are important contributors to many genetic diseases, including cancer. We present the readDepth package for R, which can detect these aberrations by measuring the depth of coverage obtained by massively parallel sequencing of the genome. In addition to achieving higher accuracy than existing packages, our tool runs much faster by utilizing multi-core architectures to parallelize the processing of these large data sets. In contrast to other published methods, readDepth does not require the sequencing of a reference sample, and uses a robust statistical model that accounts for overdispersed data. It includes a method for effectively increasing the resolution obtained from low-coverage experiments by utilizing breakpoint information from paired end sequencing to do positional refinement. We also demonstrate a method for inferring copy number using reads generated by whole-genome bisulfite sequencing, thus enabling integrative study of epigenomic and copy number alterations. Finally, we apply this tool to two genomes, showing that it performs well on genomes sequenced to both low and high coverage. The readDepth package runs on Linux and MacOSX, is released under the Apache 2.0 license, and is available at http://code.google.com/p/readdepth/
Cistrome and Transcriptome Analysis Identifies Unique Androgen Receptor (AR) and AR-V7 Splice Variant Chromatin Binding and Transcriptional Activities
The constitutively active androgen receptor (AR) splice variant, AR-V7, plays an important role in resistance to androgen deprivation therapy in castration resistant prostate cancer (CRPC). Studies seeking to determine whether AR-V7 is a partial mimic of the AR, or also has unique activities, and whether the AR-V7 cistrome contains unique binding sites have yielded conflicting results. One limitation in many studies has been the low level of AR variant compared to AR. Here, LNCaP and VCaP cell lines in which AR-V7 expression can be induced to match the level of AR, were used to compare the activities of AR and AR-V7. The two AR isoforms shared many targets, but overall had distinct transcriptomes. Optimal induction of novel targets sometimes required more receptor isoform than classical targets such as PSA. The isoforms displayed remarkably different cistromes with numerous differential binding sites. Some of the unique AR-V7 sites were located proximal to the transcription start sites (TSS). A de novo binding motif similar to a half ARE was identified in many AR-V7 preferential sites and, in contrast to conventional half ARE sites that bind AR-V7, FOXA1 was not enriched at these sites. This supports the concept that the AR isoforms have unique actions with the potential to serve as biomarkers or novel therapeutic targets
- …
