Search CORE

16,192 research outputs found

Recommended from our members

Covariate-assisted ranking and screening for large-scale two-sample inference

Author: Cai T. Tony
Sun Wenguang
Wang Weinan
Publication venue: eScholarship, University of California
Publication date: 01/01/2019
Field of study

Two-sample multiple testing has a wide range of applications. The conventionalpractice first reduces the original observations to a vector of p-values and then chooses a cutoffto adjust for multiplicity. However, this data reduction step could cause significant loss ofinformation and thus lead to suboptimal testing procedures.We introduce a new framework fortwo-sample multiple testing by incorporating a carefully constructed auxiliary variable in inferenceto improve the power. A data-driven multiple-testing procedure is developed by employinga covariate-assisted ranking and screening (CARS) approach that optimally combines the informationfrom both the primary and the auxiliary variables. The proposed CARS procedureis shown to be asymptotically valid and optimal for false discovery rate control. The procedureis implemented in the R package CARS. Numerical results confirm the effectiveness of CARSin false discovery rate control and show that it achieves substantial power gain over existingmethods. CARS is also illustrated through an application to the analysis of a satellite imagingdata set for supernova detection

eScholarship - University of California

Codon Bias Patterns of $E.coli$ 's Interacting Proteins

Author: Cimini Giulio
Deiana Antonio
Dilucca Maddalena
Giansanti Andrea
Semmoloni Andrea
Publication venue: 'Public Library of Science (PLoS)'
Publication date: 01/01/2015
Field of study

Synonymous codons, i.e., DNA nucleotide triplets coding for the same amino acid, are used differently across the variety of living organisms. The biological meaning of this phenomenon, known as codon usage bias, is still controversial. In order to shed light on this point, we propose a new codon bias index,

CompAI

, that is based on the competition between cognate and near-cognate tRNAs during translation, without being tuned to the usage bias of highly expressed genes. We perform a genome-wide evaluation of codon bias for

E.coli

, comparing

CompAI

with other widely used indices:

tAI

CAI

, and

Nc

. We show that

CompAI

and

tAI

capture similar information by being positively correlated with gene conservation, measured by ERI, and essentiality, whereas,

CAI

and

Nc

appear to be less sensitive to evolutionary-functional parameters. Notably, the rate of variation of

tAI

and

CompAI

with ERI allows to obtain sets of genes that consistently belong to specific clusters of orthologous genes (COGs). We also investigate the correlation of codon bias at the genomic level with the network features of protein-protein interactions in

E.coli

. We find that the most densely connected communities of the network share a similar level of codon bias (as measured by

CompAI

and

tAI

). Conversely, a small difference in codon bias between two genes is, statistically, a prerequisite for the corresponding proteins to interact. Importantly, among all codon bias indices,

CompAI

turns out to have the most coherent distribution over the communities of the interactome, pointing to the significance of competition among cognate and near-cognate tRNAs for explaining codon usage adaptation

arXiv.org e-Print Archive

Crossref

Directory of Open Access Journals

PubMed Central

Archivio della ricerca della Scuola IMT Alti Studi Lucca

ART

Archivio della ricerca- Università di Roma La Sapienza

IMT Institutional Repository

FigShare

A method for exploratory repeated-measures analysis applied to a breast-cancer screening study

Author: Astley Susan M.
Boggis Caroline R. M.
Brentnall Adam R.
Crowder Martin J.
Duffy Stephen W.
Gilbert Fiona J.
Gillan Maureen G. C.
James Jonathan
Wallis Matthew G.
Publication venue: 'Institute of Mathematical Statistics'
Publication date: 01/12/2011
Field of study

When a model may be fitted separately to each individual statistical unit, inspection of the point estimates may help the statistician to understand between-individual variability and to identify possible relationships. However, some information will be lost in such an approach because estimation uncertainty is disregarded. We present a comparative method for exploratory repeated-measures analysis to complement the point estimates that was motivated by and is demonstrated by analysis of data from the CADET II breast-cancer screening study. The approach helped to flag up some unusual reader behavior, to assess differences in performance, and to identify potential random-effects models for further analysis.Comment: Published in at http://dx.doi.org/10.1214/11-AOAS481 the Annals of Applied Statistics (http://www.imstat.org/aoas/) by the Institute of Mathematical Statistics (http://www.imstat.org

arXiv.org e-Print Archive

Crossref

The University of Manchester - Institutional Repository

Queen Mary Research Online

Investigation of sequence features of hinge-bending regions in proteins with domain movements using kernel logistic regression

Author: Cawley Gavin
Hayward Steven
Veevers Ruth
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 09/04/2020
Field of study

Background: Hinge-bending movements in proteins comprising two or more domains form a large class of functional movements. Hinge-bending regions demarcate protein domains and collectively control the domain movement. Consequently, the ability to recognise sequence features of hinge-bending regions and to be able to predict them from sequence alone would benefit various areas of protein research. For example, an understanding of how the sequence features of these regions relate to dynamic properties in multi-domain proteins would aid in the rational design of linkers in therapeutic fusion proteins. Results: The DynDom database of protein domain movements comprises sequences annotated to indicate whether the amino acid residue is located within a hinge-bending region or within an intradomain region. Using statistical methods and Kernel Logistic Regression (KLR) models, this data was used to determine sequence features that favour or disfavour hinge-bending regions. This is a difficult classification problem as the number of negative cases (intradomain residues) is much larger than the number of positive cases (hinge residues). The statistical methods and the KLR models both show that cysteine has the lowest propensity for hinge-bending regions and proline has the highest, even though it is the most rigid amino acid. As hinge-bending regions have been previously shown to occur frequently at the terminal regions of the secondary structures, the propensity for proline at these regions is likely due to its tendency to break secondary structures. The KLR models also indicate that isoleucine may act as a domain-capping residue. We have found that a quadratic KLR model outperforms a linear KLR model and that improvement in performance occurs up to very long window lengths (eighty residues) indicating long-range correlations. Conclusion: In contrast to the only other approach that focused solely on interdomain hinge-bending regions, the method provides a modest and statistically significant improvement over a random classifier. An explanation of the KLR results is that in the prediction of hinge-bending regions a long-range correlation is at play between a small number amino acids that either favour or disfavour hinge-bending regions. The resulting sequence-based prediction tool, HingeSeek, is available to run through a webserver at hingeseek.cmp.uea.ac.uk

University of East Anglia digital repository

Functional Interpretation of High-Throughput Sequencing Data.

Author: Lee Chee
Publication venue
Publication date: 01/01/2016
Field of study

Functional interpretation of high-throughput sequencing (HTS) data provides insight into biological systems, including important pathways in the context under study. A common approach is gene set enrichment (GSE) testing. GSE emerged in the age of microarrays as a way to biologically interpret long lists of differentially expressed genes (DEGs). However, HTS data has characteristics not present in microarray data that can bias GSE results. My thesis is focused on identifying, characterizing, and accounting for biases to improve functional interpretation in HTS data. In this thesis, I present GSE tests designed for ChIP-seq data and RNA-seq data. Our tests have applications beyond HTS data, which we show by using them to analyze genomic features, including mappability and repeat content. ChIP-Enrich is a GSE test for ChIP-seq data. It includes a database of locus definitions to annotate peaks to different gene loci (such as exons, introns, promoters, and other intergenic regions), which allows for biological discovery unique to different regions. ChIP-Enrich empirically adjusts for the observed bias due to the varying lengths of these gene loci in its enrichment test. RNA-Enrich is a GSE test for RNA-seq data. RNA-Enrich corrects for the selection bias often observed in RNA-seq data, where long and highly expressed genes are more likely to be identified as DEGs. Unlike other GSE tests for RNA-seq data, RNA-Enrich does not require permutations or a cut-off to define DEGs, and works well with small sample sizes. For both ChIP-Enrich and RNA-Enrich, we showed well-calibrated type I error compared to competing methods. Finally, we characterize sequence mappability, which is one potential bias in the interpretation of HTS data. We characterize properties of the main contributors of low mappability (transposons and segmental duplications), overall mappability, and their relationship with gene locus length and function. Across different transcribed and regulatory regions, certain gene functions showed unique signatures involving significantly more/fewer associated repeats, higher/lower mappability, and longer/shorter locus length. Our analyses provide insight into evolutionary selection pressures that maintain complexity of gene regulation. Overall, we demonstrate that considering characteristics of the human genome is essential to improving functional interpretation of HTS data.PhDBioinformaticsUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/120731/1/cheelee_1.pd

Deep Blue Documents at the University of Michigan

Galaxy alignments: Observations and impact on cosmology

Author: Brown Michael L.
Cacciato Marcello
Choi Ami
Hoekstra Henk
Joachimi Benjamin
Kiessling Alina
Kirk Donnacha
Kitching Thomas D.
Leonard Adrienne
Mandelbaum Rachel
Rassat Anais
Schäfer Björn Malte
Sifón Cristóbal
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2015
Field of study

Galaxy shapes are not randomly oriented, rather they are statistically aligned in a way that can depend on formation environment, history and galaxy type. Studying the alignment of galaxies can therefore deliver important information about the physics of galaxy formation and evolution as well as the growth of structure in the Universe. In this review paper we summarise key measurements of galaxy alignments, divided by galaxy type, scale and environment. We also cover the statistics and formalism necessary to understand the observations in the literature. With the emergence of weak gravitational lensing as a precision probe of cosmology, galaxy alignments have taken on an added importance because they can mimic cosmic shear, the effect of gravitational lensing by large-scale structure on observed galaxy shapes. This makes galaxy alignments, commonly referred to as intrinsic alignments, an important systematic effect in weak lensing studies. We quantify the impact of intrinsic alignments on cosmic shear surveys and finish by reviewing practical mitigation techniques which attempt to remove contamination by intrinsic alignments.Comment: 52 pages excl. references, 16 figures; minor changes to match version published in Space Science Reviews; part of a topical volume on galaxy alignments, with companion papers arXiv:1504.05456 and arXiv:1504.0554

arXiv.org e-Print Archive

Infoscience - École polytechnique fédérale de Lausanne

Crossref

Leiden University Scholary Publications

The University of Manchester - Institutional Repository

FigShare

Statistical Methods For Genomic And Transcriptomic Sequencing

Author: Jiang Yuchao
Publication venue: ScholarlyCommons
Publication date: 01/01/2017
Field of study

Part 1: High-throughput sequencing of DNA coding regions has become a common way of assaying genomic variation in the study of human diseases. Copy number variation (CNV) is an important type of genomic variation, but CNV profiling from whole-exome sequencing (WES) is challenging due to the high level of biases and artifacts. We propose CODEX, a normalization and CNV calling procedure for WES data. CODEX includes a Poisson latent factor model, which includes terms that specifically remove biases due to GC content, exon capture and amplification efficiency, and latent systemic artifacts. CODEX also includes a Poisson likelihood-based segmentation procedure that explicitly models the count-based WES data. CODEX is compared to existing methods on germline CNV detection in HapMap samples using microarray-based gold standard and is further evaluated on 222 neuroblastoma samples with matched normal, with focus on somatic CNVs within the ATRX gene. Part 2: Cancer is a disease driven by evolutionary selection on somatic genetic and epigenetic alterations. We propose Canopy, a method for inferring the evolutionary phylogeny of a tumor using both somatic copy number alterations and single nucleotide alterations from one or more samples derived from a single patient. Canopy is applied to bulk sequencing datasets of both longitudinal and spatial experimental designs and to a transplantable metastasis model derived from human cancer cell line MDA-MB-231. Canopy successfully identifies cell populations and infers phylogenies that are in concordance with existing knowledge and ground truth. Through simulations, we explore the effects of key parameters on deconvolution accuracy, and compare against existing methods. Part 3: Allele-specific expression is traditionally studied by bulk RNA sequencing, which measures average expression across cells. Single-cell RNA sequencing (scRNA-seq) allows the comparison of expression distribution between the two alleles of a diploid organism and thus the characterization of allele-specific bursting. We propose SCALE to analyze genome-wide allele-specific bursting, with adjustment of technical variability. SCALE detects genes exhibiting allelic differences in bursting parameters, and genes whose alleles burst non-independently. We apply SCALE to mouse blastocyst and human fibroblast cells and find that, globally, cis control in gene expression overwhelmingly manifests as differences in burst frequency

ScholarlyCommons@Penn

Detecting Selection on Noncoding Nucleotide Variation: Methods and Applications

Author: Ding Yang
Publication venue: ScholarlyCommons
Publication date: 01/01/2015
Field of study

There has been a long tradition in molecular evolution to study selective pressures operating at the amino-acid level. But protein-coding variation is not the only level on which molecular adaptations occur, and it is not clear what roles non-coding variation has played in evolutionary history, since they have not yet been systematically explored. In this dissertation I systematically explore several aspects of selective pressures of noncoding nucleotide variation: The first project (Chapter 2) describes research on the determinants of eukaryotic translation dynamics, which include selection on non-coding aspects of DNA variation. Deep sequencing of ribosome-protected mRNA fragments and polysome gradients in various eukaryotic organisms have revealed an intriguing pattern: shorter mRNAs tend to have a greater overall density of ribosomes than longer mRNAs. There is debate about the cause of this trend. To resolve this open question, I systematically analysed 5’ mRNA structure and codon usage patterns in short versus long genes across 100 sequenced eukaryotic genomes. My results showed that compared with longer ones, short genes initiate faster, and also elongate faster. Thus the higher ribosome density in short eukaryote genes cannot be explained by translation elongation. Rather it is the translation initiation rate that sets the pace for eukaryotic protein translation. This work was followed by modelling studies of translation dynamics in a yeast cell. Chapter 3 concerns detecting selective pressures on the viral RNA structures. Most previous research on RNA viruses has focused on identifying amino-acid residues under positive or purifying selection, whereas selection on RNA structures has received less attention. I developed algorithms to scan along the viral genome and identify regions that exhibit signals of purifying or diversifying selection on RNA structure, by comparing the structural distances between actual viral RNA sequences against an appropriate null distribution. Unlike other algorithms that identify structural constraints, my approach accounts for the phylogenetic relationships among viral sequences, as well the observed variation in amino-acid sequences. Applied to Influenza viruses, I found that a significant portion of influenza viral genomes have experienced purifying selection for RNA structure, in both the positive- and negative-sense RNA forms, over the past few decades; and I found the first evidence of positive selection on RNA structure in specific regions of these viral genomes. Overall, the projects presented in these chapters represent a systematic look at several novel aspects of selection on noncoding nucleotide variation. These projects should open up new directions in studying the molecular signatures of natural selection, including studies on interactions between different layers at which selection may operate simultaneously (e.g. RNA structure and protein sequence)

ScholarlyCommons@Penn

Proceedings of Abstracts Engineering and Computer Science Research Conference 2019

Author: Adams Roderick
Amafabia Daerefa-a
Barker Trevor
Beka Nathan
Bhavsar Ronakben
Bonivart Agnes
Canoville Paul
Cañamero Lola
CHEN Yong Kang
CHEN Yong Kang
Chrysanthou Andreas
Counsell Nathan
Crook Brian
Davey Neil
David-West Opukuro
Denai Mouloud
Dhakal Hom
Drix Damien
Goncharenko Julia
Grasso Marzio
Hafner Verena Vanessa
Hall Samantha
Haritos George
Hassan Eheda
Helian Na
Herfatmanesh Mohammad Reza
Ismail Sikiru O.
Johnston Ian
Johnston Ian
Kadir Shabnam
Kaye Richard
Khan Imran
Kirner Raimund
Kirner Raimund
Klaholz Ingo
Klusak Jan
Kourtessis Pandelis
Lane Peter
Lekkala Himayasri Rao
Lilley Mariana
Mayor David
McCluskey Daniel
McCluskey Daniel
Menon Catherine
Metzner Christoph
Miko Rebecca
Montalvão Diogo
Mporas Iosif
Munro Ian
Nehaniv Chrystopher
Newman James
Nwawe Richard
Panday Deepak
Partou Helen
Pissanidis Georgios
Polani Daniel
Ren Guogang
Robinson Matthew
Rosiello Vincenzo
Sayers Paul
Schilstra Maria
Schirmer Pascal
Schmuker Michael
Siadati Rana
Sinha Ankur
Skaltsas Grigorios
Steffert Tony
Steuber Volker
Steuber Volker
Suckow Bjorn
Sun Yi
Sun Yichuang
Sunmola Funlade
Sutton Samuel
te Boekhorst Rene
Toffe Gilles
Tracey Mark
Tveretina Olga
Veneziano Vito
Verma Alok
Wang Yuan
Wernick Paul
Publication venue: University of Hertfordshire
Publication date: 01/09/2019
Field of study

© 2019 The Author(s). This is an open-access work distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. For further details please see https://creativecommons.org/licenses/by/4.0/. Note: Keynote: Fluorescence visualisation to evaluate effectiveness of personal protective equipment for infection control is © 2019 Crown copyright and so is licensed under the Open Government Licence v3.0. Under this licence users are permitted to copy, publish, distribute and transmit the Information; adapt the Information; exploit the Information commercially and non-commercially for example, by combining it with other Information, or by including it in your own product or application. Where you do any of the above you must acknowledge the source of the Information in your product or application by including or linking to any attribution statement specified by the Information Provider(s) and, where possible, provide a link to this licence: http://www.nationalarchives.gov.uk/doc/open-government-licence/version/3/This book is the record of abstracts submitted and accepted for presentation at the Inaugural Engineering and Computer Science Research Conference held 17th April 2019 at the University of Hertfordshire, Hatfield, UK. This conference is a local event aiming at bringing together the research students, staff and eminent external guests to celebrate Engineering and Computer Science Research at the University of Hertfordshire. The ECS Research Conference aims to showcase the broad landscape of research taking place in the School of Engineering and Computer Science. The 2019 conference was articulated around three topical cross-disciplinary themes: Make and Preserve the Future; Connect the People and Cities; and Protect and Care

University of Hertfordshire Research Archive

Methods for Epigenetic Analyses from Long-Read Sequencing Data

Author: Snajder Rene Helmut
Publication venue
Publication date: 01/01/2023
Field of study

Epigenetics, particularly the study of DNA methylation, is a cornerstone field for our understanding of human development and disease. DNA methylation has been included in the "hallmarks of cancer" due to its important function as a biomarker and its contribution to carcinogenesis and cancer cell plasticity. Long-read sequencing technologies, such as the Oxford Nanopore Technologies platform, have evolved the study of structural variations, while at the same time allowing direct measurement of DNA methylation on the same reads. With this, new avenues of analysis have opened up, such as long-range allele-specific methylation analysis, methylation analysis on structural variations, or relating nearby epigenetic modalities on the same read to another. Basecalling and methylation calling of Nanopore reads is a computationally expensive task which requires complex machine learning architectures. Read-level methylation calls require different approaches to data management and analysis than ones developed for methylation frequencies measured from short-read technologies or array data. The 2-dimensional nature of read and genome associated DNA methylation calls, including methylation caller uncertainties, are much more storage costly than 1-dimensional methylation frequencies. Methods for storage, retrieval, and analysis of such data therefore require careful consideration. Downstream analysis tasks, such as methylation segmentation or differential methylation calling, have the potential of benefiting from read information and allow uncertainty propagation. These avenues had not been considered in existing tools. In my work, I explored the potential of long-read DNA methylation analysis and tackled some of the challenges of data management and downstream analysis using state of the art software architecture and machine learning methods. I defined a storage standard for reference anchored and read assigned DNA methylation calls, including methylation calling uncertainties and read annotations such as haplotype or sample information. This storage container is defined as a schema for the hierarchical data format version 5, includes an index for rapid access to genomic coordinates, and is optimized for parallel computing with even load balancing. It further includes a python API for creation, modification, and data access, including convenience functions for the extraction of important quality statistics via a command line interface. Furthermore, I developed software solutions for the segmentation and differential methylation testing of DNA methylation calls from Nanopore sequencing. This implementation takes advantage of the performance benefits provided by my high performance storage container. It includes a Bayesian methylome segmentation algorithm which allows for the consensus instance segmentation of multiple sample and/or haplotype assigned DNA methylation profiles, while considering methylation calling uncertainties. Based on this segmentation, the software can then perform differential methylation testing and provides a large number of options for statistical testing and multiple testing correction. I benchmarked all tools on both simulated and publicly available real data, and show the performance benefits compared to previously existing and concurrently developed solutions. Next, I applied the methods to a cancer study on a chromothriptic cancer sample from a patient with Sonic Hedgehog Medulloblastoma. I here report regulatory genomic regions differentially methylated before and after treatment, allele-specific methylation in the tumor, as well as methylation on chromothriptic structures. Finally, I developed specialized methylation callers for the combined DNA methylation profiling of CpG, GpC, and context-free adenine methylation. These callers can be used to measure chromatin accessibility in a NOMe-seq like setup, showing the potential of long-read sequencing for the profiling of transcription factor co-binding. In conclusion, this thesis presents and subsequently benchmarks new algorithmic and infrastructural solutions for the analysis of DNA methylation data from long-read sequencing

Heidelberger Dokumentenserver