725 research outputs found

    Elucidating the genetics of craniofacial shape

    Get PDF
    Alterations in craniofacial size and shape are apparent in many monogenic diseases and syndromes, but remarkably little is known about the genetics of face shape within healthy populations. This may be set to change following publication of a study that combines unsupervised hierarchical spectral clustering and canonical correlation analysis to help identify common genetic variants associated with craniofacial shape

    MGMR: leveraging RNA-Seq population data to optimize expression estimation

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>RNA-Seq is a technique that uses Next Generation Sequencing to identify transcripts and estimate transcription levels. When applying this technique for quantification, one must contend with reads that align to multiple positions in the genome (multireads). Previous efforts to resolve multireads have shown that RNA-Seq expression estimation can be improved using probabilistic allocation of reads to genes. These methods use a probabilistic generative model for data generation and resolve ambiguity using likelihood-based approaches. In many instances, RNA-seq experiments are performed in the context of a population. The generative models of current methods do not take into account such population information, and it is an open question whether this information can improve quantification of the individual samples</p> <p>Results</p> <p>In order to explore the contribution of population level information in RNA-seq quantification, we apply a hierarchical probabilistic generative model, which assumes that expression levels of different individuals are sampled from a Dirichlet distribution with parameters specific to the population, and reads are sampled from the distribution of expression levels. We introduce an optimization procedure for the estimation of the model parameters, and use HapMap data and simulated data to demonstrate that the model yields a significant improvement in the accuracy of expression levels of paralogous genes.</p> <p>Conclusions</p> <p>We provide a proof of principal of the benefit of drawing on population commonalities to estimate expression. The results of our experiments demonstrate this approach can be beneficial, primarily for estimation at the gene level.</p

    Iron Age and Anglo-Saxon genomes from East England reveal British migration history

    Get PDF
    British population history has been shaped by a series of immigrations, including the early Anglo-Saxon migrations after 400 CE. It remains an open question how these events affected the genetic composition of the current British population. Here, we present whole-genome sequences from 10 individuals excavated close to Cambridge in the East of England, ranging from the late Iron Age to the middle Anglo-Saxon period. By analysing shared rare variants with hundreds of modern samples from Britain and Europe, we estimate that on average the contemporary East English population derives 38% of its ancestry from Anglo-Saxon migrations. We gain further insight with a new method, rarecoal, which infers population history and identifies fine-scale genetic ancestry from rare variants. Using rarecoal we find that the Anglo-Saxon samples are closely related to modern Dutch and Danish populations, while the Iron Age samples share ancestors with multiple Northern European populations including Britain

    Compression of Structured High-Throughput Sequencing Data

    Get PDF
    Large biological datasets are being produced at a rapid pace and create substantial storage challenges, particularly in the domain of high-throughput sequencing (HTS). Most approaches currently used to store HTS data are either unable to quickly adapt to the requirements of new sequencing or analysis methods (because they do not support schema evolution), or fail to provide state of the art compression of the datasets. We have devised new approaches to store HTS data that support seamless data schema evolution and compress datasets substantially better than existing approaches. Building on these new approaches, we discuss and demonstrate how a multi-tier data organization can dramatically reduce the storage, computational and network burden of collecting, analyzing, and archiving large sequencing datasets. For instance, we show that spliced RNA-Seq alignments can be stored in less than 4% the size of a BAM file with perfect data fidelity. Compared to the previous compression state of the art, these methods reduce dataset size more than 40% when storing exome, gene expression or DNA methylation datasets. The approaches have been integrated in a comprehensive suite of software tools (http://goby.campagnelab.org) that support common analyses for a range of high-throughput sequencing assays.National Center for Research Resources (U.S.) (Grant UL1 RR024996)Leukemia & Lymphoma Society of America (Translational Research Program Grant LLS 6304-11)National Institute of Mental Health (U.S.) (R01 MH086883

    Inference of population splits and mixtures from genome-wide allele frequency data

    Full text link
    Many aspects of the historical relationships between populations in a species are reflected in genetic data. Inferring these relationships from genetic data, however, remains a challenging task. In this paper, we present a statistical model for inferring the patterns of population splits and mixtures in multiple populations. In this model, the sampled populations in a species are related to their common ancestor through a graph of ancestral populations. Using genome-wide allele frequency data and a Gaussian approximation to genetic drift, we infer the structure of this graph. We applied this method to a set of 55 human populations and a set of 82 dog breeds and wild canids. In both species, we show that a simple bifurcating tree does not fully describe the data; in contrast, we infer many migration events. While some of the migration events that we find have been detected previously, many have not. For example, in the human data we infer that Cambodians trace approximately 16% of their ancestry to a population ancestral to other extant East Asian populations. In the dog data, we infer that both the boxer and basenji trace a considerable fraction of their ancestry (9% and 25%, respectively) to wolves subsequent to domestication, and that East Asian toy breeds (the Shih Tzu and the Pekingese) result from admixture between modern toy breeds and "ancient" Asian breeds. Software implementing the model described here, called TreeMix, is available at http://treemix.googlecode.comComment: 28 pages, 6 figures in main text. Attached supplement is 22 pages, 15 figures. This is an updated version of the preprint available at http://precedings.nature.com/documents/6956/version/

    Localizing triplet periodicity in DNA and cDNA sequences

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>The protein-coding regions (coding exons) of a DNA sequence exhibit a triplet periodicity (TP) due to fact that coding exons contain a series of three nucleotide codons that encode specific amino acid residues. Such periodicity is usually not observed in introns and intergenic regions. If a DNA sequence is divided into small segments and a Fourier Transform is applied on each segment, a strong peak at frequency 1/3 is typically observed in the Fourier spectrum of coding segments, but not in non-coding regions. This property has been used in identifying the locations of protein-coding genes in unannotated sequence. The method is fast and requires no training. However, the need to compute the Fourier Transform across a segment (window) of arbitrary size affects the accuracy with which one can localize TP boundaries. Here, we report a technique that provides higher-resolution identification of these boundaries, and use the technique to explore the biological correlates of TP regions in the genome of the model organism <it>C. elegans</it>.</p> <p>Results</p> <p>Using both simulated TP signals and the real <it>C. elegans </it>sequence F56F11 as an example, we demonstrate that, (1) Modified Wavelet Transform (MWT) can better define the boundary of TP region than the conventional Short Time Fourier Transform (STFT); (2) The scale parameter (a) of MWT determines the precision of TP boundary localization: bigger values of a give sharper TP boundaries but result in a lower signal to noise ratio; (3) RNA splicing sites have weaker TP signals than coding region; (4) TP signals in coding region can be destroyed or recovered by frame-shift mutations; (5) 6 bp periodicities in introns and intergenic region can generate false positive signals and it can be removed with 6 bp MWT.</p> <p>Conclusions</p> <p>MWT can provide more precise TP boundaries than STFT and the boundaries can be further refined by bigger scale MWT. Subtraction of 6 bp periodicity signals reduces the number of false positives. Experimentally-introduced frame-shift mutations help recover TP signal that have been lost by possible ancient frame-shifts. More importantly, TP signal has the potential to be used to detect the splice junctions in fully spliced mRNA sequence.</p

    Population structure and genetic history of Tibetan Terriers

    Get PDF
    International audienceAbstractBackgroundTibetan Terrier is a popular medium-sized companion dog breed. According to the history of the breed, the western population of Tibetan Terriers includes two lineages, Lamleh and Luneville. These two lineages derive from a small number of founder animals from the native Tibetan Terrier population, which were brought to Europe in the 1920s. For almost a century, the western population of Tibetan Terriers and the native population in Tibet were reproductively isolated. In this study, we analysed the structure of the western population of Tibetan Terriers, the original native population from Tibet and of different crosses between these two populations. We also examined the genetic relationships of Tibetan Terriers with other dog breeds, especially terriers and some Asian breeds, and the within-breed structure of both Tibetan Terrier populations.ResultsOur analyses were based on high-density single nucleotide polymorphism (SNP) array (Illumina HD Canine 170 K) and microsatellite (18 loci) genotypes of 64 Tibetan Terriers belonging to different populations and lineages. For the comparative analysis, we used 348 publicly available SNP array genotypes of dogs from other breeds. We found that the western population of Tibetan Terriers and the native Tibetan Terriers clustered together with other Asian dog breeds, whereas all other terrier breeds were grouped into a separate group. We were also able to differentiate the western Tibetan Terrier lineages (Lamleh and Luneville) from the native Tibetan Terrier population.ConclusionsOur results reveal the relationships between the western and native populations of Tibetan Terriers and support the hypothesis that Tibetan Terrier belongs to the group of ancient dog breeds of Asian origin, which are close to the ancestors of the modern dog that were involved in the early domestication process. Thus, we were able to reject the initial hypothesis that Tibetan Terriers belong to the group of terrier breeds. The existence of this native population of Tibetan Terriers at its original location represents an exceptional and valuable genetic resource

    Profiling allele-specific gene expression in brains from individuals with autism spectrum disorder reveals preferential minor allele usage.

    Get PDF
    One fundamental but understudied mechanism of gene regulation in disease is allele-specific expression (ASE), the preferential expression of one allele. We leveraged RNA-sequencing data from human brain to assess ASE in autism spectrum disorder (ASD). When ASE is observed in ASD, the allele with lower population frequency (minor allele) is preferentially more highly expressed than the major allele, opposite to the canonical pattern. Importantly, genes showing ASE in ASD are enriched in those downregulated in ASD postmortem brains and in genes harboring de novo mutations in ASD. Two regions, 14q32 and 15q11, containing all known orphan C/D box small nucleolar RNAs (snoRNAs), are particularly enriched in shifts to higher minor allele expression. We demonstrate that this allele shifting enhances snoRNA-targeted splicing changes in ASD-related target genes in idiopathic ASD and 15q11-q13 duplication syndrome. Together, these results implicate allelic imbalance and dysregulation of orphan C/D box snoRNAs in ASD pathogenesis
    corecore