168 research outputs found

    Haplotype Threading Using the Positional Burrows-Wheeler Transform

    Get PDF
    In the classic model of population genetics, one haplotype (query) is considered as a mosaic copy of segments from a number of haplotypes in a panel, or threading the haplotype through the panel. The Li and Stephens model parameterized this problem using a hidden Markov model (HMM). However, HMM algorithms are linear to the sample size, and can be very expensive for biobank-scale panels. Here, we formulate the haplotype threading problem as the Minimal Positional Substring Cover problem, where a query is represented by a mosaic of a minimal number of substring matches from the panel. We show that this problem can be solved by a sequential set of greedy set maximal matches. Moreover, the solution space can be bounded by the left-most and the right-most solutions by the greedy approach. Based on these results, we formulate and solve several variations of this problem. Although our results are yet to be generalized to the cases with mismatches, they offer a theoretical framework for designing methods for genotype imputation and haplotype phasing

    Detecting transcription of ribosomal protein pseudogenes in diverse human tissues from RNA-seq data

    Get PDF
    Background: Ribosomal proteins (RPs) have about 2000 pseudogenes in the human genome. While anecdotal reports for RP pseudogene transcription exists, it is unclear to what extent these pseudogenes are transcribed. The RP pseudogene transcription is difficult to identify in microarrays due to potential cross-hybridization between transcripts from the parent genes and pseudogenes. Recently, transcriptome sequencing (RNA-seq) provides an opportunity to ascertain the transcription of pseudogenes. A challenge for pseudogene expression discovery in RNA-seq data lies in the difficulty to uniquely identify reads mapped to pseudogene regions, which are typically also similar to the parent genes. Results: Here we developed a specialized pipeline for pseudogene transcription discovery. We first construct a composite genome that includes the entire human genome sequence as well as mRNA sequences of real ribosomal protein genes. We then map all sequence reads to the composite genome, and only exact matches were retained. Moreover, we restrict our analysis to strictly defined mappable regions and calculate the RPKM values as measurement of pseudogene transcription levels. We report evidences for the transcription of RP pseudogenes in 16 human tissues. By analyzing the Human Body Map 2.0 study RNA-sequencing data using our pipeline, we identified that one ribosomal protein (RP) pseudogene (PGOHUM-249508) is transcribed with RPKM 170 in thyroid. Moreover, three other RP pseudogenes are transcribed with RPKM \u3e 10, a level similar to that of the normal RP genes, in white blood cell, kidney, and testes, respectively. Furthermore, an additional thirteen RP pseudogenes are of RPKM \u3e 5, corresponding to the 20-30 percentile among all genes. Unlike ribosomal protein genes that are constitutively expressed in almost all tissues, RP pseudogenes are differentially expressed, suggesting that they may contribute to tissue-specific biological processes. Conclusions: Using a specialized bioinformatics method, we identified the transcription of ribosomal protein pseudogenes in human tissues using RNA-seq data

    Efficient Haplotype Block Matching in Bi-Directional PBWT

    Get PDF
    Efficient haplotype matching search is of great interest when large genotyped cohorts are becoming available. Positional Burrows-Wheeler Transform (PBWT) enables efficient searching for blocks of haplotype matches. However, existing efficient PBWT algorithms sweep across the haplotype panel from left to right, capturing all exact matches. As a result, PBWT does not account for mismatches. It is also not easy to investigate the patterns of changes between the matching blocks. Here, we present an extension to PBWT, called bi-directional PBWT that allows the information about the blocks of matches to be present at both sides of each site. We also present a set of algorithms to efficiently merge the matching blocks or examine the patterns of changes on both sides of each site. The time complexity of the algorithms to find and merge matching blocks using bi-directional PBWT is linear to the input size. Using real data from the UK Biobank, we demonstrate the run time and memory efficiency of our algorithms. More importantly, our algorithms can identify more blocks by enabling tolerance of mismatches. Moreover, by using mutual information (MI) between the forward and the reverse PBWT matching block sets as a measure of haplotype consistency, we found the MI derived from European samples in the 1000 Genomes Project is highly correlated (Spearman correlation r=0.87) with the deCODE recombination map

    Potential of tropical maize populations for improving an elite maize hybrid

    Get PDF
    Identifying exotic maize (Zea mays L) populations possessing favorable new alleles lacking in local elite hybrids is an important strategy for improving maize hybrids. Selection of an appropriate breeding method will increase the chance of successfully transferring these favorable new alleles into elite inbred lines of local hybrids. The objec¬tives of this study were to: (i) evaluate 14 maize populations from CIMMYT and identify those containing favorable alleles for grain yield, ear length, ear diameter, kernel length, plant height, and ear height that are lacking in a local super hybrid [Jidan261 (W9706 × Ji853)], and to (ii) determine which inbred parent should be improved. These re¬sults showed that the populations Pob43, Pob501, and La Posta had positive and significant numbers of favorable alleles not found in hybrid W9706 × Ji853 that could be used for simultaneous improvement of its grain yield, ear length, and kernel length, and that population QPM-Y was also a good donor for improvement of ear diameter and kernel length in the hybrid. Based on allele frequencies in the two inbred lines and the donor population, when the populations Pob43, La Posta, Pob501, and QPM-Y were used as donors, inbred line W9706 would be improved by selfing the F1 of the cross W9706 × donor population. These results suggested that CIMMYT germplasm has potential to improve temperate elite hybrids. The relationship between GCA and SCA from a previous study and the parameters obtained from the Dudley method are discussed. The results showed that the values of Lplμ’ esti¬mates obtained by applying the Dudley method had the same trend as GCA effects for grain yield but a less clear trend for ear length, while the trends in the relationship value were reversed for SCA between these populations and Lancaster-derived lines

    A hidden markov model for haplotype inference for present-absent data of clustered genes using identified haplotypes and haplotype patterns

    Get PDF
    The majority of killer cell immunoglobin-like receptor (KIR) genes are detected as either present or absent using locus-specific genotyping technology. Ambiguity arises from the presence of a specific KIR gene since the exact copy number (one or two) of that gene is unknown. Therefore, haplotype inference for these genes is becoming more challenging due to such large portion of missing information. Meantime, many haplotypes and partial haplotype patterns have been previously identified due to tight linkage disequilibrium (LD) among these clustered genes thus can be incorporated to facilitate haplotype inference. In this paper, we developed a hidden Markov model (HMM) based method that can incorporate identified haplotypes or partial haplotype patterns for haplotype inference from present-absent data of clustered genes (e.g., KIR genes). We compared its performance with an expectation maximization (EM) based method previously developed in terms of haplotype assignments and haplotype frequency estimation through extensive simulations for KIR genes. The simulation results showed that the new HMM based method outperformed the previous method when some incorrect haplotypes were included as identified haplotypes and/or the standard deviation of haplotype frequencies were small. We also compared the performance of our method with two methods that do not use previously identified haplotypes and haplotype patterns, including an EM based method, HPALORE, and a HMM based method, MaCH. Our simulation results showed that the incorporation of identified haplotypes and partial haplotype patterns can improve accuracy for haplotype inference. The new software package HaploHMM is available and can be downloaded at http://www.soph.uab.edu/ssg/files/People/KZhang/HaploHMM/haplohmm-index.html

    Model selection and structure specification in ultra-high dimensional generalised semi-varying coefficient models

    Get PDF
    In this paper, we study the model selection and structure specification for the generalised semi-varying coefficient models (GSVCMs), where the number of potential covariates is allowed to be larger than the sample size.We first propose a penalised likelihood method with the LASSO penalty function to obtain the preliminary estimates of the functional coefficients. Then, using the quadratic approximation for the local log-likelihood function and the adaptive group LASSO penalty (or the local linear approximation of the group SCAD penalty) with the help of the preliminary estimation of the functional coefficients, we introduce a novel penalised weighted least squares procedure to select the significant covariates and identify the constant coefficients among the coefficients of the selected covariates, which could thus specify the semiparametric modelling structure. The developed model selection and structure specification approach not only inherits many nice statistical properties from the local maximum likelihood estimation and nonconcave penalised likelihood method, but also computationally attractive thanks to the computational algorithm that is proposed to implement our method. Under some mild conditions, we establish the asymptotic properties for the proposed model selection and estimation procedure such as the sparsity and oracle property.We also conduct simulation studies to examine the finite sample performance of the proposed method, and finally apply the method to analyse a real data set, which leads to some interesting findings

    Nonparametric Homogeneity Pursuit in Functional-Coefficient Models

    Get PDF
    This paper explores homogeneity of coefficient functions in nonlinear models with functional coefficients and identifies the underlying semiparametric modelling structure. With initial kernel estimates, we combine the classic hierarchical clustering method with a generalised version of the information criterion to estimate the number of clusters, each of which has a common functional coefficient, and determine the membership of each cluster. To identify a possible semi-varying coefficient modelling framework, we further introduce a penalised local least squares method to determine zero coefficients, non-zero constant coefficients and functional coefficients which vary with an index variable. Through the nonparametric kernel-based cluster analysis and the penalised approach, we can substantially reduce the number of unknown parametric and nonparametric components in the models, thereby achieving the aim of dimension reduction. Under some regularity conditions, we establish the asymptotic properties for the proposed methods including the consistency of the homogeneity pursuit. Numerical studies, including Monte-Carlo experiments and two empirical applications, are given to demonstrate the finite-sample performance of our methods

    Genetic characterization and linkage disequilibrium mapping of resistance to gray leaf spot in maize (Zea mays L.)

    Get PDF
    AbstractGray leaf spot (GLS), caused by Cercospora zeae-maydis, is an important foliar disease of maize (Zea mays L.) worldwide, resistance to which is controlled by multiple quantitative trait loci (QTL). To gain insights into the genetic architecture underlying the resistance to this disease, an association mapping population consisting of 161 inbred lines was evaluated for resistance to GLS in a plant pathology nursery at Shenyang in 2010 and 2011. Subsequently, a genome-wide association study, using 41,101 single-nucleotide polymorphisms (SNPs), identified 51 SNPs significantly (P<0.001) associated with GLS resistance, which could be converted into 31 QTL. In addition, three candidate genes related to plant defense were identified, including nucleotide-binding-site/leucine-rich repeat, receptor-like kinase genes similar to those involved in basal defense. Two genic SNPs, PZE-103142893 and PZE-109119001, associated with GLS resistance in chromosome bins 3.07 and 9.07, can be used for marker-assisted selection (MAS) of GLS resistance. These results provide an important resource for developing molecular markers closely linked with the target trait, enhancing breeding efficiency
    • …
    corecore