39 research outputs found

    Group Additive Regression Models for Genomic Data Analysis

    Get PDF
    One important problem in genomic research is to identify genomic features such as gene expression data or DNA single nucleotide polymorphisms (SNPs) that are related to clinical phenotypes. Often these genomic data can be naturally divided into biologically meaningful groups such as genes belonging to the same pathways or SNPs within genes. In this paper, we propose group additive regression models and a group gradient descent boosting procedure for identifying groups of genomic features that are related to clinical phenotypes. Our simulation results show that by dividing the variables into appropriate groups, we can obtain better identification of the group features that are related to the phenotypes. In addition, the prediction mean square errors are also smaller than the component-wise boosting procedure. We demonstrate the application of the methods to pathway-based analysis of microarray gene expression data of breast cancer and gene-based genetic association analysis of type 1 diabetes. Results from analysis of two breast cancer data sets indicate that the pathways of Metalloendopeptidases (MMPs) and MMP inhibitors, as well as cell proliferation, cell growth and maintenance are important to breast cancer relapse and survival. Results from analysis of a set of nonsynonymous SNPs on chromosome 6 confirmed a few genes that are associated with type 1 diabetes

    Variance adjusted weighted UniFrac: a powerful beta diversity measure for comparing communities based on phylogeny

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Beta diversity, which involves the assessment of differences between communities, is an important problem in ecological studies. Many statistical methods have been developed to quantify beta diversity, and among them, UniFrac and weighted-UniFrac (W-UniFrac) are widely used. The W-UniFrac is a weighted sum of branch lengths in a phylogenetic tree of the sequences from the communities. However, W-UniFrac does not consider the variation of the weights under random sampling resulting in less power detecting the differences between communities.</p> <p>Results</p> <p>We develop a new statistic termed variance adjusted weighted UniFrac (VAW-UniFrac) to compare two communities based on the phylogenetic relationships of the individuals. The VAW-UniFrac is used to test if the two communities are different. To test the power of VAW-UniFrac, we first ran a series of simulations which revealed that it always outperforms W-UniFrac, as well as UniFrac when the individuals are not uniformly distributed. Next, all three methods were applied to analyze three large 16S rRNA sequence collections, including human skin bacteria, mouse gut microbial communities, microbial communities from hypersaline soil and sediments, and a tropical forest census data. Both simulations and applications to real data show that VAW-UniFrac can satisfactorily measure differences between communities, considering not only the species composition but also abundance information.</p> <p>Conclusions</p> <p>VAW-UniFrac can recover biological insights that cannot be revealed by other beta diversity measures, and it provides a novel alternative for comparing communities.</p

    Iterative estimating equations: Linear convergence and asymptotic properties

    Get PDF
    We propose an iterative estimating equations procedure for analysis of longitudinal data. We show that, under very mild conditions, the probability that the procedure converges at an exponential rate tends to one as the sample size increases to infinity. Furthermore, we show that the limiting estimator is consistent and asymptotically efficient, as expected. The method applies to semiparametric regression models with unspecified covariances among the observations. In the special case of linear models, the procedure reduces to iterative reweighted least squares. Finite sample performance of the procedure is studied by simulations, and compared with other methods. A numerical example from a medical study is considered to illustrate the application of the method.Comment: Published in at http://dx.doi.org/10.1214/009053607000000208 the Annals of Statistics (http://www.imstat.org/aos/) by the Institute of Mathematical Statistics (http://www.imstat.org

    Analysis of Similarity/Dissimilarity of DNA Sequences Based on Chaos Game Representation

    Get PDF
    The Chaos Game is an algorithm that can allow one to produce pictures of fractal structures. Considering that the four bases A, G, C, and T of DNA sequences can be divided into three classes according to their chemical structure, we propose different kinds of CGR-walk sequences. Based on CGR coordinates of random sequences, we introduce some invariants for the DNA primary sequences. As an application, we can make the examination of similarity/dissimilarity among the first exon of β-globin gene of different species. The results indicate that our method is efficient and can get more biological information

    DV-Curve Representation of Protein Sequences and Its Application

    Get PDF
    Based on the detailed hydrophobic-hydrophilic(HP) model of amino acids, we propose dual-vector curve (DV-curve) representation of protein sequences, which uses two vectors to represent one alphabet of protein sequences. This graphical representation not only avoids degeneracy, but also has good visualization no matter how long these sequences are, and can reflect the length of protein sequence. Then we transform the 2D-graphical representation into a numerical characterization that can facilitate quantitative comparison of protein sequences. The utility of this approach is illustrated by two examples: one is similarity/dissimilarity comparison among different ND6 protein sequences based on their DV-curve figures the other is the phylogenetic analysis among coronaviruses based on their spike proteins

    An Efficient Estimation of the Mean Residual Life Function with Length-Biased Right-Censored Data

    Get PDF
    The mean residual life (MRL) function for a lifetime random variable T0 is one of the basic parameters of interest in survival analysis. In this paper, we propose a new estimator of the MRL function with length-biased right-censored data and evaluate its performance through a small Monte Carlo simulation study. The results of the simulations show that the proposed estimator outperforms the existing one referred to in Data and Model Setup Section in terms of Monte Carlo bias and mean square error, especially when the censoring rate is heavy. We also show that the proposed estimator converges in distribution under some conditions

    Usefulness and limitations of dK random graph models to predict interactions and functional homogeneity in biological networks under a pseudo-likelihood parameter estimation approach

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Many aspects of biological functions can be modeled by biological networks, such as protein interaction networks, metabolic networks, and gene coexpression networks. Studying the statistical properties of these networks in turn allows us to infer biological function. Complex statistical network models can potentially more accurately describe the networks, but it is not clear whether such complex models are better suited to find biologically meaningful subnetworks.</p> <p>Results</p> <p>Recent studies have shown that the degree distribution of the nodes is not an adequate statistic in many molecular networks. We sought to extend this statistic with 2nd and 3rd order degree correlations and developed a pseudo-likelihood approach to estimate the parameters. The approach was used to analyze the MIPS and BIOGRID yeast protein interaction networks, and two yeast coexpression networks. We showed that 2nd order degree correlation information gave better predictions of gene interactions in both protein interaction and gene coexpression networks. However, in the biologically important task of predicting functionally homogeneous modules, degree correlation information performs marginally better in the case of the MIPS and BIOGRID protein interaction networks, but worse in the case of gene coexpression networks.</p> <p>Conclusion</p> <p>Our use of dK models showed that incorporation of degree correlations could increase predictive power in some contexts, albeit sometimes marginally, but, in all contexts, the use of third-order degree correlations decreased accuracy. However, it is possible that other parameter estimation methods, such as maximum likelihood, will show the usefulness of incorporating 2nd and 3rd degree correlations in predicting functionally homogeneous modules.</p

    DDAB-assisted synthesis of iodine-rich CsPbI3 perovskite nanocrystals with improved stability in multiple environments

    Get PDF
    © 2020 The Royal Society of Chemistry. All-inorganic cesium lead halide perovskite (CsPbX3, X = Cl, Br, I) nanocrystals (NCs) have attracted considerable attention due to their tunable optical properties and high optical quantum yield. However, their stability in various environments, such as different solvents, high temperature and UV light, remains to be addressed to enable their exploitation in devices. Here, we report on the synthesis of all inorganic CsPbI3 perovskite nanocrystals capped with didodecyldimethylammonium bromide (DDAB). Monodispersed DDAB-capped CsPbI3 NCs have enhanced stability with respect to their morphological and optical properties compared to conventional oleic acid (OA)/oleylamine (OLA) capped nanocrystals. The DDAB-CsPbI3 NCs retain an optical quantum yield >80% for at least 60 days. The enhanced stability is explained by the binding of branched DDAB ligands to the NC surface, leading to the formation of a halogen-rich surface, as confirmed by X-ray photoelectron spectroscopy, with an iodine to lead atomic ratio of I : Pb = 4 : 1. These perovskites were used in light-emitting diodes (LEDs) and have a maximum external quantum efficiency (EQE) of 1.25% and a luminance of 468 cd m-2, and demonstrated improved operational performance. The enhanced stability of DDAB-CsPbI3 in the environments relevant for device processing and operation is relevant for their exploitation in optoelectronics
    corecore