18 research outputs found

    ALGORITHMS FOR CORRECTING NEXT GENERATION SEQUENCING ERRORS

    Get PDF
    The advent of next generation sequencing technologies (NGS) generated a revolution in biological research. However, in order to use the data they produce, new computational tools are needed. Due to significantly shorter length of the reads and higher per-base error rate, more complicated approaches are employed and still critical problems, such as genome assembly, are not satisfactorily solved. We therefore focus our attention on improving the quality of the NGS data. More precisely, we address the error correction issue. The current methods for correcting errors are not very accurate. In addition, they do not adapt to the data. We proposed a novel tool, HiTEC, to correct errors in NGS data. HiTEC is based on the suffix array data structure accompanied by a statistical analysis. HiTEC’s accuracy is significantly higher than all previous methods. In addition, it is the only tool with the ability of adjusting to the given data set. In addition, HiTEC is time and space efficient

    Estimation with Norm Regularization

    Full text link
    Analysis of non-asymptotic estimation error and structured statistical recovery based on norm regularized regression, such as Lasso, needs to consider four aspects: the norm, the loss function, the design matrix, and the noise model. This paper presents generalizations of such estimation error analysis on all four aspects compared to the existing literature. We characterize the restricted error set where the estimation error vector lies, establish relations between error sets for the constrained and regularized problems, and present an estimation error bound applicable to any norm. Precise characterizations of the bound is presented for isotropic as well as anisotropic subGaussian design matrices, subGaussian noise models, and convex loss functions, including least squares and generalized linear models. Generic chaining and associated results play an important role in the analysis. A key result from the analysis is that the sample complexity of all such estimators depends on the Gaussian width of a spherical cap corresponding to the restricted error set. Further, once the number of samples nn crosses the required sample complexity, the estimation error decreases as cn\frac{c}{\sqrt{n}}, where cc depends on the Gaussian width of the unit norm ball.Comment: Fixed technical issues. Generalized some result

    Probabilistic Structured Models for Plant Trait Analysis

    Get PDF
    University of Minnesota Ph.D. dissertation. March 2017. Major: Communication Sciences and Disorders. Advisor: Arindam Banerjee. 1 computer file (PDF); xii, 171 pages.Many fields in modern science and engineering such as ecology, computational biology, astronomy, signal processing, climate science, brain imaging, natural language processing, and many more involve collecting data sets in which the dimensionality of the data p exceeds the sample size n. Since it is usually impossible to obtain consistent procedures unless p < n, a line of recent work has studied models with various types of low-dimensional structure, including sparse vectors, sparse structured graphical models, low-rank matrices, and combinations thereof. In such settings, a general approach to estimation is to solve a regularized optimization problem, which combines a loss function measuring how well the model fits the data with some regularization function that encourages the assumed structure. Of particular interest are structure learning of graphical models in high dimensional setting. The majority of statistical analysis of graphical model estimations assume that all the data are fully observed and the data points are sampled from the same distribution and provide the sample complexity and convergence rate by considering only one graphical structure for all the observations. In this thesis, we extend the above results to estimate the structure of graphical models where the data is partially observed or the data is sampled from multiple distributions. First, we consider the problem of estimating change in the dependency structure of two p-dimensional models, based on samples drawn from two graphical models. The change is assumed to be structured, e.g., sparse, block sparse, node-perturbed sparse, etc., such that it can be characterized by a suitable (atomic) norm. We present and analyze a norm-regularized estimator for directly estimating the change in structure, without having to estimate the structures of the individual graphical models. Next, we consider the problem of estimating sparse structure of Gaussian copula distributions (corresponding to non-paranormal distributions) using samples with missing values. We prove that our proposed estimators consistently estimate the non-paranormal correlation matrix where the convergence rate depends on the probability of missing values. In the second part of thesis, we consider matrix completion problem. Low-rank matrix completion methods have been successful in a variety of settings such as recommendation systems. However, most of the existing matrix completion methods only provide a point estimate of missing entries, and do not characterize uncertainties of the predictions. First, we illustrate that the the posterior distribution in latent factor models, such as probabilistic matrix factorization, when marginalized over one latent factor has the Matrix Generalized Inverse Gaussian (MGIG) distribution. We show that the MGIG is unimodal, and the mode can be obtained by solving an Algebraic Riccati Equation equation. The characterization leads to a novel Collapsed Monte Carlo inference algorithm for such latent factor models. Next, we propose a Bayesian hierarchical probabilistic matrix factorization (BHPMF) model to 1) incorporate hierarchical side information, and 2) provide uncertainty quantified predictions. The former yields significant performance improvements in the problem of plant trait prediction, a key problem in ecology, by leveraging the taxonomic hierarchy in the plant kingdom. The latter is helpful in identifying predictions of low confidence which can in turn be used to guide field work for data collection efforts. Finally, we consider applications of probabilistic structured models to plant trait analysis. We apply BHPMF model to fill the gaps in TRY database. The BHPMF model is the-state-of-the-art model for plant trait prediction and is getting increasing visibility and usage in the plant trait analysis. We have submitted a R package for BHPMF to CRAN. Next, we apply the Gaussian graphical model structure estimators to obtain the trait-trait interactions. We study the trait-trait interactions structure at different climate zones and among different plant growth forms and uncover the dependence of traits on climate and on vegetation

    Mapping local and global variability in plant trait distributions

    Get PDF
    Our ability to understand and predict the response of ecosystems to a changing environment depends on quantifying vegetation functional diversity. However, representing this diversity at the global scale is challenging. Typically, in Earth system models, characterization of plant diversity has been limited to grouping related species into plant functional types (PFTs), with all trait variation in a PFT collapsed into a single mean value that is applied globally. Using the largest global plant trait database and state of the art Bayesian modeling, we created fine-grained global maps of plant trait distributions that can be applied to Earth system models. Focusing on a set of plant traits closely coupled to photosynthesis and foliar respiration - specific leaf area (SLA) and dry mass-based concentrations of leaf nitrogen (Nm) and phosphorus (Pm), we characterize how traits vary within and among over 50,000 ∌50×50-km cells across the entire vegetated land surface. We do this in several ways - without defining the PFT of each grid cell and using 4 or 14 PFTs; each model's predictions are evaluated against out-of-sample data. This endeavor advances prior trait mapping by generating global maps that preserve variability across scales by using modern Bayesian spatial statistical modeling in combination with a database over three times larger than that in previous analyses. Our maps reveal that the most diverse grid cells possess trait variability close to the range of global PFT means

    Robustness of trait connections across environmental gradients and growth forms

    No full text
    Aim Plant trait databases often contain traits that are correlated, but for whom direct (undirected statistical dependency) and indirect (mediated by other traits) connections may be confounded. The confounding of correlation and connection hinders our understanding of plant strategies, and how these vary among growth forms and climate zones. We identified the direct and indirect connections across plant traits relevant to competition, resource acquisition and reproductive strategies using a global database and explored whether connections within and between traits from different tissue types vary across climates and growth forms. Location Global. Major taxa studied Plants. Time period Present. Methods We used probabilistic graphical models and a database of 10 plant traits (leaf area, specific leaf area, mass‐ and area‐based leaf nitrogen and phosphorous content, leaf life span, plant height, stem specific density and seed mass) with 16,281 records to describe direct and indirect connections across woody and non‐woody plants across tropical, temperate, arid, cold and polar regions. Results Trait networks based on direct connections are sparser than those based on correlations. Land plants had high connectivity across traits within and between tissue types; leaf life span and stem specific density shared direct connections with all other traits. For both growth forms, two groups of traits form modules of more highly connected traits; one related to resource acquisition, the other to plant architecture and reproduction. Woody species had higher trait network modularity in polar compared to temperate and tropical climates, while non‐woody species did not show significant differences in modularity across climate regions. Main conclusions Plant traits are highly connected both within and across tissue types, yet traits segregate into persistent modules of traits. Variation in the modularity of trait networks suggests that trait connectivity is shaped by prevailing environmental conditions and demonstrates that plants of different growth forms use alternative strategies to cope with local conditions.National Science Foundation, Grant/Award Number: IIS‐1563950; Advanced Research Projects Agency ‐ Energy, Grant/Award Number: DE‐SL0012677; H2020 European Research Council, Grant/Award Number: ERC‐SyG‐2013‐610028 IMBALANCE‐P; University of Minnesota, Grant/Award Number: CE140100008, 226299, 19‐14‐00038 and 22; Australian Research Council, Grant/Award Number: CE140100008; FP7; European Research Council; Russian Science Foundation, Grant/Award Number: # 19‐14‐0003

    Mapping local and global variability in plant trait distributions

    No full text
    Our ability to understand and predict the response of ecosystems to a changing environment depends on quantifying vegetation functional diversity. However, representing this diversity at the global scale is challenging. Typically, in Earth system models, characterization of plant diversity has been limited to grouping related species into plant functional types (PFTs), with all trait variation in a PFT collapsed into a single mean value that is applied globally. Using the largest global plant trait database and state of the art Bayesian modeling, we created fine-grained global maps of plant trait distributions that can be applied to Earth system models. Focusing on a set of plant traits closely coupled to photosynthesis and foliar respiration - specific leaf area (SLA) and dry mass-based concentrations of leaf nitrogen (Nm) and phosphorus (Pm), we characterize how traits vary within and among over 50,000 ∌50×50-km cells across the entire vegetated land surface. We do this in several ways - without defining the PFT of each grid cell and using 4 or 14 PFTs; each model's predictions are evaluated against out-of-sample data. This endeavor advances prior trait mapping by generating global maps that preserve variability across scales by using modern Bayesian spatial statistical modeling in combination with a database over three times larger than that in previous analyses. Our maps reveal that the most diverse grid cells possess trait variability close to the range of global PFT means.This research was supported as part of the Energy Exascale Earth System Model (E3SM) project, funded by the US Department of Energy, Office of Science, Office of Biological and Environmental Research (Grant DE-SC0012677 to P.B.R. and A.B.). O.K.A. acknowledges the support of the Australian Research Council (CE140100008). This research was also funded by programs from the NSF Long-Term Ecological Research (Grant DEB-1234162) and Long-Term Research in Environmental Biology (Grant DEB-1242531). A.B., F.F., and P.B.R. acknowledge funding from NSF Grant IIS-1563950. P.B.R. also acknowledges support from two University of Minnesota Institute on the Environment discovery grants. This study has been supported by the TRY initiative on plant traits (www.try-db.org). The TRY database is hosted at the Max Planck Institute for Biogeochemistry (Jena, Germany) and supported by DIVERSITAS/Future Earth, the German Centre for Integrative Biodiversity Research (iDiv) Halle-Jena-Leipzig, and the EU H2020 project BACI (Grant 640176). B.B. acknowledges a Natural Environment Research Council (NERC) independent research fellowship NE/M019160/1. J.P. acknowledges the financial support from the European Research Council Synergy Grant ERC-SyG-2013-610028 IMBALANCE-P, the Spanish Government Grant CGL2013-48074-P, and the Catalan Government Grant SGR 2014-274. B.B.-L. was supported by the Earth System Modeling program of the US Department of Energy, Office of Science, Office of Biological and Environmental Research. K.K. acknowledges the contribution of the Wageningen University and Research Investment theme Resilience for the project Resilient Forest (KB-29-009-003). P.M. acknowledges support from ARC Grant FT110100457 and NERC Grant NE/F002149/1. W.H. acknowledges support from the National Natural Science Foundation of China (Grant 41473068) and the “Light of West China” Program of the Chinese Academy of Sciences

    BHPMF – a hierarchical Bayesian approach to gap-filling and trait prediction for macroecology and functional biogeography

    No full text
    Aim: Functional traits of organisms are key to understanding and predicting biodiversity and ecological change, which motivates continuous collection of traits and their integration into global databases. Such trait matrices are inherently sparse, severely limiting their usefulness for further analyses. On the other hand, traits are characterized by the phylogenetic trait signal, trait–trait correlations and environmental constraints, all of which provide information that could be used to statistically fill gaps. We propose the application of probabilistic models which, for the first time, utilize all three characteristics to fill gaps in trait databases and predict trait values at larger spatial scales. Innovation: For this purpose we introduce BHPMF, a ierarchical Bayesian extension of probabilistic matrix actorization (PMF). PMF is a machine learning technique which exploits the correlation structure of sparse matrices to impute missing entries. BHPMF additionally utilizes the taxonomic hierarchy for trait prediction and provides uncertainty estimates for each imputation. In combination with multiple regression against environmental information, BHPMF allows for extrapolation frompoint measurements to larger spatial scales.We demonstrate the applicability of BHPMF in ecological contexts, using different plant functional trait datasets, also comparing results to taking the species mean and PMF. Main conclusions: Sensitivity analyses validate the robustness and accuracy of BHPMF: our method captures the correlation structure of the trait matrix as well as the phylogenetic trait signal – also for extremely sparse trait matrices – and provides a robust measure of confidence in prediction accuracy for each missing entry. The combination of BHPMF with environmental constraints provides a promising concept to extrapolate traits beyond sampled regions, accounting for intraspecific trait variability. We conclude that BHPMF and its derivatives have a high potential to support future trait-based research in macroecology and functional biogeography
    corecore