18 research outputs found
ALGORITHMS FOR CORRECTING NEXT GENERATION SEQUENCING ERRORS
The advent of next generation sequencing technologies (NGS) generated a revolution in biological research. However, in order to use the data they produce, new computational tools are needed. Due to significantly shorter length of the reads and higher per-base error rate, more complicated approaches are employed and still critical problems, such as genome assembly, are not satisfactorily solved. We therefore focus our attention on improving the quality of the NGS data. More precisely, we address the error correction issue. The current methods for correcting errors are not very accurate. In addition, they do not adapt to the data. We proposed a novel tool, HiTEC, to correct errors in NGS data. HiTEC is based on the suffix array data structure accompanied by a statistical analysis. HiTECâs accuracy is significantly higher than all previous methods. In addition, it is the only tool with the ability of adjusting to the given data set. In addition, HiTEC is time and space efficient
Estimation with Norm Regularization
Analysis of non-asymptotic estimation error and structured statistical
recovery based on norm regularized regression, such as Lasso, needs to consider
four aspects: the norm, the loss function, the design matrix, and the noise
model. This paper presents generalizations of such estimation error analysis on
all four aspects compared to the existing literature. We characterize the
restricted error set where the estimation error vector lies, establish
relations between error sets for the constrained and regularized problems, and
present an estimation error bound applicable to any norm. Precise
characterizations of the bound is presented for isotropic as well as
anisotropic subGaussian design matrices, subGaussian noise models, and convex
loss functions, including least squares and generalized linear models. Generic
chaining and associated results play an important role in the analysis. A key
result from the analysis is that the sample complexity of all such estimators
depends on the Gaussian width of a spherical cap corresponding to the
restricted error set. Further, once the number of samples crosses the
required sample complexity, the estimation error decreases as
, where depends on the Gaussian width of the unit norm
ball.Comment: Fixed technical issues. Generalized some result
Probabilistic Structured Models for Plant Trait Analysis
University of Minnesota Ph.D. dissertation. March 2017. Major: Communication Sciences and Disorders. Advisor: Arindam Banerjee. 1 computer file (PDF); xii, 171 pages.Many fields in modern science and engineering such as ecology, computational biology, astronomy, signal processing, climate science, brain imaging, natural language processing, and many more involve collecting data sets in which the dimensionality of the data p exceeds the sample size n. Since it is usually impossible to obtain consistent procedures unless p < n, a line of recent work has studied models with various types of low-dimensional structure, including sparse vectors, sparse structured graphical models, low-rank matrices, and combinations thereof. In such settings, a general approach to estimation is to solve a regularized optimization problem, which combines a loss function measuring how well the model fits the data with some regularization function that encourages the assumed structure. Of particular interest are structure learning of graphical models in high dimensional setting. The majority of statistical analysis of graphical model estimations assume that all the data are fully observed and the data points are sampled from the same distribution and provide the sample complexity and convergence rate by considering only one graphical structure for all the observations. In this thesis, we extend the above results to estimate the structure of graphical models where the data is partially observed or the data is sampled from multiple distributions. First, we consider the problem of estimating change in the dependency structure of two p-dimensional models, based on samples drawn from two graphical models. The change is assumed to be structured, e.g., sparse, block sparse, node-perturbed sparse, etc., such that it can be characterized by a suitable (atomic) norm. We present and analyze a norm-regularized estimator for directly estimating the change in structure, without having to estimate the structures of the individual graphical models. Next, we consider the problem of estimating sparse structure of Gaussian copula distributions (corresponding to non-paranormal distributions) using samples with missing values. We prove that our proposed estimators consistently estimate the non-paranormal correlation matrix where the convergence rate depends on the probability of missing values. In the second part of thesis, we consider matrix completion problem. Low-rank matrix completion methods have been successful in a variety of settings such as recommendation systems. However, most of the existing matrix completion methods only provide a point estimate of missing entries, and do not characterize uncertainties of the predictions. First, we illustrate that the the posterior distribution in latent factor models, such as probabilistic matrix factorization, when marginalized over one latent factor has the Matrix Generalized Inverse Gaussian (MGIG) distribution. We show that the MGIG is unimodal, and the mode can be obtained by solving an Algebraic Riccati Equation equation. The characterization leads to a novel Collapsed Monte Carlo inference algorithm for such latent factor models. Next, we propose a Bayesian hierarchical probabilistic matrix factorization (BHPMF) model to 1) incorporate hierarchical side information, and 2) provide uncertainty quantified predictions. The former yields significant performance improvements in the problem of plant trait prediction, a key problem in ecology, by leveraging the taxonomic hierarchy in the plant kingdom. The latter is helpful in identifying predictions of low confidence which can in turn be used to guide field work for data collection efforts. Finally, we consider applications of probabilistic structured models to plant trait analysis. We apply BHPMF model to fill the gaps in TRY database. The BHPMF model is the-state-of-the-art model for plant trait prediction and is getting increasing visibility and usage in the plant trait analysis. We have submitted a R package for BHPMF to CRAN. Next, we apply the Gaussian graphical model structure estimators to obtain the trait-trait interactions. We study the trait-trait interactions structure at different climate zones and among different plant growth forms and uncover the dependence of traits on climate and on vegetation
Mapping local and global variability in plant trait distributions
Our ability to understand and predict the response of ecosystems to a changing environment depends on quantifying vegetation functional diversity. However, representing this diversity at the global scale is challenging. Typically, in Earth system models, characterization of plant diversity has been limited to grouping related species into plant functional types (PFTs), with all trait variation in a PFT collapsed into a single mean value that is applied globally. Using the largest global plant trait database and state of the art Bayesian modeling, we created fine-grained global maps of plant trait distributions that can be applied to Earth system models. Focusing on a set of plant traits closely coupled to photosynthesis and foliar respiration - specific leaf area (SLA) and dry mass-based concentrations of leaf nitrogen (Nm) and phosphorus (Pm), we characterize how traits vary within and among over 50,000 âŒ50Ă50-km cells across the entire vegetated land surface. We do this in several ways - without defining the PFT of each grid cell and using 4 or 14 PFTs; each model's predictions are evaluated against out-of-sample data. This endeavor advances prior trait mapping by generating global maps that preserve variability across scales by using modern Bayesian spatial statistical modeling in combination with a database over three times larger than that in previous analyses. Our maps reveal that the most diverse grid cells possess trait variability close to the range of global PFT means
Robustness of trait connections across environmental gradients and growth forms
Aim
Plant trait databases often contain traits that are correlated, but for whom direct (undirected statistical dependency) and indirect (mediated by other traits) connections may be confounded. The confounding of correlation and connection hinders our understanding of plant strategies, and how these vary among growth forms and climate zones. We identified the direct and indirect connections across plant traits relevant to competition, resource acquisition and reproductive strategies using a global database and explored whether connections within and between traits from different tissue types vary across climates and growth forms.
Location
Global.
Major taxa studied
Plants.
Time period
Present.
Methods
We used probabilistic graphical models and a database of 10 plant traits (leaf area, specific leaf area, massâ and areaâbased leaf nitrogen and phosphorous content, leaf life span, plant height, stem specific density and seed mass) with 16,281 records to describe direct and indirect connections across woody and nonâwoody plants across tropical, temperate, arid, cold and polar regions.
Results
Trait networks based on direct connections are sparser than those based on correlations. Land plants had high connectivity across traits within and between tissue types; leaf life span and stem specific density shared direct connections with all other traits. For both growth forms, two groups of traits form modules of more highly connected traits; one related to resource acquisition, the other to plant architecture and reproduction. Woody species had higher trait network modularity in polar compared to temperate and tropical climates, while nonâwoody species did not show significant differences in modularity across climate regions.
Main conclusions
Plant traits are highly connected both within and across tissue types, yet traits segregate into persistent modules of traits. Variation in the modularity of trait networks suggests that trait connectivity is shaped by prevailing environmental conditions and demonstrates that plants of different growth forms use alternative strategies to cope with local conditions.National Science Foundation, Grant/Award Number: IISâ1563950; Advanced Research Projects Agency â Energy, Grant/Award Number: DEâSL0012677; H2020 European Research Council, Grant/Award Number: ERCâSyGâ2013â610028 IMBALANCEâP; University of Minnesota, Grant/Award Number: CE140100008, 226299, 19â14â00038 and 22; Australian Research Council, Grant/Award Number: CE140100008; FP7; European Research Council; Russian Science Foundation, Grant/Award Number: # 19â14â0003
Mapping local and global variability in plant trait distributions
Our ability to understand and predict the response of ecosystems to a changing environment depends on quantifying vegetation functional diversity. However, representing this diversity at the global scale is challenging. Typically, in Earth system models, characterization of plant diversity has been limited to grouping related species into plant functional types (PFTs), with all trait variation in a PFT collapsed into a single mean value that is applied globally. Using the largest global plant trait database and state of the art Bayesian modeling, we created fine-grained global maps of plant trait distributions that can be applied to Earth system models. Focusing on a set of plant traits closely coupled to photosynthesis and foliar respiration - specific leaf area (SLA) and dry mass-based concentrations of leaf nitrogen (Nm) and phosphorus (Pm), we characterize how traits vary within and among over 50,000 âŒ50Ă50-km cells across the entire vegetated land surface. We do this in several ways - without defining the PFT of each grid cell and using 4 or 14 PFTs; each model's predictions are evaluated against out-of-sample data. This endeavor advances prior trait mapping by generating global maps that preserve variability across scales by using modern Bayesian spatial statistical modeling in combination with a database over three times larger than that in previous analyses. Our maps reveal that the most diverse grid cells possess trait variability close to the range of global PFT means.This research was supported as part of the Energy Exascale Earth System Model (E3SM) project, funded by the US Department of
Energy, Office of Science, Office of Biological and Environmental Research
(Grant DE-SC0012677 to P.B.R. and A.B.). O.K.A. acknowledges the support of the Australian Research Council (CE140100008). This research was
also funded by programs from the NSF Long-Term Ecological Research
(Grant DEB-1234162) and Long-Term Research in Environmental Biology
(Grant DEB-1242531). A.B., F.F., and P.B.R. acknowledge funding from NSF
Grant IIS-1563950. P.B.R. also acknowledges support from two University
of Minnesota Institute on the Environment discovery grants. This study
has been supported by the TRY initiative on plant traits (www.try-db.org).
The TRY database is hosted at the Max Planck Institute for Biogeochemistry (Jena, Germany) and supported by DIVERSITAS/Future Earth, the German Centre for Integrative Biodiversity Research (iDiv) Halle-Jena-Leipzig,
and the EU H2020 project BACI (Grant 640176). B.B. acknowledges a Natural Environment Research Council (NERC) independent research fellowship
NE/M019160/1. J.P. acknowledges the financial support from the European
Research Council Synergy Grant ERC-SyG-2013-610028 IMBALANCE-P, the
Spanish Government Grant CGL2013-48074-P, and the Catalan Government
Grant SGR 2014-274. B.B.-L. was supported by the Earth System Modeling
program of the US Department of Energy, Office of Science, Office of Biological and Environmental Research. K.K. acknowledges the contribution of the
Wageningen University and Research Investment theme Resilience for the
project Resilient Forest (KB-29-009-003). P.M. acknowledges support from
ARC Grant FT110100457 and NERC Grant NE/F002149/1. W.H. acknowledges
support from the National Natural Science Foundation of China (Grant
41473068) and the âLight of West Chinaâ Program of the Chinese Academy
of Sciences
BHPMF â a hierarchical Bayesian approach to gap-filling and trait prediction for macroecology and functional biogeography
Aim: Functional traits of organisms are key to understanding and predicting biodiversity and ecological change, which motivates continuous collection of traits and their integration into global databases. Such trait matrices are inherently sparse, severely limiting their usefulness for further analyses. On the other hand, traits are characterized by the phylogenetic trait signal, traitâtrait correlations and environmental constraints, all of which provide information that could be used to statistically fill gaps. We propose the application of probabilistic models which, for the first time, utilize all three characteristics to fill gaps in trait databases and predict trait values at larger spatial scales.
Innovation: For this purpose we introduce BHPMF, a ierarchical Bayesian extension of probabilistic matrix actorization (PMF). PMF is a machine learning technique which exploits the correlation structure of sparse matrices to impute missing entries. BHPMF additionally utilizes the taxonomic hierarchy for trait prediction and provides uncertainty estimates for each imputation. In combination with multiple regression against environmental information, BHPMF allows for extrapolation frompoint measurements to larger spatial scales.We demonstrate the applicability of BHPMF in ecological contexts, using different plant functional trait datasets, also comparing results to taking the species mean and PMF.
Main conclusions: Sensitivity analyses validate the robustness and accuracy of BHPMF: our method captures the correlation structure of the trait matrix as well as the phylogenetic trait signal â also for extremely sparse trait matrices â and provides a robust measure of confidence in prediction accuracy for each missing entry. The combination of BHPMF with environmental constraints provides a promising concept to extrapolate traits beyond sampled regions, accounting for intraspecific trait variability. We conclude that BHPMF and its derivatives have a high potential to support future trait-based research in macroecology and functional biogeography