409 research outputs found
HCmodelSets: An R package for specifying sets of well-fitting models in high dimensions
In the context of regression with a large number of explanatory variables, Cox and Battey(2017) emphasize that if there are alternative reasonable explanations of the data that are statisticallyindistinguishable, one should aim to specify as many of these explanations as is feasible. The standardpractice, by contrast, is to report a single model effective for prediction. The present paper illustratesthe R implementation of the new ideas in the packageHCmodelSets, using simple reproducibleexamples and real data. Results of some simulation experiments are also reported
On inference in high-dimensional logistic regression models with separated data
Direct use of the likelihood function typically produces severely biased estimates when the dimension of the parameter vector is large relative to the effective sample size. With linearly separable data generated from a logistic regression model, the loglikelihood function asymptotes and the maximum likelihood estimator does not exist. We show that an exact analysis for each regression coefficient produces half-infinite confidence sets for some parameters when the data are separable. Such conclusions are not vacuous, but an honest portrayal of the limitations of the data. Finite confidence sets are only achievable when additional, perhaps implicit, assumptions are made. Under a notional double-asymptotic regime in which the dimension of the logistic coefficient vector increases with the sample size, the present paper considers the implications of enforcing a natural constraint on the vector of logistic-transformed probabilities. We derive a relationship between the logistic coefficients and a notional parameter obtained as a probability limit of an ordinary least squares estimator. The latter exists even when the data are separable. Consistency is ascertained under weak conditions on the design matrix
Nonparametric estimation of the intensity function of a spatial point process on a Riemannian manifold
This paper is concerned with nonparametric estimation of the intensity function of a point process on a Riemannian manifold. It provides a first-order asymptotic analysis of the proposed kernel estimator for Poisson processes, supplemented by empirical work to probe the behaviour in finite samples and under other generative regimes. The investigation highlights the scope for finite-sample improvements by allowing the bandwidth to adapt to local curvature
Communication-constrained distributed quantile regression with optimal statistical guarantees
We address the problem of how to achieve optimal inference in distributed quantile regression without stringent scaling conditions. This is challenging due to the non-smooth nature of the quantile regression (QR) loss function, which invalidates the use of existing methodology. The difficulties are resolved through a double-smoothing approach that is applied to the local (at each data source) and global objective functions. Despite the reliance on a delicate combination of local and global smoothing parameters, the quantile regression model is fully parametric, thereby facilitating interpretation. In the low-dimensional regime, we establish a finite-sample theoretical framework for the sequentially defined distributed QR estimators. This reveals a trade-off between the communication cost and statistical error. We further discuss and compare several alternative confidence set constructions, based on inversion of Wald and score-type tests and resampling techniques, detailing an improvement that is effective for more extreme quantile coefficients. In high dimensions, a sparse framework is adopted, where the proposed doubly-smoothed objective function is complemented with an ℓ1-penalty. We show that the corresponding distributed penalized QR estimator achieves the global convergence rate after a near-constant number of communication rounds. A thorough simulation study further elucidates our findings
Recommended from our members
Can you make morphometrics work when you know the right answer? Pick and mix approaches for apple identification
Morphological classification of living things has challenged science for several centuries and has led to a wide range of objective morphometric approaches in data gathering and analysis. In this paper we explore those methods using apple cultivars, a model biological system in which discrete groups are pre-defined but in which there is a high level of overall morphological similarity. The effectiveness of morphometric techniques in discovering the groups is evaluated using statistical learning tools. No one technique proved optimal in classification on every occasion, linear morphometric techniques slightly out-performing geometric (72.6% accuracy on test set versus 66.7%). The combined use of these techniques with post-hoc knowledge of their individual successes with particular cultivars achieves a notably higher classification accuracy (77.8%). From this we conclude that even with pre-determined discrete categories, a range of approaches is needed where those categories are intrinsically similar to each other, and we raise the question of whether in studies where potentially continuous natural variation is being categorised the level of match between categories is routinely set too high
SolCyc: a database hub at the Sol Genomics Network (SGN) for the manual curation of metabolic networks in Solanum and Nicotiana specific databases
SolCyc is the entry portal to pathway/genome databases (PGDBs) for major species of the Solanaceae family hosted at the Sol Genomics Network. Currently, SolCyc comprises six organism-specific PGDBs for tomato, potato, pepper, petunia, tobacco and one Rubiaceae, coffee. The metabolic networks of those PGDBs have been computationally predicted by the pathologic component of the pathway tools software using the manually curated multi-domain database MetaCyc (http://www.metacyc.org/) as reference. SolCyc has been recently extended by taxon-specific databases, i.e. the family-specific SolanaCyc database, containing only curated data pertinent to species of the nightshade family, and NicotianaCyc, a genus-specific database that stores all relevant metabolic data of the Nicotiana genus. Through manual curation of the published literature, new metabolic pathways have been created in those databases, which are complemented by the continuously updated, relevant species-specific pathways from MetaCyc. At present, SolanaCyc comprises 199 pathways and 29 superpathways and NicotianaCyc accounts for 72 pathways and 13 superpathways. Curator-maintained, taxon-specific databases such as SolanaCyc and NicotianaCyc are characterized by an enrichment of data specific to these taxa and free of falsely predicted pathways. Both databases have been used to update recently created Nicotiana-specific databases for Nicotiana tabacum, Nicotiana benthamiana, Nicotiana sylvestris and Nicotiana tomentosiformis by propagating verifiable data into those PGDBs. In addition, in-depth curation of the pathways in N.tabacum has been carried out which resulted in the elimination of 156 pathways from the 569 pathways predicted by pathway tools. Together, in-depth curation of the predicted pathway network and the supplementation with curated data from taxon-specific databases has substantially improved the curation status of the species\u2013specific N.tabacum PGDB. The implementation of this strategy will significantly advance the curation status of all organism-specific databases in SolCyc resulting in the improvement on database accuracy, data analysis and visualization of biochemical networks in those species
Recommended from our members
Arabidopsis annexin1 mediates the radical-activated plasma membrane Ca2+ - and K+ -permeable conductance in root cells
Plant cell growth and stress signaling require Ca2+ influx through plasma membrane transport proteins that are regulated by
reactive oxygen species. In root cell growth, adaptation to salinity stress, and stomatal closure, such proteins operate
downstream of the plasma membrane NADPH oxidases that produce extracellular superoxide anion, a reactive oxygen
species that is readily converted to extracellular hydrogen peroxide and hydroxyl radicals, OH_. In root cells, extracellular OH_ activates a plasma membrane Ca2+-permeable conductance that permits Ca2+ influx. In Arabidopsis thaliana, distribution of
this conductance resembles that of annexin1 (ANN1). Annexins are membrane binding proteins that can form Ca2+-permeable
conductances in vitro. Here, the Arabidopsis loss-of-function mutant for annexin1 (Atann1) was found to lack the root hair and
epidermal OH_-activated Ca2+- and K+-permeable conductance. This manifests in both impaired root cell growth and ability to
elevate root cell cytosolic free Ca2+ in response to OH_. An OH_-activated Ca2+ conductance is reconstituted by recombinant
ANN1 in planar lipid bilayers. ANN1 therefore presents as a novel Ca2+-permeable transporter providing a molecular link
between reactive oxygen species and cytosolic Ca2+ in plants
The Protein Model Portal
Structural Genomics has been successful in determining the structures of many unique proteins in a high throughput manner. Still, the number of known protein sequences is much larger than the number of experimentally solved protein structures. Homology (or comparative) modeling methods make use of experimental protein structures to build models for evolutionary related proteins. Thereby, experimental structure determination efforts and homology modeling complement each other in the exploration of the protein structure space. One of the challenges in using model information effectively has been to access all models available for a specific protein in heterogeneous formats at different sites using various incompatible accession code systems. Often, structure models for hundreds of proteins can be derived from a given experimentally determined structure, using a variety of established methods. This has been done by all of the PSI centers, and by various independent modeling groups. The goal of the Protein Model Portal (PMP) is to provide a single portal which gives access to the various models that can be leveraged from PSI targets and other experimental protein structures. A single interface allows all existing pre-computed models across these various sites to be queried simultaneously, and provides links to interactive services for template selection, target-template alignment, model building, and quality assessment. The current release of the portal consists of 7.6 million model structures provided by different partner resources (CSMP, JCSG, MCSG, NESG, NYSGXRC, JCMM, ModBase, SWISS-MODEL Repository). The PMP is available at http://www.proteinmodelportal.org and from the PSI Structural Genomics Knowledgebase
- …