262 research outputs found
The Degrees of Freedom of Partial Least Squares Regression
The derivation of statistical properties for Partial Least Squares regression
can be a challenging task. The reason is that the construction of latent
components from the predictor variables also depends on the response variable.
While this typically leads to good performance and interpretable models in
practice, it makes the statistical analysis more involved. In this work, we
study the intrinsic complexity of Partial Least Squares Regression. Our
contribution is an unbiased estimate of its Degrees of Freedom. It is defined
as the trace of the first derivative of the fitted values, seen as a function
of the response. We establish two equivalent representations that rely on the
close connection of Partial Least Squares to matrix decompositions and Krylov
subspace techniques. We show that the Degrees of Freedom depend on the
collinearity of the predictor variables: The lower the collinearity is, the
higher the Degrees of Freedom are. In particular, they are typically higher
than the naive approach that defines the Degrees of Freedom as the number of
components. Further, we illustrate how the Degrees of Freedom approach can be
used for the comparison of different regression methods. In the experimental
section, we show that our Degrees of Freedom estimate in combination with
information criteria is useful for model selection.Comment: to appear in the Journal of the American Statistical Associatio
Modelling time course gene expression data with finite mixtures of linear additive models
Summary: A model class of finite mixtures of linear additive models is presented. The component-specific parameters in the regression models are estimated using regularized likelihood methods. The advantages of the regularization are that (i) the pre-specified maximum degrees of freedom for the splines is less crucial than for unregularized estimation and that (ii) for each component individually a suitable degree of freedom is selected in an automatic way. The performance is evaluated in a simulation study with artificial data as well as on a yeast cell cycle dataset of gene expression levels over time
Morphology of obligate ectosymbionts reveals Paralaxus gen. nov.: A new circumtropical genus of marine stilbonematine nematodes
Stilbonematinae are a subfamily of conspicuous marine nematodes, distinguished by a coat of sulphurâoxidizing bacterial ectosymbionts on their cuticle. As most nematodes, the worm hosts have a relatively simple anatomy and few taxonomically informative characters, and this has resulted in numerous taxonomic reassignments and synonymizations. Recent studies using a combination of morphological and molecular traits have helped to improve the taxonomy of Stilbonematinae but also raised questions on the validity of several genera. Here, we describe a new circumtropically distributed genus Paralaxus (Stilbonematinae) with three species: Paralaxus cocos sp. nov., P. bermudensis sp. nov. and P. columbae sp. nov. We used single worm metagenomes to generate host 18S rRNA and cytochrome c oxidase I (COI) as well as symbiont 16S rRNA gene sequences. Intriguingly, COI alignments and primer matching analyses suggest that the COI is not suitable for PCRâbased barcoding approaches in Stilbonematinae as the genera have a highly diverse base composition and no conserved primer sites. The phylogenetic analyses of all three gene sets, however, confirm the morphological assignments and support the erection of the new genus Paralaxus as well as corroborate the status of the other stilbonematine genera. Paralaxus most closely resembles the stilbonematine genus Laxus in overlapping sets of diagnostic features but can be distinguished from Laxus by the morphology of the genusâspecific symbiont coat. Our reâanalyses of key parameters of the symbiont coat morphology as character for all Stilbonematinae genera show that with amended descriptions, including the coat, highly reliable genus assignments can be obtained
New consistency index based on inertial operating speed
The occurrence of road crashes depends on several factors, with design consistency (i.e., conformance of highway geometry to drivers' expectations) being one of the most important. A new consistency model for evaluating the performance of tangent-to-curve transitions on two-lane rural roads was developed. This model was based on the inertial consistency index (ICI) defined for each transition. The ICI was calculated at the beginning point of the curve as the difference between the average operating speed on the previous 1-km road segment (inertial operating speed) and the actual operating speed at this point. For the calibration of the ICI and its thresholds, 88 road segments, which included 1,686 tangent-to-curve transitions, were studied. The relationship between those results and the crash rate associated with each transition was analyzed. The results showed that the higher the ICI was, the higher the crash rate; thus, the probability of accidents increased. Similar results were obtained from the study of the relationship between the ICI and the weighted average crash rate of the corresponding group of transitions. A graphical and statistical analysis established that road consistency might be considered good when the ICI was lower than 10 km/h, poor when the ICI was higher than 20 km/h, and fair otherwise. A validation process that considered 20 road segments was performed. The ICI values obtained were highly correlated to the number of crashes that had occurred at the analyzed transitions. Thus, the ICI and its consistency thresholds resulted in a new approach for evaluation of consistency.The authors thank the Center for Studies and Experimentation of Public Works of the Spanish Ministry of Public Works, which partially subsidized the data collection, for obtaining the empirical operating speed profiles used in the validation process. The authors also thank the General Directorate of Public Works of the Infrastructure and Transportation Department of the Valencian government, the Valencian Province Council, and the General Directorate of Traffic of the Ministry of the Interior of the Government of Spain for their cooperation in data gathering.GarcĂa GarcĂa, A.; Llopis CastellĂł, D.; Camacho Torregrosa, FJ.; PĂŠrez Zuriaga, AM. (2013). New consistency index based on inertial operating speed. Transportation Research Record. (2391):105-112. doi:10.3141/2391-10S1051122391Ng, J. C. ., & Sayed, T. (2004). Effect of geometric design consistency on road safety. Canadian Journal of Civil Engineering, 31(2), 218-227. doi:10.1139/l03-090Gibreel, G. M., Easa, S. M., Hassan, Y., & El-Dimeery, I. A. (1999). State of the Art of Highway Geometric Design Consistency. Journal of Transportation Engineering, 125(4), 305-313. doi:10.1061/(asce)0733-947x(1999)125:4(305)Hassan, Y. (2004). Highway Design Consistency: Refining the State of Knowledge and Practice. Transportation Research Record: Journal of the Transportation Research Board, 1881(1), 63-71. doi:10.3141/1881-08Polus, A., & Mattar-Habib, C. (2004). New Consistency Model for Rural Highways and Its Relationship to Safety. Journal of Transportation Engineering, 130(3), 286-293. doi:10.1061/(asce)0733-947x(2004)130:3(286)Cafiso, S., Di Graziano, A., Di Silvestro, G., La Cava, G., & Persaud, B. (2010). Development of comprehensive accident models for two-lane rural highways using exposure, geometry, consistency and context variables. Accident Analysis & Prevention, 42(4), 1072-1079. doi:10.1016/j.aap.2009.12.015Zuriaga, A. M. P., GarcĂa, A. G., Torregrosa, F. J. C., & DâAttoma, P. (2010). Modeling Operating Speed and Deceleration on Two-Lane Rural Roads with Global Positioning System Data. Transportation Research Record: Journal of the Transportation Research Board, 2171(1), 11-20. doi:10.3141/2171-0
Evaluation strategies for isotope ratio measurements of single particles by LA-MC-ICPMS
Data evaluation is a crucial step when it comes to the determination of accurate and precise isotope ratios computed from transient signals measured by multi-collectorâinductively coupled plasma mass spectrometry (MC-ICPMS) coupled to, for example, laser ablation (LA). In the present study, the applicability of different data evaluation strategies (i.e. âpoint-by-pointâ, âintegrationâ and âlinear regression slopeâ method) for the computation of (235)U/(238)U isotope ratios measured in single particles by LA-MC-ICPMS was investigated. The analyzed uranium oxide particles (i.e. 9073-01-B, CRM U010 and NUSIMEP-7 test samples), having sizes down to the sub-micrometre range, are certified with respect to their (235)U/(238)U isotopic signature, which enabled evaluation of the applied strategies with respect to precision and accuracy. The different strategies were also compared with respect to their expanded uncertainties. Even though the âpoint-by-pointâ method proved to be superior, the other methods are advantageous, as they take weighted signal intensities into account. For the first time, the use of a âfinite mixture modelâ is presented for the determination of an unknown number of different U isotopic compositions of single particles present on the same planchet. The model uses an algorithm that determines the number of isotopic signatures by attributing individual data points to computed clusters. The (235)U/(238)U isotope ratios are then determined by means of the slopes of linear regressions estimated for each cluster. The model was successfully applied for the accurate determination of different (235)U/(238)U isotope ratios of particles deposited on the NUSIMEP-7 test samples. ELECTRONIC SUPPLEMENTARY MATERIAL: The online version of this article (doi:10.1007/s00216-012-6674-3) contains supplementary material, which is available to authorized users
Statistical analysis and significance testing of serial analysis of gene expression data using a Poisson mixture model
<p>Abstract</p> <p>Background</p> <p>Serial analysis of gene expression (SAGE) is used to obtain quantitative snapshots of the transcriptome. These profiles are count-based and are assumed to follow a Binomial or Poisson distribution. However, tag counts observed across multiple libraries (for example, one or more groups of biological replicates) have additional variance that cannot be accommodated by this assumption alone. Several models have been proposed to account for this effect, all of which utilize a continuous prior distribution to explain the excess variance. Here, a Poisson mixture model, which assumes excess variability arises from sampling a mixture of distinct components, is proposed and the merits of this model are discussed and evaluated.</p> <p>Results</p> <p>The goodness of fit of the Poisson mixture model on 15 sets of biological SAGE replicates is compared to the previously proposed hierarchical gamma-Poisson (negative binomial) model, and a substantial improvement is seen. In further support of the mixture model, there is observed: 1) an increase in the number of mixture components needed to fit the expression of tags representing more than one transcript; and 2) a tendency for components to cluster libraries into the same groups. A confidence score is presented that can identify tags that are differentially expressed between groups of SAGE libraries. Several examples where this test outperforms those previously proposed are highlighted.</p> <p>Conclusion</p> <p>The Poisson mixture model performs well as a) a method to represent SAGE data from biological replicates, and b) a basis to assign significance when testing for differential expression between multiple groups of replicates. Code for the R statistical software package is included to assist investigators in applying this model to their own data.</p
A comprehensive re-analysis of the Golden Spike data: Towards a benchmark for differential expression methods
<p>Abstract</p> <p>Background</p> <p>The Golden Spike data set has been used to validate a number of methods for summarizing Affymetrix data sets, sometimes with seemingly contradictory results. Much less use has been made of this data set to evaluate differential expression methods. It has been suggested that this data set should not be used for method comparison due to a number of inherent flaws.</p> <p>Results</p> <p>We have used this data set in a comparison of methods which is far more extensive than any previous study. We outline six stages in the analysis pipeline where decisions need to be made, and show how the results of these decisions can lead to the apparently contradictory results previously found. We also show that, while flawed, this data set is still a useful tool for method comparison, particularly for identifying combinations of summarization and differential expression methods that are unlikely to perform well on real data sets. We describe a new benchmark, AffyDEComp, that can be used for such a comparison.</p> <p>Conclusion</p> <p>We conclude with recommendations for preferred Affymetrix analysis tools, and for the development of future spike-in data sets.</p
Accounting for uncertainty when assessing association between copy number and disease: a latent class model
<p>Abstract</p> <p>Background</p> <p>Copy number variations (CNVs) may play an important role in disease risk by altering dosage of genes and other regulatory elements, which may have functional and, ultimately, phenotypic consequences. Therefore, determining whether a CNV is associated or not with a given disease might be relevant in understanding the genesis and progression of human diseases. Current stage technology give CNV probe signal from which copy number status is inferred. Incorporating uncertainty of CNV calling in the statistical analysis is therefore a highly important aspect. In this paper, we present a framework for assessing association between CNVs and disease in case-control studies where uncertainty is taken into account. We also indicate how to use the model to analyze continuous traits and adjust for confounding covariates.</p> <p>Results</p> <p>Through simulation studies, we show that our method outperforms other simple methods based on inferring the underlying CNV and assessing association using regular tests that do not propagate call uncertainty. We apply the method to a real data set in a controlled MLPA experiment showing good results. The methodology is also extended to illustrate how to analyze aCGH data.</p> <p>Conclusion</p> <p>We demonstrate that our method is robust and achieves maximal theoretical power since it accommodates uncertainty when copy number status are inferred. We have made <monospace>R</monospace> functions freely available.</p
Visualization of proteomics data using R and bioconductor.
Data visualization plays a key role in high-throughput biology. It is an essential tool for data exploration allowing to shed light on data structure and patterns of interest. Visualization is also of paramount importance as a form of communicating data to a broad audience. Here, we provided a short overview of the application of the R software to the visualization of proteomics data. We present a summary of R's plotting systems and how they are used to visualize and understand raw and processed MS-based proteomics data.LG was supported by the
European Union 7th Framework Program (PRIME-XS project,
grant agreement number 262067) and a BBSRC Strategic Longer
and Larger grant (Award BB/L002817/1). LMB was supported
by a BBSRC Tools and Resources Development Fund (Award
BB/K00137X/1). TN was supported by a ERASMUS Placement
scholarship.This is the final published version of the article. It was originally published in Proteomics (PROTEOMICS Special Issue: Proteomics Data Visualisation Volume 15, Issue 8, pages 1375â1389, April 2015. DOI: 10.1002/pmic.201400392). The final version is available at http://onlinelibrary.wiley.com/doi/10.1002/pmic.201400392/abstract
- âŚ