19 research outputs found
A general approach to simultaneous model fitting and variable elimination in response models for biological data with many more variables than observations
<p>Abstract</p> <p>Background</p> <p>With the advent of high throughput biotechnology data acquisition platforms such as micro arrays, SNP chips and mass spectrometers, data sets with many more variables than observations are now routinely being collected. Finding relationships between response variables of interest and variables in such data sets is an important problem akin to finding needles in a haystack. Whilst methods for a number of response types have been developed a general approach has been lacking.</p> <p>Results</p> <p>The major contribution of this paper is to present a unified methodology which allows many common (statistical) response models to be fitted to such data sets. The class of models includes virtually any model with a linear predictor in it, for example (but not limited to), multiclass logistic regression (classification), generalised linear models (regression) and survival models. A fast algorithm for finding sparse well fitting models is presented. The ideas are illustrated on real data sets with numbers of variables ranging from thousands to millions. R code implementing the ideas is available for download.</p> <p>Conclusion</p> <p>The method described in this paper enables existing work on response models when there are less variables than observations to be leveraged to the situation when there are many more variables than observations. It is a powerful approach to finding parsimonious models for such datasets. The method is capable of handling problems with millions of variables and a large variety of response types within the one framework. The method compares favourably to existing methods such as support vector machines and random forests, but has the advantage of not requiring separate variable selection steps. It is also works for data types which these methods were not designed to handle. The method usually produces very sparse models which make biological interpretation simpler and more focused.</p
An Anomalous Type IV Secretion System in Rickettsia Is Evolutionarily Conserved
Bacterial type IV secretion systems (T4SSs) comprise a diverse transporter family functioning in conjugation, competence, and effector molecule (DNA and/or protein) translocation. Thirteen genome sequences from Rickettsia, obligate intracellular symbionts/pathogens of a wide range of eukaryotes, have revealed a reduced T4SS relative to the Agrobacterium tumefaciens archetype (vir). However, the Rickettsia T4SS has not been functionally characterized for its role in symbiosis/virulence, and none of its substrates are known.Superimposition of T4SS structural/functional information over previously identified Rickettsia components implicate a functional Rickettsia T4SS. virB4, virB8 and virB9 are duplicated, yet only one copy of each has the conserved features of similar genes in other T4SSs. An extraordinarily duplicated VirB6 gene encodes five hydrophobic proteins conserved only in a short region known to be involved in DNA transfer in A. tumefaciens. virB1, virB2 and virB7 are newly identified, revealing a Rickettsia T4SS lacking only virB5 relative to the vir archetype. Phylogeny estimation suggests vertical inheritance of all components, despite gene rearrangements into an archipelago of five islets. Similarities of Rickettsia VirB7/VirB9 to ComB7/ComB9 proteins of epsilon-proteobacteria, as well as phylogenetic affinities to the Legionella lvh T4SS, imply the Rickettsiales ancestor acquired a vir-like locus from distantly related bacteria, perhaps while residing in a protozoan host. Modern modifications of these systems likely reflect diversification with various eukaryotic host cells.We present the rvh (Rickettsiales vir homolog) T4SS, an evolutionary conserved transporter with an unknown role in rickettsial biology. This work lays the foundation for future laboratory characterization of this system, and also identifies the Legionella lvh T4SS as a suitable genetic model
Sparse feature learning using ensemble model for highly-correlated high-dimensional data
© Springer Nature Switzerland AG 2018. High-dimensional highly correlated data exist in several domains such as genomics. Many feature selection techniques consider correlated features as redundant and therefore need to be removed. Several studies investigate the interpretation of the correlated features in domains such as genomics, but investigating the classification capabilities of the correlated feature groups is a point of interest in several domains. In this paper, a novel method is proposed by integrating the ensemble feature ranking and co-expression networks to identify the optimal features for classification. The main advantage of the proposed method lies in the fact, that it does not consider the correlated features as redundant. But, it shows the importance of the selected correlated features to improve the performance of classification. A series of experiments on five high dimensional highly correlated datasets with different levels of imbalance ratios show that the proposed method outperformed the state-of-the-art methods
Combining Multiple Hypothesis Testing with Machine Learning Increases the Statistical Power of Genome-wide Association Studies
The standard approach to the analysis of genome-wide association studies (GWAS) is based on testing each position in the genome individually for statistical significance of its association with the phenotype under investigation. To improve the analysis of GWAS, we propose a combination of machine learning and statistical testing that takes correlation structures within the set of SNPs under investigation in a mathematically well-controlled manner into account. The novel two-step algorithm, COMBI, first trains a support vector machine to determine a subset of candidate SNPs and then performs hypothesis tests for these SNPs together with an adequate threshold correction. Applying COMBI to data from a WTCCC study (2007) and measuring performance as replication by independent GWAS published within the 2008–2015 period, we show that our method outperforms ordinary raw p-value thresholding as well as other state-of-the-art methods. COMBI presents higher power and precision than the examined alternatives while yielding fewer false (i.e. non-replicated) and more true (i.e. replicated) discoveries when its results are validated on later GWAS studies. More than 80% of the discoveries made by COMBI upon WTCCC data have been validated by independent studies. Implementations of the COMBI method are available as a part of the GWASpi toolbox 2.0.EF acknowledges support from the advanced ERC grant (ERC-2011-AdG 295642-FEP) on the Foundation of Economic Preferences. MK, BM, and KRM were supported by the German National Science Foundation (DFG) under the grants MU 987/6-1 and RA 1894/1-1. TD and DS were supported by the German National Science Foundation (DFG) under the grants DI 1723/3-1 und SCHU 2828/2-1. GB and TS acknowledge support of the German National Science Foundation (DFG) under the research group grant FOR 1735. MK, DT, KRM, and GB acknowledge financial support by the FP7-ICT Programme of the European Community, under the PASCAL2 Network of Excellence. MK acknowledges a postdoctoral fellowship by the German Research Foundation (DFG), award KL 2698/2-1, and from the Federal Ministry of Science and Education (BMBF) awards 031L0023A and 031B0187B. AN acknowledges support from the Spanish Multiple Sclerosis Network (REEM), of the Instituto de Salud Carlos III (RD12/0032/0011), the Spanish National Institute for Bioinformatics (PT13/0001/0026) the Spanish Government Grant BFU2012-38236 and from FEDER. This project has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No. 634143 (MedBioinformatics). MK and KRM were financially supported by the Ministry of Education, Science, and Technology, through the National Research Foundation of Korea under Grant R31-10008 (MK, KRM) and BK21 (KRM)