Search CORE

139 research outputs found

Variable selection via Lasso with high-dimensional proteomic data

Author: Zhai Hongxuan
Publication venue: Washington University Open Scholarship
Publication date: 18/05/2018
Field of study

Multiclass classification with high-dimensional data is an applied topic both in statistics and machine learning. The classification procedure could be done in various ways. In this thesis, we review the theory of the Lasso procedure which provides a parameter estimator while simultaneously achieving dimension reduction due to a property of the L1 norm. Lasso with elastic net penalty and sparse group lasso are also reviewed. Our data is high-dimensional proteomic data (iTRAQ ratios) of breast cancer patients with four subtypes of breast cancer. We use the multinomial logistic regression to train our classifier and use the false classification rates obtained from cross validation to compare models

Washington University St. Louis: Open Scholarship

An adaptive ensemble learner function via bagging and rank aggregation with applications to high dimensional data.

Author: Shah Jasmit SureshKumar
Publication venue: ThinkIR: The University of Louisville\u27s Institutional Repository
Publication date: 01/08/2011
Field of study

An ensemble consists of a set of individual predictors whose predictions are combined. Generally, different classification and regression models tend to work well for different types of data and also, it is usually not know which algorithm will be optimal in any given application. In this thesis an ensemble regression function is presented which is adapted from Datta et al. 2010. The ensemble function is constructed by combining bagging and rank aggregation that is capable of changing its performance depending on the type of data that is being used. In the classification approach, the results can be optimized with respect to performance measures such as accuracy, sensitivity, specificity and area under the curve (AUC) whereas in the regression approach, it can be optimized with respect to measures such as mean square error and mean absolute error. The ensemble classifier and ensemble regressor performs at the level of the best individual classifier or regression model. For complex high-dimensional datasets, it may be advisable to combine a number of classification algorithms or regression algorithms rather than using one specific algorithm

University of Louisville

Challenges in the Analysis of Mass-Throughput Data: A Technical Commentary from the Statistical Machine Learning Perspective

Author: Aliferis Constantin F.
Statnikov Alexander
Tsamardinos Ioannis
Publication venue: Libertas Academica
Publication date: 01/01/2006
Field of study

Sound data analysis is critical to the success of modern molecular medicine research that involves collection and interpretation of mass-throughput data. The novel nature and high-dimensionality in such datasets pose a series of nontrivial data analysis problems. This technical commentary discusses the problems of over-fitting, error estimation, curse of dimensionality, causal versus predictive modeling, integration of heterogeneous types of data, and lack of standard protocols for data analysis. We attempt to shed light on the nature and causes of these problems and to outline viable methodological approaches to overcome them

Directory of Open Access Journals

PubMed Central

Gene Function Prediction from Functional Association Networks Using Kernel Partial Least Squares Regression

Author: Bähler J
Lees J
Lehtinen S
Orengo C
Shawe-Taylor J
Publication venue
Publication date: 19/08/2015
Field of study

With the growing availability of large-scale biological datasets, automated methods of extracting functionally meaningful information from this data are becoming increasingly important. Data relating to functional association between genes or proteins, such as co-expression or functional association, is often represented in terms of gene or protein networks. Several methods of predicting gene function from these networks have been proposed. However, evaluating the relative performance of these algorithms may not be trivial: concerns have been raised over biases in different benchmarking methods and datasets, particularly relating to non-independence of functional association data and test data. In this paper we propose a new network-based gene function prediction algorithm using a commute-time kernel and partial least squares regression (Compass). We compare Compass to GeneMANIA, a leading network-based prediction algorithm, using a number of different benchmarks, and find that Compass outperforms GeneMANIA on these benchmarks. We also explicitly explore problems associated with the non-independence of functional association data and test data. We find that a benchmark based on the Gene Ontology database, which, directly or indirectly, incorporates information from other databases, may considerably overestimate the performance of algorithms exploiting functional association data for prediction

UCL Discovery

CMA – a comprehensive Bioconductor package for supervised classification with high dimensional data

Author: Campisi Patrizio
Neri Alessandro
Papari Giuseppe
Petkov Nicolai
Publication venue: BioMed Central
Publication date: 01/01/2006
Field of study

For the last eight years, microarray-based class prediction has been a major topic in statistics, bioinformatics and biomedicine research. Traditional methods often yield unsatisfactory results or may even be inapplicable in the p > n setting where the number of predictors by far exceeds the number of observations, hence the term “ill-posed-problem”. Careful model selection and evaluation satisfying accepted good-practice standards is a very complex task for inexperienced users with limited statistical background or for statisticians without experience in this area. The multiplicity of available methods for class prediction based on high-dimensional data is an additional practical challenge for inexperienced researchers. In this article, we introduce a new Bioconductor package called CMA (standing for “Classification for MicroArrays”) for automatically performing variable selection, parameter tuning, classifier construction, and unbiased evaluation of the constructed classifiers using a large number of usual methods. Without much time and effort, users are provided with an overview of the unbiased accuracy of most top-performing classifiers. Furthermore, the standardized evaluation framework underlying CMA can also be beneficial in statistical research for comparison purposes, for instance if a new classifier has to be compared to existing approaches. CMA is a user-friendly comprehensive package for classifier construction and evaluation implementing most usual approaches. It is freely available from the Bioconductor website at http://bioconductor.org/packages/2.3/bioc/html/CMA.html

Crossref

Proceedings - University of Groningen

University of Groningen

ARTS repository - University of Groningen

PubMed Central

Open Access LMU

Archivio della Ricerca - Università di Roma 3

University of Groningen Digital Archive

Dissertations of the University of Groningen

CMA – a comprehensive Bioconductor package for supervised classification with high dimensional data

Author: Boulesteix A-L
Daumer M
Slawski M
Publication venue: BioMed Central
Publication date: 01/01/2008
Field of study

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

Open Access LMU

Extending principal covariates regression for high-dimensional multi-block data

Author: Park S.
Publication venue: [s.n.]
Publication date: 17/11/2023
Field of study

This dissertation addresses the challenge of deciphering extensive datasets collected from multiple sources, such as health habits and genetic information, in the context of studying complex issues like depression. A data analysis method known as Principal Covariate Regression (PCovR) provides a strong basis in this challenge.Yet, analyzing these intricate datasets is far from straightforward. The data often contain redundant and irrelevant variables, making it difficult to extract meaningful insights. Furthermore, these data may involve different types of outcome variables (for instance, the variable pertaining to depression could manifest as a score from a depression scale or a binary diagnosis (yes/no) from a medical professional), adding another layer of complexity.To overcome these obstacles, novel adaptations of PCovR are proposed in this dissertation. The methods automatically select important variables, categorize insights into those originating from a single source or multiple sources, and accommodate various outcome variable types. The effectiveness of these methods is demonstrated in predicting outcomes and revealing the subtle relationships within data from multiple sources.Moreover, the dissertation offers a glimpse of future directions in enhancing PCovR. Implications of extending the method such that it selects important variables are critically examined. Also, an algorithm that has the potential to yield optimal results is suggested. In conclusion, this dissertation proposes methods to tackle the complexity of large data from multiple sources, and points towards where opportunities may lie in the next line of research

Tilburg University Repository

Extending principal covariates regression for high-dimensional multi-block data

Author: Park S.
Publication venue: [s.n.]
Publication date: 17/11/2023
Field of study

Tilburg University Repository

Machine Learning and Integrative Analysis of Biomedical Big Data.

Author: Choi Howard
Chung Neo Christopher
Mirza Bilal
Ping Peipei
Wang Jie
Wang Wei
Publication venue: eScholarship, University of California
Publication date: 01/01/2019
Field of study

Recent developments in high-throughput technologies have accelerated the accumulation of massive amounts of omics data from multiple sources: genome, epigenome, transcriptome, proteome, metabolome, etc. Traditionally, data from each source (e.g., genome) is analyzed in isolation using statistical and machine learning (ML) methods. Integrative analysis of multi-omics and clinical data is key to new biomedical discoveries and advancements in precision medicine. However, data integration poses new computational challenges as well as exacerbates the ones associated with single-omics studies. Specialized computational approaches are required to effectively and efficiently perform integrative analysis of biomedical data acquired from diverse modalities. In this review, we discuss state-of-the-art ML-based approaches for tackling five specific computational challenges associated with integrative analysis: curse of dimensionality, data heterogeneity, missing data, class imbalance and scalability issues

Multidisciplinary Digital Publishing Institute

Ezid

Directory of Open Access Journals

eScholarship - University of California