5 research outputs found

    Sparse Proteomics Analysis - A compressed sensing-based approach for feature selection and classification of high-dimensional proteomics mass spectrometry data

    Get PDF
    Background: High-throughput proteomics techniques, such as mass spectrometry (MS)-based approaches, produce very high-dimensional data-sets. In a clinical setting one is often interested in how mass spectra differ between patients of different classes, for example spectra from healthy patients vs. spectra from patients having a particular disease. Machine learning algorithms are needed to (a) identify these discriminating features and (b) classify unknown spectra based on this feature set. Since the acquired data is usually noisy, the algorithms should be robust against noise and outliers, while the identified feature set should be as small as possible. Results: We present a new algorithm, Sparse Proteomics Analysis (SPA), based on the theory of compressed sensing that allows us to identify a minimal discriminating set of features from mass spectrometry data-sets. We show (1) how our method performs on artificial and real-world data-sets, (2) that its performance is competitive with standard (and widely used) algorithms for analyzing proteomics data, and (3) that it is robust against random and systematic noise. We further demonstrate the applicability of our algorithm to two previously published clinical data-sets

    Applications of Machine Learning: From Single Cell Biology to Algorithmic Fairness

    Full text link
    It is common practice to obtain answers to complex questions by analyzing large amounts of data. Formal modeling and careful mathematical definitions are essential to extracting relevant answers from data, and establishing a mathematical framework requires deliberate interdisciplinary collaboration between the specialists who provide the questions and the mathematicians who translate them. This dissertation details the results of two of these interdisciplinary collaborations: one in single cell RNA sequencing, and the other in fairness. High throughput microfluidic protocols in single cell RNA sequencing (scRNA-seq) collect integer valued mRNA counts from many individual cells in a single experiment; this enables high resolution studies of rare cell types and cell development pathways. ScRNA-seq data are sparse: often 90% of the collected reads are zeros. Specialized methods are required to obtain solutions to biological questions from these sparse, integer-valued data. Determining genetic markers that can identify specific cell populations is one of the major objectives of the analysis of mRNA count data. We introduce RANKCORR, a fast method with robust mathematical underpinnings that performs multi-class marker selection. RANKCORR proceeds by ranking the mRNA count data before linearly separating the ranked data using a small number of genes. Ranking scRNA-seq count data provides a reasonable non-parametric method for analyzing these data; we further include an analysis of the statistical properties of this rank transformation. We compare the performance of RANKCORR to a variety of other marker selection methods. These experiments show that RANKCORR is consistently one of the top-performing marker selection methods on scRNA-seq data, though other methods show similar overall performance. This suggests that the speed of the algorithm is the most important consideration for large data sets. RANKCORR is efficient and able to handle the largest data sets; as such, it is a useful tool for dealing with high throughput scRNA-seq data. The second collaboration combines state of the art machine learning methods with formal definitions of fairness. Machine learning methods have a tendency to preserve or exacerbate biases that exist in data; consequently, the algorithms that influence our daily lives often display biases against certain protected groups. It is both objectionable and often illegal to allow daily decisions (e.g. mortgage approvals, job advertisements) to disadvantage protected groups; a growing body of literature in the field of algorithmic fairness aims to mitigate these issues. We contribute two methods towards this goal. We first introduce a preprocessing method designed to debias the training data. Specifically, the method attempts to remove any variation in the original data that comes from protected group status. This is accomplished by leveraging knowledge of groups that we expect to receive similar outcomes from a fair algorithm. We further present a method for training a classifier (from potentially biased data) that is both accurate and fair using the gradient boosting framework. Gradient boosting is a powerful method for constructing predictive models that can be superior to neural networks on tabular data; the development of a fair gradient boosting method is thus desirable for the adoption of fair methods. Moreover, the method that we present is designed to construct predictors that are fair at an individual level - that is, two comparable individuals will be assigned similar results. This is different from most of the existing fair algorithms that ensure fairness at a statistical level.PHDMathematicsUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/163215/1/ahsvargo_1.pd
    corecore