27 research outputs found

    MDQC: a new quality assessment method for microarrays based on quality control reports

    Get PDF
    Motivation: The process of producing microarray data involves multiple steps, some of which may suffer from technical problems and seriously damage the quality of the data. Thus, it is essential to identify those arrays with low quality. This article addresses two questions: (1) how to assess the quality of a microarray dataset using the measures provided in quality control (QC) reports; (2) how to identify possible sources of the quality problems. Results: We propose a novel multivariate approach to evaluate the quality of an array that examines the ‘Mahalanobis distance' of its quality attributes from those of other arrays. Thus, we call it Mahalanobis Distance Quality Control (MDQC) and examine different approaches of this method. MDQC flags problematic arrays based on the idea of outlier detection, i.e. it flags those arrays whose quality attributes jointly depart from those of the bulk of the data. Using two case studies, we show that a multivariate analysis gives substantially richer information than analyzing each parameter of the QC report in isolation. Moreover, once the QC report is produced, our quality assessment method is computationally inexpensive and the results can be easily visualized and interpreted. Finally, we show that computing these distances on subsets of the quality measures in the report may increase the method's ability to detect unusual arrays and helps to identify possible reasons of the quality problems. Availability: The library to implement MDQC will soon be available from Bioconductor Contact: [email protected] Supplementary information: Supplementary data are available at Bioinformatics onlin

    Predicting sepsis severity at first clinical presentation:The role of endotypes and mechanistic signatures

    Get PDF
    BACKGROUND: Inter-individual variability during sepsis limits appropriate triage of patients. Identifying, at first clinical presentation, gene expression signatures that predict subsequent severity will allow clinicians to identify the most at-risk groups of patients and enable appropriate antibiotic use. METHODS: Blood RNA-Seq and clinical data were collected from 348 patients in four emergency rooms (ER) and one intensive-care-unit (ICU), and 44 healthy controls. Gene expression profiles were analyzed using machine learning and data mining to identify clinically relevant gene signatures reflecting disease severity, organ dysfunction, mortality, and specific endotypes/mechanisms. FINDINGS: Gene expression signatures were obtained that predicted severity/organ dysfunction and mortality in both ER and ICU patients with accuracy/AUC of 77–80%. Network analysis revealed these signatures formed a coherent biological program, with specific but overlapping mechanisms/pathways. Given the heterogeneity of sepsis, we asked if patients could be assorted into discrete groups with distinct mechanisms (endotypes) and varying severity. Patients with early sepsis could be stratified into five distinct and novel mechanistic endotypes, named Neutrophilic-Suppressive/NPS, Inflammatory/INF, Innate-Host-Defense/IHD, Interferon/IFN, and Adaptive/ADA, each based on ∼200 unique gene expression differences, and distinct pathways/mechanisms (e.g., IL6/STAT3 in NPS). Endotypes had varying overall severity with two severe (NPS/INF) and one relatively benign (ADA) groupings, consistent with reanalysis of previous endotype studies. A 40 gene-classification tool (accuracy=96%) and several gene-pairs (accuracy=89–97%) accurately predicted endotype status in both ER and ICU validation cohorts. INTERPRETATION: The severity and endotype signatures indicate that distinct immune signatures precede the onset of severe sepsis and lethality, providing a method to triage early sepsis patients

    PGCA: An algorithm to link protein groups created from MS/MS data

    Full text link
    <div><p>The quantitation of proteins using shotgun proteomics has gained popularity in the last decades, simplifying sample handling procedures, removing extensive protein separation steps and achieving a relatively high throughput readout. The process starts with the digestion of the protein mixture into peptides, which are then separated by liquid chromatography and sequenced by tandem mass spectrometry (MS/MS). At the end of the workflow, recovering the identity of the proteins originally present in the sample is often a difficult and ambiguous process, because more than one protein identifier may match a set of peptides identified from the MS/MS spectra. To address this identification problem, many MS/MS data processing software tools combine all plausible protein identifiers matching a common set of peptides into a protein group. However, this solution introduces new challenges in studies with multiple experimental runs, which can be characterized by three main factors: <i>i)</i> protein groups’ identifiers are local, i.e., they vary run to run, <i>ii)</i> the composition of each group may change across runs, and <i>iii)</i> the supporting evidence of proteins within each group may also change across runs. Since in general there is no conclusive evidence about the absence of proteins in the groups, protein groups need to be linked across different runs in subsequent statistical analyses. We propose an algorithm, called Protein Group Code Algorithm (PGCA), to link groups from multiple experimental runs by forming global protein groups from connected local groups. The algorithm is computationally inexpensive and enables the connection and analysis of lists of protein groups across runs needed in biomarkers studies. We illustrate the identification problem and the stability of the PGCA mapping using 65 iTRAQ experimental runs. Further, we use two biomarker studies to show how PGCA enables the discovery of relevant candidate protein group markers with similar but non-identical compositions in different runs.</p></div

    A computational pipeline for the development of multi-marker bio-signature panels and ensemble classifiers

    Full text link
    Background: Biomarker panels derived separately from genomic and proteomic data and with a variety of computational methods have demonstrated promising classification performance in various diseases. An open question is how to create effective proteo-genomic panels. The framework of ensemble classifiers has been applied successfully in various analytical domains to combine classifiers so that the performance of the ensemble exceeds the performance of individual classifiers. Using blood-based diagnosis of acute renal allograft rejection as a case study, we address the following question in this paper: Can acute rejection classification performance be improved by combining individual genomic and proteomic classifiers in an ensemble? Results The first part of the paper presents a computational biomarker development pipeline for genomic and proteomic data. The pipeline begins with data acquisition (e.g., from bio-samples to microarray data), quality control, statistical analysis and mining of the data, and finally various forms of validation. The pipeline ensures that the various classifiers to be combined later in an ensemble are diverse and adequate for clinical use. Five mRNA genomic and five proteomic classifiers were developed independently using single time-point blood samples from 11 acute-rejection and 22 non-rejection renal transplant patients. The second part of the paper examines five ensembles ranging in size from two to 10 individual classifiers. Performance of ensembles is characterized by area under the curve (AUC), sensitivity, and specificity, as derived from the probability of acute rejection for individual classifiers in the ensemble in combination with one of two aggregation methods: (1) Average Probability or (2) Vote Threshold. One ensemble demonstrated superior performance and was able to improve sensitivity and AUC beyond the best values observed for any of the individual classifiers in the ensemble, while staying within the range of observed specificity. The Vote Threshold aggregation method achieved improved sensitivity for all 5 ensembles, but typically at the cost of decreased specificity. Conclusion Proteo-genomic biomarker ensemble classifiers show promise in the diagnosis of acute renal allograft rejection and can improve classification performance beyond that of individual genomic or proteomic classifiers alone. Validation of our results in an international multicenter study is currently underway.Computer Science, Department ofMedical Genetics, Department ofMedicine, Department ofPathology and Laboratory Medicine, Department ofRespiratory Medicine, Division ofScience, Faculty ofStatistics, Department ofNon UBCMedicine, Faculty ofReviewedFacult
    corecore