325 research outputs found

    An Out-of-Core GPU based dimensionality reduction algorithm for Big Mass Spectrometry Data and its application in bottom-up Proteomics

    Get PDF
    Modern high resolution Mass Spectrometry instruments can generate millions of spectra in a single systems biology experiment. Each spectrum consists of thousands of peaks but only a small number of peaks actively contribute to deduction of peptides. Therefore, pre-processing of MS data to detect noisy and non-useful peaks are an active area of research. Most of the sequential noise reducing algorithms are impractical to use as a pre-processing step due to high time-complexity. In this paper, we present a GPU based dimensionality-reduction algorithm, called G-MSR, for MS2 spectra. Our proposed algorithm uses novel data structures which optimize the memory and computational operations inside GPU. These novel data structures include Binary Spectra and Quantized Indexed Spectra (QIS). The former helps in communicating essential information between CPU and GPU using minimum amount of data while latter enables us to store and process complex 3-D data structure into a 1-D array structure while maintaining the integrity of MS data. Our proposed algorithm also takes into account the limited memory of GPUs and switches between in-core and out-of-core modes based upon the size of input data. G-MSR achieves a peak speed-up of 386x over its sequential counterpart and is shown to process over a million spectra in just 32 seconds. The code for this algorithm is available as a GPL open-source at GitHub at the following link: https://github.com/pcdslab/G-MSR

    MS-REDUCE: An ultrafast technique for reduction of Big Mass Spectrometry Data for high-throughput processing

    Get PDF
    Modern proteomics studies utilize high-throughput mass spectrometers which can produce data at an astonishing rate. These big Mass Spectrometry (MS) datasets can easily reach peta-scale level creating storage and analytic problems for large-scale systems biology studies. Each spectrum consists of thousands of peaks which have to be processed to deduce the peptide. However, only a small percentage of peaks in a spectrum are useful for peptide deduction as most of the peaks are either noise or not useful for a given spectrum. This redundant processing of non-useful peaks is a bottleneck for streaming high-throughput processing of big MS data. One way to reduce the amount of computation required in a high-throughput environment is to eliminate non-useful peaks. Existing noise removing algorithms are limited in their data-reduction capability and are compute intensive making them unsuitable for big data and high-throughput environments. In this paper we introduce a novel low-complexity technique based on classification, quantization and sampling of MS peaks We present a novel data-reductive strategy for analysis of Big MS data. Our algorithm, called MS-REDUCE, is capable of eliminating noisy peaks as well as peaks that do not contribute to peptide deduction before any peptide deduction is attempted. Our experiments have shown up to 100x speed up over existing state of the art noise elimination algorithms while maintaining comparable high quality matches. Using our approach we were able to process a million spectra in just under an hour on a moderate server. The developed tool and strategy will be available to wider proteomics and parallel computing community at the authors webpages after the paper is published

    A hybrid algorithm for Bayesian network structure learning with application to multi-label learning

    Get PDF
    We present a novel hybrid algorithm for Bayesian network structure learning, called H2PC. It first reconstructs the skeleton of a Bayesian network and then performs a Bayesian-scoring greedy hill-climbing search to orient the edges. The algorithm is based on divide-and-conquer constraint-based subroutines to learn the local structure around a target variable. We conduct two series of experimental comparisons of H2PC against Max-Min Hill-Climbing (MMHC), which is currently the most powerful state-of-the-art algorithm for Bayesian network structure learning. First, we use eight well-known Bayesian network benchmarks with various data sizes to assess the quality of the learned structure returned by the algorithms. Our extensive experiments show that H2PC outperforms MMHC in terms of goodness of fit to new data and quality of the network structure with respect to the true dependence structure of the data. Second, we investigate H2PC's ability to solve the multi-label learning problem. We provide theoretical results to characterize and identify graphically the so-called minimal label powersets that appear as irreducible factors in the joint distribution under the faithfulness condition. The multi-label learning problem is then decomposed into a series of multi-class classification problems, where each multi-class variable encodes a label powerset. H2PC is shown to compare favorably to MMHC in terms of global classification accuracy over ten multi-label data sets covering different application domains. Overall, our experiments support the conclusions that local structural learning with H2PC in the form of local neighborhood induction is a theoretically well-motivated and empirically effective learning framework that is well suited to multi-label learning. The source code (in R) of H2PC as well as all data sets used for the empirical tests are publicly available.Comment: arXiv admin note: text overlap with arXiv:1101.5184 by other author

    Innovative Approaches To Identify Regulators Of Liver Regeneration

    Get PDF
    The mammalian liver possesses a remarkable ability to regenerate after injury to prevent immediate organ failure. However, amid a rising global burden of liver disease, the only curative treatment for patients with end-stage liver disease is transplantation. Elucidating the mechanisms underlying tissue repair and regrowth will enable identification of therapeutic targets to stimulate native liver regeneration, thereby circumventing the great paucity of available transplant organs. Here, utilizing the Fah-/- mouse model of liver repopulation, I applied transcriptomic and epigenomic techniques to investigate the changes occurring as hepatocytes restore organ mass following toxic injury. By labeling ribosomal or nuclear envelope proteins, I performed the first extensive characterization of gene expression and chromatin landscape changes specifically in repopulating hepatocytes in response to injury. Transcriptomic analysis showed that repopulating hepatocytes highly upregulate Slc7a11, a gene that encodes the cystine/glutamate antiporter. I demonstrated that ectopic Slc7a11 expression promotes liver regeneration and Slc7a11 mutation inhibits hepatocyte replication. Integrative bioinformatics analyses of chromatin accessibility revealed dynamic changes at promoters and liver-enriched enhancer regions that correlate with the activation of proliferation-associated genes and the repression of transcripts expressed in mature, quiescent hepatocytes. Furthermore, changes in chromatin accessibility and gene expression are associated with increased promoter binding of CCCTC-binding factor (CTCF) and decreased enhancer occupancy of hepatocyte nuclear factor 4α (HNF4α). In summary, my thesis work identifies Slc7a11 as a potential driver of liver regeneration, and provides insights into the complex crosstalk between chromatin accessibility and transcription factor occupancy to regulate gene expression in repopulating hepatocytes

    Genomics and transcriptomics of the molting gland (Y-organ) in the blackback land crab, Gecarcinus lateralis

    Get PDF
    Includes bibliographical references.2016 Summer.Molting is required for growth and development in crustaceans. In the blackback land crab Gecarcinus lateralis, molting is stimulated by ecdysteroids, hormones produced in the Y-organ (YO). Throughout the molting cycle, the YO demonstrates phenotypic plasticity. The phenotypic plasticity is correlated with the stages of the molt cycle, during which YO ecdysteroid production varies. During intermolt, the longest stage of the molt cycle, the circulating ecdysteroid titers are low and molting is suppressed. In preparation for molting, the YO increases ecdysteroid production during premolt. Circulating ecdysteroids continue to rise, dropping right before the ecdysis and remaining low in the subsequent postmolt period. During the molt cycle, the YO's sensitivity to inhibitory cues also varies, which contributes to ecdysteroid fluctuations. To better understand how changes in gene expression modulate the YO's phenotypic plasticity, a YO transcriptome from five molt stages was generated. Using over 5.6 million reads from Illumina, 229,278 contigs were assembled to comprise the reference transcriptome. By comparing expression levels of the transcripts between the molt stages, 13,189 unique differentially expressed contigs were identified in G. lateralis. Based on differential expression, insect hormone biosynthesis and oxidative phosphorylation pathways were enriched, validating the YO transcriptome identity. Using GO enrichment, MAP kinase was identified as a possible candidate gene for regulating YO ecdysteroid synthesis. To complement and validate the transcriptome, claw muscle genomic DNA was sequenced and assembled using 2.6 million reads. 375,152 scaffolds ≥ 500 bp were built, with an N50 of 1,841 bp. Using k-mer frequencies, the genome size was estimated to be 3.07 Gb, similar to mammalian vertebrates. The median gene size of G. lateralis was approximated to be 6,300 bp; the disparity between the median estimate and the N50 prohibited further computational analysis. Genome scaffolds were sufficient in length for manual comparison. Alignment of the transcriptome and genome sequences of the Rheb gene showed 100% nucleotide alignment in the open reading frame, and extended the sequence by 7.7 fold, including the identification of four introns. The sequence comparison validated both genome and transcriptome assemblies and extended the gene sequence. Next-generation sequencing provided us with a global perspective of molecular variations within the YO throughout the molt cycle. We hypothesize variations in gene expression regulate YO phenotypic plasticity by varying ecdysteroid production. YO transitions throughout molting are essential for regulation. YO activation and commitment, both corresponding to increased ecdysteroids, are required to induce ecdysis. YO repression, during which circulating ecdysteroid titers are low, is needed to prevent precocious molting. Identifying changes in gene expression and key regulatory elements correlating with variations in YO phenotype will increase our understanding of molt cycle regulation, which is critical for crustacean development, growth, and repair

    Profiling of Chronic Myeloid Neoplasms through RNA Sequencing Analysis

    Get PDF
    Background: Chronic myeloid leukemia (CML) is a clonal myeloproliferative disorder resulting from the neoplastic transformation of hematopoietic stem cells. A higher number of (54%) CML patients in Qatar failed to respond to the treatment compared to those of (34%) internationally. Classification of the different stages of the disease is done in the clinic but is not sufficient; there is a great need to find markers that can be used to stratify the stages of the CML disease and furthermore monitor those that are undergoing treatment. Aims and Objectives: For this study, we aimed to identify the disease-specific profile of genes (transcripts) by analyzing the RNA sequencing of patients, recruited from two different stages of CML disease i.e. Chronic Phase (CP) and Complete Remission (CR). Our objective was to see if we can identify a disease-specific group of transcripts. Methods: A total of 16 subjects were recruited for the study, which included nine patients with chronic phase (CP), three with complete remission and four healthy individuals (Controls, (CNT). RNA extracted from frozen PBMCs or fresh blood was used for RNA sequencing and analysis using Hiseq Illumina 4000. Results and Discussions: We identified a subset of 23 transcripts and classified them into four different groups. We discovered that eight transcripts: MAZ_4, RIT1_3, BUB1_4, CFLAR_5, PARVG_6, SENP5_6, GATS_1, TAOK3_4 were significantly and differentially expressed in the chronic phase patients only compared to five transcripts that segregated the complete remission from the rest of the cohort study and These are HNRNPA3_4, SLC4A7_3, TBC1D4_8, CTC1_1, GRAMD1A_6. The following five transcripts: U2SURP_5SEPT9_2, DPY19L3_8, CYTH4_1, PLXNB2_8 were differentially expressed in both groups of patients compared to control group. These that are shared between CP and CR group might indicate that study subjects in the complete remission did not fully recover and might need closer monitoring and to validate this current observation will be through a longitudinal follow up for some patients in complete remission using other diagnostic methods to examine the ratio of BCR-ABL1 fusion gene and the expression level of these shared transcripts between CP and CR groups. Conclusion: This study provides sets of the promising genes (transcripts) that have the potential to be used to stratify CML patients into complete remission and chronic phase groups. These findings have the potential for following-up and to specifically determine the treatment dose. Furthermore, some of these differentially expressed genes (transcripts) might be a potential therapeutic target or could be used as prognostic biomarkers

    Algorithms for integrated analysis of glycomics and glycoproteomics by LC-MS/MS

    Get PDF
    The glycoproteome is an intricate and diverse component of a cell, and it plays a key role in the definition of the interface between that cell and the rest of its world. Methods for studying the glycoproteome have been developed for released glycan glycomics and site-localized bottom-up glycoproteomics using liquid chromatography-coupled mass spectrometry and tandem mass spectrometry (LC-MS/MS), which is itself a complex problem. Algorithms for interpreting these data are necessary to be able to extract biologically meaningful information in a high throughput, automated context. Several existing solutions have been proposed but may be found lacking for larger glycopeptides, for complex samples, different experimental conditions, different instrument vendors, or even because they simply ignore fundamentals of glycobiology. I present a series of open algorithms that approach the problem from an instrument vendor neutral, cross-platform fashion to address these challenges, and integrate key concepts from the underlying biochemical context into the interpretation process. In this work, I created a suite of deisotoping and charge state deconvolution algorithms for processing raw mass spectra at an LC scale from a variety of instrument types. These tools performed better than previously published algorithms by enforcing the underlying chemical model more strictly, while maintaining a higher degree of signal fidelity. From this summarized, vendor-normalized data, I composed a set of algorithms for interpreting glycan profiling experiments that can be used to quantify glycan expression. From this I constructed a graphical method to model the active biosynthetic pathways of the sample glycome and dig deeper into those signals than would be possible from the raw data alone. Lastly, I created a glycopeptide database search engine from these components which is capable of identifying the widest array of glycosylation types available, and demonstrate a learning algorithm which can be used to tune the model to better understand the process of glycopeptide fragmentation under specific experimental conditions to outperform a simpler model by between 10% and 15%. This approach can be further augmented with sample-wide or site-specific glycome models to increase depth-of-coverage for glycoforms consistent with prior beliefs
    • …
    corecore