1,814 research outputs found

    Artificial Bee Colony With Differential Evolution Algorithm For Feature Extraction And Selection Of Mass Spectrometry Data

    Get PDF
    Kemajuan dalam teknik spektrometri jisim untuk kajian proteomik telah meningkatkan penemuan pengecaman-bio daripada corak kuantitatif proteomik. Pemprosesan data yang banyak untuk molekul yang terlibat boleh meningkat kepada siri puncak saling berkait dan bertindih di dalam spektrum jisim. Spektrum ini juga mengalami data berdimensi tinggi berbanding saiz sampel yang kecil. Beberapa kajian telah memperkenalkan teknik statistik dan pembelajaran mesin seperti Analisa Komponen Asas ((PCA)), Analisa Komponen Tak Bersandar ((ICA)) dan Analisa Riak Pekali (waveletcoefficient) untuk mengekstrak data yang berpotensi. Namun, tiada satu pun daripada kaedah yang dibincangkan mengambil kira dengan serius masalah kelemahan data yang berdimensi tinggi benbanding saiz sample yang kecil. Kajian ini telah tertumpu kepada dua peringkat dalam analisa spektometri jisim. Pertama, kaedah ciri penyaringan iaitu akan menyaring puncak-puncak yang memberi inferens tentang maksud biologi bagi data tersebut. Anggaran pengecutan bagi kovarians telah di cadangkan untuk mengumpul m/z windows dan mengenalpasti pekali korelasi terbaik antara puncak-puncak bagi data spektometri jisim untuk ciri penyaringan. Kedua, kaedah ciri pemilihan yang mencari ciri-ciri terbaik berdasarkan keputusan yang paling tepat daripada model klasifikasi yang dijanakan. The advancement in mass spectrometry technique for proteomic studies has proliferated the discovery of biomarkers from quantitative proteomics pattern. Highthroughput data for a given molecule can give rise to a series of inter-related and overlapping peaks in a mass spectrum. The spectrum suffers from high dimensionality data relative to small sample size. Several studies have proposed statistical and machine learning techniques such as Principle Component Analysis (PCA), Independent Component Analysis (ICA) and wavelet-coefficient in order to extract the potential features. However, none of these methods take into account the huge number of features relative to small sample size. This study focused on two stages of mass spectrometry analysis. Firstly, feature extraction methods extract peaks as potential features to infer biological meaning of the data. Shrinkage estimation of covariance was proposed to assemble m=z windows and identify the correlation coefficient among peaks of mass spectrometry data for feature extraction. Secondly, feature selection techniques search parsimonious features through a learning model that exhibits the most accurate results

    Inferential stability in systems biology

    Get PDF
    The modern biological sciences are fraught with statistical difficulties. Biomolecular stochasticity, experimental noise, and the “large p, small n” problem all contribute to the challenge of data analysis. Nevertheless, we routinely seek to draw robust, meaningful conclusions from observations. In this thesis, we explore methods for assessing the effects of data variability upon downstream inference, in an attempt to quantify and promote the stability of the inferences we make. We start with a review of existing methods for addressing this problem, focusing upon the bootstrap and similar methods. The key requirement for all such approaches is a statistical model that approximates the data generating process. We move on to consider biomarker discovery problems. We present a novel algorithm for proposing putative biomarkers on the strength of both their predictive ability and the stability with which they are selected. In a simulation study, we find our approach to perform favourably in comparison to strategies that select on the basis of predictive performance alone. We then consider the real problem of identifying protein peak biomarkers for HAM/TSP, an inflammatory condition of the central nervous system caused by HTLV-1 infection. We apply our algorithm to a set of SELDI mass spectral data, and identify a number of putative biomarkers. Additional experimental work, together with known results from the literature, provides corroborating evidence for the validity of these putative biomarkers. Having focused on static observations, we then make the natural progression to time course data sets. We propose a (Bayesian) bootstrap approach for such data, and then apply our method in the context of gene network inference and the estimation of parameters in ordinary differential equation models. We find that the inferred gene networks are relatively unstable, and demonstrate the importance of finding distributions of ODE parameter estimates, rather than single point estimates

    Machine Learning-based Classification of Diffuse Large B-cell Lymphoma Patients by Their Protein Expression Profiles

    No full text
    Characterization of tumors at the molecular level has improved our knowledge of cancer causation and progression. Proteomic analysis of their signaling pathways promises to enhance our understanding of cancer aberrations at the functional level, but this requires accurate and robust tools. Here, we develop a state of the art quantitative mass spectrometric pipeline to characterize formalin-fixed paraffin-embedded tissues of patients with closely related subtypes of diffuse large B-cell lymphoma. We combined a super-SILAC approach with label-free quantification (hybrid LFQ) to address situations where the protein is absent in the super-SILAC standard but present in the patient samples. Shotgun proteomic analysis on a quadrupole Orbitrap quantified almost 9,000 tumor proteins in 20 patients. The quantitative accuracy of our approach allowed the segregation of diffuse large B-cell lymphoma patients according to their cell of origin using both their global protein expression patterns and the 55-protein signature obtained previously from patient-derived cell lines (Deeb, S. J., D'Souza, R. C., Cox, J., Schmidt-Supprian, M., and Mann, M. (2012) Mol. Cell. Proteomics 11, 77-89). Expression levels of individual segregation-driving proteins as well as categories such as extracellular matrix proteins behaved consistently with known trends between the subtypes. We used machine learning (support vector machines) to extract candidate proteins with the highest segregating power. A panel of four proteins (PALD1, MME, TNFAIP8, and TBC1D4) is predicted to classify patients with low error rates. Highly ranked proteins from the support vector analysis revealed differential expression of core signaling molecules between the subtypes, elucidating aspects of their pathobiology

    Proteome Profiling of Breast Tumors by Gel Electrophoresis and Nanoscale Electrospray Ionization Mass Spectrometry

    Get PDF
    We have conducted proteome-wide analysis of fresh surgery specimens derived from breast cancer patients, using an approach that integrates size-based intact protein fractionation, nanoscale liquid separation of peptides, electrospray ion trap mass spectrometry, and bioinformatics. Through this approach, we have acquired a large amount of peptide fragmentation spectra from size-resolved fractions of the proteomes of several breast tumors, tissue peripheral to the tumor, and samples from patients undergoing noncancer surgery. Label-free quantitation was used to generate protein abundance maps for each proteome and perform comparative analyses. The mass spectrometry data revealed distinct qualitative and quantitative patterns distinguishing the tumors from healthy tissue as well as differences between metastatic and non-metastatic human breast cancers including many established and potential novel candidate protein biomarkers. Selected proteins were evaluated by Western blotting using tumors grouped according to histological grade, size, and receptor expression but differing in nodal status. Immunohistochemical analysis of a wide panel of breast tumors was conducted to assess expression in different types of breast cancers and the cellular distribution of the candidate proteins. These experiments provided further insights and an independent validation of the data obtained by mass spectrometry and revealed the potential of this approach for establishing multimodal markers for early metastasis, therapy outcomes, prognosis, and diagnosis in the future. © 2008 American Chemical Society

    Meta-Analysis of MS-Based Proteomics Studies Indicates Interferon Regulatory Factor 4 and Nucleobindin1 as Potential Prognostic and Drug Resistance Biomarkers in Diffuse Large B Cell Lymphoma

    Get PDF
    Funding: Rune Matthiesen is supported by Fundação para a Ciência e a Tecnologia (CEEC position, 2019–2025 investigator). This article is a Fiigureresult of the projects (iNOVA4Health— UIDB/04462/2020), supported by Lisboa Portugal Regional Operational Programme (Lisboa2020), under the PORTUGAL 2020 Partnership Agreement, through the European Regional Development Fund (ERDF). This work is also funded by FEDER funds through the COMPETE 2020 Programme and National Funds through FCT—Portuguese Foundation for Science and Technology under the project numbers: PTDC/BTM-TEC/30087/2017 and PTDC/BTM-TEC/30088/2017.The prognosis of diffuse large B cell lymphoma (DLBCL) is inaccurately predicted using clinical features and immunohistochemistry (IHC) algorithms. Nomination of a panel of molecules as the target for therapy and predicting prognosis in DLBCL is challenging because of the divergences in the results of molecular studies. Mass spectrometry (MS)-based proteomics in the clinic represents an analytical tool with the potential to improve DLBCL diagnosis and prognosis. Previous proteomics studies using MS-based proteomics identified a wide range of proteins. To achieve a consensus, we reviewed MS-based proteomics studies and extracted the most consistently significantly dysregulated proteins. These proteins were then further explored by analyzing data from other omics fields. Among all significantly regulated proteins, interferon regulatory factor 4 (IRF4) was identified as a potential target by proteomics, genomics, and IHC. Moreover, annexinA5 (ANXA5) and nucleobindin1 (NUCB1) were two of the most up-regulated proteins identified in MS studies. Functional enrichment analysis identified the light zone reactions of the germinal center (LZ-GC) together with cytoskeleton locomotion functions as enriched based on consistent, significantly dysregulated proteins. In this study, we suggest IRF4 and NUCB1 proteins as potential biomarkers that deserve further investigation in the field of DLBCL sub-classification and prognosis.publishersversionpublishe

    Biomarker identification in HIV and non-HIV related lymphomas

    Get PDF
    DLBCL is the most common lymphoma subtype occurring in older populations as well as in younger HIV infected patients. The current treatment options for DLBCL are effective for most patients yet the relapse rate is high. While many biomarkers for DLBCL exist, they are not in clinical use due to low sensitivity and specificity. In addition, these biomarkers have not been studied in the HIV context. Therefore, the identification of new biomarkers for HIV negative and HIV positive DLBCL, may lead to a better understanding of the disease pathology and better therapeutic design. Initially differences in the clinicopathological features between HIV negative and HIV positive DLBCL patients were determined by conducting a retrospective study of patients treated at GSH. Subsequent to this, potential protein biomarkers for DLBCL were determined using MALDI imaging mass spectrometry (IMS) and characterised using LCMS. The expression of one of the biomarkers, heat shock protein (Hsp) 70, was confirmed on a separate cohort of samples using immunohistochemistry. Our results indicate that the clinicopathological features for HIV negative and HIV positive DLBCL are similar except for median age, and frequency of elevated LDH levels. Several clinicopathological factors were prognostic for all DLBCL cases including age, gender, stage and bone marrow involvement. In addition, tumour extranodal site was also a prognostic indicator for the HIV negative cohort. The biomarkers identified in the study consisted of four protein clusters including glycolytic enzymes, ribosomal proteins, histones and collagen. These proteins could differentiate between control and tumour tissue, and the DLBCL subtypes in both cohorts. The majority (41/52) of samples in the confirmation cohort were negative for Hsp70 expression. The HIV positive DLBCL cases had a higher percentage of cases expressing Hsp70 than their HIV negative counterparts. The non-GC subtype also frequently overexpressed Hsp70, confirming MALDI IMS data. Expression of Hsp70 correlated with poor outcome in the HIV negative cohort. In conclusion, this study identified potential biomarkers for HIV negative and HIV positive DLBCL from both clinical and molecular sources. These may be used as diagnostic and prognostic markers complementary to current clinical management for DLBCL

    Knowledge management overview of feature selection problem in high-dimensional financial data: Cooperative co-evolution and Map Reduce perspectives

    Get PDF
    The term big data characterizes the massive amounts of data generation by the advanced technologies in different domains using 4Vs volume, velocity, variety, and veracity-to indicate the amount of data that can only be processed via computationally intensive analysis, the speed of their creation, the different types of data, and their accuracy. High-dimensional financial data, such as time-series and space-Time data, contain a large number of features (variables) while having a small number of samples, which are used to measure various real-Time business situations for financial organizations. Such datasets are normally noisy, and complex correlations may exist between their features, and many domains, including financial, lack the al analytic tools to mine the data for knowledge discovery because of the high-dimensionality. Feature selection is an optimization problem to find a minimal subset of relevant features that maximizes the classification accuracy and reduces the computations. Traditional statistical-based feature selection approaches are not adequate to deal with the curse of dimensionality associated with big data. Cooperative co-evolution, a meta-heuristic algorithm and a divide-And-conquer approach, decomposes high-dimensional problems into smaller sub-problems. Further, MapReduce, a programming model, offers a ready-To-use distributed, scalable, and fault-Tolerant infrastructure for parallelizing the developed algorithm. This article presents a knowledge management overview of evolutionary feature selection approaches, state-of-The-Art cooperative co-evolution and MapReduce-based feature selection techniques, and future research directions

    A Bayesian network approach to feature selection in mass spectrometry data

    Get PDF
    One of the key goals of current cancer research is the identification of biologic molecules that allow non-invasive detection of existing cancers or cancer precursors. One way to begin this process of biomarker discovery is by using time-of-flight mass spectroscopy to identify proteins or other molecules in tissue or serum that correlate to certain cancers. However, there are many difficulties associated with the output of such experiments. The distribution of protein abundances in a population is unknown, the mass spectroscopy measurements have high variability, and high correlations between variables cause problems with popular methods of data mining. to mitigate these issues, Bayesian inductive methods, combined with non-model dependent information theory scoring, are used to find feature sets and build classifiers for mass spectroscopy data from blood serum Such methods show improvement over existing measures, and naturally incorporate measurement uncertainties. Resulting Bayesian network models are applied to three blood serum data sets: one artificially generated, one from a 2004 leukemia study, and another from a 2007 prostate cancer study. Feature sets obtained appear to show sufficient stability under cross-validation to provide not only biomarker candidates but also families of features for further biochemical analysis
    corecore