556 research outputs found

    Feature selection and nearest centroid classification for protein mass spectrometry

    Get PDF
    BACKGROUND: The use of mass spectrometry as a proteomics tool is poised to revolutionize early disease diagnosis and biomarker identification. Unfortunately, before standard supervised classification algorithms can be employed, the "curse of dimensionality" needs to be solved. Due to the sheer amount of information contained within the mass spectra, most standard machine learning techniques cannot be directly applied. Instead, feature selection techniques are used to first reduce the dimensionality of the input space and thus enable the subsequent use of classification algorithms. This paper examines feature selection techniques for proteomic mass spectrometry. RESULTS: This study examines the performance of the nearest centroid classifier coupled with the following feature selection algorithms. Student-t test, Kolmogorov-Smirnov test, and the P-test are univariate statistics used for filter-based feature ranking. From the wrapper approaches we tested sequential forward selection and a modified version of sequential backward selection. Embedded approaches included shrunken nearest centroid and a novel version of boosting based feature selection we developed. In addition, we tested several dimensionality reduction approaches, namely principal component analysis and principal component analysis coupled with linear discriminant analysis. To fairly assess each algorithm, evaluation was done using stratified cross validation with an internal leave-one-out cross-validation loop for automated feature selection. Comprehensive experiments, conducted on five popular cancer data sets, revealed that the less advocated sequential forward selection and boosted feature selection algorithms produce the most consistent results across all data sets. In contrast, the state-of-the-art performance reported on isolated data sets for several of the studied algorithms, does not hold across all data sets. CONCLUSION: This study tested a number of popular feature selection methods using the nearest centroid classifier and found that several reportedly state-of-the-art algorithms in fact perform rather poorly when tested via stratified cross-validation. The revealed inconsistencies provide clear evidence that algorithm evaluation should be performed on several data sets using a consistent (i.e., non-randomized, stratified) cross-validation procedure in order for the conclusions to be statistically sound

    Comparison of Supervised Classification Methods for Protein Profiling in Cancer Diagnosis

    Get PDF
    A key challenge in clinical proteomics of cancer is the identification of biomarkers that could allow detection, diagnosis and prognosis of the diseases. Recent advances in mass spectrometry and proteomic instrumentations offer unique chance to rapidly identify these markers. These advances pose considerable challenges, similar to those created by microarray-based investigation, for the discovery of pattern of markers from high-dimensional data, specific to each pathologic state (e.g. normal vs cancer). We propose a three-step strategy to select important markers from high-dimensional mass spectrometry data using surface enhanced laser desorption/ionization (SELDI) technology. The first two steps are the selection of the most discriminating biomarkers with a construction of different classifiers. Finally, we compare and validate their performance and robustness using different supervised classification methods such as Support Vector Machine, Linear Discriminant Analysis, Quadratic Discriminant Analysis, Neural Networks, Classification Trees and Boosting Trees. We show that the proposed method is suitable for analysing high-throughput proteomics data and that the combination of logistic regression and Linear Discriminant Analysis outperform other methods tested

    Statistical Analysis of Chemical Sensor Data

    Get PDF

    Double Backpropagation with Applications to Robustness and Saliency Map Interpretability

    Get PDF
    This thesis is concerned with works in connection to double backpropagation, which is a phenomenon that arises when first-order optimization methods are applied to a neural network's loss function, if this contains derivatives. Its connection to robustness and saliency map interpretability is explained

    A Review on Dimension Reduction Techniques in Data Mining

    Get PDF
    Real world data is high-dimensional like images, speech signals containing multiple dimensions to represent data. Higher dimensional data are more complex for detecting and exploiting the relationships among terms. Dimensionality reduction is a technique used for reducing complexity for analyzing high dimensional data. There are many methodologies that are being used to find the Critical Dimensions for a dataset that significantly reduces the number of dimensions. They reduce the dimensions from the original input data. Dimensionality reduction methods can be of two types as feature extractions and feature selection techniques. Feature Extraction is a distinct form of Dimensionality Reduction to extract some important feature from input dataset. Two different approaches available for dimensionality reduction are supervised approach and unsupervised approach. One exclusive purpose of this survey is to provide an adequate comprehension of the different dimensionality reduction techniques that exist currently and also to introduce the applicability of any one of the prescribed methods that depends upon the given set of parameters and varying conditions. This paper surveys the schemes that are majorly used for Dimensionality Reduction mainly high dimension datasets. A comparative analysis of surveyed methodologies is also done, based on which, best methodology for a certain type of dataset can be chosen. Keywords: Data Mining, Dimensionality Reduction, Clustering, feature selection; curse of dimensionality; critical dimensio

    Genetic Algorithms for Feature Selection and Classification of Complex Chromatographic and Spectroscopic Data

    Get PDF
    A basic methodology for analyzing large multivariate chemical data sets based on feature selection is proposed. Each chromatogram or spectrum is represented as a point in a high dimensional measurement space. A genetic algorithm for feature selection and classification is applied to the data to identify features that optimize the separation of the classes in a plot of the two or three largest principal components of the data. A good principal component plot can only be generated using features whose variance or information is primarily about differences between classes in the data. Hence, feature subsets that maximize the ratio of between-class to within-class variance are selected by the pattern recognition genetic algorithm. Furthermore, the structure of the data set can be explored, for example, new classes can be discovered by simply tuning various parameters of the fitness function of the pattern recognition genetic algorithm. The proposed method has been validated on a wide range of data. A two-step procedure for pattern recognition analysis of spectral data has been developed. First, wavelets are used to denoise and deconvolute spectral bands by decomposing each spectrum into wavelet coefficients, which represent the samples constituent frequencies. Second, the pattern recognition genetic algorithm is used to identify wavelet coefficients characteristic of the class. In several studies involving spectral library searching, this method was employed. In one study, a search pre-filter to detect the presence of carboxylic acids from vapor phase infrared spectra which has previously eluted prominent researchers has been successfully formulated and validated. In another study, this same approach has been used to develop a pattern recognition assisted infrared library searching technique to determine the model, manufacturer, and year of the vehicle from which a clear coat paint smear originated. The pattern recognition genetic algorithm has also been used to develop a potential method to identify molds in indoor environments using volatile organic compounds. A distinct profile indicative of microbial volatile organic compounds was developed from air sampling data that could be readily differentiated from the blank for both high mold count and moderate mold count exposure samples. The utility of the pattern recognition genetic algorithm for discovery of biomarker candidates from genomic and proteomic data sets has also been shown.Chemistry Departmen

    Computational Methods for the Differential Profiling of Triacylglycerols Using RP-HPLC/APCI-MS

    Get PDF
    Reversed phase liquid chromatography with atmospheric pressure chemical ionization mass spectrometry (RP-HPLC/APCI-MS) was employed for the analysis of natural mixtures of triacylglycerols. An integrated framework for data analysis, including preprocessing, statistical analysis and automated structure identification, was implemented in the R statistical program. Raw data stored as mzXML, mzData, or mzXML files are preprocessed using a series of steps for peak detection, chromatographic alignment, and normalization. Targeted and non-targeted feature selection steps are employed to filter the data for features that are relevant and informative for a particular biological question. Triacylglycerol structures are identified by evaluating relationships between the diacylglycerol fragment ions and protonated molecules observed in APCI mass spectra, and suggested structures are evaluated using a correlation-based score that reflects whether structure-associated ions are concurrently eluting over the retention-time course of the analysis. The algorithm was tested using five soybean oils and triacylglycerol structure identifications were verified from literature references. We employed the developed methodology for classification of plant oils and marine oils to their biological source, and also to determine structural differences in triacylglycerols in adipose tissue from mice fed different high-fat diets in studies of diet-induced obesity

    Ambient ionization - mass spectrometry: Advances toward intrasurgical cancer detection

    Get PDF
    My dissertation research has focused on the development of ambient ionization – mass spectrometry (MS) for clinical measurements, specifically intrasurgical cancer detection. The molecular differences between normal and cancerous tissue were detected via direct tissue analysis in vitro by touch spray ionization (TS) or by analyzing sectioned or smeared tissue using desorption electrospray ionization (DESI). The physical form of the tissue, e.g. in vitro sampling, sectioned, or smeared, was inconsequential in differentiating normal from cancerous tissue; however, the spectra acquired by TS and DESI differed due to differences in ionization processes. We envision that TS-MS and DESI-MS could impact diagnostic medicine, for example in providing surgeons with rapid, near real-time information as to tissue disease state, i.e. normal or tumor. Disease state information provided to surgeons about discrete pathologically ambiguous areas, ideally intrasurgically via TS or DESI-MS smear analysis, could improve the completeness of tumor resection while minimizing unintended damage to adjacent tissue. Touch spray ionization was developed for intrasurgical detection of cancer; TS greatly simplifies MS analysis by using the same device for in vivo sampling and subsequent ionization. Frozen tissue sections were sampled and analyzed by TS-MS providing the ability to differentiate normal from human prostate cancer, via lipid profiles, using multivariate statistics. The next proof-of-concept step for TS-MS was the analysis human kidney cancer specimens in vitro, immediately following resection. TS-MS analysis of untreated kidney tissue emulated intrasurgical use, e.g. the presence and co-sampling of biofluids such as blood. Regardless, normal renal tissue and kidney cancer was differentiated using lipid profiles and multivariate statistics. Desorption electrospray ionization (DESI) - MS imaging of tissue sections differentiated normal from tumor in all cancers studied. DESI-MS imaging of human prostate and human kidney tissue sections were performed to corroborate TS-MS results. Human brain cancer, a major focus of my dissertation research, was studied by imaging tissue sections using DESI-MS to establish the characteristic chemical features, e.g. lipid and metabolite profiles, that distinguish normal brain parenchyma from gliomas and different brain tumors. It was found that information in the negative ion mode lipid profile, positive ion mode lipid profile, and negative ion mode metabolite profile is able to discriminate brain parenchyma (grey and white matter) and gliomas, the most common form of malignant brain tumor. Further, the negative mode lipid and metabolite profiles also proved capable of discriminating different types of brain tumors (gliomas, meningiomas, and pituitary tumors) which account for ~80% of all central nervous system tumors. DESI-MS imaging of effaced or otherwise pathologically ambiguous frozen tissue sections offered the ability to determine the underlying brain parenchyma in cancerous samples – something that traditional morphologic evaluation was not able to determine. Further, DESI-MS was able to detect molecular changes resulting from varying amounts of glioma tumor cells present within infiltrated tissues. The tumor cell percentage of these samples was predicted using N-acetyl-aspartic acid, a neurometabolite which was found to decrease in cancerous tissue, and matched well with histopathologic evaluation. The transition from DESI-MS imaging of sectioned tissue to DESI-MS analysis of tissue smears was driven by the time restriction of intrasurgical application. The potential of DESI-MS analysis of smears was first demonstrated upon canine non-Hodgkin’s lymphoma fine-needle aspirate smears which provided similar sensitivity and specificity values to that of tissue section imaging but is technically less demanding and decreased analysis time. DESI-MS imaging of tissue sections established that MS profiles contained the sufficient information for diagnosis; whereas DESI-MS analysis of tissue smears made the intrasurgical analysis of human brain tumors feasible. The observed lipid or metabolite profiles were not significantly altered by the physical act of smearing and their signal intensities were comparable to those of tissue sections. Further, the chemical information obtained from tissue smears was equivalent to those of tissue sections as determined by canonical component analysis. The culmination of my dissertation research was the creation and implementation of an intrasurgical DESI-MS tissue smear analysis method for human gliomas. Preliminary results from the initial intrasurgical cases analyzed using the developed DESI-MS method are discussed

    Non-invasive, innovative and promising strategy for breast cancer diagnosis based on metabolomic profile of urine, cancer cell lines and tissue

    Get PDF
    The work presented in this thesis aimed to establish the metabolomic profile of urine and breast cancer (BC) tissue from BC patients (samples cordially provided by Funchal Hospital), in addition to BC cell lines (MCF-7, MDA-MB-231, T-47D) as a powerful strategy to identify metabolites as potential BC biomarkers, helping on the development of non-invasive approaches for BC diagnosis and management. To achieve the main goal and obtain a deeper and comprehensive knowledge on BC metabolome, different analytical platforms, namely headspace solid-phase microextraction (HSSPME) combined with gas chromatography-quadrupole mass spectrometry (GC-qMS) and nuclear magnetic ressonance (1H NMR) spectroscopy were used. The application of multivariate statistical methods - principal component analysis (PCA) and orthogonal partial least square – discriminant analysis (OPLS-DA), to data matrix obtained from the different target samples allowed to find a set of highly sensitive and specific metabolites metabolites, namely, 4-heptanone, acetic acid and glutamine, able to be used as potential biomarkers in BC diagnosis. Significant group separation was observed in OPLS-DA score plot between BC and CTL indicating intrinsic metabolic alterations in each group. To attest the robustness of the model, a random permutation test with 1000 permutations was performed with OPLS-DA. The permutation test yielded R2 (represents goodness of fit) and Q2 values (represents predictive ability) with values higher than 0.717 and 0.691, respectively. Several metabolic pathways were dysregulated in BC considering the analytical approaches used. The main pathways included pyruvate, glutamine and sulfur metabolisms, indicating that there might be an association between the metabolites arising from the type of biological sample of the same donor used to perform the investigation. The integration of data obtained from different analytical platforms (GC-qMS and 1H NMR) for urinary and tissue samples revealed that five metabolites (e.g., acetone, 3-hexanone, 4-heptanone, 2methyl-5-(methylthio)-furan and acetate), were found significant using a dual analytical approach.O trabalho apresentado nesta tese teve como objetivo estabelecer o perfil metabolómico da urina e do tecido da mama de doentes com cancro de mama (BC) (amostras cordialmente fornecidas pelo Hospital do Funchal), além das linhas celulares de BC (MCF-7, MDA-MB-231, T -47D) como uma poderosa estratégia para identificar metabolitos como potenciais biomarcadores de BC, auxiliando no desenvolvimento de abordagens não invasivas para o diagnóstico e a gestão da patologia. Para obter um conhecimento mais profundo e abrangente do metaboloma de BC, diferentes plataformas analíticas, nomeadamente a microextração em fase sólida em modo headspace (HS-SPME) combinada com a cromatografia em fase gasosa acoplada à espectrometria de massa (GC-qMS) e espectroscopia de ressonância magnética nuclear (1H RMN), foram usadas para atingir o objetivo principal. A aplicação de métodos estatísticos multivariados - análise de componentes principais (PCA) e análise discriminante de mínimos quadrados parciais ortogonais (OPLS-DA) à matriz de dados obtida a partir das diferentes amostras alvo, permitiu estabelecer um grupo de metabolitos sensíveis e específicos, nomeadamente a 4-heptanona, o ácido acético e a glutamina, possíveis de serem utilizados como potenciais biomarcadores no diagnóstico de BC. Uma separação significativa entre os grupos BC e CTL foi observada pelo OPLS-DA, indicando alterações metabólicas em cada grupo. Para verificar a robustez do modelo, foi realizado um teste de permutação aleatória com 1000 permutações com o sistema OPLS-DA. Valores de R2 (representa o ajuste) e Q2 (representa a capacidade preditiva) superiores a 0,717 e 0,691, foram obtidos utilizando o teste da permutação. Diversas vias metabólicas estavam desreguladas no BC considerando as abordagens analíticas utilizadas. As principais vias incluíram os metabolismos do piruvato e glutamina, indicando que poderá haver uma associação entre os metabolitos derivados do tipo de amostra biológica do mesmo doador utilizado para realizar a investigação. A integração de dados obtidos pelas diferentes plataformas analíticas (GC-qMS e 1H RMN) para amostras urinárias e de tecido revelou cinco metabolitos significativos usando a dupla abordagem analítica. (i.e., acetona, 3-hexanona, 4-heptanona, 2-metil-5- (metiltio) - furano e acetato)
    corecore