40 research outputs found

    Gene Expression Analysis Methods on Microarray Data a A Review

    Get PDF
    In recent years a new type of experiments are changing the way that biologists and other specialists analyze many problems. These are called high throughput experiments and the main difference with those that were performed some years ago is mainly in the quantity of the data obtained from them. Thanks to the technology known generically as microarrays, it is possible to study nowadays in a single experiment the behavior of all the genes of an organism under different conditions. The data generated by these experiments may consist from thousands to millions of variables and they pose many challenges to the scientists who have to analyze them. Many of these are of statistical nature and will be the center of this review. There are many types of microarrays which have been developed to answer different biological questions and some of them will be explained later. For the sake of simplicity we start with the most well known ones: expression microarrays

    Pre-processing for noise detection in gene expression classification data

    Get PDF
    Due to the imprecise nature of biological experiments, biological data is often characterized by the presence of redundant and noisy data. This may be due to errors that occurred during data collection, such as contaminations in laboratorial samples. It is the case of gene expression data, where the equipments and tools currently used frequently produce noisy biological data. Machine Learning algorithms have been successfully used in gene expression data analysis. Although many Machine Learning algorithms can deal with noise, detecting and removing noisy instances from the training data set can help the induction of the target hypothesis. This paper evaluates the use of distance-based pre-processing techniques for noise detection in gene expression data classification problems. This evaluation analyzes the effectiveness of the techniques investigated in removing noisy data, measured by the accuracy obtained by different Machine Learning classifiers over the pre-processed data.São Paulo State Research Foundation (FAPESP)CNP

    Bioinformatic analysis and deep learning on large-scale human transcriptomic data: studies on aging, Alzheimer’s neurodegeneration and cancer

    Get PDF
    [ES] El objetivo general del proyecto ha sido el análisis bioinformático integrativo de datos múltiples de proteómica y genómica combinados con datos clínicos asociados para la búsqueda de biomarcadores y módulos poligénicos causales aplicado a enfermedades complejas; principalmente, cáncer de origen primario desconocido, en sus distintos tipos y subtipos y enfermedades neurodegenerativas (ND) mayormente Alzheimer, además de neurodegeneración debida a la edad. Además, se ha hecho un uso intensivo de técnicas de inteligencia artificial, más en concreto de técnicas de redes neuronales de aprendizaje profundo para el análisis y pronóstico de dichas enfermedades

    Data Imputation through the Identification of Local Anomalies

    Get PDF
    We introduce a comprehensive and statistical framework in a model free setting for a complete treatment of localized data corruptions due to severe noise sources, e.g., an occluder in the case of a visual recording. Within this framework, we propose i) a novel algorithm to efficiently separate, i.e., detect and localize, possible corruptions from a given suspicious data instance and ii) a Maximum A Posteriori (MAP) estimator to impute the corrupted data. As a generalization to Euclidean distance, we also propose a novel distance measure, which is based on the ranked deviations among the data attributes and empirically shown to be superior in separating the corruptions. Our algorithm first splits the suspicious instance into parts through a binary partitioning tree in the space of data attributes and iteratively tests those parts to detect local anomalies using the nominal statistics extracted from an uncorrupted (clean) reference data set. Once each part is labeled as anomalous vs normal, the corresponding binary patterns over this tree that characterize corruptions are identified and the affected attributes are imputed. Under a certain conditional independency structure assumed for the binary patterns, we analytically show that the false alarm rate of the introduced algorithm in detecting the corruptions is independent of the data and can be directly set without any parameter tuning. The proposed framework is tested over several well-known machine learning data sets with synthetically generated corruptions; and experimentally shown to produce remarkable improvements in terms of classification purposes with strong corruption separation capabilities. Our experiments also indicate that the proposed algorithms outperform the typical approaches and are robust to varying training phase conditions

    Integrative Data Mining and Meta Analysis of Disease-Specific Large-Scale Genomic,Transcriptomic and Proteomic Data

    Get PDF
    During the past decades, large-scale microarray technologies have been applied to the field of genomics, transcriptomics and proteomics. DNA microarrays and mass spectrometry have been used as tools for identifying changes in gene- and protein expression and genomic alterations that can be linked to various stages of tumor development. Although these technologies have generated a deluge of data, bioinformatic algorithms still need to be improved to advance the understanding of many biological fundamental questions. In particular, most bioinformatic strategies are optimized for one of these technologies and only allow for an one dimensional view on the biological question. Within this thesis a bioinformatic tool was developed that combines the multidimensional information that can be obtained when analysing genomic, transcriptomic and proteomic data in an integrative manner. Neuroblastoma is a malignant pediatric tumor of the nervous system. The tumor is characterized by aberration patterns that correlate with patient outcome. aCGH (array comparative genomic hybridization) and DNA-microrarray gene expression analysis were choosen as appropriate methods to analyse the impact of DNA copy number variations on gene expression in 81 neuroblastoma samples. Within this thesis a novel bioinformatic strategy was used which identifies chromosomal aberrations that influence the expression of genes located at the same (cis-effects) and also at different (trans-effects) chromosomal positions in neuroblastoma. Sample specific cis-effects were identified for the paired data by a probe-matching procedure, gene expression discretization and a correlation score in combination with one-dimensional hierarchical clustering. The graphical representation revealed that tumors with an amplification of the oncogene MYCN had a gain of chromosome 17 whereas genes in cis-position were downregulated. Simultaneously, a loss of chromosome 1 and a downregulation of the corresponding genes hint towards a crossrelationship between chromosome 17 and 1. A Bayesian network (BN) as representation of joint probability distributions was adopted to detect neuroblastoma specific cis- and trans-effects. The strength of association between aCGH and gene expression data was represented by markov blankets, which where build up by mutual information. This gave rise to a graphical network that linked DNA copy number changes with genes and also gene-gene interactions. This method found chromosomal aberrations on 11q and 17q to have a major impact on neuroblastoma. A prominent trans-effect was identified by a gain of 17q.23.2 and an upregulation of CPT1B which is located at 22.q13.33. Further, to identify the effects of gene expression changes on the protein expression the bioinformatic tool was expanded to enable an integration of mass spectrometry and DNA-microrarray data of a set of 53 patients after lung transplantation. The tool was applied for early diagnosis of the Bronchiolitis Obliterans Syndrome (BOS) which occurs often in the second year after lung transplantation and leads to a repulsion of the lung transplant. Gene expression profiles were translated into virtual spectra and linked to their potential mass spectrometry peak. The correlation score between the virtual and real spectra did not exhibit significant patterns in relation to BOS. However, the metaanalysis approach resulted in 15 genes that could not be found in the seperate analysis of the two data types such as INSL4, CCL26 and FXYD3. These genes constitute potential biomarkers for the detection of BO
    corecore