10 research outputs found

    Missing value imputation for microarray gene expression data using histone acetylation information

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>It is an important pre-processing step to accurately estimate missing values in microarray data, because complete datasets are required in numerous expression profile analysis in bioinformatics. Although several methods have been suggested, their performances are not satisfactory for datasets with high missing percentages.</p> <p>Results</p> <p>The paper explores the feasibility of doing missing value imputation with the help of gene regulatory mechanism. An imputation framework called histone acetylation information aided imputation method (HAIimpute method) is presented. It incorporates the histone acetylation information into the conventional KNN(<it>k</it>-nearest neighbor) and LLS(local least square) imputation algorithms for final prediction of the missing values. The experimental results indicated that the use of acetylation information can provide significant improvements in microarray imputation accuracy. The HAIimpute methods consistently improve the widely used methods such as KNN and LLS in terms of normalized root mean squared error (NRMSE). Meanwhile, the genes imputed by HAIimpute methods are more correlated with the original complete genes in terms of Pearson correlation coefficients. Furthermore, the proposed methods also outperform GOimpute, which is one of the existing related methods that use the functional similarity as the external information.</p> <p>Conclusion</p> <p>We demonstrated that the using of histone acetylation information could greatly improve the performance of the imputation especially at high missing percentages. This idea can be generalized to various imputation methods to facilitate the performance. Moreover, with more knowledge accumulated on gene regulatory mechanism in addition to histone acetylation, the performance of our approach can be further improved and verified.</p

    Improved k-means clustering using principal component analysis and imputation methods for breast cancer dataset

    Get PDF
    Data mining techniques have been used to analyse pattern from data sets in order to derive useful information. Classification of data sets into clusters is one of the essential process for data manipulation. One of the most popular and efficient clustering methods is K-means method. However, the K-means clustering method has some difficulties in the analysis of high dimension data sets with the presence of missing values. Moreover, previous studies showed that high dimensionality of the feature in data set presented poses different problems for K-means clustering. For missing value problem, imputation method is needed to minimise the effect of incomplete high dimensional data sets in K-means clustering process. This research studies the effect of imputation algorithm and dimensionality reduction techniques on the performance of K-means clustering. Three imputation methods are implemented for the missing value estimation which are K-nearest neighbours (KNN), Least Local Square (LLS), and Bayesian Principle Component Analysis (BPCA). Principal Component Analysis (PCA) is a dimension reduction method that has a dimensional reduction capability by removing the unnecessary attribute of high dimensional data sets. Hence, PCA hybrid with K-means (PCA K-means) is proposed to give a better clustering result. The experimental process was performed by using Wisconsin Breast Cancer. By using LLS imputation method, the proposed hybrid PCA K-means outperformed the standard Kmeans clustering based on the results for breast cancer data set; in terms of clustering accuracy (0.29%) and computing time (95.76%)

    Data analysis tools for mass spectrometry proteomics

    Get PDF
    ABSTRACT Proteins are large biomolecules which consist of amino acid chains. They differ from one another in their amino acid sequences, which are mainly dictated by the nucleotide sequence of their corresponding genes. Proteins fold into specific threedimensional structures that determine their activity. Because many of the proteins act as catalytes in biochemical reactions, they are considered as the executive molecules in the cells and therefore their research is fundamental in biotechnology and medicine. Currently the most common method to investigate the activity, interactions, and functions of proteins on a large scale, is high-throughput mass spectrometry (MS). The mass spectrometers are used for measuring the molecule masses, or more specifically, their mass-to-charge ratios. Typically the proteins are digested into peptides and their masses are measured by mass spectrometry. The masses are matched against known sequences to acquire peptide identifications, and subsequently, the proteins from which the peptides were originated are quantified. The data that are gathered from these experiments contain a lot of noise, leading to loss of relevant information and even to wrong conclusions. The noise can be related, for example, to differences in the sample preparation or to technical limitations of the analysis equipment. In addition, assumptions regarding the data might be wrong or the chosen statistical methods might not be suitable. Taken together, these can lead to irreproducible results. Developing algorithms and computational tools to overcome the underlying issues is of most importance. Thus, this work aims to develop new computational tools to address these problems. In this PhD Thesis, the performance of existing label-free proteomics methods are evaluated and new statistical data analysis methods are proposed. The tested methods include several widely used normalization methods, which are thoroughly evaluated using multiple gold standard datasets. Various statistical methods for differential expression analysis are also evaluated. Furthermore, new methods to calculate differential expression statistic are developed and their superior performance compared to the existing methods is shown using a wide set of metrics. The tools are published as open source software packages.TIIVISTELMÄ Proteiinit ovat aminohappoketjuista muodostuvia isoja biomolekyylejä. Ne eroavat toisistaan aminohappojen järjestyksen osalta, mikä pääosin määräytyy proteiineja koodaavien geenien perusteella. Lisäksi proteiinit laskostuvat kolmiulotteisiksi rakenteiksi, jotka osaltaan määrittelevät niiden toimintaa. Koska proteiinit toimivat katalyytteinä biokemiallisissa reaktioissa, niillä katsotaan olevan keskeinen rooli soluissa ja siksi myös niiden tutkimusta pidetään tärkeänä. Tällä hetkellä yleisin menetelmä laajamittaiseen proteiinien aktiivisuuden, interaktioiden sekä funktioiden tutkimiseen on suurikapasiteettinen massaspektrometria (MS). Massaspektrometreja käytetään mittaamaan molekyylien massoja – tai tarkemmin massan ja varauksen suhdetta. Tyypillisesti proteiinit hajotetaan peptideiksi massojen mittausta varten. Massaspektrometrillä havaittuja massoja verrataan tunnetuista proteiinisekvensseistä koottua tietokantaa vasten, jotta peptidit voidaan tunnistaa. Peptidien myötä myös proteiinit on mahdollista päätellä ja kvantitoida. Kokeissa kerätty data sisältää normaalisti runsaasti kohinaa, joka saattaa johtaa olennaisen tiedon hukkumiseen ja jopa pahimmillaan johtaa vääriin johtopäätöksiin. Tämä kohina voi johtua esimerkiksi näytteen käsittelystä johtuvista eroista tai mittalaitteiden teknisistä rajoitteista. Lisäksi olettamukset datan luonteesta saattavat olla virheellisiä tai käytetään datalle soveltumattomia tilastollisia malleja. Pahimmillaan tämä johtaa tilanteisiin, joissa tutkimuksen tuloksia ei pystytä toistamaan. Erilaisten laskennallisten työkalujen sekä algoritmien kehittäminen näiden ongelmien ehkäisemiseksi onkin ensiarvoisen tärkeää tutkimusten luotettavuuden kannalta. Tässä työssä keskitytäänkin sovelluksiin, joilla pyritään ratkaisemaan tällä osa-alueella ilmeneviä ongelmia. Tutkimuksessa vertaillaan yleisesti käytössä olevia kvantitatiivisen proteomiikan ohjelmistoja ja yleisimpiä datan normalisointimenetelmiä, sekä kehitetään uusia datan analysointityökaluja. Menetelmien keskinäiset vertailut suoritetaan useiden sellaisten standardiaineistojen kanssa, joiden todellinen sisältö tiedetään. Tutkimuksessa vertaillaan lisäksi joukko tilastollisia menetelmiä näytteiden välisten erojen havaitsemiseen sekä kehitetään kokonaan uusia tehokkaita menetelmiä ja osoitetaan niiden parempi suorituskyky suhteessa aikaisempiin menetelmiin. Kaikki tutkimuksessa kehitetyt työkalut on julkaistu avoimen lähdekoodin sovelluksina

    Missing value imputation for microarray gene expression data using histone acetylation information-1

    No full text
    Der burst model of missing values. The legends are the same as Figure 1. The HAIimpute methods are more robust than GOimpute methods in this case. The knnHAI method outperforms KNN and GOKNN, while llsHAI outperforms LLS and GOLLS in most cases.<p><b>Copyright information:</b></p><p>Taken from "Missing value imputation for microarray gene expression data using histone acetylation information"</p><p>http://www.biomedcentral.com/1471-2105/9/252</p><p>BMC Bioinformatics 2008;9():252-252.</p><p>Published online 29 May 2008</p><p>PMCID:PMC2432074.</p><p></p

    Missing value imputation for microarray gene expression data using histone acetylation information-0

    No full text
    Der random model of missing values. The horizontal axis is the varying range of missing percentages from 1% to 20%. The vertical axis is NRMSE of 100 independent and random test runs for each method. The knnHAI method outperforms KNN and GOKNN, while llsHAI mostly outperforms LLS and GOLLS. Generally, llsHAI performs best in most cases.<p><b>Copyright information:</b></p><p>Taken from "Missing value imputation for microarray gene expression data using histone acetylation information"</p><p>http://www.biomedcentral.com/1471-2105/9/252</p><p>BMC Bioinformatics 2008;9():252-252.</p><p>Published online 29 May 2008</p><p>PMCID:PMC2432074.</p><p></p

    Enhanced label-free discovery proteomics through improved data analysis and knowledge enrichment

    Get PDF
    Mass spectrometry (MS)-based proteomics has evolved into an important tool applied in fundamental biological research as well as biomedicine and medical research. The rapid developments of technology have required the establishment of data processing algorithms, protocols and workflows. The successful application of such software tools allows for the maturation of instrumental raw data into biological and medical knowledge. However, as the choice of algorithms is vast, the selection of suitable processing tools for various data types and research questions is not trivial. In this thesis, MS data processing related to the label-free technology is systematically considered. Essential questions, such as normalization, choice of preprocessing software, missing values and imputation, are reviewed in-depth. Considerations related to preprocessing of the raw data are complemented with exploration of methods for analyzing the processed data into practical knowledge. In particular, longitudinal differential expression is reviewed in detail, and a novel approach well-suited for noisy longitudinal high-througput data with missing values is suggested. Knowledge enrichment through integrated functional enrichment and network analysis is introduced for intuitive and information-rich delivery of the results. Effective visualization of such integrated networks enables the fast screening of results for the most promising candidates (e.g. clusters of co-expressing proteins with disease-related functions) for further validation and research. Finally, conclusions related to the prepreprocessing of the raw data are combined with considerations regarding longitudinal differential expression and integrated knowledge enrichment into guidelines for a potential label-free discovery proteomics workflow. Such proposed data processing workflow with practical suggestions for each distinct step, can act as a basis for transforming the label-free raw MS data into applicable knowledge.Massaspektrometriaan (MS) pohjautuva proteomiikka on kehittynyt tehokkaaksi työkaluksi, jota hyödynnetään niin biologisessa kuin lääketieteellisessäkin tutkimuksessa. Alan nopea kehitys on synnyttänyt erikoistuneita algoritmeja, protokollia ja ohjelmistoja datan käsittelyä varten. Näiden ohjelmistotyökalujen oikeaoppinen käyttö lopulta mahdollistaa datan tehokkaan esikäsittelyn, analysoinnin ja jatkojalostuksen biologiseksi tai lääketieteelliseksi ymmärrykseksi. Mahdollisten vaihtoehtojen suuresta määrästä johtuen sopivan ohjelmistotyökalun valinta ei usein kuitenkaan ole yksiselitteistä ja ongelmatonta. Tässä väitöskirjassa tarkastellaan leimaamattomaan proteomiikkaan liittyviä laskennallisia työkaluja. Väitöskirjassa käydään läpi keskeisiä kysymyksiä datan normalisoinnista sopivan esikäsittelyohjelmiston valintaan ja puuttuvien arvojen käsittelyyn. Datan esikäsittelyn lisäksi tarkastellaan datan tilastollista jatkoanalysointia sekä erityisesti erilaisen ekspression havaitsemista pitkittäistutkimuksissa. Väitöskirjassa esitellään uusi, kohinaiselle ja puuttuvia arvoja sisältävälle suurikapasiteetti-pitkittäismittausdatalle soveltuva menetelmä erilaisen ekspression havaitsemiseksi. Uuden tilastollisen menetelmän lisäksi väitöskirjassa tarkastellaan havaittujen tilastollisten löydösten rikastusta käytännön ymmärrykseksi integroitujen rikastumis- ja verkkoanalyysien kautta. Tällaisten funktionaalisten verkkojen tehokas visualisointi mahdollistaa keskeisten tulosten nopean tulkinnan ja kiinnostavimpien löydösten valinnan jatkotutkimuksia varten. Lopuksi datan esikäsittelyyn ja pitkittäistutkimusten tilastollisen jatkokäsittelyyn liittyvät johtopäätökset yhdistetään tiedollisen rikastamisen kanssa. Näihin pohdintoihin perustuen esitellään mahdollinen työnkulku leimaamattoman MS proteomiikkadatan käsittelylle raakadatasta hyödynnettäviksi löydöksiksi sekä edelleen käytännön biologiseksi ja lääketieteelliseksi ymmärrykseksi

    Statistical modelling of masked gene regulatory pathway changes across microarray studies of interferon gamma activated macrophages

    Get PDF
    Interferon gamma (IFN-γ) regulation of macrophages plays an essential role in innate immunity and pathogenicity of viral infections by directing large and small genome-wide changes in the transcriptional program of macrophages. Smaller changes at the transcriptional level are difficult to detect but can have profound biological effects, motivating the hypothesis of this thesis that responses of macrophages to immune activation by IFN-γ include small quantitative changes that are masked by noise but represent meaningful transcriptional systems in pathways against infection. To test this hypothesis, statistical meta-analysis of microarray studies is investigated as a tool to obtain the necessary increase in analysis sensitivity. Three meta-analysis models (Effect size model, Rank Product model, Fisher’s sum of logs) and three further modified versions were applied to a heterogeneous set of four microarray studies on the effect of IFN-γ on murine macrophages. Performance assessments include recovery of known biology and are followed by development of novel biological hypotheses through secondary analysis of meta-analysis outcomes in context of independent biological data sources. A separate network analysis of a microarray time course study investigate s if gene sets with coordinated time-dependent relationships overlap can also identify subtle IFN-γ related transcriptional changes in macrophages that match those identified through meta-analysis. It was found that all meta-analysis models can identify biologically meaningful transcription at enhanced sensitivity levels, with slightly improved performance advantages for a non-parametric model (Rank Product meta-analysis). Meta-analysis yielded consistently regulated genes, hidden in individual microarray studies, related to sterol biosynthesis (Stard3, Pgrmc1, Galnt6, Rab11a, Golga4, Lrp10), implicated in cross-talk between type II and type I interferon or IL-10 signalling (Tbk1, Ikbke, Clic4, Ptpre, Batf), and circadian rhythm (Csnk1e). Further network analysis confirms that meta-analysis findings are highly concentrated in a distinct immune response cluster of co-expressed genes, and also identifies global expression modularisation in IFN-γ treated macrophages, pointing to Trafd1 as a central anti-correlated node topologically linked to interactions with down-regulated sterol biosynthesis pathway members. Outcomes from this thesis suggest that small transcriptional changes in IFN-γ activated macrophages can be detected by enhancing sensitivity through combination of multiple microarray studies. Together with use of bioinformatical resources, independent data sets and network analysis, further validation assigns a potential role for low or variable transcription genes in linking type II interferon signalling to type I and TLR signalling, as well as the sterol metabolic network
    corecore