Search CORE

10 research outputs found

Missing value imputation for microarray gene expression data using histone acetylation information

Author: AA Alizadeh
AL Clayton
AP Gasch
C Rich
Caisheng He
CM Perou
D Schubeler
DE Koryakov
DJ Duggan
DK Pokholok
E Segal
GC Yuan
GCLY Yuan
H Kim
H Yoshimoto
HY Yu
I Takemasa
J Tuikkala
JA Orr
Jiang Wang
Jihua Feng
JJ Hu
JL DeRisi
JL Schafer
KJ Kim
KW McCool
L Mariño-Ramírez
L Narlikar
L Verdone
M Ouyang
MB Eisen
MD Meneghini
MPS Brown
MS Kobor
MSB Sehgal
O Alter
O Alter
O Troyanskaya
OJ Rando
P Johansson
P Spellman
Qian Xiang
RJA Little
S Chatterjee
S Oba
S Raychaudhuri
SA Armstrong
SC Kim
SK Kurdistani
TR Golub
TR O'Connor
TY Roh
X Feng
X Guo
Xianhua Dai
Yangyang Deng
Zhiming Dai
Publication venue: BioMed Central
Publication date: 01/01/2008
Field of study

Abstract Background It is an important pre-processing step to accurately estimate missing values in microarray data, because complete datasets are required in numerous expression profile analysis in bioinformatics. Although several methods have been suggested, their performances are not satisfactory for datasets with high missing percentages. Results The paper explores the feasibility of doing missing value imputation with the help of gene regulatory mechanism. An imputation framework called histone acetylation information aided imputation method (HAIimpute method) is presented. It incorporates the histone acetylation information into the conventional KNN(<it>k</it>-nearest neighbor) and LLS(local least square) imputation algorithms for final prediction of the missing values. The experimental results indicated that the use of acetylation information can provide significant improvements in microarray imputation accuracy. The HAIimpute methods consistently improve the widely used methods such as KNN and LLS in terms of normalized root mean squared error (NRMSE). Meanwhile, the genes imputed by HAIimpute methods are more correlated with the original complete genes in terms of Pearson correlation coefficients. Furthermore, the proposed methods also outperform GOimpute, which is one of the existing related methods that use the functional similarity as the external information. Conclusion We demonstrated that the using of histone acetylation information could greatly improve the performance of the imputation especially at high missing percentages. This idea can be generalized to various imputation methods to facilitate the performance. Moreover, with more knowledge accumulated on gene regulatory mechanism in addition to histone acetylation, the performance of our approach can be further improved and verified.</p

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

Missing value imputation for microRNA expression data by using a GO-based similarity measure

Author: A Krek
AJ Enright
B John
BP Lewis
D Cimino
D Wang
Dandan Song
F Biagioni
G Schuler
G Yu
G Yu
H Kim
H Wu
J Lu
J Peng
J Tuikkala
JL Sevilla
JZ Wang
M Ashburner
M Kanehisa
N Qing-shan
O Troyanskaya
P Resnik
P Sethupathy
P Zhang
PW Lord
Q Xiang
R Edgar
S Griffiths-Jones
S Volinia
X Zhou
Yang Yang
Zhuangdi Xu
Publication venue: 'Springer Science and Business Media LLC'
Publication date
Field of study

Crossref

Improved k-means clustering using principal component analysis and imputation methods for breast cancer dataset

Author: Armina Roslan
Publication venue
Publication date: 01/01/2018
Field of study

Data mining techniques have been used to analyse pattern from data sets in order to derive useful information. Classification of data sets into clusters is one of the essential process for data manipulation. One of the most popular and efficient clustering methods is K-means method. However, the K-means clustering method has some difficulties in the analysis of high dimension data sets with the presence of missing values. Moreover, previous studies showed that high dimensionality of the feature in data set presented poses different problems for K-means clustering. For missing value problem, imputation method is needed to minimise the effect of incomplete high dimensional data sets in K-means clustering process. This research studies the effect of imputation algorithm and dimensionality reduction techniques on the performance of K-means clustering. Three imputation methods are implemented for the missing value estimation which are K-nearest neighbours (KNN), Least Local Square (LLS), and Bayesian Principle Component Analysis (BPCA). Principal Component Analysis (PCA) is a dimension reduction method that has a dimensional reduction capability by removing the unnecessary attribute of high dimensional data sets. Hence, PCA hybrid with K-means (PCA K-means) is proposed to give a better clustering result. The experimental process was performed by using Wisconsin Breast Cancer. By using LLS imputation method, the proposed hybrid PCA K-means outperformed the standard Kmeans clustering based on the results for breast cancer data set; in terms of clustering accuracy (0.29%) and computing time (95.76%)

Universiti Teknologi Malaysia Institutional Repository

An integrative imputation method based on multi-omics datasets

Author: A Scherl
AJ Schetter
AW-C Liew
BJ Webb-Robertson
Chao Xu
D Albrecht
D Cirillo
D Szklarczyk
D Wang
Dongdong Lin
F Li
G Sales
H Kim
Hong-Wen Deng
J Hu
J Kang
J Tuikkala
Jigang Zhang
Jingyao Li
K-O Cheng
K-Y Kim
L Breiman
L Nie
L Nie
LE Chai
LH Hartwell
LP Brás
M Ouyang
O Troyanskaya
Q Xiang
R Jornsten
R Jörnsten
R Jörnsten
R Pedreschi
S Haider
S Oba
T Aittokallio
T Maier
W Torres-Garcia
W Torres-Garcia
W Zhang
W Zhang
X Gan
X Zhang
YF Pei
Yu-Ping Wang
Z Cai
Publication venue: 'Springer Science and Business Media LLC'
Publication date
Field of study

Crossref

Data analysis tools for mass spectrometry proteomics

Author: Suomi Tomi
Publication venue: fi=Turun yliopisto|en=University of Turku|
Publication date: 17/08/2021
Field of study

ABSTRACT Proteins are large biomolecules which consist of amino acid chains. They differ from one another in their amino acid sequences, which are mainly dictated by the nucleotide sequence of their corresponding genes. Proteins fold into specific threedimensional structures that determine their activity. Because many of the proteins act as catalytes in biochemical reactions, they are considered as the executive molecules in the cells and therefore their research is fundamental in biotechnology and medicine. Currently the most common method to investigate the activity, interactions, and functions of proteins on a large scale, is high-throughput mass spectrometry (MS). The mass spectrometers are used for measuring the molecule masses, or more specifically, their mass-to-charge ratios. Typically the proteins are digested into peptides and their masses are measured by mass spectrometry. The masses are matched against known sequences to acquire peptide identifications, and subsequently, the proteins from which the peptides were originated are quantified. The data that are gathered from these experiments contain a lot of noise, leading to loss of relevant information and even to wrong conclusions. The noise can be related, for example, to differences in the sample preparation or to technical limitations of the analysis equipment. In addition, assumptions regarding the data might be wrong or the chosen statistical methods might not be suitable. Taken together, these can lead to irreproducible results. Developing algorithms and computational tools to overcome the underlying issues is of most importance. Thus, this work aims to develop new computational tools to address these problems. In this PhD Thesis, the performance of existing label-free proteomics methods are evaluated and new statistical data analysis methods are proposed. The tested methods include several widely used normalization methods, which are thoroughly evaluated using multiple gold standard datasets. Various statistical methods for differential expression analysis are also evaluated. Furthermore, new methods to calculate differential expression statistic are developed and their superior performance compared to the existing methods is shown using a wide set of metrics. The tools are published as open source software packages.TIIVISTELMÄ Proteiinit ovat aminohappoketjuista muodostuvia isoja biomolekyylejä. Ne eroavat toisistaan aminohappojen järjestyksen osalta, mikä pääosin määräytyy proteiineja koodaavien geenien perusteella. Lisäksi proteiinit laskostuvat kolmiulotteisiksi rakenteiksi, jotka osaltaan määrittelevät niiden toimintaa. Koska proteiinit toimivat katalyytteinä biokemiallisissa reaktioissa, niillä katsotaan olevan keskeinen rooli soluissa ja siksi myös niiden tutkimusta pidetään tärkeänä. Tällä hetkellä yleisin menetelmä laajamittaiseen proteiinien aktiivisuuden, interaktioiden sekä funktioiden tutkimiseen on suurikapasiteettinen massaspektrometria (MS). Massaspektrometreja käytetään mittaamaan molekyylien massoja – tai tarkemmin massan ja varauksen suhdetta. Tyypillisesti proteiinit hajotetaan peptideiksi massojen mittausta varten. Massaspektrometrillä havaittuja massoja verrataan tunnetuista proteiinisekvensseistä koottua tietokantaa vasten, jotta peptidit voidaan tunnistaa. Peptidien myötä myös proteiinit on mahdollista päätellä ja kvantitoida. Kokeissa kerätty data sisältää normaalisti runsaasti kohinaa, joka saattaa johtaa olennaisen tiedon hukkumiseen ja jopa pahimmillaan johtaa vääriin johtopäätöksiin. Tämä kohina voi johtua esimerkiksi näytteen käsittelystä johtuvista eroista tai mittalaitteiden teknisistä rajoitteista. Lisäksi olettamukset datan luonteesta saattavat olla virheellisiä tai käytetään datalle soveltumattomia tilastollisia malleja. Pahimmillaan tämä johtaa tilanteisiin, joissa tutkimuksen tuloksia ei pystytä toistamaan. Erilaisten laskennallisten työkalujen sekä algoritmien kehittäminen näiden ongelmien ehkäisemiseksi onkin ensiarvoisen tärkeää tutkimusten luotettavuuden kannalta. Tässä työssä keskitytäänkin sovelluksiin, joilla pyritään ratkaisemaan tällä osa-alueella ilmeneviä ongelmia. Tutkimuksessa vertaillaan yleisesti käytössä olevia kvantitatiivisen proteomiikan ohjelmistoja ja yleisimpiä datan normalisointimenetelmiä, sekä kehitetään uusia datan analysointityökaluja. Menetelmien keskinäiset vertailut suoritetaan useiden sellaisten standardiaineistojen kanssa, joiden todellinen sisältö tiedetään. Tutkimuksessa vertaillaan lisäksi joukko tilastollisia menetelmiä näytteiden välisten erojen havaitsemiseen sekä kehitetään kokonaan uusia tehokkaita menetelmiä ja osoitetaan niiden parempi suorituskyky suhteessa aikaisempiin menetelmiin. Kaikki tutkimuksessa kehitetyt työkalut on julkaistu avoimen lähdekoodin sovelluksina

UTUPub

Missing value imputation for microarray gene expression data using histone acetylation information-1

Author: Caisheng He (93553)
Jiang Wang (14683)
Jihua Feng (93554)
Qian Xiang (93550)
Xianhua Dai (93551)
Yangyang Deng (93552)
Zhiming Dai (93555)
Publication venue
Publication date
Field of study

Der burst model of missing values. The legends are the same as Figure 1. The HAIimpute methods are more robust than GOimpute methods in this case. The knnHAI method outperforms KNN and GOKNN, while llsHAI outperforms LLS and GOLLS in most cases.Copyright information:Taken from "Missing value imputation for microarray gene expression data using histone acetylation information"http://www.biomedcentral.com/1471-2105/9/252BMC Bioinformatics 2008;9():252-252.Published online 29 May 2008PMCID:PMC2432074.</p

FigShare

Missing value imputation for microarray gene expression data using histone acetylation information-0

Author: Caisheng He (93553)
Jiang Wang (14683)
Jihua Feng (93554)
Qian Xiang (93550)
Xianhua Dai (93551)
Yangyang Deng (93552)
Zhiming Dai (93555)
Publication venue
Publication date
Field of study

Der random model of missing values. The horizontal axis is the varying range of missing percentages from 1% to 20%. The vertical axis is NRMSE of 100 independent and random test runs for each method. The knnHAI method outperforms KNN and GOKNN, while llsHAI mostly outperforms LLS and GOLLS. Generally, llsHAI performs best in most cases.Copyright information:Taken from "Missing value imputation for microarray gene expression data using histone acetylation information"http://www.biomedcentral.com/1471-2105/9/252BMC Bioinformatics 2008;9():252-252.Published online 29 May 2008PMCID:PMC2432074.</p

FigShare

Enhanced label-free discovery proteomics through improved data analysis and knowledge enrichment

Author: Välikangas Tommi
Publication venue: fi=Turun yliopisto|en=University of Turku|
Publication date: 11/02/2022
Field of study

Mass spectrometry (MS)-based proteomics has evolved into an important tool applied in fundamental biological research as well as biomedicine and medical research. The rapid developments of technology have required the establishment of data processing algorithms, protocols and workflows. The successful application of such software tools allows for the maturation of instrumental raw data into biological and medical knowledge. However, as the choice of algorithms is vast, the selection of suitable processing tools for various data types and research questions is not trivial. In this thesis, MS data processing related to the label-free technology is systematically considered. Essential questions, such as normalization, choice of preprocessing software, missing values and imputation, are reviewed in-depth. Considerations related to preprocessing of the raw data are complemented with exploration of methods for analyzing the processed data into practical knowledge. In particular, longitudinal differential expression is reviewed in detail, and a novel approach well-suited for noisy longitudinal high-througput data with missing values is suggested. Knowledge enrichment through integrated functional enrichment and network analysis is introduced for intuitive and information-rich delivery of the results. Effective visualization of such integrated networks enables the fast screening of results for the most promising candidates (e.g. clusters of co-expressing proteins with disease-related functions) for further validation and research. Finally, conclusions related to the prepreprocessing of the raw data are combined with considerations regarding longitudinal differential expression and integrated knowledge enrichment into guidelines for a potential label-free discovery proteomics workflow. Such proposed data processing workflow with practical suggestions for each distinct step, can act as a basis for transforming the label-free raw MS data into applicable knowledge.Massaspektrometriaan (MS) pohjautuva proteomiikka on kehittynyt tehokkaaksi työkaluksi, jota hyödynnetään niin biologisessa kuin lääketieteellisessäkin tutkimuksessa. Alan nopea kehitys on synnyttänyt erikoistuneita algoritmeja, protokollia ja ohjelmistoja datan käsittelyä varten. Näiden ohjelmistotyökalujen oikeaoppinen käyttö lopulta mahdollistaa datan tehokkaan esikäsittelyn, analysoinnin ja jatkojalostuksen biologiseksi tai lääketieteelliseksi ymmärrykseksi. Mahdollisten vaihtoehtojen suuresta määrästä johtuen sopivan ohjelmistotyökalun valinta ei usein kuitenkaan ole yksiselitteistä ja ongelmatonta. Tässä väitöskirjassa tarkastellaan leimaamattomaan proteomiikkaan liittyviä laskennallisia työkaluja. Väitöskirjassa käydään läpi keskeisiä kysymyksiä datan normalisoinnista sopivan esikäsittelyohjelmiston valintaan ja puuttuvien arvojen käsittelyyn. Datan esikäsittelyn lisäksi tarkastellaan datan tilastollista jatkoanalysointia sekä erityisesti erilaisen ekspression havaitsemista pitkittäistutkimuksissa. Väitöskirjassa esitellään uusi, kohinaiselle ja puuttuvia arvoja sisältävälle suurikapasiteetti-pitkittäismittausdatalle soveltuva menetelmä erilaisen ekspression havaitsemiseksi. Uuden tilastollisen menetelmän lisäksi väitöskirjassa tarkastellaan havaittujen tilastollisten löydösten rikastusta käytännön ymmärrykseksi integroitujen rikastumis- ja verkkoanalyysien kautta. Tällaisten funktionaalisten verkkojen tehokas visualisointi mahdollistaa keskeisten tulosten nopean tulkinnan ja kiinnostavimpien löydösten valinnan jatkotutkimuksia varten. Lopuksi datan esikäsittelyyn ja pitkittäistutkimusten tilastollisen jatkokäsittelyyn liittyvät johtopäätökset yhdistetään tiedollisen rikastamisen kanssa. Näihin pohdintoihin perustuen esitellään mahdollinen työnkulku leimaamattoman MS proteomiikkadatan käsittelylle raakadatasta hyödynnettäviksi löydöksiksi sekä edelleen käytännön biologiseksi ja lääketieteelliseksi ymmärrykseksi

UTUPub

Statistical modelling of masked gene regulatory pathway changes across microarray studies of interferon gamma activated macrophages

Author: Forster Thorsten
Publication venue: The University of Edinburgh
Publication date: 28/06/2014
Field of study

Interferon gamma (IFN-γ) regulation of macrophages plays an essential role in innate immunity and pathogenicity of viral infections by directing large and small genome-wide changes in the transcriptional program of macrophages. Smaller changes at the transcriptional level are difficult to detect but can have profound biological effects, motivating the hypothesis of this thesis that responses of macrophages to immune activation by IFN-γ include small quantitative changes that are masked by noise but represent meaningful transcriptional systems in pathways against infection. To test this hypothesis, statistical meta-analysis of microarray studies is investigated as a tool to obtain the necessary increase in analysis sensitivity. Three meta-analysis models (Effect size model, Rank Product model, Fisher’s sum of logs) and three further modified versions were applied to a heterogeneous set of four microarray studies on the effect of IFN-γ on murine macrophages. Performance assessments include recovery of known biology and are followed by development of novel biological hypotheses through secondary analysis of meta-analysis outcomes in context of independent biological data sources. A separate network analysis of a microarray time course study investigate s if gene sets with coordinated time-dependent relationships overlap can also identify subtle IFN-γ related transcriptional changes in macrophages that match those identified through meta-analysis. It was found that all meta-analysis models can identify biologically meaningful transcription at enhanced sensitivity levels, with slightly improved performance advantages for a non-parametric model (Rank Product meta-analysis). Meta-analysis yielded consistently regulated genes, hidden in individual microarray studies, related to sterol biosynthesis (Stard3, Pgrmc1, Galnt6, Rab11a, Golga4, Lrp10), implicated in cross-talk between type II and type I interferon or IL-10 signalling (Tbk1, Ikbke, Clic4, Ptpre, Batf), and circadian rhythm (Csnk1e). Further network analysis confirms that meta-analysis findings are highly concentrated in a distinct immune response cluster of co-expressed genes, and also identifies global expression modularisation in IFN-γ treated macrophages, pointing to Trafd1 as a central anti-correlated node topologically linked to interactions with down-regulated sterol biosynthesis pathway members. Outcomes from this thesis suggest that small transcriptional changes in IFN-γ activated macrophages can be detected by enhancing sensitivity through combination of multiple microarray studies. Together with use of bioinformatical resources, independent data sets and network analysis, further validation assigns a potential role for low or variable transcription genes in linking type II interferon signalling to type I and TLR signalling, as well as the sterol metabolic network

Edinburgh Research Archive