Search CORE

Tumor classification and marker gene prediction by feature selection and fuzzy c-means clustering using microarray data

Author: Bø Trond Hellem
Hovig Eivind
Jonassen Inge
Myklebost Ola
Wang Junbai
Publication venue: BioMed Central
Publication date: 01/01/2003
Field of study

BACKGROUND: Using DNA microarrays, we have developed two novel models for tumor classification and target gene prediction. First, gene expression profiles are summarized by optimally selected Self-Organizing Maps (SOMs), followed by tumor sample classification by Fuzzy C-means clustering. Then, the prediction of marker genes is accomplished by either manual feature selection (visualizing the weighted/mean SOM component plane) or automatic feature selection (by pair-wise Fisher's linear discriminant). RESULTS: The proposed models were tested on four published datasets: (1) Leukemia (2) Colon cancer (3) Brain tumors and (4) NCI cancer cell lines. The models gave class prediction with markedly reduced error rates compared to other class prediction approaches, and the importance of feature selection on microarray data analysis was also emphasized. CONCLUSIONS: Our models identify marker genes with predictive potential, often better than other available methods in the literature. The models are potentially useful for medical diagnostics and may reveal some insights into cancer classification. Additionally, we illustrated two limitations in tumor classification from microarray data related to the biology underlying the data, in terms of (1) the class size of data, and (2) the internal structure of classes. These limitations are not specific for the classification models used

NORA - Norwegian Open Research Archives

M-CGH: Analysing microarray-based CGH experiments

Author: Kresse Stine H
Meza-Zepeda Leonardo A
Myklebost Ola
Wang Junbai
Publication venue: BioMed Central
Publication date: 01/01/2004
Field of study

BACKGROUND: Microarray-based comparative genomic hybridisation (array CGH) is a technique by which variation in relative copy numbers between two genomes can be analysed by competitive hybridisation to DNA microarrays. This technology has most commonly been used to detect chromosomal amplifications and deletions in cancer. Dedicated tools are needed to analyse the results of such experiments, which include appropriate visualisation, and to take into consideration the physical relation in the genome between the probes on the array. RESULTS: M-CGH is a MATLAB toolbox with a graphical user interface designed specifically for the analysis of array CGH experiments, with multiple approaches to ratio normalization. Specifically, the distributions of three classes of DNA copy numbers (gains, normal and losses) can be estimated using a maximum likelihood method. Amplicon boundaries are computed by either the fuzzy K-nearest neighbour method or a wavelet approach. The program also allows linking each genomic clone with the corresponding genomic information in the Ensembl database . CONCLUSIONS: M-CGH, which encompasses the basic tools needed for analysing array CGH experiments, is freely available for academics , and does not require any other MATLAB toolbox

NORA - Norwegian Open Research Archives

Clustering of the SOM easily reveals distinct gene expression patterns: results of a reanalysis of lymphoma study

Author: Aasheim Hans Christian
Delabie Jan
Myklebost Ola
Smeland Erlend
Wang Junbai
Publication venue: BioMed Central
Publication date: 01/01/2002
Field of study

BACKGROUND: A method to evaluate and analyze the massive data generated by series of microarray experiments is of utmost importance to reveal the hidden patterns of gene expression. Because of the complexity and the high dimensionality of microarray gene expression profiles, the dimensional reduction of raw expression data and the feature selections necessary for, for example, classification of disease samples remains a challenge. To solve the problem we propose a two-level analysis. First self-organizing map (SOM) is used. SOM is a vector quantization method that simplifies and reduces the dimensionality of original measurements and visualizes individual tumor sample in a SOM component plane. Next, hierarchical clustering and K-means clustering is used to identify patterns of gene expression useful for classification of samples. RESULTS: We tested the two-level analysis on public data from diffuse large B-cell lymphomas. The analysis easily distinguished major gene expression patterns without the need for supervision: a germinal center-related, a proliferation, an inflammatory and a plasma cell differentiation-related gene expression pattern. The first three patterns matched the patterns described in the original publication using supervised clustering analysis, whereas the fourth one was novel. CONCLUSIONS: Our study shows that by using SOM as an intermediate step to analyze genome-wide gene expression data, the gene expression patterns can more easily be revealed. The "expression display" by the SOM component plane summarises the complicated data in a way that allows the clinician to evaluate the classification options rather than giving a fixed diagnosis

NORA - Norwegian Open Research Archives

BayesPI-BAR2: A New Python Package for Predicting Functional Non-coding Mutations in Cancer Patient Cohorts

Author: Jan Delabie
Junbai Wang
Kirill Batmanov
Publication venue: 'Frontiers Media SA'
Publication date: 01/04/2019
Field of study

Most of somatic mutations in cancer occur outside of gene coding regions. These mutations may disrupt the gene regulation by affecting protein-DNA interaction. A study of these disruptions is important in understanding tumorigenesis. However, current computational tools process DNA sequence variants individually, when predicting the effect on protein-DNA binding. Thus, it is a daunting task to identify functional regulatory disturbances among thousands of mutations in a patient. Previously, we have reported and validated a pipeline for identifying functional non-coding somatic mutations in cancer patient cohorts, by integrating diverse information such as gene expression, spatial distribution of the mutations, and a biophysical model for estimating protein binding affinity. Here, we present a new user-friendly Python package BayesPI-BAR2 based on the proposed pipeline for integrative whole-genome sequence analysis. This may be the first prediction package that considers information from both multiple mutations and multiple patients. It is evaluated in follicular lymphoma and skin cancer patients, by focusing on sequence variants in gene promoter regions. BayesPI-BAR2 is a useful tool for predicting functional non-coding mutations in whole genome sequencing data: it allows identification of novel transcription factors (TFs) whose binding is altered by non-coding mutations in cancer. BayesPI-BAR2 program can analyze multiple datasets of genome-wide mutations at once and generate concise, easily interpretable reports for potentially affected gene regulatory sites. The package is freely available at http://folk.uio.no/junbaiw/BayesPI-BAR2/

Characterizing a collective and dynamic component of chromatin immunoprecipitation enrichment profiles in yeast

Author: Bussemaker Harmen J.
Wang Junbai
Ward Lucas
Publication venue: 'Columbia University Libraries/Information Services'
Publication date: 01/01/2014
Field of study

Background: Recent chromatin immunoprecipitation (ChIP) experiments in fly, mouse, and human have revealed the existence of high-occupancy target (HOT) regions or “hotspots” that show enrichment across many assayed DNA-binding proteins. Similar co-enrichment observed in yeast so far has been treated as artifactual, and has not been fully characterized. Results: Here we reanalyze ChIP data from both array-based and sequencing-based experiments to show that in the yeast S. cerevisiae, the collective enrichment phenomenon is strongly associated with proximity to noncoding RNA genes and with nucleosome depletion. DNA sequence motifs that confer binding affinity for the proteins are largely absent from these hotspots, suggesting that protein-protein interactions play a prominent role. The hotspots are condition-specific, suggesting that they reflect a chromatin state or protein state, and are not a static feature of underlying sequence. Additionally, only a subset of all assayed factors is associated with these loci, suggesting that the co-enrichment cannot be simply explained by a chromatin state that is universally more prone to immunoprecipitation. Conclusions: Together our results suggest that the co-enrichment patterns observed in yeast represent transcription factor co-occupancy. More generally, they make clear that great caution must be used when interpreting ChIP enrichment profiles for individual factors in isolation, as they will include factor-specific as well as collective contributions

Columbia University Academic Commons

arXiv.org e-Print Archive

Application of new probabilistic graphical models in the genetic regulatory networks studies

Author: Anderson
Bar-Joseph
Chiang
Chickering
Cox
de la Fuente
Edwards
Friedman
Futcher
Geiger
Hartemink
Jan Delabie
Jong
Junbai Wang
Kikuchi
Lee
Leo Wang-Kit Cheung
Li
Meek
Qian
Rangel
Roberts
Rung
Segal
Somogyi
Spirtes
Spirtes
Spirtes
Steffen
Toh
Troyanskaya
Wang
Wu
Yeung
Yu
Yu
Zhang
Zhou
Publication venue: 'Elsevier BV'
Publication date: 31/12/2005
Field of study

This paper introduces two new probabilistic graphical models for reconstruction of genetic regulatory networks using DNA microarray data. One is an Independence Graph (IG) model with either a forward or a backward search algorithm and the other one is a Gaussian Network (GN) model with a novel greedy search method. The performances of both models were evaluated on four MAPK pathways in yeast and three simulated data sets. Generally, an IG model provides a sparse graph but a GN model produces a dense graph where more information about gene-gene interactions is preserved. Additionally, we found two key limitations in the prediction of genetic regulatory networks using DNA microarray data, the first is the sufficiency of sample size and the second is the complexity of network structures may not be captured without additional data at the protein level. Those limitations are present in all prediction methods which used only DNA microarray data.Comment: 38 pages, 3 figure

Elsevier - Publisher Connector

Genome-wide analysis uncovers high frequency, strong differential chromosomal interactions and their associated epigenetic patterns in E2-mediated gene regulation

Author: Hang-Kai Hsu
Jeffrey Parvin
Junbai Wang
Kun Huang
Pei-Yin Hsu
Tim H-M Huang
Victor X Jin
Xun Lan
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2013
Field of study

Computational study of associations between histone modification and protein-DNA binding in yeast genome by integrating diverse information

Author: A Boorsma
A Pekowska
AJ Saldanha
AP Wolffe
B Li
B van Steensel
BC Foat
C Jiang
C Moorman
C Stark
CB Millar
CL Peterson
CT Harbison
D Mackay
D Ucar
DE Schones
DK Pokholok
F Gao
F Robert
GJ Filion
H Pham
HK Tsai
I Nabney
J Mellor
J Wang
J Wang
J Wang
J Wang
J Wang
J Wang
J. Ernst
Junbai Wang
Jung-Shin Lee
KJ Won
Kyoung-Jae Won
LD Ward
MB Eisen
MD Shahbazian
ML Bulyk
ND Heintzman
R Gordan
R. Karlic
RH Morse
RH Morse
S Henikoff
S Mahony
SE Hanlon
SK Kurdistani
SL Schreiber
T Kouzarides
TL Bailey
W Lee
X Guo
XS Liu
Publication venue: BioMed Central
Publication date: 01/01/2011
Field of study

Abstract Background In parallel with the quick development of high-throughput technologies, <it>in vivo (vitro) </it>experiments for genome-wide identification of protein-DNA interactions have been developed. Nevertheless, a few questions remain in the field, such as how to distinguish true protein-DNA binding (functional binding) from non-specific protein-DNA binding (non-functional binding). Previous researches tackled the problem by integrated analysis of multiple available sources. However, few systematic studies have been carried out to examine the possible relationships between histone modification and protein-DNA binding. Here this issue was investigated by using publicly available histone modification data in yeast. Results Two separate histone modification datasets were studied, at both the open reading frame (ORF) and the promoter region of binding targets for 37 yeast transcription factors. Both results revealed a distinct histone modification pattern between the functional protein-DNA binding sites and non-functional ones for almost half of all TFs tested. Such difference is much stronger at the ORF than at the promoter region. In addition, a protein-histone modification interaction pathway can only be inferred from the functional protein binding targets. Conclusions Overall, the results suggest that histone modification information can be used to distinguish the functional protein-DNA binding from the non-functional, and that the regulation of various proteins is controlled by the modification of different histone lysines such as the protein-specific histone modification levels.</p

Helsebibliotekets Research Archive

BayesPI - a new model to study protein-DNA interactions: a case study of condition-specific protein binding parameters for Yeast transcription factors

Author: A Delaunay
A Tanay
A Yarragudi
AE Tsong
AR Borneman
B Alberts
BC Foat
BE Bernstein
C Moorman
CK Lee
CT Harbison
CY Chen
D Das
D Das
D Mackay
DC Raitt
DS Fields
E Aurell
E Wingender
F Gao
F Ozsolak
G Tuteja
GL Bond
HG Roider
HJ Bussemaker
HK Tsai
I Nabney
J Deckert
J Lee
J Wang
J Wang
J Wang
J Zeitlinger
JB Kinney
JM Bland
JM Cherry
Junbai Wang
K Murphy
KD MacIsaac
L Jen-Jacobson
L Narlikar
L Segal
M Djordjevic
MJ Buck
ML Bulyk
Morigen
MP Ryan
O Sertil
OG Berg
PV Benos
Q Zhou
RD Kornberg
RF Lascaris
RH Morse
S Ghaemmaghami
S Keles
SF Gull
TE Cheatham 3rd
TK Man
TN Mavrich
U Gerland
U Gerland
VB Zhurkin
W Gorner
W Lee
WK Olson
X Liu
Publication venue: BioMed Central
Publication date: 01/01/2009
Field of study

Abstract Background We have incorporated Bayesian model regularization with biophysical modeling of protein-DNA interactions, and of genome-wide nucleosome positioning to study protein-DNA interactions, using a high-throughput dataset. The newly developed method (BayesPI) includes the estimation of a transcription factor (TF) binding energy matrices, the computation of binding affinity of a TF target site and the corresponding chemical potential. Results The method was successfully tested on synthetic ChIP-chip datasets, real yeast ChIP-chip experiments. Subsequently, it was used to estimate condition-specific and species-specific protein-DNA interaction for several yeast TFs. Conclusion The results revealed that the modification of the protein binding parameters and the variation of the individual nucleotide affinity in either recognition or flanking sequences occurred under different stresses and in different species. The findings suggest that such modifications may be adaptive and play roles in the formation of the environment-specific binding patterns of yeast TFs and in the divergence of TF binding sites across the related yeast species.</p