39 research outputs found
The efficiency of logistic regression compared to normal discriminant analysis under class-conditional classification noise
AbstractIn many real world classification problems, class-conditional classification noise (CCC-Noise) frequently deteriorates the performance of a classifier that is naively built by ignoring it. In this paper, we investigate the impact of CCC-Noise on the quality of a popular generative classifier, normal discriminant analysis (NDA), and its corresponding discriminative classifier, logistic regression (LR). We consider the problem of two multivariate normal populations having a common covariance matrix. We compare the asymptotic distribution of the misclassification error rate of these two classifiers under CCC-Noise. We show that when the noise level is low, the asymptotic error rates of both procedures are only slightly affected. We also show that LR is less deteriorated by CCC-Noise compared to NDA. Under CCC-Noise contexts, the Mahalanobis distance between the populations plays a vital role in determining the relative performance of these two procedures. In particular, when this distance is small, LR tends to be more tolerable to CCC-Noise compared to NDA
Tree-Based Position Weight Matrix Approach to Model Transcription Factor Binding Site Profiles
Most of the position weight matrix (PWM) based bioinformatics methods developed to predict transcription factor binding sites (TFBS) assume each nucleotide in the sequence motif contributes independently to the interaction between protein and DNA sequence, usually producing high false positive predictions. The increasing availability of TF enrichment profiles from recent ChIP-Seq methodology facilitates the investigation of dependent structure and accurate prediction of TFBSs. We develop a novel Tree-based PWM (TPWM) approach to accurately model the interaction between TF and its binding site. The whole tree-structured PWM could be considered as a mixture of different conditional-PWMs. We propose a discriminative approach, called TPD (TPWM based Discriminative Approach), to construct the TPWM from the ChIP-Seq data with a pre-existing PWM. To achieve the maximum discriminative power between the positive and negative datasets, the cutoff value is determined based on the Matthew Correlation Coefficient (MCC). The resulting TPWMs are evaluated with respect to accuracy on extensive synthetic datasets. We then apply our TPWM discriminative approach on several real ChIP-Seq datasets to refine the current TFBS models stored in the TRANSFAC database. Experiments on both the simulated and real ChIP-Seq data show that the proposed method starting from existing PWM has consistently better performance than existing tools in detecting the TFBSs. The improved accuracy is the result of modelling the complete dependent structure of the motifs and better prediction of true positive rate. The findings could lead to better understanding of the mechanisms of TF-DNA interactions
Recommended from our members
Identifying the substrate proteins of U-box E3s E4B and CHIP by orthogonal ubiquitin transfer
E3 ubiquitin (UB) ligases E4B and carboxyl terminus of Hsc70-interacting protein (CHIP) use a common U-box motif to transfer UB from E1 and E2 enzymes to their substrate proteins and regulate diverse cellular processes. To profile their ubiquitination targets in the cell, we used phage display to engineer E2-E4B and E2-CHIP pairs that were free of cross-reactivity with the native UB transfer cascades. We then used the engineered E2-E3 pairs to construct βorthogonal UB transfer (OUT)β cascades so that a mutant UB (xUB) could be exclusively used by the engineered E4B or CHIP to label their substrate proteins. Purification of xUB-conjugated proteins followed by proteomics analysis enabled the identification of hundreds of potential substrates of E4B and CHIP in human embryonic kidney 293 cells. Kinase MAPK3 (mitogen-activated protein kinase 3), methyltransferase PRMT1 (protein arginine N-methyltransferase 1), and phosphatase PPP3CA (protein phosphatase 3 catalytic subunit alpha) were identified as the shared substrates of the two E3s. Phosphatase PGAM5 (phosphoglycerate mutase 5) and deubiquitinase OTUB1 (ovarian tumor domain containing ubiquitin aldehyde binding protein 1) were confirmed as E4B substrates, and b-catenin and CDK4 (cyclin-dependent kinase 4) were confirmed as CHIP substrates. On the basis of the CHIP-CDK4 circuit identified by OUT, we revealed that CHIP signals CDK4 degradation in response to endoplasmic reticulum stress
Distinct mechanisms control genome recognition by p53 at its target genes linked to different cell fates.
The tumor suppressor p53 integrates stress response pathways by selectively engaging one of several potential transcriptomes, thereby triggering cell fate decisions (e.g., cell cycle arrest, apoptosis). Foundational to this process is the binding of tetrameric p53 to 20-bp response elements (REs) in the genome (RRRCWWGYYYN0-13RRRCWWGYYY). In general, REs at cell cycle arrest targets (e.g. p21) are of higher affinity than those at apoptosis targets (e.g., BAX). However, the RE sequence code underlying selectivity remains undeciphered. Here, we identify molecular mechanisms mediating p53 binding to high- and low-affinity REs by showing that key determinants of the code are embedded in the DNA shape. We further demonstrate that differences in minor/major groove widths, encoded by G/C or A/T bp content at positions 3, 8, 13, and 18 in the RE, determine distinct p53 DNA-binding modes by inducing different Arg248 and Lys120 conformations and interactions. The predictive capacity of this code was confirmed in vivo using genome editing at the BAX RE to interconvert the DNA-binding modes, transcription pattern, and cell fate outcome
The efficiency of logistic regression compared to normal discriminant analysis under class-conditional classification noise
In many real world classification problems, class-conditional classification noise (CCC-Noise) frequently deteriorates the performance of a classifier that is naively built by ignoring it. In this paper, we investigate the impact of CCC-Noise on the quality of a popular generative classifier, normal discriminant analysis (NDA), and its corresponding discriminative classifier, logistic regression (LR). We consider the problem of two multivariate normal populations having a common covariance matrix. We compare the asymptotic distribution of the misclassification error rate of these two classifiers under CCC-Noise. We show that when the noise level is low, the asymptotic error rates of both procedures are only slightly affected. We also show that LR is less deteriorated by CCC-Noise compared to NDA. Under CCC-Noise contexts, the Mahalanobis distance between the populations plays a vital role in determining the relative performance of these two procedures. In particular, when this distance is small, LR tends to be more tolerable to CCC-Noise compared to NDA.Class noise Misclassification rate Misspecified model Asymptotic distribution