62 research outputs found

    Modular Algorithms for Biomolecular Network Alignment

    Get PDF
    Comparative analysis of biomolecular networks constructed using measurements from different conditions, tissues, and organisms offer a powerful approach to understanding the structure, function, dynamics, and evolution of complex biological systems. The rapidly advancing field of systems biology aims to understand the structure, function, dynamics, and evolution of complex biological systems in terms of the underlying networks of interactions among the large number of molecular participants involved including genes, proteins, and metabolites. In particular, the comparative analysis of network models representing biomolecular interactions in different species or tissues offers an important tool for identifying conserved modules, predicting functions of specific genes or proteins and studying the evolution of biological processes, among other applications. The primary focus of this dissertation is on the biomolecular network alignment problem: Given two or more network models, the problem is to optimally match the nodes and links in one network with the nodes and links of the other. The Biomolecular Network Alignment (BiNA) Toolkit developed as part of this dissertation provides a set of efficient (in terms of the running time complexity) and accurate (in terms of various evaluation criteria discussed in the literature) network alignment algorithms for biomolecular networks. BiNA is scalable, user-friendly, modular, and extensible for performing alignments on diverse types of biomolecular networks. The algorithm is applicable to (1) undirected graphs in their weighted and unweighted variations (2) undirected graphs in their labeled and unlabeled variations (3) and has been applied to align multiple networks from hundreds of nodes with a few thousand edges to networks with tens of thousands of nodes with millions of edges. The dissertation provides various applications of network comparison tools including how results from such alignments have been utilized to (1) construct phylogenetic trees based on protein-protein interaction networks, and (2) find biochemical pathways involved in ligand recognition in B cells

    Leveraging existing data sets to generate new insights into Alzheimerā€™s disease biology in specific patient subsets

    Get PDF
    To generate new insights into the biology of Alzheimerā€™s Disease (AD), we developed methods to combine and reuse a wide variety of existing data sets in new ways. We first identified genes consistently associated with AD in each of four separate expression studies, and confirmed this result using a fifth study. We next developed algorithms to search hundreds of thousands of Gene Expression Omnibus (GEO) data sets, identifying a link between an AD-associated gene (NEUROD6) and gender. We therefore stratified patients by gender along with APOE4 status, and analyzed multiple SNP data sets to identify variants associated with AD. SNPs in either the region of NEUROD6 or SNAP25 were significantly associated with AD, in APOE4+ females and APOE4+ males, respectively. We developed algorithms to search Connectivity Map (CMAP) data for medicines that modulate AD-associated genes, identifying hypotheses that warrant further investigation for treating specific AD patient subsets. In contrast to other methods, this approach focused on integrating multiple gene expression datasets across platforms in order to achieve a robust intersection of disease-affected genes, and then leveraging these results in combination with genetic studies in order to prioritize potential genes for targeted therapy

    Protein-RNA interface residue prediction using machine learning: an assessment of the state of the art

    Get PDF
    Background: RNA molecules play diverse functional and structural roles in cells. They function as messengers for transferring genetic information from DNA to proteins, as the primary genetic material in many viruses, as catalysts (ribozymes) important for protein synthesis and RNA processing, and as essential and ubiquitous regulators of gene expression in living organisms. Many of these functions depend on precisely orchestrated interactions between RNA molecules and specific proteins in cells. Understanding the molecular mechanisms by which proteins recognize and bind RNA is essential for comprehending the functional implications of these interactions, but the recognition ā€˜codeā€™ that mediates interactions between proteins and RNA is not yet understood. Success in deciphering this code would dramatically impact the development of new therapeutic strategies for intervening in devastating diseases such as AIDS and cancer. Because of the high cost of experimental determination of protein-RNA interfaces, there is an increasing reliance on statistical machine learning methods for training predictors of RNA-binding residues in proteins. However, because of differences in the choice of datasets, performance measures, and data representations used, it has been difficult to obtain an accurate assessment of the current state of the art in protein-RNA interface prediction. Results: We provide a review of published approaches for predicting RNA-binding residues in proteins and a systematic comparison and critical assessment of protein-RNA interface residue predictors trained using these approaches on three carefully curated non-redundant datasets. We directly compare two widely used machine learning algorithms (NaĀØıve Bayes (NB) and Support Vector Machine (SVM)) using three different data representations in which features are encoded using either sequence- or structure-based windows. Our results show that (i) Sequencebased classifiers that use a position-specific scoring matrix (PSSM)-based representation (PSSMSeq) outperform those that use an amino acid identity based representation (IDSeq) or a smoothed PSSM (SmoPSSMSeq); (ii) Structure-based classifiers that use smoothed PSSM representation (SmoPSSMStr) outperform those that use PSSM (PSSMStr) as well as sequence identity based representation (IDStr). PSSMSeq classifiers, when tested on an independent test set of 44 proteins, achieve performance that is comparable to that of three state-of-the-art structure-based predictors (including those that exploit geometric features) in terms of Matthews Correlation Coefficient (MCC), although the structure-based methods achieve substantially higher Specificity (albeit at the expense of Sensitivity) compared to sequence-based methods. We also find that the expected performance of the classifiers on a residue level can be markedly different from that on a protein level. Our experiments show that the classifiers trained on three different non-redundant protein-RNA interface datasets achieve comparable cross-validation performance. However, we find that the results are significantly affected by differences in the distance threshold used to define interface residues. Conclusions: Our results demonstrate that protein-RNA interface residue predictors that use a PSSM-based encoding of sequence windows outperform classifiers that use other encodings of sequence windows. While structure-based methods that exploit geometric features can yield significant increases in the Specificity of protein-RNA interface residue predictions, such increases are offset by decreases in Sensitivity. These results underscore the importance of comparing alternative methods using rigorous statistical procedures, multiple performance measures, and datasets that are constructed based on several alternative definitions of interface residues and redundancy cutoffs as well as including evaluations on independent test sets into the comparisons

    Cereblon gene expression and correlation with clinical outcomes in patients with relapsed/refractory multiple myeloma treated with pomalidomide: an analysis of STRATUS

    Get PDF
    We analyzed gene expression levels of CRBN, cMYC, IRF4, BLIMP1, and XBP1 in 224 patients with multiple myeloma treated with pomalidomide and low-dose dexamethasone in the STRATUS study (ClinicalTrials.gov: NCT01712789; EudraCT number: 2012-001888-78). Clinical responses were observed at all CRBN expression levels. A trend in progression-free survival (PFS; pā€‰=ā€‰.038) and a potential trend in overall survival (OS; pā€‰=ā€‰.059) favoring high CRBN expressers were observed; however, no notable difference in overall response rate (ORR) was observed. ORR (30%), median PFS (17.7 weeks), and median OS (52.3 weeks) in low-CRBN expressers were comparable to those in the STRATUS intent-to-treat population (ORR, 33%; median PFS, 20.0 weeks; median OS, 51.7 weeks). A trend in ORR (pā€‰=ā€‰.050) favoring higher cMYC expressers was observed with no notable difference in PFS or OS. This analysis does not support exploring CRBN as a biomarker for selecting patients for pomalidomide therapy.The authors received editorial assistance from William Ho, PhD, and Peter J Simon, PhD, funded by Celgene Corporation

    Detection of gene orthology from gene co-expression and protein interaction networks

    Get PDF
    Background Ortholog detection methods present a powerful approach for finding genes that participate in similar biological processes across different organisms, extending our understanding of interactions between genes across different pathways, and understanding the evolution of gene families. Results We exploit features derived from the alignment of protein-protein interaction networks and gene-coexpression networks to reconstruct KEGG orthologs for Drosophila melanogaster, Saccharomyces cerevisiae, Mus musculus and Homo sapiens protein-protein interaction networks extracted from the DIP repository and Mus musculus and Homo sapiens and Sus scrofa gene coexpression networks extracted from NCBI\u27s Gene Expression Omnibus using the decision tree, Naive-Bayes and Support Vector Machine classification algorithms. Conclusions The performance of our classifiers in reconstructing KEGG orthologs is compared against a basic reciprocal BLAST hit approach. We provide implementations of the resulting algorithms as part of BiNA, an open source biomolecular network alignment toolkit

    A high-risk, Double-Hit, group of newly diagnosed myeloma identified by genomic analysis

    Get PDF
    Patients with newly diagnosed multiple myeloma (NDMM) with high-risk disease are in need of new treatment strategies to improve the outcomes. Multiple clinical, cytogenetic, or gene expression features have been used to identify high-risk patients, each of which has significant weaknesses. Inclusion of molecular features into risk stratification could resolve the current challenges. In a genome-wide analysis of the largest set of molecular and clinical data established to date from NDMM, as part of the Myeloma Genome Project, we have defined DNA drivers of aggressive clinical behavior. Whole-genome and exome data from 1273 NDMM patients identified genetic factors that contribute significantly to progression free survival (PFS) and overall survival (OS) (cumulative R2ā€‰=ā€‰18.4% and 25.2%, respectively). Integrating DNA drivers and clinical data into a Cox model using 784 patients with ISS, age, PFS, OS, and genomic data, the model has a cumlative R2 of 34.3% for PFS and 46.5% for OS. A high-risk subgroup was defined by recursive partitioning using either a) bi-allelic TP53 inactivation or b) amplification (ā‰„4 copies) of CKS1B (1q21) on the background of International Staging System III, comprising 6.1% of the population (median PFSā€‰=ā€‰15.4ā€‰months; OSā€‰=ā€‰20.7ā€‰months) that was validated in an independent dataset. Double-Hit patients have a dire prognosis despite modern therapies and should be considered for novel therapeutic approaches

    Modular Algorithms for Biomolecular Network Alignment

    Get PDF
    Comparative analysis of biomolecular networks constructed using measurements from different conditions, tissues, and organisms offer a powerful approach to understanding the structure, function, dynamics, and evolution of complex biological systems. The rapidly advancing field of systems biology aims to understand the structure, function, dynamics, and evolution of complex biological systems in terms of the underlying networks of interactions among the large number of molecular participants involved including genes, proteins, and metabolites. In particular, the comparative analysis of network models representing biomolecular interactions in different species or tissues offers an important tool for identifying conserved modules, predicting functions of specific genes or proteins and studying the evolution of biological processes, among other applications. The primary focus of this dissertation is on the biomolecular network alignment problem: Given two or more network models, the problem is to optimally match the nodes and links in one network with the nodes and links of the other. The Biomolecular Network Alignment (BiNA) Toolkit developed as part of this dissertation provides a set of efficient (in terms of the running time complexity) and accurate (in terms of various evaluation criteria discussed in the literature) network alignment algorithms for biomolecular networks. BiNA is scalable, user-friendly, modular, and extensible for performing alignments on diverse types of biomolecular networks. The algorithm is applicable to (1) undirected graphs in their weighted and unweighted variations (2) undirected graphs in their labeled and unlabeled variations (3) and has been applied to align multiple networks from hundreds of nodes with a few thousand edges to networks with tens of thousands of nodes with millions of edges. The dissertation provides various applications of network comparison tools including how results from such alignments have been utilized to (1) construct phylogenetic trees based on protein-protein interaction networks, and (2) find biochemical pathways involved in ligand recognition in B cells.</p
    • ā€¦
    corecore