26 research outputs found

    A Multiobjective Evolutionary Conceptual Clustering Methodology for Gene Annotation Within Structural Databases: A Case of Study on the Gene Ontology Database

    Get PDF
    Current tools and techniques devoted to examine the content of large databases are often hampered by their inability to support searches based on criteria that are meaningful to their users. These shortcomings are particularly evident in data banks storing representations of structural data such as biological networks. Conceptual clustering techniques have demonstrated to be appropriate for uncovering relationships between features that characterize objects in structural data. However, typical con ceptual clustering approaches normally recover the most obvious relations, but fail to discover the lessfrequent but more informative underlying data associations. The combination of evolutionary algorithms with multiobjective and multimodal optimization techniques constitutes a suitable tool for solving this problem. We propose a novel conceptual clustering methodology termed evolutionary multiobjective conceptual clustering (EMO-CC), re lying on the NSGA-II multiobjective (MO) genetic algorithm. We apply this methodology to identify conceptual models in struc tural databases generated from gene ontologies. These models can explain and predict phenotypes in the immunoinflammatory response problem, similar to those provided by gene expression or other genetic markers. The analysis of these results reveals that our approach uncovers cohesive clusters, even those comprising a small number of observations explained by several features, which allows describing objects and their interactions from different perspectives and at different levels of detail.Ministerio de Ciencia y Tecnología TIC-2003-00877Ministerio de Ciencia y Tecnología BIO2004-0270EMinisterio de Ciencia y Tecnología TIN2006-1287

    Possibilistic Approach to Biclustering: An Application to Oligonucleotide Microarray Data Analysis

    Get PDF
    The important research objective of identifying genes with similar behavior with respect to different conditions has recently been tackled with biclustering techniques. In this paper we introduce a new approach to the biclustering problem using the Possibilistic Clustering paradigm. The proposed Possibilistic Biclustering algorithm finds one bicluster at a time, assigning a membership to the bicluster for each gene and for each condition. The biclustering problem, in which one would maximize the size of the bicluster and minimizing the residual, is faced as the optimization of a proper functional. We applied the algorithm to the Yeast database, obtaining fast convergence and good quality solutions. We discuss the effects of parameter tuning and the sensitivity of the method to parameter values. Comparisons with other methods from the literature are also presented

    Systematic analysis of genome-wide fitness data in yeast reveals novel gene function and drug action

    Get PDF
    The relationship between co-fitness and co-inhibition of genes in chemicogenomic yeast screens provides insights into gene function and drug target prediction

    Redundancy-aware learning of protein structure-function relationships

    Get PDF
    The protein kinases are a large family of enzymes that play a fundamental role in propagating signals within the cell. Because of the high degree of binding site similarity shared among protein kinases, designing drug compounds with high specificity among the kinases has proven difficult. However, computational approaches to comparing the 3-dimensional geometry and physicochemical properties of key binding site residues, referred to here as substructures, have been shown to be informative of inhibitor selectivity. This thesis introduces two fundamental approaches for the comparative analysis of substructure similarity and demonstrates the importance of each method on a variety of large protein structure datasets for multiple biological applications. The Family-wise Alignment of SubStructural Templates Framework (The FASST Framework) provides an unsupervised learning approach for identifying substructure clusterings. The substructure clusterings identified by FASST allow for the automatic evaluation of substructure variability, the identification of distinct structural conformations and the selection of anomalous outlier structures within large structure datasets. These clusterings are shown to be capable of identifying biologically meaningful structure trends among a diverse number of protein families. The FASST Live visualization and analysis platform provides multiple comparative analysis pipelines and allows the user to interactively explore the substructure clusterings computed by FASST. The Combinatorial Clustering Of Residue Position Subsets (CCORPS) method provides a supervised learning approach for identifying structural features that are correlated with a given set of annotation labels. The ability of CCORPS to identify structural features predictive of functional divergence among families of homologous enzymes is demonstrated across 48 distinct protein families. The CCORPS method is further demonstrated to generalize to the very difficult problem of predicting protein kinase inhibitor affinity. CCORPS is demonstrated to make perfect or near-perfect predictions for the binding ability of 12 of the 38 kinase inhibitors studied, while only having overall poor predictive ability for 1 of the 38 compounds. Additionally, CCORPS is shown to identify shared structural features across phylogenetically diverse groups of kinases that are correlated with binding affinity for particular inhibitors; such instances of structural similarity among phylogenetically diverse kinases are also shown to not be rare among kinases. Finally, these function-specific structural features may serve as potential starting points for the development of highly specific kinase inhibitors. Importantly, both The FASST Framework and CCORPS implement a redundancy-aware approach to dealing with structure overrepresentation that allows for the incorporation of all available structure data. As shown in this thesis, surprising structural variability exists even among structure datasets consisting of a single protein sequence. By incorporating the full variety of structural conformations within the analysis, the methods presented here provide a richer view of the variability of large protein structure datasets

    Large-scale prediction of adverse drug reactions using chemical, biological, and phenotypic properties of drugs

    Get PDF
    Abstract Objective Adverse drug reaction (ADR) is one of the major causes of failure in drug development. Severe ADRs that go undetected until the post-marketing phase of a drug often lead to patient morbidity. Accurate prediction of potential ADRs is required in the entire life cycle of a drug, including early stages of drug design, different phases of clinical trials, and post-marketing surveillance. Methods Many studies have utilized either chemical structures or molecular pathways of the drugs to predict ADRs. Here, the authors propose a machine-learning-based approach for ADR prediction by integrating the phenotypic characteristics of a drug, including indications and other known ADRs, with the drug's chemical structures and biological properties, including protein targets and pathway information. A large-scale study was conducted to predict 1385 known ADRs of 832 approved drugs, and five machine-learning algorithms for this task were compared. Results This evaluation, based on a fivefold cross-validation, showed that the support vector machine algorithm outperformed the others. Of the three types of information, phenotypic data were the most informative for ADR prediction. When biological and phenotypic features were added to the baseline chemical information, the ADR prediction model achieved significant improvements in area under the curve (from 0.9054 to 0.9524), precision (from 43.37% to 66.17%), and recall (from 49.25% to 63.06%). Most importantly, the proposed model successfully predicted the ADRs associated with withdrawal of rofecoxib and cerivastatin. Conclusion The results suggest that phenotypic information on drugs is valuable for ADR prediction. Moreover, they demonstrate that different models that combine chemical, biological, or phenotypic information can be built from approved drugs, and they have the potential to detect clinically important ADRs in both preclinical and post-marketing phases.This study was supported in part by grants from the NHLBI 5U19HL065962 and the NCI R01CA141307. ML is supported by the NLM training grant 3T15LM007450-08S1. JS is partially supported by the 2010 NARSAD Young Investigator Award. ZZ is partially supported by the 2009 NARSAD Maltz Investigator Award. MM is supported by a Veterans Administration HSR&D Career Development Award (CDA-08-020)

    Development of mathematical methods for modeling biological systems

    Get PDF

    New approaches for clustering high dimensional data

    Get PDF
    Clustering is one of the most effective methods for analyzing datasets that contain a large number of objects with numerous attributes. Clustering seeks to identify groups, or clusters, of similar objects. In low dimensional space, the similarity between objects is often evaluated by summing the difference across all of their attributes. High dimensional data, however, may contain irrelevant attributes which mask the existence of clusters. The discovery of groups of objects that are highly similar within some subsets of relevant attributes becomes an important but challenging task. My thesis focuses on various models and algorithms for this task. We first present a flexible clustering model, namely OP-Cluster (Order Preserving Cluster). Under this model, two objects are similar on a subset of attributes if the values of these two objects induce the same relative ordering of these attributes. OPClustering algorithm has demonstrated to be useful to identify co-regulated genes in gene expression data. We also propose a semi-supervised approach to discover biologically meaningful OP-Clusters by incorporating existing gene function classifications into the clustering process. This semi-supervised algorithm yields only OP-clusters that are significantly enriched by genes from specific functional categories. Real datasets are often noisy. We propose a noise-tolerant clustering algorithm for mining frequently occuring itemsets. This algorithm is called approximate frequent itemsets (AFI). Both the theoretical and experimental results demonstrate that our AFI mining algorithm has higher recoverability of real clusters than any other existing itemset mining approaches. Pair-wise dissimilarities are often derived from original data to reduce the complexities of high dimensional data. Traditional clustering algorithms taking pair-wise dissimilarities as input often generate disjoint clusters from pair-wise dissimilarities. It is well known that the classification model represented by disjoint clusters is inconsistent with many real classifications, such gene function classifications. We develop a Poclustering algorithm, which generates overlapping clusters from pair-wise dissimilarities. We prove that by allowing overlapping clusters, Poclustering fully preserves the information of any dissimilarity matrices while traditional partitioning algorithms may cause significant information loss

    Société Francophone de Classification (SFC) Actes des 26èmes Rencontres

    Get PDF
    National audienceLes actes des rencontres de la Société Francophone de Classification (SFC, http://www.sfc-classification.net/) contiennent l'ensemble des contributions,présentés lors des rencontres entre les 3 et 5 septembre 2019 au Centre de Recherche Inria Nancy Grand Est/LORIA Nancy. La classification sous toutes ces formes, mathématiques, informatique (apprentissage, fouille de données et découverte de connaissances ...), et statistiques, est la thématique étudiée lors de ces journées. L'idée est d'illustrer les différentes facettes de la classification qui reflètent les intérêts des chercheurs dans la matière, provenant des mathématiques et de l'informatique

    Molecular Bronchiolitis Obliterans Syndrome Risk Monitoring: A Systems-Based Approach

    Get PDF
    The combination of high throughput omics (i.e. genomics or proteomics) and machine learning offers new possibilities for clinical diagnostics and the detection of biomarkers. One disease for which no reliable prognostic marker has been found yet is bronchiolitis obliterans (BO), a clinical manifestation of chronic rejection after lung transplantation. BO is the major limiting factor for long-term survival after lung transplantation, and manifests as a chronic bronchiolar inammation accompanied by progressive sub-mucosal fibrosis leading to gradual obliteration of the bronchiolar lumen. The resulting reduction in forced expiratory volume per second (FEV 1 ) is defined as the bronchiolitis obliterans syndrome (BOS). As chronic lung transplant failure occurs more frequently than in other organ transplants, molecular markers for early BO and BOS detection are urgently required to adapt the patients immunosuppressive regimen when airway damage is minimal. To achieve this goal, gene expression in bronchial epithelial cells (microarray anaylsis) and on the proteome level in bronchoalveolar lavage fluid (BALF)(mass spectrometry profiling) were monitored. Analysis of the obtained data sets was performed using novel and established methods from the fields of machine learning and statistics. This thesis also introduces a novel clustering algorithm. In the analysis of gene expression microarrays one problem is the unsupervised discovery of stable and biologically relevant patient subgroups. To this end I developed a novel clustering algorithm. This algorithm focuses on the discovery of a set of patient clusters defined by the consistent up- and down-regulation of a subset of genes. Assessment of cluster stability is done using a bootstrap resampling scheme. This makes it possible to rank the genes in accordance with their clusterwise importance. The algorithm was applied to a publicly available B-cell lymphoma microarray data set and compared to other commonly used clustering algorithms
    corecore