383 research outputs found

    The hippocampal formation from a machine learning perspective

    Get PDF
    Nos dias de hoje, existem diversos tipos de sensores que conseguem captar uma grande quantidade de dados em curtos espaços de tempo. Em muitas situações, as informações obtidas pelos diferentes sensores traduzem fenómenos específicos, através de dados obtidos em diferentes formatos. Nesses casos, torna-se difícil saber quais as relações entre os dados e/ou identificar se os diferentes dados traduzem uma certa condição. Neste contexto, torna-se relevante desenvolver sistemas que tenham capacidade de analisar grandes quantidades de dados num menor tempo possível, produzindo informação válida a partir da informação recolhida. O cérebro dos animais é um órgão biológico capaz de fazer algo semelhante com a informação obtida pelos sentidos, que traduzem fenómenos específicos. Dentro do cérebro, existe um elemento chamado Hipocampo, que se encontra situado na área do lóbulo temporal. A sua função principal consiste em analisar os elementos previamente codificados pelo Entorhinal Cortex, dando origem à formação de novas memórias. Sendo o Hipocampo um órgão que foi sofrendo evoluções ao longo do tempos, é importante perceber qual é o seu funcionamento e, se possível, tentar encontrar modelos computacionais que traduzam o seu mecanismo. Desde a remoção do Hipocampo num paciente que sofria de convulsões, ficou claro que, sem esse elemento, não seria possível memorizar lugares ou eventos ocorridos num determinado espaço de tempo. Essa funcionalidade é obtida através de um conjunto específico de células chamadas de Grid Cells, que estão situadas na área do Entorhinal Cortex, mas também das Place Cells, Head Direction Cells e Boundary Vector Cells. Neste âmbito, o principal objetivo desta Dissertação consiste em descrever os principais mecanismos biológicos localizados no Hipocampo e definir modelos computacionais que consigam simular as funções mais críticas de ambos os Hipocampos e da área do Entorhinal Cortex.Nowadays, sensor devices are able to generate huge amounts of data in short periods of time. In many situations, that data, collected by many different sensors, translates a specific phenomenon, but is presented in very different types and formats. In these cases, it is hard to determine how these distinct types of data are related to each other or translate a certain condition. In this context, it would be of great importance to develop a system capable of analysing such data in the smallest amount time to produce valid information. The brain is a biological organ capable of such decisions. Inside the brain, there is an element called Hippocampus, that is situated in the Temporal Lobe area. Its main function is to analyse the sensorial data encoded by the Entorhinal Cortex to create new memories. Since the Hippocampus has evolved for thousands of years to perform these tasks, it is of high importance to try to understand its functioning and to model it, i.e. to define a set of computer algorithms that approximates it. Since the removal of the Hippocampus from a patient suffering from seizures, the scientific community believes that the Hippocampus is crucial for memory formation and for spatial navigation. Without it, it wouldn’t be possible to memorize places and events that happened in a specific time or place. Such functionality is achieved with the help of set of cells called Grid Cells, present in the Entorhinal Cortex area, but also with Place Cells, Head Direction Cells and Boundary Vector Cells. The combined information analysed by those cells allows the unique identification of places or events. The main objective of the work developed in this Thesis consists in describing the biological mechanisms present in the Hippocampus area and to define potential computer models that allow the simulation of all or the most critical functions of both the Hippocampus and the Entorhinal Cortex areas

    A treatment of stereochemistry in computer aided organic synthesis

    Get PDF
    This thesis describes the author’s contributions to a new stereochemical processing module constructed for the ARChem retrosynthesis program. The purpose of the module is to add the ability to perform enantioselective and diastereoselective retrosynthetic disconnections and generate appropriate precursor molecules. The module uses evidence based rules generated from a large database of literature reactions. Chapter 1 provides an introduction and critical review of the published body of work for computer aided synthesis design. The role of computer perception of key structural features (rings, functions groups etc.) and the construction and use of reaction transforms for generating precursors is discussed. Emphasis is also given to the application of strategies in retrosynthetic analysis. The availability of large reaction databases has enabled a new generation of retrosynthesis design programs to be developed that use automatically generated transforms assembled from published reactions. A brief description of the transform generation method employed by ARChem is given. Chapter 2 describes the algorithms devised by the author for handling the computer recognition and representation of the stereochemical features found in molecule and reaction scheme diagrams. The approach is generalised and uses flexible recognition patterns to transform information found in chemical diagrams into concise stereo descriptors for computer processing. An algorithm for efficiently comparing and classifying pairs of stereo descriptors is described. This algorithm is central for solving the stereochemical constraints in a variety of substructure matching problems addressed in chapter 3. The concise representation of reactions and transform rules as hyperstructure graphs is described. Chapter 3 is concerned with the efficient and reliable detection of stereochemical symmetry in both molecules, reactions and rules. A novel symmetry perception algorithm, based on a constraints satisfaction problem (CSP) solver, is described. The use of a CSP solver to implement an isomorph‐free matching algorithm for stereochemical substructure matching is detailed. The prime function of this algorithm is to seek out unique retron locations in target molecules and then to generate precursor molecules without duplications due to symmetry. Novel algorithms for classifying asymmetric, pseudo‐asymmetric and symmetric stereocentres; meso, centro, and C2 symmetric molecules; and the stereotopicity of trigonal (sp2) centres are described. Chapter 4 introduces and formalises the annotated structural language used to create both retrosynthetic rules and the patterns used for functional group recognition. A novel functional group recognition package is described along with its use to detect important electronic features such as electron‐withdrawing or donating groups and leaving groups. The functional groups and electronic features are used as constraints in retron rules to improve transform relevance. Chapter 5 details the approach taken to design detailed stereoselective and substrate controlled transforms from organised hierarchies of rules. The rules employ a rich set of constraints annotations that concisely describe the keying retrons. The application of the transforms for collating evidence based scoring parameters from published reaction examples is described. A survey of available reaction databases and the techniques for mining stereoselective reactions is demonstrated. A data mining tool was developed for finding the best reputable stereoselective reaction types for coding as transforms. For various reasons it was not possible during the research period to fully integrate this work with the ARChem program. Instead, Chapter 6 introduces a novel one‐step retrosynthesis module to test the developed transforms. The retrosynthesis algorithms use the organisation of the transform rule hierarchy to efficiently locate the best retron matches using all applicable stereoselective transforms. This module was tested using a small set of selected target molecules and the generated routes were ranked using a series of measured parameters including: stereocentre clearance and bond cleavage; example reputation; estimated stereoselectivity with reliability; and evidence of tolerated functional groups. In addition a method for detecting regioselectivity issues is presented. This work presents a number of algorithms using common set and graph theory operations and notations. Appendix A lists the set theory symbols and meanings. Appendix B summarises and defines the common graph theory terminology used throughout this thesis

    Data Enrichment for Data Mining Applied to Bioinformatics and Cheminformatics Domains

    Get PDF
    Problemas cada vez mais complexos estão a ser tratados na àrea das ciências da vida. A aquisição de todos os dados que possam estar relacionados com o problema em questão é primordial. Igualmente importante é saber como os dados estão relacionados uns com os outros e com o próprio problema. Por outro lado, existem grandes quantidades de dados e informações disponíveis na Web. Os investigadores já estão a utilizar Data Mining e Machine Learning como ferramentas valiosas nas suas investigações, embora o procedimento habitual seja procurar a informação baseada nos modelos indutivos. Até agora, apesar dos grandes sucessos já alcançados com a utilização de Data Mining e Machine Learning, não é fácil integrar esta vasta quantidade de informação disponível no processo indutivo, com algoritmos proposicionais. A nossa principal motivação é abordar o problema da integração de informação de domínio no processo indutivo de técnicas proposicionais de Data Mining e Machine Learning, enriquecendo os dados de treino a serem utilizados em sistemas de programação de lógica indutiva. Os algoritmos proposicionais de Machine Learning são muito dependentes dos atributos dos dados. Ainda é difícil identificar quais os atributos mais adequados para uma determinada tarefa na investigação. É também difícil extrair informação relevante da enorme quantidade de dados disponíveis. Vamos concentrar os dados disponíveis, derivar características que os algoritmos de ILP podem utilizar para induzir descrições, resolvendo os problemas. Estamos a criar uma plataforma web para obter informação relevante para problemas de Bioinformática (particularmente Genómica) e Quimioinformática. Esta vai buscar os dados a repositórios públicos de dados genómicos, proteicos e químicos. Após o enriquecimento dos dados, sistemas Prolog utilizam programação lógica indutiva para induzir regras e resolver casos específicos de Bioinformática e Cheminformática. Para avaliar o impacto do enriquecimento dos dados com ILP, comparamos com os resultados obtidos na resolução dos mesmos casos utilizando algoritmos proposicionais.Increasingly more complex problems are being addressed in life sciences. Acquiring all the data that may be related to the problem in question is paramount. Equally important is to know how the data is related to each other and to the problem itself. On the other hand, there are large amounts of data and information available on the Web. Researchers are already using Data Mining and Machine Learning as a valuable tool in their researches, albeit the usual procedure is to look for the information based on induction models. So far, despite the great successes already achieved using Data Mining and Machine Learning, it is not easy to integrate this vast amount of available information in the inductive process with propositional algorithms. Our main motivation is to address the problem of integrating domain information into the inductive process of propositional Data Mining and Machine Learning techniques by enriching the training data to be used in inductive logic programming systems. The algorithms of propositional machine learning are very dependent on data attributes. It still is hard to identify which attributes are more suitable for a particular task in the research. It is also hard to extract relevant information from the enormous quantity of data available. We will concentrate the available data, derive features that ILP algorithms can use to induce descriptions, solving the problems. We are creating a web platform to obtain relevant bioinformatics (particularly Genomics) and Cheminformatics problems. It fetches the data from public repositories with genomics, protein and chemical data. After the data enrichment, Prolog systems use inductive logic programming to induce rules and solve specific Bioinformatics and Cheminformatics case studies. To assess the impact of the data enrichment with ILP, we compare with the results obtained solving the same cases using propositional algorithms

    On Computable Protein Functions

    Get PDF
    Proteins are biological machines that perform the majority of functions necessary for life. Nature has evolved many different proteins, each of which perform a subset of an organism’s functional repertoire. One aim of biology is to solve the sparse high dimensional problem of annotating all proteins with their true functions. Experimental characterisation remains the gold standard for assigning function, but is a major bottleneck due to resource scarcity. In this thesis, we develop a variety of computational methods to predict protein function, reduce the functional search space for proteins, and guide the design of experimental studies. Our methods take two distinct approaches: protein-centric methods that predict the functions of a given protein, and function-centric methods that predict which proteins perform a given function. We applied our methods to help solve a number of open problems in biology. First, we identified new proteins involved in the progression of Alzheimer’s disease using proteomics data of brains from a fly model of the disease. Second, we predicted novel plastic hydrolase enzymes in a large data set of 1.1 billion protein sequences from metagenomes. Finally, we optimised a neural network method that extracts a small number of informative features from protein networks, which we used to predict functions of fission yeast proteins

    RUPEE: A Big Data Approach to Indexing and Searching Protein Structures

    Get PDF
    Title from PDF of title page viewed July 7, 2021Yugyung LeeVitaIncludes bibliographical references (pages 149-158)Thesis (Ph.D.)--School of Computing and Engineering and Department of Mathematics and Statistics. University of Missouri--Kansas City, 2021Given the close relationship between protein structure and function, protein structure searches have long played an established role in bioinformatics. Despite their maturity, existing protein structure searches either compromise the quality of results to obtain faster response times or suffer from longer response times to provide better quality results. Existing protein structure searches that focus on faster response times often use sequence clustering or depend on other simplifying assumptions not based on structure alone. In the case of sequence clustering, strong structure similarities are often hidden behind cluster representatives. Existing protein structure searches that focus on better quality results often perform full pairwise protein structure alignments with the query structure against every available structure in the searched database, which can take as long as a full day to complete. The poor response times of these protein structure searches prevent the easy and efficient exploration of relationships between protein structures, which is the norm in other areas of inquiry. To address these trade-offs between faster response times and quality results, we have developed RUPEE, a fast and accurate purely geometric protein structure search combining a novel approach to encoding sequences of torsion angles with established techniques from information retrieval and big data. RUPEE can compare the query structure to every available structure in the searched database with fast response times. To accomplish this, first, we introduce a new polar plot of torsion angles to help identify separable regions of torsion angles and derive a simple encoding of torsion angles based on the identified regions. Then, we introduce a heuristic to encode sequences of torsion angles called Run Position Encoding to increase the specificity of our encoding within regular secondary structures, alpha-helices and beta-strands. Once we have a linear encoding of protein structures based on their torsion angles, we use min-hashing and locality sensitive hashing, established techniques from information retrieval and big data, to compare the query structure to every available structure in the searched database with fast response times. Moreover, because RUPEE is a purely geometric protein structure search, it does not depend on protein sequences. RUPEE also does not depend on other simplifying assumptions not based on structure alone. As such, RUPEE can be used effectively to search on protein structures with low sequence and structure similarity to known structures, such as predicted structures that results from protein structure prediction algorithms. Comparing our results to the mTM-align, SSM, CATHEDRAL, and VAST protein structure searches, RUPEE has set a new bar for protein structure searches. RUPEE produces better quality results than the best available protein structure searches and does so with the fastest response times.Introduction -- Encoding Torsion Angles -- Indexing Protein Structures -- Searching Protein Structures -- Results and Evaluation -- Using RUPEE -- Conclusion -- Appendix A. Benchmarks of Known Protein Structures -- Appendix B. Benchmarks of Protein Structure Prediction

    Improving the resolution of interaction maps: A middleground between high-resolution complexes and genome-wide interactomes

    Get PDF
    Protein-protein interactions are ubiquitous in Biology and therefore central to understand living organisms. In recent years, large-scale studies have been undertaken to describe, at least partially, protein-protein interaction maps or interactomes for a number of relevant organisms including human. Although the analysis of interaction networks is proving useful, current interactomes provide a blurry and granular picture of the molecular machinery, i.e. unless the structure of the protein complex is known the molecular details of the interaction are missing and sometime is even not possible to know if the interaction between the proteins is direct, i.e. physical interaction or part of functional, not necessary, direct association. Unfortunately, the determination of the structure of protein complexes cannot keep pace with the discovery of new protein-protein interactions resulting in a large, and increasing, gap between the number of complexes that are thought to exist and the number for which 3D structures are available. The aim of the thesis was to tackle this problem by implementing computational approaches to derive structural models of protein complexes and thus reduce this existing gap. Over the course of the thesis, a novel modelling algorithm to predict the structure of protein complexes, V-D2OCK, was implemented. This new algorithm combines structure-based prediction of protein binding sites by means of a novel algorithm developed over the course of the thesis: VORFFIP and M-VORFFIP, data-driven docking and energy minimization. This algorithm was used to improve the coverage and structural content of the human interactome compiled from different sources of interactomic data to ensure the most comprehensive interactome. Finally, the human interactome and structural models were compiled in a database, V-D2OCK DB, that offers an easy and user-friendly access to the human interactome including a bespoken graphical molecular viewer to facilitate the analysis of the structural models of protein complexes. Furthermore, new organisms, in addition to human, were included providing a useful resource for the study of all known interactomes

    Data Mining

    Get PDF
    The availability of big data due to computerization and automation has generated an urgent need for new techniques to analyze and convert big data into useful information and knowledge. Data mining is a promising and leading-edge technology for mining large volumes of data, looking for hidden information, and aiding knowledge discovery. It can be used for characterization, classification, discrimination, anomaly detection, association, clustering, trend or evolution prediction, and much more in fields such as science, medicine, economics, engineering, computers, and even business analytics. This book presents basic concepts, ideas, and research in data mining

    High Performance Computing Techniques to Better Understand Protein Conformational Space

    Get PDF
    This thesis presents an amalgamation of high performance computing techniques to get better insight into protein molecular dynamics. Key aspects of protein function and dynamics can be learned from their conformational space. Datasets that represent the complex nuances of a protein molecule are high dimensional. Efficient dimensionality reduction becomes indispensable for the analysis of such exorbitant datasets. Dimensionality reduction forms a formidable portion of this work and its application has been explored for other datasets as well. It begins with the parallelization of a known non-liner feature reduction algorithm called Isomap. The code for the algorithm was re-written in C with portions of it parallelized using OpenMP. Next, a novel data instance reduction method was devised which evaluates the information content offered by each data point, which ultimately helps in truncation of the dataset with much fewer data points to evaluate. Once a framework has been established to reduce the number of variables representing a dataset, the work is extended to explore algebraic topology techniques to extract meaningful information from these datasets. This step is the one that helps in sampling the conformations of interest of a protein molecule. The method employs the notion of hierarchical clustering to identify classes within a molecule, thereafter, algebraic topology is used to analyze these classes. Finally, the work is concluded by presenting an approach to solve the open problem of protein folding. A Monte-Carlo based tree search algorithm is put forth to simulate the pathway that a certain protein conformation undertakes to reach another conformation. The dissertation, in its entirety, offers solutions to a few problems that hinder the progress of solution for the vast problem of understanding protein dynamics. The motion of a protein molecule is guided by changes in its energy profile. In this course the molecule gradually slips from one energy class to another. Structurally, this switch is transient spanning over milliseconds or less and hence is difficult to be captured solely by the work in wet laboratories

    UOW Research Report 1994

    Get PDF
    corecore