5,050 research outputs found

    Comparative Microbial Modules Resource: Generation and Visualization of Multi-species Biclusters

    Get PDF
    The increasing abundance of large-scale, high-throughput datasets for many closely related organisms provides opportunities for comparative analysis via the simultaneous biclustering of datasets from multiple species. These analyses require a reformulation of how to organize multi-species datasets and visualize comparative genomics data analyses results. Recently, we developed a method, multi-species cMonkey, which integrates heterogeneous high-throughput datatypes from multiple species to identify conserved regulatory modules. Here we present an integrated data visualization system, built upon the Gaggle, enabling exploration of our method's results (available at http://meatwad.bio.nyu.edu/cmmr.html). The system can also be used to explore other comparative genomics datasets and outputs from other data analysis procedures – results from other multiple-species clustering programs or from independent clustering of different single-species datasets. We provide an example use of our system for two bacteria, Escherichia coli and Salmonella Typhimurium. We illustrate the use of our system by exploring conserved biclusters involved in nitrogen metabolism, uncovering a putative function for yjjI, a currently uncharacterized gene that we predict to be involved in nitrogen assimilation

    Graph Theory and Networks in Biology

    Get PDF
    In this paper, we present a survey of the use of graph theoretical techniques in Biology. In particular, we discuss recent work on identifying and modelling the structure of bio-molecular networks, as well as the application of centrality measures to interaction networks and research on the hierarchical structure of such networks and network motifs. Work on the link between structural network properties and dynamics is also described, with emphasis on synchronization and disease propagation.Comment: 52 pages, 5 figures, Survey Pape

    Improving Graph Neural Network Expressivity via Subgraph Isomorphism Counting

    Full text link
    While Graph Neural Networks (GNNs) have achieved remarkable results in a variety of applications, recent studies exposed important shortcomings in their ability to capture the structure of the underlying graph. It has been shown that the expressive power of standard GNNs is bounded by the Weisfeiler-Leman (WL) graph isomorphism test, from which they inherit proven limitations such as the inability to detect and count graph substructures. On the other hand, there is significant empirical evidence, e.g. in network science and bioinformatics, that substructures are often intimately related to downstream tasks. To this end, we propose "Graph Substructure Networks" (GSN), a topologically-aware message passing scheme based on substructure encoding. We theoretically analyse the expressive power of our architecture, showing that it is strictly more expressive than the WL test, and provide sufficient conditions for universality. Importantly, we do not attempt to adhere to the WL hierarchy; this allows us to retain multiple attractive properties of standard GNNs such as locality and linear network complexity, while being able to disambiguate even hard instances of graph isomorphism. We perform an extensive experimental evaluation on graph classification and regression tasks and obtain state-of-the-art results in diverse real-world settings including molecular graphs and social networks. The code is publicly available at https://github.com/gbouritsas/graph-substructure-networks

    Simulation Intelligence: Towards a New Generation of Scientific Methods

    Full text link
    The original "Seven Motifs" set forth a roadmap of essential methods for the field of scientific computing, where a motif is an algorithmic method that captures a pattern of computation and data movement. We present the "Nine Motifs of Simulation Intelligence", a roadmap for the development and integration of the essential algorithms necessary for a merger of scientific computing, scientific simulation, and artificial intelligence. We call this merger simulation intelligence (SI), for short. We argue the motifs of simulation intelligence are interconnected and interdependent, much like the components within the layers of an operating system. Using this metaphor, we explore the nature of each layer of the simulation intelligence operating system stack (SI-stack) and the motifs therein: (1) Multi-physics and multi-scale modeling; (2) Surrogate modeling and emulation; (3) Simulation-based inference; (4) Causal modeling and inference; (5) Agent-based modeling; (6) Probabilistic programming; (7) Differentiable programming; (8) Open-ended optimization; (9) Machine programming. We believe coordinated efforts between motifs offers immense opportunity to accelerate scientific discovery, from solving inverse problems in synthetic biology and climate science, to directing nuclear energy experiments and predicting emergent behavior in socioeconomic settings. We elaborate on each layer of the SI-stack, detailing the state-of-art methods, presenting examples to highlight challenges and opportunities, and advocating for specific ways to advance the motifs and the synergies from their combinations. Advancing and integrating these technologies can enable a robust and efficient hypothesis-simulation-analysis type of scientific method, which we introduce with several use-cases for human-machine teaming and automated science

    딥러닝 기반 단일 거리 공간 내 GPCR 단백질군 계층 구조의 동시적 모델링 기법

    Get PDF
    학위논문(석사)--서울대학교 대학원 :공과대학 컴퓨터공학부,2019. 8. 김선.G 단밸질 연결 수용체(GPCR)은 계층 구조로 형성된 다양한 단백질군으로 구성된다. 단백질 서열을 통한 GPCR에 대한 계산적인 모델링은 군(family), 아군(subfamily), 준아군(sub-subfamily)의 각 계층에서 독립적으로 실행되는 방식으로 이루어져왔다. 하지만 이러한 접근 방식들은 단절된 모델들을 통하여 단백질 내의 정보를 처리하기 때문에 GPCR 종류 사이의 관계는 고려하지 못한다는 한계를 가지고 있다. 본 연구에서는 딥러닝을 이용하여 GPCR의 계층 구조에서 나타나는 특징들을 단일한 모델로 동시적으로 학습하는 방법을 제시한다. 또한 계층적인 관계들을 하나의 벡터 공간에 거리를 통해 표현할 수 있도록 하기 위한 손실함수도 제시한다. 이 연구는 GPCR 수용체들의 여러 계층에서 공통적으로 나타나는 특징들을 학습하고 표현할 수 있도록 하는 방법을 다루고 있다. 여러 심화적인 실험들을 통하여 우리는 기술적인 측면과 생물학적인 측면에서 단백질 간 계층적인 관계가 성공적으로 학습이 되었다는 것을 보였다. 첫번째로, 우리는 임베딩 벡터에 계층적 군집화(hierarchical clustering) 알고리즘을 적용함으로써 계통수(phylogenetic tree)를 만들었고, 군집 알고리즘과 실제 계층 구조와의 수치적인 비교를 통하여 임베딩 벡터를 통해 계통학적 특징에 대한 유추가 가능하다는 것을 보였다. 두번째로, 임베딩 벡터의 군집화 결과에 다중 서열 정렬(multiple sequence alignment)를 적용시킴으로써 생물학적으로 유의미한 서열적 특성들을 찾아낼 수 있다는 것을 보였다. 이는 임베딩 벡터 분석이 GPCR 단백질 연구에 있어 효율적인 첫걸음이 될 수 있다는 것을 보여준다. 이러한 결과는 여러 계층으로 이루어진 단백질군에 대한 동시적인 모델링이 가능하다는 것을 말하고 있다.G protein-coupled receptors (GPCRs) belong to diverse families of proteins that can be defined at multiple levels. Computational modeling of GPCR families from the sequences has been performed separately at each level of family, sub-family, and sub-subfamily. However, relationships between classes are ignored in these approaches as they process the information in the sequences with a group of disconnected models. In this work, we propose a deep learning network to simultaneously learn representations in the GPCR hierarchy with a unified model and a loss term to express hierarchical relations in terms of distances in a single embedding space. The model introduces a method to learn and construct shared representations across hierarchies of the protein family. In extensive experiments, we showed that hierarchical relations between sequences are successfully captured in our model in both of technical and biological aspect. First, we showed that phylogenetic information in the sequences can be inferred from the vectors by constructing phylogenetic tree using hierarchical clustering algorithm and by quantitatively analyzing the quality of clustering results compared to the real label information. Second, inspection on embedding vectors is demonstrated to be a effective first step to-ward an analysis of GPCR proteins by showing that biologically significant sequence features can be revealed from multiple sequence alignments on clustering results on embedding vectors. Our work showed that simultaneous modeling of protein families with multiple hierarchies is possible.Abstract i Chapter Ⅰ. Introduction 1 1.1 Background 1 1.2 Motivation 3 Chapter Ⅱ. Methods 7 2.1 Data Preparation 7 2.1.1 Dataset 7 2.1.2 Data representation 7 2.2 Model architecture 8 2.2.1 Feature extractor with CNN 8 2.2.2 Embedding layer 8 2.2.3 Output layer 9 2.3 Loss function 10 2.3.1 Softmax loss 10 2.3.2 Center loss 10 2.3.3 Overall loss 12 2.4 Training procedure 13 2.5 Evaluation metric 14 2.5.1 Silhouette score 14 2.5.2 Adjusted mutual information score 15 Chapter Ⅲ. Results 17 3.1 Evaluation on hierarchical structure 17 3.1.1 Preservation of distances 17 3.1.2 Phylogenetic tree reconstruction 20 3.1.3 Quantitative evaluation on clustering results 21 3.2 Sequence analysis with embedding vectors 26 3.2.1 Technical analysis 26 3.2.2 Biological analysis 28 3.3 Classification accuracy 30 Chapter Ⅳ. Conclusion 32 References 35Maste

    Temporal Changes in Local Topology of an Email-Based Social Network

    Get PDF
    The dynamics of complex social networks has become one of the research areas of growing importance. The knowledge about temporal changes of the network topology and characteristics is crucial in networked communication systems in which accurate predictions are important. The local network topology can be described by the means of network motifs which are small subgraphs -- usually containing from 3 to 7 nodes. They were shown to be useful for creating profiles that reveal several properties of the network. In this paper, the time-varying characteristics of social networks, such as the number of nodes and edges as well as clustering coefficients and different centrality measures are investigated. At the same time, the analysis of three-node motifs (triads) was used to track the temporal changes in the structure of a large social network derived from e-mail communication between university employees. We have shown that temporal changes in local connection patterns of the social network are indeed correlated with the changes in the clustering coefficient as well as various centrality measures values and are detectable by means of motifs analysis. Together with robust sampling network motifs can provide an appealing way to monitor and assess temporal changes in large social networks

    The Eighth Central European Conference "Chemistry towards Biology": snapshot

    Get PDF
    The Eighth Central European Conference "Chemistry towards Biology" was held in Brno, Czech Republic, on 28 August – 1 September 2016The Eighth Central European Conference "Chemistry towards Biology" was held in Brno, Czech Republic, on 28 August-1 September 2016 to bring together experts in biology, chemistry and design of bioactive compounds; promote the exchange of scientific results, methods and ideas; and encourage cooperation between researchers from all over the world. The topics of the conference covered "Chemistry towards Biology", meaning that the event welcomed chemists working on biology-related problems, biologists using chemical methods, and students and other researchers of the respective areas that fall within the common scope of chemistry and biology. The authors of this manuscript are plenary speakers and other participants of the symposium and members of their research teams. The following summary highlights the major points/topics of the meeting

    Remote Homology Detection of Protein Sequences

    Get PDF
    The classification of protein sequences using string kernels provides valuable insights for protein function prediction. Almost all string kernels are based on patterns that are not independent, and therefore the associated scores are obtained using a set of redundant features. In this talk we will discuss how a class of patterns, called Irredundant, is specifically designed to address this issue. Loosely speaking the set of Irredundant patterns is the smallest class of independent patterns that can describe all patterns in a string. We present a classification method based on the statistics of these patterns, named Irredundant Class. Results on benchmark data show that Irredundant Class outperforms most of the string kernel methods previously proposed, and it achieves results as good as the current state-of-the-art methods with a fewer number of patterns. Unfortunately we show that the information carried by the irredundant patterns can not be easily interpreted, thus alternative notions are needed

    A knowledge discovery object model API for Java

    Get PDF
    BACKGROUND: Biological data resources have become heterogeneous and derive from multiple sources. This introduces challenges in the management and utilization of this data in software development. Although efforts are underway to create a standard format for the transmission and storage of biological data, this objective has yet to be fully realized. RESULTS: This work describes an application programming interface (API) that provides a framework for developing an effective biological knowledge ontology for Java-based software projects. The API provides a robust framework for the data acquisition and management needs of an ontology implementation. In addition, the API contains classes to assist in creating GUIs to represent this data visually. CONCLUSIONS: The Knowledge Discovery Object Model (KDOM) API is particularly useful for medium to large applications, or for a number of smaller software projects with common characteristics or objectives. KDOM can be coupled effectively with other biologically relevant APIs and classes. Source code, libraries, documentation and examples are available at
    corecore