1,497 research outputs found

    On-Chip Living-Cell Microarrays for Network Biology

    Get PDF

    Computing Similarity between a Pair of Trajectories

    Full text link
    With recent advances in sensing and tracking technology, trajectory data is becoming increasingly pervasive and analysis of trajectory data is becoming exceedingly important. A fundamental problem in analyzing trajectory data is that of identifying common patterns between pairs or among groups of trajectories. In this paper, we consider the problem of identifying similar portions between a pair of trajectories, each observed as a sequence of points sampled from it. We present new measures of trajectory similarity --- both local and global --- between a pair of trajectories to distinguish between similar and dissimilar portions. Our model is robust under noise and outliers, it does not make any assumptions on the sampling rates on either trajectory, and it works even if they are partially observed. Additionally, the model also yields a scalar similarity score which can be used to rank multiple pairs of trajectories according to similarity, e.g. in clustering applications. We also present efficient algorithms for computing the similarity under our measures; the worst-case running time is quadratic in the number of sample points. Finally, we present an extensive experimental study evaluating the effectiveness of our approach on real datasets, comparing with it with earlier approaches, and illustrating many issues that arise in trajectory data. Our experiments show that our approach is highly accurate in distinguishing similar and dissimilar portions as compared to earlier methods even with sparse sampling

    The AFLOW Fleet for Materials Discovery

    Full text link
    The traditional paradigm for materials discovery has been recently expanded to incorporate substantial data driven research. With the intent to accelerate the development and the deployment of new technologies, the AFLOW Fleet for computational materials design automates high-throughput first principles calculations, and provides tools for data verification and dissemination for a broad community of users. AFLOW incorporates different computational modules to robustly determine thermodynamic stability, electronic band structures, vibrational dispersions, thermo-mechanical properties and more. The AFLOW data repository is publicly accessible online at aflow.org, with more than 1.7 million materials entries and a panoply of queryable computed properties. Tools to programmatically search and process the data, as well as to perform online machine learning predictions, are also available.Comment: 14 pages, 8 figure

    A Comprehensive Survey on Graph Summarization with Graph Neural Networks

    Full text link
    As large-scale graphs become more widespread, more and more computational challenges with extracting, processing, and interpreting large graph data are being exposed. It is therefore natural to search for ways to summarize these expansive graphs while preserving their key characteristics. In the past, most graph summarization techniques sought to capture the most important part of a graph statistically. However, today, the high dimensionality and complexity of modern graph data are making deep learning techniques more popular. Hence, this paper presents a comprehensive survey of progress in deep learning summarization techniques that rely on graph neural networks (GNNs). Our investigation includes a review of the current state-of-the-art approaches, including recurrent GNNs, convolutional GNNs, graph autoencoders, and graph attention networks. A new burgeoning line of research is also discussed where graph reinforcement learning is being used to evaluate and improve the quality of graph summaries. Additionally, the survey provides details of benchmark datasets, evaluation metrics, and open-source tools that are often employed in experimentation settings, along with a discussion on the practical uses of graph summarization in different fields. Finally, the survey concludes with a number of open research challenges to motivate further study in this area.Comment: 20 pages, 4 figures, 3 tables, Journal of IEEE Transactions on Artificial Intelligenc

    Integrating Fuzzy Decisioning Models With Relational Database Constructs

    Get PDF
    Human learning and classification is a nebulous area in computer science. Classic decisioning problems can be solved given enough time and computational power, but discrete algorithms cannot easily solve fuzzy problems. Fuzzy decisioning can resolve more real-world fuzzy problems, but existing algorithms are often slow, cumbersome and unable to give responses within a reasonable timeframe to anything other than predetermined, small dataset problems. We have developed a database-integrated highly scalable solution to training and using fuzzy decision models on large datasets. The Fuzzy Decision Tree algorithm is the integration of the Quinlan ID3 decision-tree algorithm together with fuzzy set theory and fuzzy logic. In existing research, when applied to the microRNA prediction problem, Fuzzy Decision Tree outperformed other machine learning algorithms including Random Forest, C4.5, SVM and Knn. In this research, we propose that the effectiveness with which large dataset fuzzy decisions can be resolved via the Fuzzy Decision Tree algorithm is significantly improved when using a relational database as the storage unit for the fuzzy ID3 objects, versus traditional storage objects. Furthermore, it is demonstrated that pre-processing certain pieces of the decisioning within the database layer can lead to much swifter membership determinations, especially on Big Data datasets. The proposed algorithm uses the concepts inherent to databases: separated schemas, indexing, partitioning, pipe-and-filter transformations, preprocessing data, materialized and regular views, etc., to present a model with a potential to learn from itself. Further, this work presents a general application model to re-architect Big Data applications in order to efficiently present decisioned results: lowering the volume of data being handled by the application itself, and significantly decreasing response wait times while allowing the flexibility and permanence of a standard relational SQL database, supplying optimal user satisfaction in today\u27s Data Analytics world. We experimentally demonstrate the effectiveness of our approach

    Large-Scale Analysis of Protein-Ligand Binding Sites using the Binding MOAD Database.

    Full text link
    Current structure-based drug design (SBDD) methods require understanding of general tends of protein-ligand interactions. Informative descriptors of ligand-binding sites provide powerful heuristics to improve SBDD methods designed to infer function from protein structure. These descriptors must have a solid statistical foundation for assessing general trends in large sets of protein-ligand complexes. This dissertation focuses on mining the Binding MOAD database of highly curated protein-ligand complexes to determine frequently observed patterns of binding-site composition. An extension to Binding MOAD’s framework is developed to store structural details of binding sites and facilitate large-scale analysis. This thesis uses the framework to address three topics. It first describes a strategy for determining over-representation of amino acids within ligand-binding sites, comparing the trends of residue propensity for binding sites of biologically relevant ligands to those of spurious molecules with no known function. To determine the significance of these trends and to provide guidelines for residue-propensity studies, the effect of the data set size on the variation in propensity values is evaluated. Next, binding-site residue propensities are applied to improve the performance of a geometry-based, binding-site prediction algorithm. Propensity-based scores are found to perform comparably to the native score in successfully ranking correct predictions. For large proteins, propensity-based and consensus scores improve the scoring success. Finally, current protein-ligand scoring functions are evaluated using a new criterion: the ability to discern biologically relevant ligands from “opportunistic binders,” molecules present in crystal structures due to their high concentrations in the crystallization medium. Four different scoring functions are evaluated against a diverse benchmark set. All are found to perform well for ranking biologically relevant sites over spurious ones, and all performed best when penalties for torsional strain of ligands were included. The final chapter describes a structural alignment method, termed HwRMSD, which can align proteins of very low sequence homology based on their structural similarity using a weighted structure superposition. The overall aims of the dissertation are to collect high-quality binding-site composition data within the largest available set of protein-ligand complexes and to evaluate the appropriate applications of this data to emerging methods for computational proteomics.Ph.D.BioinformaticsUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/91400/1/nickolay_1.pd

    A Data Mining Approach To Gen Dynamic Behavioral Process

    Get PDF
    Les textes présentés dans cette section spéciale mettent en lumière certaines embûches reliées à l’implantation et à la gestion des aires protégées. L’identification du territoire, le respect des ententes internationales en matière de diversité biologique, mais tout particulièrement l’acceptation par les communautés locales d’une aire de conservation dans leur environnement de vie semblent être des éléments qui ralentissent les processus de protection de la biodiversité. L’acceptabilité socia..

    Development and Applications of Similarity Measures for Spatial-Temporal Event and Setting Sequences

    Get PDF
    Similarity or distance measures between data objects are applied frequently in many fields or domains such as geography, environmental science, biology, economics, computer science, linguistics, logic, business analytics, and statistics, among others. One area where similarity measures are particularly important is in the analysis of spatiotemporal event sequences and associated environs or settings. This dissertation focuses on developing a framework of modeling, representation, and new similarity measure construction for sequences of spatiotemporal events and corresponding settings, which can be applied to different event data types and used in different areas of data science. The first core part of this dissertation presents a matrix-based spatiotemporal event sequence representation that unifies punctual and interval-based representation of events. This framework supports different event data types and provides support for data mining and sequence classification and clustering. The similarity measure is based on the modified Jaccard index with temporal order constraints and accommodates different event data types. This approach is demonstrated through simulated data examples and the performance of the similarity measures is evaluated with a k-nearest neighbor algorithm (k-NN) classification test on synthetic datasets. These similarity measures are incorporated into a clustering method and successfully demonstrate the usefulness in a case study analysis of event sequences extracted from space time series of a water quality monitoring system. This dissertation further proposes a new similarity measure for event setting sequences, which involve the space and time in which events occur. While similarity measures for spatiotemporal event sequences have been studied, the settings and setting sequences have not yet been considered. While modeling event setting sequences, spatial and temporal scales are considered to define the bounds of the setting and incorporate dynamic variables along with static variables. Using a matrix-based representation and an extended Jaccard index, new similarity measures are developed to allow for the use of all variable data types. With these similarity measures coupled with other multivariate statistical analysis approaches, results from a case study involving setting sequences and pollution event sequences associated with the same monitoring stations, support the hypothesis that more similar spatial-temporal settings or setting sequences may generate more similar events or event sequences. To test the scalability of STES similarity measure in a larger dataset and an extended application in different fields, this dissertation compares and contrasts the prospective space-time scan statistic with the STES similarity approach for identifying COVID-19 hotspots. The COVID-19 pandemic has highlighted the importance of detecting hotspots or clusters of COVID-19 to provide decision makers at various levels with better information for managing distribution of human and technical resources as the outbreak in the USA continues to grow. The prospective space-time scan statistic has been used to help identify emerging disease clusters yet results from this approach can encounter strategic limitations imposed by the spatial constraints of the scanning window. The STES-based approach adapted for this pandemic context computes the similarity of evolving normalized COVID-19 daily cases by county and clusters these to identify counties with similarly evolving COVID-19 case histories. This dissertation analyzes the spread of COVID-19 within the continental US through four periods beginning from late January 2020 using the COVID-19 datasets maintained by John Hopkins University, Center for Systems Science and Engineering (CSSE). Results of the two approaches can complement with each other and taken together can aid in tracking the progression of the pandemic. Overall, the dissertation highlights the importance of developing similarity measures for analyzing spatiotemporal event sequences and associated settings, which can be applied to different event data types and used for data mining, sequence classification, and clustering

    Data integration, pathway analysis and mining for systems biology

    Get PDF
    Post-genomic molecular biology embodies high-throughput experimental techniques and hence is a data-rich field. The goal of this thesis is to develop bioinformatics methods to utilise publicly available data in order to produce knowledge and to aid mining of newly generated data. As an example of knowledge or hypothesis generation, consider function prediction of biological molecules. Assignment of protein function is a non-trivial task owing to the fact that the same protein may be involved in different biological processes, depending on the state of the biological system and protein localisation. The function of a gene or a gene product may be provided as a textual description in a gene or protein annotation database. Such textual descriptions lack in providing the contextual meaning of the gene function. Therefore, we need ways to represent the meaning in a formal way. Here we apply data integration approach to provide rich representation that enables context-sensitive mining of biological data in terms of integrated networks and conceptual spaces. Context-sensitive gene function annotation follows naturally from this framework, as a particular application. Next, knowledge that is already publicly available can be used to aid mining of new experimental data. We developed an integrative bioinformatics method that utilises publicly available knowledge of protein-protein interactions, metabolic networks and transcriptional regulatory networks to analyse transcriptomics data and predict altered biological processes. We applied this method to a study of dynamic response of Saccharomyces cerevisiae to oxidative stress. The application of our method revealed dynamically altered biological functions in response to oxidative stress, which were validated by comprehensive in vivo metabolomics experiments. The results provided in this thesis indicate that integration of heterogeneous biological data facilitates advanced mining of the data. The methods can be applied for gaining insight into functions of genes, gene products and other molecules, as well as for offering functional interpretation to transcriptomics and metabolomics experiments

    Multi-field Visualisation via Trait-induced Merge Trees

    Full text link
    In this work, we propose trait-based merge trees a generalization of merge trees to feature level sets, targeting the analysis of tensor field or general multi-variate data. For this, we employ the notion of traits defined in attribute space as introduced in the feature level sets framework. The resulting distance field in attribute space induces a scalar field in the spatial domain that serves as input for topological data analysis. The leaves in the merge tree represent those areas in the input data that are closest to the defined trait and thus most closely resemble the defined feature. Hence, the merge tree yields a hierarchy of features that allows for querying the most relevant and persistent features. The presented method includes different query methods for the tree which enable the highlighting of different aspects. We demonstrate the cross-application capabilities of this approach with three case studies from different domains
    • …
    corecore