1,497 research outputs found
Computing Similarity between a Pair of Trajectories
With recent advances in sensing and tracking technology, trajectory data is
becoming increasingly pervasive and analysis of trajectory data is becoming
exceedingly important. A fundamental problem in analyzing trajectory data is
that of identifying common patterns between pairs or among groups of
trajectories. In this paper, we consider the problem of identifying similar
portions between a pair of trajectories, each observed as a sequence of points
sampled from it.
We present new measures of trajectory similarity --- both local and global
--- between a pair of trajectories to distinguish between similar and
dissimilar portions. Our model is robust under noise and outliers, it does not
make any assumptions on the sampling rates on either trajectory, and it works
even if they are partially observed. Additionally, the model also yields a
scalar similarity score which can be used to rank multiple pairs of
trajectories according to similarity, e.g. in clustering applications. We also
present efficient algorithms for computing the similarity under our measures;
the worst-case running time is quadratic in the number of sample points.
Finally, we present an extensive experimental study evaluating the
effectiveness of our approach on real datasets, comparing with it with earlier
approaches, and illustrating many issues that arise in trajectory data. Our
experiments show that our approach is highly accurate in distinguishing similar
and dissimilar portions as compared to earlier methods even with sparse
sampling
The AFLOW Fleet for Materials Discovery
The traditional paradigm for materials discovery has been recently expanded
to incorporate substantial data driven research. With the intent to accelerate
the development and the deployment of new technologies, the AFLOW Fleet for
computational materials design automates high-throughput first principles
calculations, and provides tools for data verification and dissemination for a
broad community of users. AFLOW incorporates different computational modules to
robustly determine thermodynamic stability, electronic band structures,
vibrational dispersions, thermo-mechanical properties and more. The AFLOW data
repository is publicly accessible online at aflow.org, with more than 1.7
million materials entries and a panoply of queryable computed properties. Tools
to programmatically search and process the data, as well as to perform online
machine learning predictions, are also available.Comment: 14 pages, 8 figure
A Comprehensive Survey on Graph Summarization with Graph Neural Networks
As large-scale graphs become more widespread, more and more computational
challenges with extracting, processing, and interpreting large graph data are
being exposed. It is therefore natural to search for ways to summarize these
expansive graphs while preserving their key characteristics. In the past, most
graph summarization techniques sought to capture the most important part of a
graph statistically. However, today, the high dimensionality and complexity of
modern graph data are making deep learning techniques more popular. Hence, this
paper presents a comprehensive survey of progress in deep learning
summarization techniques that rely on graph neural networks (GNNs). Our
investigation includes a review of the current state-of-the-art approaches,
including recurrent GNNs, convolutional GNNs, graph autoencoders, and graph
attention networks. A new burgeoning line of research is also discussed where
graph reinforcement learning is being used to evaluate and improve the quality
of graph summaries. Additionally, the survey provides details of benchmark
datasets, evaluation metrics, and open-source tools that are often employed in
experimentation settings, along with a discussion on the practical uses of
graph summarization in different fields. Finally, the survey concludes with a
number of open research challenges to motivate further study in this area.Comment: 20 pages, 4 figures, 3 tables, Journal of IEEE Transactions on
Artificial Intelligenc
Integrating Fuzzy Decisioning Models With Relational Database Constructs
Human learning and classification is a nebulous area in computer science. Classic decisioning problems can be solved given enough time and computational power, but discrete algorithms cannot easily solve fuzzy problems. Fuzzy decisioning can resolve more real-world fuzzy problems, but existing algorithms are often slow, cumbersome and unable to give responses within a reasonable timeframe to anything other than predetermined, small dataset problems. We have developed a database-integrated highly scalable solution to training and using fuzzy decision models on large datasets. The Fuzzy Decision Tree algorithm is the integration of the Quinlan ID3 decision-tree algorithm together with fuzzy set theory and fuzzy logic. In existing research, when applied to the microRNA prediction problem, Fuzzy Decision Tree outperformed other machine learning algorithms including Random Forest, C4.5, SVM and Knn. In this research, we propose that the effectiveness with which large dataset fuzzy decisions can be resolved via the Fuzzy Decision Tree algorithm is significantly improved when using a relational database as the storage unit for the fuzzy ID3 objects, versus traditional storage objects. Furthermore, it is demonstrated that pre-processing certain pieces of the decisioning within the database layer can lead to much swifter membership determinations, especially on Big Data datasets. The proposed algorithm uses the concepts inherent to databases: separated schemas, indexing, partitioning, pipe-and-filter transformations, preprocessing data, materialized and regular views, etc., to present a model with a potential to learn from itself. Further, this work presents a general application model to re-architect Big Data applications in order to efficiently present decisioned results: lowering the volume of data being handled by the application itself, and significantly decreasing response wait times while allowing the flexibility and permanence of a standard relational SQL database, supplying optimal user satisfaction in today\u27s Data Analytics world. We experimentally demonstrate the effectiveness of our approach
Large-Scale Analysis of Protein-Ligand Binding Sites using the Binding MOAD Database.
Current structure-based drug design (SBDD) methods require understanding of general tends of protein-ligand interactions. Informative descriptors of ligand-binding sites provide powerful heuristics to improve SBDD methods designed to infer function from protein structure. These descriptors must have a solid statistical foundation for assessing general trends in large sets of protein-ligand complexes. This dissertation focuses on mining the Binding MOAD database of highly curated protein-ligand complexes to determine frequently observed patterns of binding-site composition. An extension to Binding MOAD’s framework is developed to store structural details of binding sites and facilitate large-scale analysis. This thesis uses the framework to address three topics. It first describes a strategy for determining over-representation of amino acids within ligand-binding sites, comparing the trends of residue propensity for binding sites of biologically relevant ligands to those of spurious molecules with no known function. To determine the significance of these trends and to provide guidelines for residue-propensity studies, the effect of the data set size on the variation in propensity values is evaluated. Next, binding-site residue propensities are applied to improve the performance of a geometry-based, binding-site prediction algorithm. Propensity-based scores are found to perform comparably to the native score in successfully ranking correct predictions. For large proteins, propensity-based and consensus scores improve the scoring success. Finally, current protein-ligand scoring functions are evaluated using a new criterion: the ability to discern biologically relevant ligands from “opportunistic binders,” molecules present in crystal structures due to their high concentrations in the crystallization medium. Four different scoring functions are evaluated against a diverse benchmark set. All are found to perform well for ranking biologically relevant sites over spurious ones, and all performed best when penalties for torsional strain of ligands were included. The final chapter describes a structural alignment method, termed HwRMSD, which can align proteins of very low sequence homology based on their structural similarity using a weighted structure superposition. The overall aims of the dissertation are to collect high-quality binding-site composition data within the largest available set of protein-ligand complexes and to evaluate the appropriate applications of this data to emerging methods for computational proteomics.Ph.D.BioinformaticsUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/91400/1/nickolay_1.pd
A Data Mining Approach To Gen Dynamic Behavioral Process
Les textes présentés dans cette section spéciale mettent en lumière certaines embûches reliées à l’implantation et à la gestion des aires protégées. L’identification du territoire, le respect des ententes internationales en matière de diversité biologique, mais tout particulièrement l’acceptation par les communautés locales d’une aire de conservation dans leur environnement de vie semblent être des éléments qui ralentissent les processus de protection de la biodiversité. L’acceptabilité socia..
Development and Applications of Similarity Measures for Spatial-Temporal Event and Setting Sequences
Similarity or distance measures between data objects are applied frequently in many fields or domains such as geography, environmental science, biology, economics, computer science, linguistics, logic, business analytics, and statistics, among others. One area where similarity measures are particularly important is in the analysis of spatiotemporal event sequences and associated environs or settings. This dissertation focuses on developing a framework of modeling, representation, and new similarity measure construction for sequences of spatiotemporal events and corresponding settings, which can be applied to different event data types and used in different areas of data science. The first core part of this dissertation presents a matrix-based spatiotemporal event sequence representation that unifies punctual and interval-based representation of events. This framework supports different event data types and provides support for data mining and sequence classification and clustering. The similarity measure is based on the modified Jaccard index with temporal order constraints and accommodates different event data types. This approach is demonstrated through simulated data examples and the performance of the similarity measures is evaluated with a k-nearest neighbor algorithm (k-NN) classification test on synthetic datasets. These similarity measures are incorporated into a clustering method and successfully demonstrate the usefulness in a case study analysis of event sequences extracted from space time series of a water quality monitoring system. This dissertation further proposes a new similarity measure for event setting sequences, which involve the space and time in which events occur. While similarity measures for spatiotemporal event sequences have been studied, the settings and setting sequences have not yet been considered. While modeling event setting sequences, spatial and temporal scales are considered to define the bounds of the setting and incorporate dynamic variables along with static variables. Using a matrix-based representation and an extended Jaccard index, new similarity measures are developed to allow for the use of all variable data types. With these similarity measures coupled with other multivariate statistical analysis approaches, results from a case study involving setting sequences and pollution event sequences associated with the same monitoring stations, support the hypothesis that more similar spatial-temporal settings or setting sequences may generate more similar events or event sequences. To test the scalability of STES similarity measure in a larger dataset and an extended application in different fields, this dissertation compares and contrasts the prospective space-time scan statistic with the STES similarity approach for identifying COVID-19 hotspots. The COVID-19 pandemic has highlighted the importance of detecting hotspots or clusters of COVID-19 to provide decision makers at various levels with better information for managing distribution of human and technical resources as the outbreak in the USA continues to grow. The prospective space-time scan statistic has been used to help identify emerging disease clusters yet results from this approach can encounter strategic limitations imposed by the spatial constraints of the scanning window. The STES-based approach adapted for this pandemic context computes the similarity of evolving normalized COVID-19 daily cases by county and clusters these to identify counties with similarly evolving COVID-19 case histories. This dissertation analyzes the spread of COVID-19 within the continental US through four periods beginning from late January 2020 using the COVID-19 datasets maintained by John Hopkins University, Center for Systems Science and Engineering (CSSE). Results of the two approaches can complement with each other and taken together can aid in tracking the progression of the pandemic. Overall, the dissertation highlights the importance of developing similarity measures for analyzing spatiotemporal event sequences and associated settings, which can be applied to different event data types and used for data mining, sequence classification, and clustering
Data integration, pathway analysis and mining for systems biology
Post-genomic molecular biology embodies high-throughput experimental techniques and hence is a data-rich field. The goal of this thesis is to develop bioinformatics methods to utilise publicly available data in order to produce knowledge and to aid mining of newly generated data. As an example of knowledge or hypothesis generation, consider function prediction of biological molecules. Assignment of protein function is a non-trivial task owing to the fact that the same protein may be involved in different biological processes, depending on the state of the biological system and protein localisation. The function of a gene or a gene product may be provided as a textual description in a gene or protein annotation database. Such textual descriptions lack in providing the contextual meaning of the gene function. Therefore, we need ways to represent the meaning in a formal way. Here we apply data integration approach to provide rich representation that enables context-sensitive mining of biological data in terms of integrated networks and conceptual spaces. Context-sensitive gene function annotation follows naturally from this framework, as a particular application. Next, knowledge that is already publicly available can be used to aid mining of new experimental data. We developed an integrative bioinformatics method that utilises publicly available knowledge of protein-protein interactions, metabolic networks and transcriptional regulatory networks to analyse transcriptomics data and predict altered biological processes. We applied this method to a study of dynamic response of Saccharomyces cerevisiae to oxidative stress. The application of our method revealed dynamically altered biological functions in response to oxidative stress, which were validated by comprehensive in vivo metabolomics experiments. The results provided in this thesis indicate that integration of heterogeneous biological data facilitates advanced mining of the data. The methods can be applied for gaining insight into functions of genes, gene products and other molecules, as well as for offering functional interpretation to transcriptomics and metabolomics experiments
Multi-field Visualisation via Trait-induced Merge Trees
In this work, we propose trait-based merge trees a generalization of merge
trees to feature level sets, targeting the analysis of tensor field or general
multi-variate data. For this, we employ the notion of traits defined in
attribute space as introduced in the feature level sets framework. The
resulting distance field in attribute space induces a scalar field in the
spatial domain that serves as input for topological data analysis. The leaves
in the merge tree represent those areas in the input data that are closest to
the defined trait and thus most closely resemble the defined feature. Hence,
the merge tree yields a hierarchy of features that allows for querying the most
relevant and persistent features. The presented method includes different query
methods for the tree which enable the highlighting of different aspects. We
demonstrate the cross-application capabilities of this approach with three case
studies from different domains
- …