8 research outputs found

    Integration of Heterogeneous Databases: Discovery of Meta-Information and Maintenance of Schema-Restructuring Views

    Get PDF
    In today\u27s networked world, information is widely distributed across many independent databases in heterogeneous formats. Integrating such information is a difficult task and has been adressed by several projects. However, previous integration solutions, such as the EVE-Project, have several shortcomings. Database contents and structure change frequently, and users often have incomplete information about the data content and structure of the databases they use. When information from several such insufficiently described sources is to be extracted and integrated, two problems have to be solved: How can we discover the structure and contents of and interrelationships among unknown databases, and how can we provide durable integration views over several such databases? In this dissertation, we have developed solutions for those key problems in information integration. The first part of the dissertation addresses the fact that knowledge about the interrelationships between databases is essential for any attempt at solving the information integration problem. We are presenting an algorithm called FIND2 based on the clique-finding problem in graphs and k-uniform hypergraphs to discover redundancy relationships between two relations. Furthermore, the algorithm is enhanced by heuristics that significantly reduce the search space when necessary. Extensive experimental studies on the algorithm both with and without heuristics illustrate its effectiveness on a variety of real-world data sets. The second part of the dissertation addresses the durable view problem and presents the first algorithm for incremental view maintenance in schema-restructuring views. Such views are essential for the integration of heterogeneous databases. They are typically defined in schema-restructuring query languages like SchemaSQL, which can transform schema into data and vice versa, making traditional view maintenance based on differential queries impossible. Based on an existing algebra for SchemaSQL, we present an update propagation algorithm that propagates updates along the query algebra tree and prove its correctness. We also propose optimizations on our algorithm and present experimental results showing its benefits over view recomputation

    FAC-PIN: An efficient and fast agglomerative clustering algorithm for protein interaction networks to predict protein complexes and functional modules

    Get PDF
    Proteins are known to interact with each other to perform specific living organism functions by forming functional modules or protein complexes. Many community detection methods have been devised for the discovery of functional modules or protein complexes in protein interaction networks. One common problem in current agglomerative community detection approaches is that vertices with just one neighbor are often classified as separated clusters, which does not make sense for module or complex identification. In this thesis, we propose a new agglomerative algorithm, FAC-PIN, based on a local premetric of relative vertex-to-vertex clustering value. Our proposed FAC-PIN method is applied to PINs from different species for validating functional modules and protein complexes generated from FAC-PIN with experimentally verified functional modules and complexes respectively. The preliminary computational results show that FAC-PIN can discover functional modules and protein complexes from PINs more accurately. As well as we have also compared the computational times for different species with HC-PIN and CNM algorithms. Our algorithm outperforms two algorithms. Our FAC-PIN algorithm is faster and accurate algorithm which is the current state-of-the-art agglomerative approach to complex prediction and functional module identification

    Navigating Diverse Datasets in the Face of Uncertainty

    Get PDF
    When exploring big volumes of data, one of the challenging aspects is their diversity of origin. Multiple files that have not yet been ingested into a database system may contain information of interest to a researcher, who must curate, understand and sieve their content before being able to extract knowledge. Performance is one of the greatest difficulties in exploring these datasets. On the one hand, examining non-indexed, unprocessed files can be inefficient. On the other hand, any processing before its understanding introduces latency and potentially un- necessary work if the chosen schema matches poorly the data. We have surveyed the state-of-the-art and, fortunately, there exist multiple proposal of solutions to handle data in-situ performantly. Another major difficulty is matching files from multiple origins since their schema and layout may not be compatible or properly documented. Most surveyed solutions overlook this problem, especially for numeric, uncertain data, as is typical in fields like astronomy. The main objective of our research is to assist data scientists during the exploration of unprocessed, numerical, raw data distributed across multiple files based solely on its intrinsic distribution. In this thesis, we first introduce the concept of Equally-Distributed Dependencies, which provides the foundations to match this kind of dataset. We propose PresQ, a novel algorithm that finds quasi-cliques on hypergraphs based on their expected statistical properties. The probabilistic approach of PresQ can be successfully exploited to mine EDD between diverse datasets when the underlying populations can be assumed to be the same. Finally, we propose a two-sample statistical test based on Self-Organizing Maps (SOM). This method can outperform, in terms of power, other classifier-based two- sample tests, being in some cases comparable to kernel-based methods, with the advantage of being interpretable. Both PresQ and the SOM-based statistical test can provide insights that drive serendipitous discoveries

    Quality and interestingness of association rules derived from data mining of relational and semi-structured data

    Get PDF
    Deriving useful and interesting rules from a data mining system are essential and important tasks. Problems such as the discovery of random and coincidental patterns or patterns with no significant values, and the generation of a large volume of rules from a database commonly occur. Works on sustaining the interestingness of rules generated by data mining algorithms are actively and constantly being examined and developed. As the data mining techniques are data-driven, it is beneficial to affirm the rules using a statistical approach. It is important to establish the ways in which the existing statistical measures and constraint parameters can be effectively utilized and the sequence of their usage.In this thesis, a systematic way to evaluate the association rules discovered from frequent, closed and maximal itemset mining algorithms; and frequent subtree mining algorithm including the rules based on induced, embedded and disconnected subtrees is presented. With reference to the frequent subtree mining, in addition a new direction is explored based on utilizing the DSM approach capable of preserving all information from tree-structured database in a flat data format, consequently enabling the direct application of a wider range of data mining analysis/techniques to tree-structured data. Implications of this approach were investigated and it was found that basing rules on disconnected subtrees, can be useful in terms of increasing the accuracy and the coverage rate of the rule set.A strategy that combines data mining and statistical measurement techniques such as sampling, redundancy and contradictive checks, correlation and regression analysis to evaluate the rules is developed. This framework is then applied to real-world datasets that represent diverse characteristics of data/items. Empirical results show that with a proper combination of data mining and statistical analysis, the proposed framework is capable of eliminating a large number of non-significant, redundant and contradictive rules while preserving relatively valuable high accuracy rules. Moreover, the results reveal the important characteristics and differences between mining frequent, closed or maximal itemsets; and mining frequent subtree including the rules based on induced, embedded and disconnected subtrees; as well as the impact of confidence measure for the prediction and classification task

    Mining climate data for shire level wheat yield predictions in Western Australia

    Get PDF
    Climate change and the reduction of available agricultural land are two of the most important factors that affect global food production especially in terms of wheat stores. An ever increasing world population places a huge demand on these resources. Consequently, there is a dire need to optimise food production. Estimations of crop yield for the South West agricultural region of Western Australia have usually been based on statistical analyses by the Department of Agriculture and Food in Western Australia. Their estimations involve a system of crop planting recommendations and yield prediction tools based on crop variety trials. However, many crop failures arise from adherence to these crop recommendations by farmers that were contrary to the reported estimations. Consequently, the Department has sought to investigate new avenues for analyses that improve their estimations and recommendations. This thesis explores a new approach in the way analyses are carried out. This is done through the introduction of new methods of analyses such as data mining and online analytical processing in the strategy. Additionally, this research attempts to provide a better understanding of the effects of both gradual variation parameters such as soil type, and continuous variation parameters such as rainfall and temperature, on the wheat yields. The ultimate aim of the research is to enhance the prediction efficiency of wheat yields. The task was formidable due to the complex and dichotomous mixture of gradual and continuous variability data that required successive information transformations. It necessitated the progressive moulding of the data into useful information, practical knowledge and effective industry practices. Ultimately, this new direction is to improve the crop predictions and to thereby reduce crop failures. The research journey involved data exploration, grappling with the complexity of Geographic Information System (GIS), discovering and learning data compatible software tools, and forging an effective processing method through an iterative cycle of action research experimentation. A series of trials was conducted to determine the combined effects of rainfall and temperature variations on wheat crop yields. These experiments specifically related to the South Western Agricultural region of Western Australia. The study focused on wheat producing shires within the study area. The investigations involved a combination of macro and micro analyses techniques for visual data mining and data mining classification techniques, respectively. The research activities revealed that wheat yield was most dependent upon rainfall and temperature. In addition, it showed that rainfall cyclically affected the temperature and soil type due to the moisture retention of crop growing locations. Results from the regression analyses, showed that the statistical prediction of wheat yields from historical data, may be enhanced by data mining techniques including classification. The main contribution to knowledge as a consequence of this research was the provision of an alternate and supplementary method of wheat crop prediction within the study area. Another contribution was the division of the study area into a GIS surface grid of 100 hectare cells upon which the interpolated data was projected. Furthermore, the proposed framework within this thesis offers other researchers, with similarly structured complex data, the benefits of a general processing pathway to enable them to navigate their own investigations through variegated analytical exploration spaces. In addition, it offers insights and suggestions for future directions in other contextual research explorations

    Semantics-based language models for information retrieval and text mining

    Get PDF
    The language modeling approach centers on the issue of estimating an accurate model by choosing appropriate language models as well as smoothing techniques. In the thesis, we propose a novel context-sensitive semantic smoothing method referred to as a topic signature language model. It extracts explicit topic signatures from a document and then statistically maps them into individual words in the vocabulary. In order to support the new language model, we developed two automated algorithms to extract multiword phrases and ontological concepts, respectively, and an EM-based algorithm to learn semantic mapping knowledge from co-occurrence data. The topic signature language model is applied to three applications: information retrieval, text classification, and text clustering. The evaluations on news collection and biomedical literature prove the effectiveness of the topic signature language model.In the experiment of information retrieval, the topic signature language model consistently outperforms the baseline two-stage language model as well as the context-insensitive semantic smoothing method in all configurations. It also beats the state-of-the-art Okapi models in all configurations. In the experiment of text classification, when the size of training documents is small, the Bayesian classifier with semantic smoothing not only outperforms the classifiers with background smoothing and Laplace smoothing, but it also beats the active learning classifiers and SVM classifiers. On the task of clustering, whether or not the dataset to cluster is small, the model-based k-means with semantic smoothing performs significantly better than both the model-based k-means with background smoothing and Laplace smoothing. It is also superior to the spherical k-means in terms of effectiveness.In addition, we empirically prove that, within the framework of topic signature language models, the semantic knowledge learned from one collection could be effectively applied to other collections. In the thesis, we also compare three types of topic signatures (i.e., words, multiword phrases, and ontological concepts), with respect to their effectiveness and efficiency for semantic smoothing. In general, it is more expensive to extract multiword phrases and ontological concepts than individual words, but semantic mapping based on multiword phrases and ontological concepts are more effective in handling data sparsity than on individual words.Ph.D., Information Science and Technology -- Drexel University, 200

    LIPIcs, Volume 261, ICALP 2023, Complete Volume

    Get PDF
    LIPIcs, Volume 261, ICALP 2023, Complete Volum

    LIPIcs, Volume 258, SoCG 2023, Complete Volume

    Get PDF
    LIPIcs, Volume 258, SoCG 2023, Complete Volum
    corecore