883 research outputs found

    Pattern Mining and Events Discovery in Molecular Dynamics Simulations Data

    Get PDF
    Molecular dynamics simulation method is widely used to calculate and understand a wide range of properties of materials. A lot of research efforts have been focused on simulation techniques but relatively fewer works are done on methods for analyzing the simulation results. Large-scale simulations usually generate massive amounts of data, which make manual analysis infeasible, particularly when it is necessary to look into the details of the simulation results. In this dissertation, we propose a system that uses computational method to automatically perform analysis of simulation data, which represent atomic position-time series. The system identifies, in an automated fashion, the micro-level events (such as the bond formation/breaking) that are connected to large movements of the atoms, which is considered to be relevant to the diffusion property of the material. The challenge is how to discover such interesting atomic activities which are the key to understanding macro-level (bulk) properties of material. Furthermore, simply mining the structure graph of a material (the graph where the constituent atoms form nodes and the bonds between the atoms form edges) offers little help in this scenario. It is the patterns among the atomic dynamics that may be good candidate for underlying mechanisms. We propose an event-graph model to model the atomic dynamics and propose a graph mining algorithm to discover popular subgraphs in the event graph. We also analyze such patterns in primitive ring mining process and calculate the distributions of primitive rings during large and normal movement of atoms. Because the event graph is a directed acyclic graph, our mining algorithm uses a new graph encoding scheme that is based on topological- sorting. This encoding scheme also ensures that our algorithm enumerates candidate subgraphs without any duplication. Our experiments using simulation data of silica liquid show the effectiveness of the proposed mining system

    Data Mining Using the Crossing Minimization Paradigm

    Get PDF
    Our ability and capacity to generate, record and store multi-dimensional, apparently unstructured data is increasing rapidly, while the cost of data storage is going down. The data recorded is not perfect, as noise gets introduced in it from different sources. Some of the basic forms of noise are incorrect recording of values and missing values. The formal study of discovering useful hidden information in the data is called Data Mining. Because of the size, and complexity of the problem, practical data mining problems are best attempted using automatic means. Data Mining can be categorized into two types i.e. supervised learning or classification and unsupervised learning or clustering. Clustering only the records in a database (or data matrix) gives a global view of the data and is called one-way clustering. For a detailed analysis or a local view, biclustering or co-clustering or two-way clustering is required involving the simultaneous clustering of the records and the attributes. In this dissertation, a novel fast and white noise tolerant data mining solution is proposed based on the Crossing Minimization (CM) paradigm; the solution works for one-way as well as two-way clustering for discovering overlapping biclusters. For decades the CM paradigm has traditionally been used for graph drawing and VLSI (Very Large Scale Integration) circuit design for reducing wire length and congestion. The utility of the proposed technique is demonstrated by comparing it with other biclustering techniques using simulated noisy, as well as real data from Agriculture, Biology and other domains. Two other interesting and hard problems also addressed in this dissertation are (i) the Minimum Attribute Subset Selection (MASS) problem and (ii) Bandwidth Minimization (BWM) problem of sparse matrices. The proposed CM technique is demonstrated to provide very convincing results while attempting to solve the said problems using real public domain data. Pakistan is the fourth largest supplier of cotton in the world. An apparent anomaly has been observed during 1989-97 between cotton yield and pesticide consumption in Pakistan showing unexpected periods of negative correlation. By applying the indigenous CM technique for one-way clustering to real Agro-Met data (2001-2002), a possible explanation of the anomaly has been presented in this thesis

    Data analytics 2016: proceedings of the fifth international conference on data analytics

    Get PDF

    Systematic assessment of protein interaction data using graph topology approaches

    Get PDF
    Ph.DDOCTOR OF PHILOSOPH

    Associative pattern mining for supervised learning

    Get PDF
    The Internet era has revolutionized computational sciences and automated data collection techniques, made large amounts of previously inaccessible data available and, consequently, broadened the scope of exploratory computing research. As a result, data mining, which is still an emerging field of research, has gained importance because of its ability to analyze and discover previously unknown, hidden, and useful knowledge from these large amounts of data. One aspect of data mining, known as frequent pattern mining, has recently gained importance due to its ability to find associative relationships among the parts of data, thereby aiding a type of supervised learning known as associative learning . The purpose of this dissertation is two-fold: to develop and demonstrate supervised associative learning in non-temporal data for multi-class classification and to develop a new frequent pattern mining algorithm for time varying (temporal) data which alleviates the current issues in analyzing this data for knowledge discovery. In order to use associative relationships for classification, we have to algorithmically learn their discriminatory power. While it is well known that multiple sets of features work better for classification, we claim that the isomorphic relationships among the features work even better and, therefore, can be used as higher order features. To validate this claim, we exploit these relationships as input features for classification instead of using the underlying raw features. The next part of this dissertation focuses on building a new classifier using associative relationships as a basis for the multi-class classification problem. Most of the existing associative classifiers represent the instances from a class in a row-based format wherein one row represents features of one instance and extract association rules from the entire dataset. The rules formed in this way are known as class constrained rules, as they have class labels on the right side of the rules. We argue that this class constrained representation schema lacks important information that is necessary for multi-class classification. Further, most existing works use either the intraclass or inter-class importance of the association rules, both of which sets of techniques offer empirical benefits. We hypothesize that both intra-class and inter-class variations are important for fast and accurate multi-class classification. We also present a novel weighted association rule-based classification mechanism that uses frequent relationships among raw features from an instance as the basis for classifying the instance into one of the many classes. The relationships are weighted according to both their intra-class and inter-class importance. The final part of this dissertation concentrates on mining time varying data. This problem is known as inter-transaction association rule mining in the data-mining field. Most of the existing work transforms the time varying data into a static format and then use multiple scans over the new data to extract patterns. We present a unique index-based algorithmic framework for inter-transaction association rule mining. Our proposed technique requires only one scan of the original database. Further, the proposed technique can also provide the location information of each extracted pattern. We use mathematical induction to prove that the new representation scheme captures all underlying frequent relationships

    Structure discovery techniques for circuit design and process model visualization

    Get PDF
    Graphs are one of the most used abstractions in many knowledge fields because of the easy and flexibility by which graphs can represent relationships between objects. The pervasiveness of graphs in many disciplines means that huge amounts of data are available in graph form, allowing many opportunities for the extraction of useful structure from these graphs in order to produce insight into the data. In this thesis we introduce a series of techniques to resolve well-known challenges in the areas of digital circuit design and process mining. The underlying idea that ties all the approaches together is discovering structures in graphs. We show how many problems of practical importance in these areas can be solved utilizing both common and novel structure mining approaches. In the area of digital circuit design, this thesis proposes automatically discovering frequent, repetitive structures in a circuit netlist in order to improve the quality of physical planning. These structures can be used during floorplanning to produce regular designs, which are known to be highly efficient and economical. At the same time, detecting these repeating structures can exponentially reduce the total design time. The second focus of this thesis is in the area of the visualization of process models. Process mining is a recent area of research which centers on studying the behavior of real-life systems and their interactions with the environment. Complicated process models, however, hamper this goal. By discovering the important structures in these models, we propose a series of methods that can derive visualization-friendly process models with minimal loss in accuracy. In addition, and combining the areas of circuit design and process mining, this thesis opens the area of specification mining in asynchronous circuits. Instead of the usual design flow, which involves synthesizing circuits from specifications, our proposal discovers specifications from implemented circuits. This area allows for many opportunities for verification and re-synthesis of asynchronous circuits. The proposed methods have been tested using real-life benchmarks, and the quality of the results compared to the state-of-the-art.Els grafs són una de les representacions abstractes més comuns en molts camps de recerca, gràcies a la facilitat i flexibilitat amb la que poden representar relacions entre objectes. Aquesta popularitat fa que una gran quantitat de dades es puguin trobar en forma de graf, i obre moltes oportunitats per a extreure estructures d'aquest grafs, útils per tal de donar una intuïció millor de les dades subjacents. En aquesta tesi introduïm una sèrie de tècniques per resoldre reptes habitualment trobats en les àrees de disseny de circuits digitals i mineria de processos industrials. La idea comú sota tots els mètodes proposats es descobrir automàticament estructures en grafs. En la tesi es mostra que molts problemes trobats a la pràctica en aquestes àrees poden ser resolts utilitzant nous mètodes de descobriment d'estructures. En l'àrea de disseny de circuits, proposem descobrir, automàticament, estructures freqüents i repetitives en les definicions del circuit per tal de millorar la qualitat de les etapes posteriors de planificació física. Les estructures descobertes poden fer-se servir durant la planificació per produir dissenys regulars, que son molt més econòmics d'implementar. Al mateix temps, la descoberta i ús d'aquestes estructures pot reduir exponencialment el temps total de disseny. El segon punt focal d'aquesta tesi és en l'àrea de la visualització de models de processos industrials. La mineria de processos industrials es un tema jove de recerca que es centra en estudiar el comportament de sistemes reals i les interaccions d'aquests sistemes amb l'entorn. No obstant, quan d'aquest anàlisi s'obtenen models massa complexos visualment, l'estudi n'és problemàtic. Proposem una sèrie de mètodes que, gràcies al descobriment automàtic de les estructures més importants, poden generar models molt més fàcils de visualitzar que encara descriuen el comportament del sistema amb gran precisió. Combinant les àrees de disseny de circuits i mineria de processos, aquesta tesi també obre un nou tema de recerca: la mineria d'especificacions per circuits asíncrons. En l'estil de disseny asíncron habitual, sintetitzadors automàtics generen circuits a partir de les especificacions. En aquesta tesi proposem el pas invers: descobrir automàticament les especificacions de circuits ja implementats. Així, creem noves oportunitats per a la verificació i la re-síntesi de circuits asíncrons. Els mètodes proposats en aquesta tesi s'han validat fent servir dades obtingudes d'aplicacions pràctiques, i en comparem els resultats amb els mètodes existents
    corecore