Search CORE

759 research outputs found

Constraint-based Sequential Pattern Mining with Decision Diagrams

Author: Cire Andre A.
Hosseininasab Amin
van Hoeve Willem-Jan
Publication venue
Publication date: 14/11/2018
Field of study

Constrained sequential pattern mining aims at identifying frequent patterns on a sequential database of items while observing constraints defined over the item attributes. We introduce novel techniques for constraint-based sequential pattern mining that rely on a multi-valued decision diagram representation of the database. Specifically, our representation can accommodate multiple item attributes and various constraint types, including a number of non-monotone constraints. To evaluate the applicability of our approach, we develop an MDD-based prefix-projection algorithm and compare its performance against a typical generate-and-check variant, as well as a state-of-the-art constraint-based sequential pattern mining algorithm. Results show that our approach is competitive with or superior to these other methods in terms of scalability and efficiency.Comment: AAAI201

arXiv.org e-Print Archive

University of Toronto Research Repository

Association for the Advancement of Artificial Intelligence: AAAI Publications

Emerging Chemical Patterns for Virtual Screening and Knowledge Discovery

Author: Auer Jens Horst
Publication venue: Universitäts- und Landesbibliothek Bonn
Publication date
Field of study

The adaptation and evaluation of contemporary data mining methods to chemical and biological problems is one of major areas of research in chemoinformatics. Currently, large databases containing millions of small organic compounds are publicly available, and the need for advanced methods to analyze these data increases. Most methods used in chemoinformatics, e.g. quantitative structure activity relationship (QSAR) modeling, decision trees and similarity searching, depend on the availability of large high-quality training data sets. However, in biological settings, the availability of these training sets is rather limited. This is especially true for early stages of drug discovery projects where typically only few active molecules are available. The ability of chemoinformatic methods to generalize from small training sets and accurately predict compound properties such as activity, ADME or toxicity is thus crucially important. Additionally, biological data such as results from high-throughput screening (HTS) campaigns is heavily biased towards inactive compounds. This bias presents an additional challenge for the adaptation of data mining methods and distinguishes chemoinformatics data from the standard benchmark scenarios in the data mining community. Even if a highly accurate classifier would be available, it is still necessary to evaluate the predictions experimentally. These experiments are both costly and time-consuming and the need to optimize resources has driven the development of integrated screening protocols which try to minimize experimental efforts but still reaching high hit rates of active compounds. This integration, termed “sequential screening” benefits from the complementary nature of experimental HTS and computational virtual screening (VS) methods. In this thesis, a current data mining framework based on class-specific nominal combinations of attributes (emerging patterns) is adapted to chemoinformatic problems and thoroughly evaluated. Combining emerging pattern methodology and the well-known notion of chemical descriptors, emerging chemical patterns (ECP) are defined as class- specific descriptor value range combinations. Each pattern can be thought of as a region in chemical space which is dominated by compounds from one class only. Based on chemical patterns, several experiments are presented which evaluate the performance of pattern-based knowledge mining, property prediction, compound ranking and sequential screening. ECP-based classification is implemented and evaluated on four activity classes for the prediction of compound potency levels. Compared to decision trees and a Bayesian binary QSAR method, ECP-based classification produces high accuracy in positive and negative classes even on the basis of very small training set, a result especially valuable to chemoinformatic problems. The simple nature of ECPs as class-specific descriptor value range combinations makes them easily interpretable. This is used to related ECPs to changes in the interaction network of protein-ligand complexes when the binding conformation is replaced by a computer-modeled conformation in a knowledge mining experiment. ECPs capture well-known energetic differences between binding and energy-minimized conformations and additionally present new insight into these differences on a class level analysis. Finally, the integration of ECPs and HTS is evaluated in simulated lead-optimization and sequential screening experiments. The high accuracy on very small training sets is exploited to design an iterative simulated lead optimization experiment based on experimental evaluation of randomly selected small training sets. In each iteration, all compounds predicted to be weakly active are removed and the remaining compound set is enriched with highly potent compounds. On this basis, a simulated sequential screening experiment shows that ECP-based ranking recovers 19% of available compounds while reducing the “experimental” effort to 0.2%. These findings illustrate the potential of sequential screening protocols and hopefully increase the popularity of this relatively new methodology

bonndoc – Der Publikationsserver der Universität Bonn

フロンティア法を用いた根無し木の列挙

Author: 吉田隆史
Publication venue
Publication date: 21/09/2016
Field of study

ゼロサプレス型BDD(ZDD:Zero-Suppressed BDD)は, 集合族をコンパクトに表現するデータ構造であり, 特に疎な集合族に対して効率よく解を表すことができる. ZDDは, 膨大な集合族を効率よく圧縮できる場合が多く, 集合データに対する様々な演算も効率よく行うことができるという特長を持っており, VLSIの設計, 電力網や道路網などの社会インフラの制約充足問題, データマイニング, 遺伝子データの解析など, 幅広い分野に応用されている. 特に近年, ZDDを用いて様々な制約条件を満たすグラフ構造を列挙する技法であるフロンティア法が注目を集めている. フロンティア法は, Knuthによって提案された, グラフ上の2頂点間のパスを列挙するアルゴリズムであるSIMPATHを一般化したものである. この技法は従来の手法に比べて大幅な高速化に成功している. 本研究では, フロンティア法を用いて非同型な根無し木の列挙を列挙するアルゴリズムを提案した. 提案するアルゴリズムは, 与えられた頂点数nからなる根無し木を全列挙するものである. ただし同型なものは一度しか出力しない. 根無し木の列挙については, 中野と宇野によって, 木1つあたりの計算時間が定数時間であるアルゴリズムが提案されている. 同型な木を重複して出力することを許すならば, フロンティア法を用いて全域木を列挙する方法によって, 簡単に実現することができる. 提案手法では, 中野と宇野のアルゴリズムで用いられている同型な木の重複を防ぐためのアイデアを利用することで, ZDDを用いた非同型な根無し木の全列挙を行う. 根無し木の列挙にフロンティア法を用いることで, 従来の解空間を探索して解を1つ1つ出力する方法とは異なり, ZDDを用いて全ての解をコンパクトに表現して出力する. また, 求められた解はZDDとして索引化された形で表現されるので, 特定の条件を満たす解だけを容易に抽出することも可能である.電気通信大学201

Creative Repository of Electro-Communications

Strong compound-risk factors: Efficient discovery through emerging patterns and contrast sets

Author: Li J
Yang Q
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/09/2007
Field of study

Odds ratio (OR), relative risk (RR) (risk ratio), and absolute risk reduction (ARR) (risk difference) are biostatistics measurements that are widely used for identifying significant risk factors in dichotomous groups of subjects. In the past, they have often been used to assess simple risk factors. In this paper, we introduce the concept of compound-risk factors to broaden the applicability of these statistical tests for assessing factor interplays. We observe that compound-risk factors with a high risk ratio or a big risk difference have an one-to-one correspondence to strong emerging patterns or strong contrast sets-two types of patterns that have been extensively studied in the data mining field. Such a relationship has been unknown to researchers in the past, and efficient algorithms for discovering strong compound-risk factors have been lacking. In this paper, we propose a theoretical framework and a new algorithm that unify the discovery of compound-risk factors that have a strong OR, risk ratio, or a risk difference. Our method guarantees that all patterns meeting a certain test threshold can be efficiently discovered. Our contribution thus represents the first of its kind in linking the risk ratios and ORs to pattern mining algorithms, making it possible to find compound-risk factors in large-scale data sets. In addition, we show that using compound-risk factors can improve classification accuracy in probabilistic learning algorithms on several disease data sets, because these compound-risk factors capture the interdependency between important data attributes. © 2007 IEEE

OPUS - University of Technology Sydney

Mining time-series data using discriminative subsequences

Author: Hills Jonathan F. F.
Publication venue
Publication date: 01/09/2014
Field of study

Time-series data is abundant, and must be analysed to extract usable knowledge. Local-shape-based methods offer improved performance for many problems, and a comprehensible method of understanding both data and models. For time-series classification, we transform the data into a local-shape space using a shapelet transform. A shapelet is a time-series subsequence that is discriminative of the class of the original series. We use a heterogeneous ensemble classifier on the transformed data. The accuracy of our method is significantly better than the time-series classification benchmark (1-nearest-neighbour with dynamic time-warping distance), and significantly better than the previous best shapelet-based classifiers. We use two methods to increase interpretability: First, we cluster the shapelets using a novel, parameterless clustering method based on Minimum Description Length, reducing dimensionality and removing duplicate shapelets. Second, we transform the shapelet data into binary data reflecting the presence or absence of particular shapelets, a representation that is straightforward to interpret and understand. We supplement the ensemble classifier with partial classifocation. We generate rule sets on the binary-shapelet data, improving performance on certain classes, and revealing the relationship between the shapelets and the class label. To aid interpretability, we use a novel algorithm, BruteSuppression, that can substantially reduce the size of a rule set without negatively affecting performance, leading to a more compact, comprehensible model. Finally, we propose three novel algorithms for unsupervised mining of approximately repeated patterns in time-series data, testing their performance in terms of speed and accuracy on synthetic data, and on a real-world electricity-consumption device-disambiguation problem. We show that individual devices can be found automatically and in an unsupervised manner using a local-shape-based approach

University of East Anglia digital repository

Interactive Constrained {B}oolean Matrix Factorization

Author: Miettinen P.
Mukuze N.
Publication venue
Publication date: 01/01/2016
Field of study

MPG.PuRe

A Comprehensive Survey of Deep Learning in Remote Sensing: Theories, Tools and Challenges for the Community

Author: Ball John E.
Anderson Derek T.
Chan Chee Seng
Publication venue
Publication date: 01/01/2017
Field of study

In recent years, deep learning (DL), a re-branding of neural networks (NNs), has risen to the top in numerous areas, namely computer vision (CV), speech recognition, natural language processing, etc. Whereas remote sensing (RS) possesses a number of unique challenges, primarily related to sensors and applications, inevitably RS draws from many of the same theories as CV; e.g., statistics, fusion, and machine learning, to name a few. This means that the RS community should be aware of, if not at the leading edge of, of advancements like DL. Herein, we provide the most comprehensive survey of state-of-the-art RS DL research. We also review recent new developments in the DL field that can be used in DL for RS. Namely, we focus on theories, tools and challenges for the RS community. Specifically, we focus on unsolved challenges and opportunities as it relates to (i) inadequate data sets, (ii) human-understandable solutions for modelling physical phenomena, (iii) Big Data, (iv) non-traditional heterogeneous data sources, (v) DL architectures and learning algorithms for spectral, spatial and temporal data, (vi) transfer learning, (vii) an improved theoretical understanding of DL systems, (viii) high barriers to entry, and (ix) training and optimizing the DL.Comment: 64 pages, 411 references. To appear in Journal of Applied Remote Sensin

arXiv.org e-Print Archive

Crossref

FigShare

Recommended from our members

Spreadsheet Tools for Data Analysts

Author: Barowy Daniel W
Publication venue: ScholarWorks@UMass Amherst
Publication date: 01/11/2017
Field of study

Spreadsheets are a natural fit for data analysis, combining a simple data storage and presentation layer with a programming language and basic debugging tools. Because spreadsheets are accessible and flexible, they are used by both novices and experts. Consequently, spreadsheets are hugely popular, with more than 750 million copies of Microsoft Excel installed worldwide. This popularity means that spreadsheets are the most popular programming language on the planet and the de facto tool for data analysis. Nevertheless, spreadsheets do not address a number of important tasks in a typical analyst\u27s pipeline, and their design frequently complicates them. This thesis describes three key challenges for analysts using spreadsheets. 1) Data wrangling is the process of converting or mapping data from a raw form into another form suitable for use with automated tools. 2) Data cleaning is the process of locating and correcting omitted or erroneous data. 3) Formula auditing is the process of finding and correcting spreadsheet program errors. These three tasks combined are estimated to occupy more than three quarters of a data analyst\u27s time. Furthermore, errors not caught during these steps have led to catastrophically bad decisions resulting in billions of dollars in losses. Advances in automated techniques for these tasks may result in dramatic savings in both time and money. Three novel programming language-based techniques were created to address these key tasks. The first, automatic layout transformation using examples, is a program synthesis-based technique that lets spreadsheet users perform data wrangling tasks automatically, at scale, and without programming. The second, data debugging, is technique for data cleaning that combines program analysis and statistical analysis to automatically find likely data errors. The third, spatio-structural program analysis unifies positional and dependence information and finds spreadsheet errors using a kind of anomaly analysis. Each technique was implemented as an end-user tool---FlaskRelate, CheckCell, and ExceLint respectively---in the form of a point-and-click plugin for Microsoft Excel. Our evaluation demonstrates that these techniques substantially improve user efficiency. Finally, because these tools build on each other in a complementary fashion, data analysts can run data wrangling, cleaning, and formula auditing tasks together in a single analysis pipeline

ScholarWorks@UMass Amherst