Search CORE

61 research outputs found

Shapley and Banzhaf Vectors of a Formal Concept

Author: Ignatov Dmitry I.
Publication venue: CEUR-WS.org
Publication date: 01/01/2020
Field of study

We propose the usage of two power indices from cooperative game theory and public choice theory for ranking attributes of closed sets, namely intents of formal concepts (or closed itemsets). The introduced indices are related to extensional concept stability and based on counting generators, especially those that contain a selected attribute. The introduction of such indices is motivated by the so-called interpretable machine learning, which supposes that we do not only have the class membership decision of a trained model for a particular object, but also a set of attributes (in the form of JSM-hypotheses or other patterns) along with individual importance of their single attributes (or more complex constituent elements). We characterise computation of Shapley and Banzhaf values of a formal concept in terms of minimal generators and their order filters, provide the reader with their properties important for computation purposes, and show experimental results

Berner Fachhochschule: ARBOR

Visual Landmark Recognition from Internet Photo Collections: A Large-Scale Evaluation

Author: Leibe Bastian
Weyand Tobias
Publication venue: 'Elsevier BV'
Publication date: 18/09/2014
Field of study

The task of a visual landmark recognition system is to identify photographed buildings or objects in query photos and to provide the user with relevant information on them. With their increasing coverage of the world's landmark buildings and objects, Internet photo collections are now being used as a source for building such systems in a fully automatic fashion. This process typically consists of three steps: clustering large amounts of images by the objects they depict; determining object names from user-provided tags; and building a robust, compact, and efficient recognition index. To this date, however, there is little empirical information on how well current approaches for those steps perform in a large-scale open-set mining and recognition task. Furthermore, there is little empirical information on how recognition performance varies for different types of landmark objects and where there is still potential for improvement. With this paper, we intend to fill these gaps. Using a dataset of 500k images from Paris, we analyze each component of the landmark recognition pipeline in order to answer the following questions: How many and what kinds of objects can be discovered automatically? How can we best use the resulting image clusters to recognize the object in a query? How can the object be efficiently represented in memory for recognition? How reliably can semantic information be extracted? And finally: What are the limiting factors in the resulting pipeline from query to semantics? We evaluate how different choices of methods and parameters for the individual pipeline steps affect overall system performance and examine their effects for different query categories such as buildings, paintings or sculptures

arXiv.org e-Print Archive

Publikationsserver der RWTH Aachen University

Recommended from our members

Using Rule Induction to Elucidate Co-Occurrence Patterns in Microbial Data

Author: Thurimella Kumar
Publication venue: CU Scholar
Publication date: 10/04/2013
Field of study

Hundreds of studies have addressed whether the presence or absence of certain bacteria are linked with a particular phenotype. However, it is plausible that the causative agent (or the consequence) of a given phenotype is not a single type of microbe, but groups of them, perhaps in speciﬁc combinations. Rule Induction is a commonly used machine learning method to infer structure within observational data, and build rules to represent these structures. In this thesis I introduce the application of a method, Rule Induction, to infer co-occurrence patterns in microbial data. First, I benchmark the methods within Rule Induction, to assess how rules are generated with regards to several parameters such as table density, support and conﬁdence. I then subsample data over multiple iterations to understand the robustness of the rules being produced to verify due to sampling. Next, I provide insight into diﬀerent biological variables and examine their eﬀect on rules produced. I compare 16S rRNA region, speciﬁcally V1-3 and V3-5 regions. I compare different sequencingtechnology, specifically 454 and Illumina. I finally compare time, specifically looking over a time frame of 400 ays. Within all these comparisons I aim to understand the differentces, but more importantly what is conserved when these samples are stratified by these variables in terms of the generated rules. Finally, I explore Rule Induction using two microbial datasets, and compare the rules to already-known associations. The first dataset I interpret identifies a correlation between HIV and the Gut Microbiome. The second data set distinguishes the Gut Microbiome over varyuing geographical lovations. I link each of these rules produced from each data set with taxonomic information and consolidate those rules to give rise to the underlying structure within the biological data

CU Scholar Institutional Repository

Interactive Data Exploration with Smart Drill-Down

Author: Garcia-Molina Hector
Joglekar Manas
Parameswaran Aditya
Publication venue
Publication date: 01/05/2016
Field of study

We present {\em smart drill-down}, an operator for interactively exploring a relational table to discover and summarize "interesting" groups of tuples. Each group of tuples is described by a {\em rule}. For instance, the rule

(a, b, \star, 1000)

tells us that there are a thousand tuples with value

a

in the first column and

b

in the second column (and any value in the third column). Smart drill-down presents an analyst with a list of rules that together describe interesting aspects of the table. The analyst can tailor the definition of interesting, and can interactively apply smart drill-down on an existing rule to explore that part of the table. We demonstrate that the underlying optimization problems are {\sc NP-Hard}, and describe an algorithm for finding the approximately optimal list of rules to display when the user uses a smart drill-down, and a dynamic sampling scheme for efficiently interacting with large tables. Finally, we perform experiments on real datasets on our experimental prototype to demonstrate the usefulness of smart drill-down and study the performance of our algorithms

arXiv.org e-Print Archive

Crossref

PubMed Central

eScholarship - University of California

Efficient Mining of Subsample-Stable Graph Patterns

Author: Buzmakov Aleksey
Kuznetsov Sergei,
Napoli Amedeo
Publication venue: HAL CCSD
Publication date: 18/11/2017
Field of study

International audienceA scalable method for mining graph patterns stable under subsampling is proposed. The existing subsample stability and robustness measures are not antimonotonic according to definitions known so far. We study a broader notion of anti-monotonicity for graph patterns, so that measures of subsample stability become antimonotonic. Then we propose gSOFIA for mining the most subsample-stable graph patterns. The experiments on numerous graph datasets show that gSOFIA is very efficient for discovering subsample-stable graph patterns

INRIA a CCSD electronic archive server

Scalable Multi-label Classification

Author: Geoff Holmes
Jesse Read
Supervised Bernhard Pfahringer
Publication venue: 'University of Waikato'
Publication date: 01/01/2010
Field of study

Multi-label classification is relevant to many domains, such as text, image and other media, and bioinformatics. Researchers have already noticed that in multi-label data, correlations exist between labels, and a variety of approaches, drawing inspiration from many spheres of machine learning, have been able to model these correlations. However, data sources from the real world are growing ever larger and the multi-label task is particularly sensitive to this due to the complexity associated with multiple labels and the correlations between them. Consequently, many methods do not scale up to large problems. This thesis deals with scalable multi-label classification: methods which exhibit high predictive performance, but are also able to scale up to larger problems. The first major contribution is the pruned sets method, which is able to model label correlations directly for high predictive performance, but reduces overfitting and complexity over related methods by pruning and subsampling label sets, and can thus scale up to larger datasets. The second major contribution is the classifier chains method, which models correlations with a chain of binary classifiers. The use of binary models allows for scalability to even larger datasets. Pruned sets and classifier chains are robust with respect to both the variety and scale of data that they can deal with, and can be incorporated into other methods. In an ensemble scheme, these methods are able to compete with state-of-the-art methods in terms of predictive performance as well as scale up to large datasets of hundreds of thousands of training examples. This thesis also puts a special emphasis on multi-label evaluation; introducing a new evaluation measure and studying threshold calibration. With one of the largest and most varied collections of multi-label datasets in the literature, extensive experimental evaluation shows the advantage of these methods, both in terms of predictive performance, and computational efficiency and scalability

CiteSeerX

Research Commons@Waikato

Fast Generation of Best Interval Patterns for Nonmonotonic Constraints

Author: A Buzmakov
B Ganter
B Ganter
C Roth
F Moerchen
GI Webb
GI Webb
H Yao
J Cao
J Vreeken
N Pasquier
N Tatti
SO Kuznetsov
SO Kuznetsov
SO Kuznetsov
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 16/06/2015
Field of study

International audienceIn pattern mining, the main challenge is the exponential explosion of the set of patterns. Typically, to solve this problem, a constraint for pattern selection is introduced. One of the first constraints proposed in pattern mining is support (frequency) of a pattern in a dataset. Frequency is an anti-monotonic function, i.e., given an infrequent pattern, all its superpatterns are not frequent. However, many other constraints for pattern selection are neither monotonic nor anti-monotonic, which makes it difficult to generate patterns satisfying these constraints.In this paper we introduce the notion of "generalized monotonicity" and Sofia algorithm that allow generating best patterns in polynomial time for some nonmonotonic constraints modulo constraint computation and pattern extension operations. In particular, this algorithm is polynomial for data on itemsets and interval tuples. In this paper we consider stability and delta-measure which are nonmonotonic constraints and apply them to interval tuple datasets. In the experiments, we compute best interval tuple patterns w.r.t. these measures and show the advantage of our approach over postfiltering approaches

arXiv.org e-Print Archive

Crossref

INRIA a CCSD electronic archive server