13,502 research outputs found
Subjectively Interesting Subgroup Discovery on Real-valued Targets
Deriving insights from high-dimensional data is one of the core problems in
data mining. The difficulty mainly stems from the fact that there are
exponentially many variable combinations to potentially consider, and there are
infinitely many if we consider weighted combinations, even for linear
combinations. Hence, an obvious question is whether we can automate the search
for interesting patterns and visualizations. In this paper, we consider the
setting where a user wants to learn as efficiently as possible about
real-valued attributes. For example, to understand the distribution of crime
rates in different geographic areas in terms of other (numerical, ordinal
and/or categorical) variables that describe the areas. We introduce a method to
find subgroups in the data that are maximally informative (in the formal
Information Theoretic sense) with respect to a single or set of real-valued
target attributes. The subgroup descriptions are in terms of a succinct set of
arbitrarily-typed other attributes. The approach is based on the Subjective
Interestingness framework FORSIED to enable the use of prior knowledge when
finding most informative non-redundant patterns, and hence the method also
supports iterative data mining.Comment: 12 pages, 10 figures, 2 tables, conference submissio
Finding Statistically Significant Interactions between Continuous Features
The search for higher-order feature interactions that are statistically
significantly associated with a class variable is of high relevance in fields
such as Genetics or Healthcare, but the combinatorial explosion of the
candidate space makes this problem extremely challenging in terms of
computational efficiency and proper correction for multiple testing. While
recent progress has been made regarding this challenge for binary features, we
here present the first solution for continuous features. We propose an
algorithm which overcomes the combinatorial explosion of the search space of
higher-order interactions by deriving a lower bound on the p-value for each
interaction, which enables us to massively prune interactions that can never
reach significance and to thereby gain more statistical power. In our
experiments, our approach efficiently detects all significant interactions in a
variety of synthetic and real-world datasets.Comment: 13 pages, 5 figures, 2 tables, accepted to the 28th International
Joint Conference on Artificial Intelligence (IJCAI 2019
Anytime Subgroup Discovery in Numerical Domains with Guarantees
International audienceSubgroup discovery is the task of discovering patterns that accurately discriminate a class label from the others. Existing approaches can uncover such patterns either through an exhaustive or an approximate exploration of the pattern search space. However, an exhaustive exploration is generally unfeasible whereas approximate approaches do not provide guarantees bounding the error of the best pattern quality nor the exploration progression ("How far are we of an exhaustive search"). We design here an algorithm for mining numerical data with three key properties w.r.t. the state of the art: (i) It yields progressively interval patterns whose quality improves over time; (ii) It can be interrupted anytime and always gives a guarantee bounding the error on the top pattern quality and (iii) It always bounds a distance to the exhaustive exploration. After reporting experimentations showing the effectiveness of our method, we discuss its generalization to other kinds of patterns
Report on the Second Workshop on Sustainable Software for Science: Practice and Experiences (WSSSPE2)
This technical report records and discusses the Second Workshop on Sustainable Software for Science: Practice and Experiences (WSSSPE2). The report includes a description of the alternative, experimental submission and review process, two workshop keynote presentations, a series of lightning talks, a discussion on sustainability, and five discussions from the topic areas of exploring sustainability; software development experiences; credit & incentives; reproducibility & reuse & sharing; and code testing & code review. For each topic, the report includes a list of tangible actions that were proposed and that would lead to potential change. The workshop recognized that reliance on scientific software is pervasive in all areas of world-leading research today. The workshop participants then proceeded to explore different perspectives on the concept of sustainability. Key enablers and barriers of sustainable scientific software were identified from their experiences. In addition, recommendations with new requirements such as software credit files and software prize frameworks were outlined for improving practices in sustainable software engineering. There was also broad consensus that formal training in software development or engineering was rare among the practitioners. Significant strides need to be made in building a sense of community via training in software and technical practices, on increasing their size and scope, and on better integrating them directly into graduate education programs. Finally, journals can define and publish policies to improve reproducibility, whereas reviewers can insist that authors provide sufficient information and access to data and software to allow them reproduce the results in the paper. Hence a list of criteria is compiled for journals to provide to reviewers so as to make it easier to review software submitted for publication as a “Software Paper.
Large sets of consecutive Maass forms and fluctuations in the Weyl remainder
We explore an algorithm which systematically finds all discrete eigenvalues
of an analytic eigenvalue problem. The algorithm is more simple and elementary
as could be expected before. It consists of Hejhal's identity, linearisation,
and Turing bounds. Using the algorithm, we compute more than one hundredsixty
thousand consecutive eigenvalues of the Laplacian on the modular surface, and
investigate the asymptotic and statistic properties of the fluctuations in the
Weyl remainder. We summarize the findings in two conjectures. One is on the
maximum size of the Weyl remainder, and the other is on the distribution of a
suitably scaled version of the Weyl remainder.Comment: A version with higher resolution figures can be downloaded from
http://www.maths.bris.ac.uk/~mahlt/research/T2012a.pd
- …