Search CORE

12,159 research outputs found

An application of minimum description length clustering to partitioning learning curves

Author: Lee M.
Navarro D.
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2005
Field of study

© Copyright 2005 IEEEWe apply a Minimum Description Length–based clustering technique to the problem of partitioning a set of learning curves. The goal is to partition experimental data collected from different sources into groups of sources that are statistically the same.We solve this problem by defining statistical models for the data generating processes, then partitioning them using the Normalized Maximum Likelihood criterion. Unlike many alternative model selection methods, this approach which is optimal (in a minimax coding sense) for data of any sample size. We present an application of the method to the cognitive modeling problem of partitioning of human learning curves for different categorization tasks

Crossref

Adelaide Research & Scholarship

Squarepants in a Tree: Sum of Subtree Clustering and Hyperbolic Pants Decomposition

Author: Alstrup S.
Aluru S.
Bern M. W.
David Eppstein
Erickson J.
Saitou N.
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 24/02/2008
Field of study

We provide efficient constant factor approximation algorithms for the problems of finding a hierarchical clustering of a point set in any metric space, minimizing the sum of minimimum spanning tree lengths within each cluster, and in the hyperbolic or Euclidean planes, minimizing the sum of cluster perimeters. Our algorithms for the hyperbolic and Euclidean planes can also be used to provide a pants decomposition, that is, a set of disjoint simple closed curves partitioning the plane minus the input points into subsets with exactly three boundary components, with approximately minimum total length. In the Euclidean case, these curves are squares; in the hyperbolic case, they combine our Euclidean square pants decomposition with our tree clustering method for general metric spaces.Comment: 22 pages, 14 figures. This version replaces the proof of what is now Lemma 5.2, as the previous proof was erroneou

arXiv.org e-Print Archive

Crossref

A Multiscale Approach for Statistical Characterization of Functional Images

Author: Antoniadis Anestis
Bigot Jérémie
Von Sachs Rainer
Publication venue: 'Informa UK Limited'
Publication date: 01/01/2009
Field of study

Increasingly, scientific studies yield functional image data, in which the observed data consist of sets of curves recorded on the pixels of the image. Examples include temporal brain response intensities measured by fMRI and NMR frequency spectra measured at each pixel. This article presents a new methodology for improving the characterization of pixels in functional imaging, formulated as a spatial curve clustering problem. Our method operates on curves as a unit. It is nonparametric and involves multiple stages: (i) wavelet thresholding, aggregation, and Neyman truncation to effectively reduce dimensionality; (ii) clustering based on an extended EM algorithm; and (iii) multiscale penalized dyadic partitioning to create a spatial segmentation. We motivate the different stages with theoretical considerations and arguments, and illustrate the overall procedure on simulated and real datasets. Our method appears to offer substantial improvements over monoscale pixel-wise methods. An Appendix which gives some theoretical justifications of the methodology, computer code, documentation and dataset are available in the online supplements

Scientific Publications of the University of Toulouse II Le Mirail

Hal - Université Grenoble Alpes

Open Archive Toulouse Archive Ouverte

HAL-INSA Toulouse

DIAL UCLouvain

Hal-Diderot

Towards an Efficient Discovery of the Topological Representative Subgraphs

Author: Dhifli Wajdi
Moussaoui Mohamed
Nguifo Engelbert Mephu
Saidi Rabie
Publication venue
Publication date: 01/01/2013
Field of study

With the emergence of graph databases, the task of frequent subgraph discovery has been extensively addressed. Although the proposed approaches in the literature have made this task feasible, the number of discovered frequent subgraphs is still very high to be efficiently used in any further exploration. Feature selection for graph data is a way to reduce the high number of frequent subgraphs based on exact or approximate structural similarity. However, current structural similarity strategies are not efficient enough in many real-world applications, besides, the combinatorial nature of graphs makes it computationally very costly. In order to select a smaller yet structurally irredundant set of subgraphs, we propose a novel approach that mines the top-k topological representative subgraphs among the frequent ones. Our approach allows detecting hidden structural similarities that existing approaches are unable to detect such as the density or the diameter of the subgraph. In addition, it can be easily extended using any user defined structural or topological attributes depending on the sought properties. Empirical studies on real and synthetic graph datasets show that our approach is fast and scalable

arXiv.org e-Print Archive

HAL Clermont Université

Proceedings of the 2011 New York Workshop on Computer, Earth and Space Science

Author: Naud Catherine
Way Michael J.
Publication venue
Publication date: 11/04/2011
Field of study

The purpose of the New York Workshop on Computer, Earth and Space Sciences is to bring together the New York area's finest Astronomers, Statisticians, Computer Scientists, Space and Earth Scientists to explore potential synergies between their respective fields. The 2011 edition (CESS2011) was a great success, and we would like to thank all of the presenters and participants for attending. This year was also special as it included authors from the upcoming book titled "Advances in Machine Learning and Data Mining for Astronomy". Over two days, the latest advanced techniques used to analyze the vast amounts of information now available for the understanding of our universe and our planet were presented. These proceedings attempt to provide a small window into what the current state of research is in this vast interdisciplinary field and we'd like to thank the speakers who spent the time to contribute to this volume.Comment: Author lists modified. 82 pages. Workshop Proceedings from CESS 2011 in New York City, Goddard Institute for Space Studie

arXiv.org e-Print Archive

CERN Document Server

Multivariate Approaches to Classification in Extragalactic Astronomy

Author: Chattopadhyay Asis Kumar
Fraix-Burnet Didier
Thuillard Marc
Publication venue: 'Frontiers Media SA'
Publication date: 01/01/2015
Field of study

Clustering objects into synthetic groups is a natural activity of any science. Astrophysics is not an exception and is now facing a deluge of data. For galaxies, the one-century old Hubble classification and the Hubble tuning fork are still largely in use, together with numerous mono-or bivariate classifications most often made by eye. However, a classification must be driven by the data, and sophisticated multivariate statistical tools are used more and more often. In this paper we review these different approaches in order to situate them in the general context of unsupervised and supervised learning. We insist on the astrophysical outcomes of these studies to show that multivariate analyses provide an obvious path toward a renewal of our classification of galaxies and are invaluable tools to investigate the physics and evolution of galaxies.Comment: Open Access paper. http://www.frontiersin.org/milky\_way\_and\_galaxies/10.3389/fspas.2015.00003/abstract\>. \<10.3389/fspas.2015.00003 \&g

arXiv.org e-Print Archive

Hal - Université Grenoble Alpes

Frontiers - Publisher Connector

HAL Descartes

HAL-INSU

HAL Université de Savoie

Recent Advances in Graph Partitioning

Author: A Buluç
A Felner
A George
A Lisser
A Pothen
A Trifunović
AB Kahng
AE Feldmann
AH Land
AJ Soper
B Brandfass
B Hendrickson
B Hendrickson
B Hendrickson
B Junker
B Monien
B Peng
BW Kernighan
C Aykanat
C Chevalier
C Chevalier
C Farhat
C Lanczos
C Walshaw
C Walshaw
C Walshaw
C Walshaw
C Walshaw
C Walshaw
CE Bichot
CE Ferreira
D Delling
D Delling
D Delling
D Drake
D Luxen
D Ron
D Ron
D Wagner
DA Papa
DE Drake Vinkemeier
E Jeannot
E Rolland
F Comellas
F Glover
F Glover
F Pellegrini
F Pellegrini
F Pellegrini
F Schulz
FT Leighton
G Even
G Karypis
G Karypis
G Karypis
G Zumbusch
H Li
H Meyerhenke
H Meyerhenke
H Meyerhenke
H Meyerhenke
H Meyerhenke
HD Simon
HD Simon
I Moulitsas
I Safro
I Safro
J Chen
J Cong
J Fietz
J Hromkovič
J Hungershöfer
J Maue
J Maue
J Shalf
JR Gilbert
K Andreev
K Lang
K Schloegel
K Schloegel
K Schloegel
KS Camilus
L Brunetta
L Grady
L Lovász
LA Sanchis
LR Ford
M Armbruster
M Bader
M Birn
M Fiedler
M Jerrum
M Newman
M Sellmann
M Zhou
MR Garey
N Sensen
O Goldschmidt
P Chardaire
P Galinier
P Korosec
P Sanders
P Sanders
R Diekmann
R Diekmann
R Glantz
R Preis
RD Williams
S Arora
S Huang
S Lafon
S Lloyd
S Pettie
SE Karisch
SY Chan
T Bui
T Kieritz
U Benlic
U Benlic
U Feige
V Osipov
WE Donath
WE Donath
WW Hager
WW Hager
X Sui
Y Low
YM Kim
Ü Çatalyürek
Publication venue
Publication date: 03/02/2015
Field of study

We survey recent trends in practical algorithms for balanced graph partitioning together with applications and future research directions

arXiv.org e-Print Archive

Crossref

eScholarship - University of California

FlashProfile: A Framework for Synthesizing Data Profiles

Author: Gulwani Sumit
Jain Prateek
Millstein Todd
Padhi Saswat
Perelman Daniel
Polozov Oleksandr
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 24/10/2018
Field of study

We address the problem of learning a syntactic profile for a collection of strings, i.e. a set of regex-like patterns that succinctly describe the syntactic variations in the strings. Real-world datasets, typically curated from multiple sources, often contain data in various syntactic formats. Thus, any data processing task is preceded by the critical step of data format identification. However, manual inspection of data to identify the different formats is infeasible in standard big-data scenarios. Prior techniques are restricted to a small set of pre-defined patterns (e.g. digits, letters, words, etc.), and provide no control over granularity of profiles. We define syntactic profiling as a problem of clustering strings based on syntactic similarity, followed by identifying patterns that succinctly describe each cluster. We present a technique for synthesizing such profiles over a given language of patterns, that also allows for interactive refinement by requesting a desired number of clusters. Using a state-of-the-art inductive synthesis framework, PROSE, we have implemented our technique as FlashProfile. Across

153

tasks over

75

large real datasets, we observe a median profiling time of only

\sim\,0.7\,

s. Furthermore, we show that access to syntactic profiles may allow for more accurate synthesis of programs, i.e. using fewer examples, in programming-by-example (PBE) workflows such as FlashFill.Comment: 28 pages, SPLASH (OOPSLA) 201

arXiv.org e-Print Archive

eScholarship - University of California