2,311 research outputs found
Nonparametric Feature Extraction from Dendrograms
We propose feature extraction from dendrograms in a nonparametric way. The
Minimax distance measures correspond to building a dendrogram with single
linkage criterion, with defining specific forms of a level function and a
distance function over that. Therefore, we extend this method to arbitrary
dendrograms. We develop a generalized framework wherein different distance
measures can be inferred from different types of dendrograms, level functions
and distance functions. Via an appropriate embedding, we compute a vector-based
representation of the inferred distances, in order to enable many numerical
machine learning algorithms to employ such distances. Then, to address the
model selection problem, we study the aggregation of different dendrogram-based
distances respectively in solution space and in representation space in the
spirit of deep representations. In the first approach, for example for the
clustering problem, we build a graph with positive and negative edge weights
according to the consistency of the clustering labels of different objects
among different solutions, in the context of ensemble methods. Then, we use an
efficient variant of correlation clustering to produce the final clusters. In
the second approach, we investigate the sequential combination of different
distances and features sequentially in the spirit of multi-layered
architectures to obtain the final features. Finally, we demonstrate the
effectiveness of our approach via several numerical studies
FlashProfile: A Framework for Synthesizing Data Profiles
We address the problem of learning a syntactic profile for a collection of
strings, i.e. a set of regex-like patterns that succinctly describe the
syntactic variations in the strings. Real-world datasets, typically curated
from multiple sources, often contain data in various syntactic formats. Thus,
any data processing task is preceded by the critical step of data format
identification. However, manual inspection of data to identify the different
formats is infeasible in standard big-data scenarios.
Prior techniques are restricted to a small set of pre-defined patterns (e.g.
digits, letters, words, etc.), and provide no control over granularity of
profiles. We define syntactic profiling as a problem of clustering strings
based on syntactic similarity, followed by identifying patterns that succinctly
describe each cluster. We present a technique for synthesizing such profiles
over a given language of patterns, that also allows for interactive refinement
by requesting a desired number of clusters.
Using a state-of-the-art inductive synthesis framework, PROSE, we have
implemented our technique as FlashProfile. Across tasks over large
real datasets, we observe a median profiling time of only s.
Furthermore, we show that access to syntactic profiles may allow for more
accurate synthesis of programs, i.e. using fewer examples, in
programming-by-example (PBE) workflows such as FlashFill.Comment: 28 pages, SPLASH (OOPSLA) 201
Learning representations from dendrograms
We propose unsupervised representation learning and feature extraction from dendrograms. The commonly used Minimax distance measures correspond to building a dendrogram with single linkage criterion, with defining specific forms of a level function and a distance function over that. Therefore, we extend this method to arbitrary dendrograms. We develop a generalized framework wherein different distance measures and representations can be inferred from different types of dendrograms, level functions and distance functions. Via an appropriate embedding, we compute a vector-based representation of the inferred distances, in order to enable many numerical machine learning algorithms to employ such distances. Then, to address the model selection problem, we study the aggregation of different dendrogram-based distances respectively in solution space and in representation space in the spirit of deep representations. In the first approach, for example for the clustering problem, we build a graph with positive and negative edge weights according to the consistency of the clustering labels of different objects among different solutions, in the context of ensemble methods. Then, we use an efficient variant of correlation clustering to produce the final clusters. In the second approach, we investigate the combination of different distances and features sequentially in the spirit of multi-layered architectures to obtain the final features. Finally, we demonstrate the effectiveness of our approach via several numerical studies
Investigation of methods for machine learning associations between genetic variations and phenotype
The relationship between genetics and phenotype is a complex one that remains poorly understood. Many factors contribute to the relationship between genetic variations and differences in phenotype. An improved understanding of the genetic underpinnings of various phenotypes can help us make important advances in testing for, preventing, treating, and curing a number of diseases and disorders.
The recent popularization of direct-to-consumer sequencing services, coupled with consumers releasing their genetic information for public use, has led to an unprecedented level of access to genetic information. Crowd-sourcing the problem of developing robust genome-wide association techniques for ever larger amounts of data is a promising trend.
This thesis explores likely methods to data mine one such public genetic data repository, openSNP, for correlated genotypes and phenotypes. Particular care is given to data clean-up and the steps required to preprocess public data for machine learning. The preprocessing methods are detailed in such a way that they may be applied to other genetic data repositories that already exist, for example the Personal Genome Project, as well as genetic data repositories that may become available in the future. Following data clean-up, a number of machine learning techniques are investigated, applied, and assessed for their utility in such a big-data problem. No single machine learning approach was found to be sufficient; the combination of imbalanced phenotype response classes and an underdetermined system led to a difficult machine learning challenge. Additional techniques must be explored or developed in order to make such genome-wide association studies possible and meaningful
Light-weight ontologies for scrutable user modelling
This thesis is concerned with the ways light-weight ontologies can support scrutability for large user models and the user modelling process. It explores the role that light-weight ontologies can play, and how they can be exploited, for the purpose of creating and maintaining large, scrutable user models consisting of hundreds of components. We address problems in four key areas: ontology creation, metadata annotation, creation and maintenance of large user models, and user model visualisation, with a goal to provide a simple and adaptable approach that maintains scrutability. Each of these key areas presents a number of challenges that we address. Our solution is the development of a toolkit, LOSUM, which consists of a number of tools to support the user modelling process. It incorporates light-weight ontologies to fulfill a number of roles: aiding in metadata creation, providing structure for large user model visualisation, and as a means to reason across granularities in the user model. In conjunction with this, LOSUM also features a novel visualisation tool, SIV, which performs a dual role of ontology and user model visualisation, supporting the process of ontology creation, metadata annotation, and user model visualisation. We evaluated our approach at each stage with small user studies, and conducted a large scale integrative evaluation of these approaches together in an authentic learning context with 114 students, of whom 77 had exposure to their learner models through SIV. The results showed that students could use the interface and understand the process of user model construction. The flexibility and adaptability of the toolkit has also been demonstrated in its deployment in several other application areas
Skeletal camera network embedded structure-from-motion for 3D scene reconstruction from UAV images
Structure-from-Motion (SfM) techniques have been widely used for 3D scene reconstruction from multi-view images. However, due to the large computational costs of SfM methods there is a major challenge in processing highly overlapping images, e.g. images from unmanned aerial vehicles (UAV). This paper embeds a novel skeletal camera network (SCN) into SfM to enable efficient 3D scene reconstruction from a large set of UAV images. First, the flight control data are used within a weighted graph to construct a topologically connected camera network (TCN) to determine the spatial connections between UAV images. Second, the TCN is refined using a novel hierarchical degree bounded maximum spanning tree to generate a SCN, which contains a subset of edges from the TCN and ensures that each image is involved in at least a 3-view configuration. Third, the SCN is embedded into the SfM to produce a novel SCN-SfM method, which allows performing tie-point matching only for the actually connected image pairs. The proposed method was applied in three experiments with images from two fixed-wing UAVs and an octocopter UAV, respectively. In addition, the SCN-SfM method was compared to three other methods for image connectivity determination. The comparison shows a significant reduction in the number of matched images if our method is used, which leads to less computational costs. At the same time the achieved scene completeness and geometric accuracy are comparable
- …