2,311 research outputs found

    Nonparametric Feature Extraction from Dendrograms

    Full text link
    We propose feature extraction from dendrograms in a nonparametric way. The Minimax distance measures correspond to building a dendrogram with single linkage criterion, with defining specific forms of a level function and a distance function over that. Therefore, we extend this method to arbitrary dendrograms. We develop a generalized framework wherein different distance measures can be inferred from different types of dendrograms, level functions and distance functions. Via an appropriate embedding, we compute a vector-based representation of the inferred distances, in order to enable many numerical machine learning algorithms to employ such distances. Then, to address the model selection problem, we study the aggregation of different dendrogram-based distances respectively in solution space and in representation space in the spirit of deep representations. In the first approach, for example for the clustering problem, we build a graph with positive and negative edge weights according to the consistency of the clustering labels of different objects among different solutions, in the context of ensemble methods. Then, we use an efficient variant of correlation clustering to produce the final clusters. In the second approach, we investigate the sequential combination of different distances and features sequentially in the spirit of multi-layered architectures to obtain the final features. Finally, we demonstrate the effectiveness of our approach via several numerical studies

    FlashProfile: A Framework for Synthesizing Data Profiles

    Get PDF
    We address the problem of learning a syntactic profile for a collection of strings, i.e. a set of regex-like patterns that succinctly describe the syntactic variations in the strings. Real-world datasets, typically curated from multiple sources, often contain data in various syntactic formats. Thus, any data processing task is preceded by the critical step of data format identification. However, manual inspection of data to identify the different formats is infeasible in standard big-data scenarios. Prior techniques are restricted to a small set of pre-defined patterns (e.g. digits, letters, words, etc.), and provide no control over granularity of profiles. We define syntactic profiling as a problem of clustering strings based on syntactic similarity, followed by identifying patterns that succinctly describe each cluster. We present a technique for synthesizing such profiles over a given language of patterns, that also allows for interactive refinement by requesting a desired number of clusters. Using a state-of-the-art inductive synthesis framework, PROSE, we have implemented our technique as FlashProfile. Across 153153 tasks over 7575 large real datasets, we observe a median profiling time of only ∼ 0.7 \sim\,0.7\,s. Furthermore, we show that access to syntactic profiles may allow for more accurate synthesis of programs, i.e. using fewer examples, in programming-by-example (PBE) workflows such as FlashFill.Comment: 28 pages, SPLASH (OOPSLA) 201

    Learning representations from dendrograms

    Get PDF
    We propose unsupervised representation learning and feature extraction from dendrograms. The commonly used Minimax distance measures correspond to building a dendrogram with single linkage criterion, with defining specific forms of a level function and a distance function over that. Therefore, we extend this method to arbitrary dendrograms. We develop a generalized framework wherein different distance measures and representations can be inferred from different types of dendrograms, level functions and distance functions. Via an appropriate embedding, we compute a vector-based representation of the inferred distances, in order to enable many numerical machine learning algorithms to employ such distances. Then, to address the model selection problem, we study the aggregation of different dendrogram-based distances respectively in solution space and in representation space in the spirit of deep representations. In the first approach, for example for the clustering problem, we build a graph with positive and negative edge weights according to the consistency of the clustering labels of different objects among different solutions, in the context of ensemble methods. Then, we use an efficient variant of correlation clustering to produce the final clusters. In the second approach, we investigate the combination of different distances and features sequentially in the spirit of multi-layered architectures to obtain the final features. Finally, we demonstrate the effectiveness of our approach via several numerical studies

    Investigation of methods for machine learning associations between genetic variations and phenotype

    Get PDF
    The relationship between genetics and phenotype is a complex one that remains poorly understood. Many factors contribute to the relationship between genetic variations and differences in phenotype. An improved understanding of the genetic underpinnings of various phenotypes can help us make important advances in testing for, preventing, treating, and curing a number of diseases and disorders. The recent popularization of direct-to-consumer sequencing services, coupled with consumers releasing their genetic information for public use, has led to an unprecedented level of access to genetic information. Crowd-sourcing the problem of developing robust genome-wide association techniques for ever larger amounts of data is a promising trend. This thesis explores likely methods to data mine one such public genetic data repository, openSNP, for correlated genotypes and phenotypes. Particular care is given to data clean-up and the steps required to preprocess public data for machine learning. The preprocessing methods are detailed in such a way that they may be applied to other genetic data repositories that already exist, for example the Personal Genome Project, as well as genetic data repositories that may become available in the future. Following data clean-up, a number of machine learning techniques are investigated, applied, and assessed for their utility in such a big-data problem. No single machine learning approach was found to be sufficient; the combination of imbalanced phenotype response classes and an underdetermined system led to a difficult machine learning challenge. Additional techniques must be explored or developed in order to make such genome-wide association studies possible and meaningful

    Light-weight ontologies for scrutable user modelling

    Get PDF
    This thesis is concerned with the ways light-weight ontologies can support scrutability for large user models and the user modelling process. It explores the role that light-weight ontologies can play, and how they can be exploited, for the purpose of creating and maintaining large, scrutable user models consisting of hundreds of components. We address problems in four key areas: ontology creation, metadata annotation, creation and maintenance of large user models, and user model visualisation, with a goal to provide a simple and adaptable approach that maintains scrutability. Each of these key areas presents a number of challenges that we address. Our solution is the development of a toolkit, LOSUM, which consists of a number of tools to support the user modelling process. It incorporates light-weight ontologies to fulfill a number of roles: aiding in metadata creation, providing structure for large user model visualisation, and as a means to reason across granularities in the user model. In conjunction with this, LOSUM also features a novel visualisation tool, SIV, which performs a dual role of ontology and user model visualisation, supporting the process of ontology creation, metadata annotation, and user model visualisation. We evaluated our approach at each stage with small user studies, and conducted a large scale integrative evaluation of these approaches together in an authentic learning context with 114 students, of whom 77 had exposure to their learner models through SIV. The results showed that students could use the interface and understand the process of user model construction. The flexibility and adaptability of the toolkit has also been demonstrated in its deployment in several other application areas

    Skeletal camera network embedded structure-from-motion for 3D scene reconstruction from UAV images

    Get PDF
    Structure-from-Motion (SfM) techniques have been widely used for 3D scene reconstruction from multi-view images. However, due to the large computational costs of SfM methods there is a major challenge in processing highly overlapping images, e.g. images from unmanned aerial vehicles (UAV). This paper embeds a novel skeletal camera network (SCN) into SfM to enable efficient 3D scene reconstruction from a large set of UAV images. First, the flight control data are used within a weighted graph to construct a topologically connected camera network (TCN) to determine the spatial connections between UAV images. Second, the TCN is refined using a novel hierarchical degree bounded maximum spanning tree to generate a SCN, which contains a subset of edges from the TCN and ensures that each image is involved in at least a 3-view configuration. Third, the SCN is embedded into the SfM to produce a novel SCN-SfM method, which allows performing tie-point matching only for the actually connected image pairs. The proposed method was applied in three experiments with images from two fixed-wing UAVs and an octocopter UAV, respectively. In addition, the SCN-SfM method was compared to three other methods for image connectivity determination. The comparison shows a significant reduction in the number of matched images if our method is used, which leads to less computational costs. At the same time the achieved scene completeness and geometric accuracy are comparable
    • …
    corecore