3 research outputs found

    An approach to clustering biological phenotypes /

    Get PDF
    Recently emerging approaches to high-throughput phenotyping have become important tools in unraveling the biological basis of agronomically and medically important phenotypes. These experiments produce very large sets of either low or high-dimensional data. Finding clusters in the entire space of high-dimensional data (HDD) is a challenging task, because the relative distances between any two objects converge to zero with increasing dimensionality. Additionally, real data may not be mathematically well behaved. Finally, many clusters are expected on biological grounds to be "natural" -- that is, to have irregular, overlapping boundaries in different subsets of the dimensions. More precisely, the natural clusters of the data could differ in shape, size, density, and dimensionality; and they might not be disjoint. In principle, clustering such data could be done by dimension reduction methods. However, these methods convert many dimensions to a smaller set of dimensions that make the clustering results difficult to interpret and may also lead to a significant loss of information. Another possible approach is to find subspaces (subsets of dimensions) in the entire data space of the HDD. However, the existing subspace methods don't discover natural clusters. Therefore, in this dissertation I propose a novel data preprocessing method, demonstrating that a group of phenotypes are interdependent, and propose a novel density-based subspace clustering algorithm for high-dimensional data, called Dynamic Locally Density Adaptive Scalable Subspace Clustering (DynaDASC). This algorithm is relatively locally density adaptive, scalable, dynamic, and nonmetric in nature, and discovers natural clusters.Dr. Toni Kazic, Dissertation Supervisor.|Includes vita.Includes bibliographical references (pages 62-73)

    Characterizing low-dimensional phenotypes by clustering

    Get PDF
    A person's height is influenced by many factors, such as her parents' heights and how well she was nourished as a child. The features of an organism that can be seen are called phenotypes. Phenotypes that are influenced by multiple genes or the environment; that have many different features; and that vary smoothly over many different values; are called complex phenotypes. These features are called dimensions. Many agronomically valuable phenotypes are complex, such as how much grain can be produced by a field of corn. To understand these phenotypes better so that we could improve yield, we have to be able to recognize when two phenotypes differ and by how much they differ. One way to recognize such differences is by a computational technique called clustering. Clustering groups similar things together by some criterion. For example, different varieties of corn have different yields, and the way their yields change for different amounts of fertilizer also changes. Say two varieties of corn maximize their yields at different fertilizer amounts. Once we can compare different yields and amounts of fertilizer for different corn varieties, we can recognize these similarities. Phenotypes described by only a few dimensions can be impossible to cluster reliably if the numbers for the different dimensions are not comparable to each other. This thesis demonstrates a novel way to make the dimensions comparable to each other and applies the method to 90 different varieties of corn grown under nine different combinations of water and fertilizer
    corecore