9 research outputs found

    Visualizing dimensionality reduction of systems biology data

    Full text link
    One of the challenges in analyzing high-dimensional expression data is the detection of important biological signals. A common approach is to apply a dimension reduction method, such as principal component analysis. Typically, after application of such a method the data is projected and visualized in the new coordinate system, using scatter plots or profile plots. These methods provide good results if the data have certain properties which become visible in the new coordinate system and which were hard to detect in the original coordinate system. Often however, the application of only one method does not suffice to capture all important signals. Therefore several methods addressing different aspects of the data need to be applied. We have developed a framework for linear and non-linear dimension reduction methods within our visual analytics pipeline SpRay. This includes measures that assist the interpretation of the factorization result. Different visualizations of these measures can be combined with functional annotations that support the interpretation of the results. We show an application to high-resolution time series microarray data in the antibiotic-producing organism Streptomyces coelicolor as well as to microarray data measuring expression of cells with normal karyotype and cells with trisomies of human chromosomes 13 and 21

    Local linear embedded regression in the quantitative analysis of glucose in near infrared spectra

    Get PDF
    This paper investigates the use of Local Linear Embedded Regression (LLER) for the quantitative analysis of glucose from near infrared spectra. The performance of the LLER model is evaluated and compared with the regression techniques Principal Component Regression (PCR), Partial Least Squares Regression (PLSR) and Support Vector Regression (SVR) both with and without pre-processing. The prediction capability of the proposed model has been validated to predict the glucose concentration in an aqueous solution composed of three components (urea, triacetin and glucose). The results show that the LLER method offers improvements in comparison to PCR, PLSR and SVR

    SLICER: inferring branched, nonlinear cellular trajectories from single cell RNA-seq data

    Get PDF
    Accuracy of trajectory reconstruction using a subset of cells. (a) Graph showing how similar the SLICER trajectory is when computed using a random subset of lung cells. The blue bars show the similarity in cell ordering (units are percent sorted with respect to the trajectory constructed from all cells). The orange bars show the similarity in branch assignments (percentage of cells assigned to the same branch as the trajectory constructed from all cells). The values shown were obtained by averaging the results from five subsampled datasets for each percentage (80 %, 60 %, 40 %, and 20 %). (b) Order preservation and branch identity values computed as in panel (a), but for datasets sampled from the neural stem cell dataset. (PDF 106 kb

    Task-driven learned hyperspectral data reduction using end-to-end supervised deep learning

    Get PDF
    An important challenge in hyperspectral imaging tasks is to cope with the large number of spectral bins. Common spectral data reduction methods do not take prior knowledge about the task into account. Consequently, sparsely occurring features that may be essential for the imaging task may not be preserved in the data reduction step. Convolutional neural network (CNN) approaches are capable of learning the specific features relevant to the particular imaging task, but applying them directly to the spectral input data is constrained by the computational efficiency. We propose a novel supervised deep learning approach for combining data reduction and image analysis in an end-to-end architecture. In our approach, the neural network component that performs the reduction is trained such that image features most relevant for the task are preserved in the reduction step. Results for two convolutional neural network architectures and two types of generated datasets show that the proposed Data Reduction CNN (DRCNN) approach can produce more accurate results than existing popular data reduction methods, and can be used in a wide range of problem settings. The integration of knowledge about the task allows for more image compression and higher accuracies compared to standard data reduction methods

    Exploiting geometry, topology, and optimization for knowledge discovery in big data

    Get PDF
    2013 Summer.Includes bibliographical references.In this dissertation, we consider several topics that are united by the theme of topological and geometric data analysis. First, we consider an application in landscape ecology using a well-known vector quantization algorithm to characterize and segment the color content of natural imagery. Color information in an image may be viewed naturally as clusters of pixels with similar attributes. The inherent structure and distribution of these clusters serves to quantize the information in the image and provides a basis for classification. A friendly graphical user interface called Biological Landscape Organizer and Semi-supervised Segmenting Machine (BLOSSM) was developed to aid in this classification. We consider four different choices for color space and five different metrics in which to analyze our data, and results are compared. Second, we present a novel topologically driven clustering algorithm that blends Locally Linear Embedding (LLE) and vector quantization by mapping color information to a lower dimensional space, identifying distinct color regions, and classifying pixels together based on both a proximity measure and color content. It is observed that these techniques permit a significant reduction in color resolution while maintaining the visually important features of images. Third, we develop a novel algorithm which we call Sparse LLE that leads to sparse representations in local reconstructions by using a data weighted 1-norm regularization term in the objective function of an optimization problem. It is observed that this new formulation has proven effective at automatically determining an appropriate number of nearest neighbors for each data point. We explore various optimization techniques, namely Primal Dual Interior Point algorithms, to solve this problem, comparing the computational complexity for each. Fourth, we present a novel algorithm that can be used to determine the boundary of a data set, or the vertices of a convex hull encasing a point cloud of data, in any dimension by solving a quadratic optimization problem. In this problem, each point is written as a linear combination of its nearest neighbors where the coefficients of this linear combination are penalized if they do not construct a convex combination, revealing those points that cannot be represented in this way, the vertices of the convex hull containing the data. Finally, we exploit the relatively new tool from topological data analysis, persistent homology, and consider the use of vector bundles to re-embed data in order to improve the topological signal of a data set by embedding points sampled from a projective variety into successive Grassmannians

    Selection of the Optimal Parameter Value for the Locally Linear Embedding Algorithm

    No full text
    The locally linear embedding (LLE) algorithm has recently emerged as a promising technique for nonlinear dimensionality reduction of high-dimensional data. One of its advantages over many similar methods is that only one parameter has to be defined, but no guidance was yet given how to choose it. We propose a hierarchical method for automatic selection of an optimal parameter value. Our approach is experimentally verified on two large data sets of real-world images and applied to visualization of multidimensional data

    Computational Methods for Inferring Transcriptome Dynamics

    Get PDF
    The sequencing of the human genome paved the way for a new type of medicine, in which a molecular-level, cell-by-cell understanding of the genomic control system informs diagnosis and treatment. A key experimental approach for achieving such understanding is measuring gene expression dynamics across a range of cell types and biological conditions. The raw outputs of these experiments are millions of short DNA sequences, and computational methods are required to draw scientific conclusions from such experimental data. In this dissertation, I present computational methods to address some of the challenges involved in inferring dynamic transcriptome changes. My work focuses two types of challenges: (1) discovering important biological variation within a population of single cells and (2) robustly extracting information from sequencing reads. Three of the methods are designed to identify biologically relevant differences among a heterogenous mixture of cells. SingleSplice uses a statistical model to detect true biological variation in alternative splicing within a population of single cells. SLICER elucidates transcriptome changes during a sequential biological process by positing the process as a nonlinear manifold embedded in high-dimensional gene expression space. MATCHER uses manifold alignment to infer what multiple types of single cell measurements obtained from different individual cells would look like if they were performed simultaneously on the same cell. These methods gave insight into several important biological systems, including embryonic stem cells and cardiac fibroblasts undergoing reprogramming. To enable study of the pseudogene ceRNA effect, I developed a computational method for robustly computing pseudogene expression levels in the presence of high sequence similarity that confounds sequencing read alignment. AppEnD, an algorithm for detecting untemplated additions, allowed the study of transcript modifications during RNA degradation.Doctor of Philosoph
    corecore