28 research outputs found

    Visual Exploration And Information Analytics Of High-Dimensional Medical Images

    Get PDF
    Data visualization has transformed how we analyze increasingly large and complex data sets. Advanced visual tools logically represent data in a way that communicates the most important information inherent within it and culminate the analysis with an insightful conclusion. Automated analysis disciplines - such as data mining, machine learning, and statistics - have traditionally been the most dominant fields for data analysis. It has been complemented with a near-ubiquitous adoption of specialized hardware and software environments that handle the storage, retrieval, and pre- and postprocessing of digital data. The addition of interactive visualization tools allows an active human participant in the model creation process. The advantage is a data-driven approach where the constraints and assumptions of the model can be explored and chosen based on human insight and confirmed on demand by the analytic system. This translates to a better understanding of data and a more effective knowledge discovery. This trend has become very popular across various domains, not limited to machine learning, simulation, computer vision, genetics, stock market, data mining, and geography. In this dissertation, we highlight the role of visualization within the context of medical image analysis in the field of neuroimaging. The analysis of brain images has uncovered amazing traits about its underlying dynamics. Multiple image modalities capture qualitatively different internal brain mechanisms and abstract it within the information space of that modality. Computational studies based on these modalities help correlate the high-level brain function measurements with abnormal human behavior. These functional maps are easily projected in the physical space through accurate 3-D brain reconstructions and visualized in excellent detail from different anatomical vantage points. Statistical models built for comparative analysis across subject groups test for significant variance within the features and localize abnormal behaviors contextualizing the high-level brain activity. Currently, the task of identifying the features is based on empirical evidence, and preparing data for testing is time-consuming. Correlations among features are usually ignored due to lack of insight. With a multitude of features available and with new emerging modalities appearing, the process of identifying the salient features and their interdependencies becomes more difficult to perceive. This limits the analysis only to certain discernible features, thus limiting human judgments regarding the most important process that governs the symptom and hinders prediction. These shortcomings can be addressed using an analytical system that leverages data-driven techniques for guiding the user toward discovering relevant hypotheses. The research contributions within this dissertation encompass multidisciplinary fields of study not limited to geometry processing, computer vision, and 3-D visualization. However, the principal achievement of this research is the design and development of an interactive system for multimodality integration of medical images. The research proceeds in various stages, which are important to reach the desired goal. The different stages are briefly described as follows: First, we develop a rigorous geometry computation framework for brain surface matching. The brain is a highly convoluted structure of closed topology. Surface parameterization explicitly captures the non-Euclidean geometry of the cortical surface and helps derive a more accurate registration of brain surfaces. We describe a technique based on conformal parameterization that creates a bijective mapping to the canonical domain, where surface operations can be performed with improved efficiency and feasibility. Subdividing the brain into a finite set of anatomical elements provides the structural basis for a categorical division of anatomical view points and a spatial context for statistical analysis. We present statistically significant results of our analysis into functional and morphological features for a variety of brain disorders. Second, we design and develop an intelligent and interactive system for visual analysis of brain disorders by utilizing the complete feature space across all modalities. Each subdivided anatomical unit is specialized by a vector of features that overlap within that element. The analytical framework provides the necessary interactivity for exploration of salient features and discovering relevant hypotheses. It provides visualization tools for confirming model results and an easy-to-use interface for manipulating parameters for feature selection and filtering. It provides coordinated display views for visualizing multiple features across multiple subject groups, visual representations for highlighting interdependencies and correlations between features, and an efficient data-management solution for maintaining provenance and issuing formal data queries to the back end

    Analysis of the singular value decomposition as a tool for processing microarray expression data

    Get PDF
    We give two informative derivations of a spectral algorithm for clustering and partitioning a bi-partite graph. In the first case we begin with a discrete optimization problem that relaxes into a tractable continuous analogue. In the second case we use the power method to derive an iterative interpretation of the algorithm. Both versions reveal a natural approach for re-scaling the edge weights and help to explain the performance of the algorithm in the presence of outliers. Our motivation for this work is in the analysis of microarray data from bioinformatics, and we give some numerical results for a publicly available acute leukemia data set

    A survey of kernel and spectral methods for clustering

    Get PDF
    Clustering algorithms are a useful tool to explore data structures and have been employed in many disciplines. The focus of this paper is the partitioning clustering problem with a special interest in two recent approaches: kernel and spectral methods. The aim of this paper is to present a survey of kernel and spectral clustering methods, two approaches able to produce nonlinear separating hypersurfaces between clusters. The presented kernel clustering methods are the kernel version of many classical clustering algorithms, e.g., K-means, SOM and neural gas. Spectral clustering arise from concepts in spectral graph theory and the clustering problem is configured as a graph cut problem where an appropriate objective function has to be optimized. An explicit proof of the fact that these two paradigms have the same objective is reported since it has been proven that these two seemingly different approaches have the same mathematical foundation. Besides, fuzzy kernel clustering methods are presented as extensions of kernel K-means clustering algorithm. (C) 2007 Pattem Recognition Society. Published by Elsevier Ltd. All rights reserved

    스펙트럴 이중 군집화를 이용한 그래프기반 협업필터링의 국지 앙상블 방법

    Get PDF
    학위논문(석사) -- 서울대학교대학원 : 자연과학대학 협동과정 계산과학전공, 2022. 8. 강명주.The importance of a personalized recommendation system is emerging as the world becomes more complex and individualized. Among various recommendation systems, Neural Graph Collaborative Filtering(NGCF) and its variants treat the user-item set as a bipartite graph and learn the interactions between user and item without using their unique features. While these approaches only using collaborative signals have achieved state-of-the-art performance, they still have the disadvantage of abandoning feature similarity among users and items. To tackle this problem, we adopt unsupervised community detection from bipartite graph structure to enhance the collaborative signal for a Graph-based recommendation system. Co-Clustering algorithms segment the user-item matrix into small groups. Each local CF captures a strong correlation among these local user-item subsets, while the original incidence matrix is also used to analyze global interaction between groups. Finally, our Local-Ensemble Graph Collaborative Filtering(LEGCF) aggregates all local and global collaborative information. As the proposed approach can utilize various Co-clustering and Collaborative Filtering flexibly, one of the most straightforward variants, Spectral Co-Clustering and NGCF, can enhance the overall performance.본 논문에서는 추천시스템을 위한 그래프 기반 협업 필터링 모델을 스펙트럴 이중 분할하여 생성된 부분 그래프를 앙상블(Ensemble) 하여 추천 성능을 개선하는 방법에 대해 연구하였다. 그래프 인공 신경망 (Graph Neural Networks, GNN)을 이용한 협업 필터링 기반의 추천 시스템의 기본 모델은, 사용자나 아이템에 대한 사전 정보를 전혀 사용하지 않고 사용자-아이템 간 상호작용 정보만을 활용하여 신경망 모델의 임베딩을 구성한다. 따라서 사용자와 아이템의 사전정보만으로 유추할 수 있는 특정 사용자 그룹의 경향성을 추천시스템에 사용할 수 없는 단점이 있다. 한편, 스펙트럴 이중 분할 방법은 특잇값 분해를 반복하여 이분 그래프를 양 도메인의 정보를 모두 포함한 부분그래프로 분할한다. 추천시스템을 위한 데이터 세트를 스펙트럴 이중 분할 할 경우, 특정 사용자그룹과 아이템 그룹을 전체 데이터로부터 분리할 수 있으며, 분할된 그룹은 높은 데이터 밀도와 강한 상호작용 신호를 갖게 된다. 따라서 분할된 그룹 데이터에 대해 협업 필터링을 적용할 경우, 데이터 세트나 협업 필터링 모델의 종류와 관계없이 해당 데이터그룹에서는 추천 능력이 향상된다. 나아가서, 분할된 부분 그래프들을 개별적으로 협업 필터링한 뒤 앙상블 하여 그룹별 상호작용 신호를 분석한 지역임베딩(Local Embedding)과 전체 데이터를 아우를 수 있는 전역임베딩(Global Embedding)을 통합하여 최종임 베딩을 구성하였다. 여섯 개의 데이터 세트와 세 가지의 협업 필터링 모델을 스펙트럴 이중 분할하여 앙상블 한 결과, 모델 종류와 관계없이 추천 능력이 향상되었다. 그러나 몇 가지 데이터 세트의 경우 성능향상이 거의 이루어지지 않았는데, 이는 데이터가 이미 적절히 분할되어있는 경우 스펙트럴 이중 분할이 추천 성능을 향상하지 못한 것으로 분석된다. 반면 데이터가 골고루 분포되어있어 기본 협업 필터링 모델이 상호작용 신호를 분석하기 어려운 경우, 스펙트럴 이준 분할을 통한 앙상블 방법으로 모든 협업 필터링 모델에 대하여 추천 능력이 향상되었다1 Introduction 1 2 Preliminaries 5 2.1 Spectral Co-Clustering 5 2.1.1 Bipartite Graph Partitioning 5 2.1.2 Optimization 8 2.2 Bayesian Personalized Ranking(BPR) Loss 11 2.2.1 Implicit Data 11 2.2.2 Personalized Total Ranking 11 2.2.3 Bayesian Personalized Ranking 12 3 Proposed Method 15 3.1 Dataset 15 3.2 Spectral Co-Clustering 15 3.3 Local-Ensemble model 17 4 Experimental Result 20 4.1 Evaluation Metric 20 4.2 Result Analysis 21 5 Conclusion 29 References 31 Abstract (in Korean) 35석

    A Computational Framework for Learning from Complex Data: Formulations, Algorithms, and Applications

    Get PDF
    Many real-world processes are dynamically changing over time. As a consequence, the observed complex data generated by these processes also evolve smoothly. For example, in computational biology, the expression data matrices are evolving, since gene expression controls are deployed sequentially during development in many biological processes. Investigations into the spatial and temporal gene expression dynamics are essential for understanding the regulatory biology governing development. In this dissertation, I mainly focus on two types of complex data: genome-wide spatial gene expression patterns in the model organism fruit fly and Allen Brain Atlas mouse brain data. I provide a framework to explore spatiotemporal regulation of gene expression during development. I develop evolutionary co-clustering formulation to identify co-expressed domains and the associated genes simultaneously over different temporal stages using a mesh-generation pipeline. I also propose to employ the deep convolutional neural networks as a multi-layer feature extractor to generate generic representations for gene expression pattern in situ hybridization (ISH) images. Furthermore, I employ the multi-task learning method to fine-tune the pre-trained models with labeled ISH images. My proposed computational methods are evaluated using synthetic data sets and real biological data sets including the gene expression data from the fruit fly BDGP data sets and Allen Developing Mouse Brain Atlas in comparison with baseline existing methods. Experimental results indicate that the proposed representations, formulations, and methods are efficient and effective in annotating and analyzing the large-scale biological data sets

    Development of Biclustering Techniques for Gene Expression Data Modeling and Mining

    Get PDF
    The next-generation sequencing technologies can generate large-scale biological data with higher resolution, better accuracy, and lower technical variation than the arraybased counterparts. RNA sequencing (RNA-Seq) can generate genome-scale gene expression data in biological samples at a given moment, facilitating a better understanding of cell functions at genetic and cellular levels. The abundance of gene expression datasets provides an opportunity to identify genes with similar expression patterns across multiple conditions, i.e., co-expression gene modules (CEMs). Genomescale identification of CEMs can be modeled and solved by biclustering, a twodimensional data mining technique that allows clustering of rows and columns in a gene expression matrix, simultaneously. Compared with traditional clustering that targets global patterns, biclustering can predict local patterns. This unique feature makes biclustering very useful when applied to big gene expression data since genes that participate in a cellular process are only active in specific conditions, thus are usually coexpressed under a subset of all conditions. The combination of biclustering and large-scale gene expression data holds promising potential for condition-specific functional pathway/network analysis. However, existing biclustering tools do not have satisfied performance on high-resolution RNA-Seq data, majorly due to the lack of (i) a consideration of high sparsity of RNA-Seq data, especially for scRNA-Seq data, and (ii) an understanding of the underlying transcriptional regulation signals of the observed gene expression values. QUBIC2, a novel biclustering algorithm, is designed for large-scale bulk RNA-Seq and single-cell RNA-seq (scRNA-Seq) data analysis. Critical novelties of the algorithm include (i) used a truncated model to handle the unreliable quantification of genes with low or moderate expression; (ii) adopted the Gaussian mixture distribution and an information-divergency objective function to capture shared transcriptional regulation signals among a set of genes; (iii) utilized a Dual strategy to expand the core biclusters, aiming to save dropouts from the background; and (iv) developed a statistical framework to evaluate the significances of all the identified biclusters. Method validation on comprehensive data sets suggests that QUBIC2 had superior performance in functional modules detection and cell type classification. The applications of temporal and spatial data demonstrated that QUBIC2 could derive meaningful biological information from scRNA-Seq data. Also presented in this dissertation is QUBICR. This R package is characterized by an 82% average improved efficiency compared to the source C code of QUBIC. It provides a set of comprehensive functions to facilitate biclustering-based biological studies, including the discretization of expression data, query-based biclustering, bicluster expanding, biclusters comparison, heatmap visualization of any identified biclusters, and co-expression networks elucidation. In the end, a systematical summary is provided regarding the primary applications of biclustering for biological data and more advanced applications for biomedical data. It will assist researchers to effectively analyze their big data and generate valuable biological knowledge and novel insights with higher efficiency

    Interactive Data Exploration of Distributed Raw Files: A Systematic Mapping Study

    Get PDF
    When exploring big amounts of data without a clear target, providing an interactive experience becomes really dif cult, since this tentative inspection usually defeats any early decision on data structures or indexing strategies. This is also true in the physics domain, speci cally in high-energy physics, where the huge volume of data generated by the detectors are normally explored via C++ code using batch processing, which introduces a considerable latency. An interactive tool, when integrated into the existing data management systems, can add a great value to the usability of these platforms. Here, we intend to review the current state-of-the-art of interactive data exploration, aiming at satisfying three requirements: access to raw data les, stored in a distributed environment, and with a reasonably low latency. This paper follows the guidelines for systematic mapping studies, which is well suited for gathering and classifying available studies.We summarize the results after classifying the 242 papers that passed our inclusion criteria. While there are many proposed solutions that tackle the problem in different manners, there is little evidence available about their implementation in practice. Almost all of the solutions found by this paper cover a subset of our requirements, with only one partially satisfying the three. The solutions for data exploration abound. It is an active research area and, considering the continuous growth of data volume and variety, is only to become harder. There is a niche for research on a solution that covers our requirements, and the required building blocks are there

    Multi-species integrative biclustering

    Get PDF
    We describe an algorithm, multi-species cMonkey, for the simultaneous biclustering of heterogeneous multiple-species data collections and apply the algorithm to a group of bacteria containing Bacillus subtilis, Bacillus anthracis, and Listeria monocytogenes. The algorithm reveals evolutionary insights into the surprisingly high degree of conservation of regulatory modules across these three species and allows data and insights from well-studied organisms to complement the analysis of related but less well studied organisms
    corecore