5 research outputs found

    CubanSea: Cluster-Based Visualization of Search Results

    Full text link

    Clustering von großen hochdimensionalen und unsicheren Datensätzen in der Astronomie

    Get PDF
    Ein ständiges Wachstum der Datenmengen ist in vielen IT-affinen Bereichen gegeben. Wissenschaftliche und insbesondere astronomische Datensätze weisen komplexe Eigenschaften wie Unsicherheiten, eine hohen Anzahl an Dimensionen sowie die enorme Anzahl an Dateninstanzen auf. Beispielsweise besitzen astronomische Datensätze mehrere Millionen Dateninstanzen mit jeweils mehreren tausend Dimensionen, die sich durch die Anzahl unabhängiger Eigenschaften bzw. Komponenten widerspiegeln. Diese Größenordnungen bzgl. der Dimensionen und Datenmengen in Kombination mit Unsicherheiten zeigen, dass automatisierte Analysen der Datensätze in akzeptabler Analysezeit und damit akzeptabler Berechnungskomplexität notwendig sind. Mit Clustering Verfahren existiert eine mögliche Analysemethodik zur Untersuchung von Ähnlichkeiten innerhalb eines Datensatzes. Aktuelle Verfahren integrieren jedoch nur einzelne Aspekte der komplexen Datensätze im Verfahren, mit einer teilweise nicht-linearen Berechnungskomplexität im Hinblick auf eine steigende Anzahl an Dateninstanzen sowie Dimensionen. Diese Dissertation skizziert die einzelnen Herausforderungen der Prozessierung komplexer Daten in einem Clustering Verfahren. Darüber hinaus präsentiert die Arbeit einen neuartigen parametrisierbaren Ansatz zur Verarbeitung großer und komplexer Datensätze, genannt Fractal Similarity Measures, der die Datenmengen in log-linearer Analysezeit prozessiert. Durch das ebenfalls vorgestellte sogenannte unsichere Sortierungsverfahren für hochdimensionale Daten, stellt die dafür notwendigen Initialisierungsverfahren Gitter bereit. Mit Hilfe des neuen Konzepts des fraktalen Ähnlichkeitsmaßes bzw. dem fraktalen Informationswert analysiert das Verfahren die möglichen Cluster sowie die Dateninstanzen auf Ähnlichkeiten. Zur Demonstration der Funktionalität und Effizienz des Algorithmus evaluiert diese Arbeit das Verfahren mit Hilfe eines synthetischen und eines reellen Datensatzes aus der Astronomie. Die Prozessierung des reellen Datensatzes setzt eine Vergleichbarkeit der gegebenen Spektraldaten voraus, weshalb ein weiteres Verfahren zur Vorprozessierung von Spektraldaten auf Basis des Hadoop-Rahmenwerks vorgestellt wird. Die Dissertation stellt darüber hinaus Ergebnisse des Clustering-Vorgangs des reellen Datensatzes vor, die mit manuell erstellten Ergebnissen von Domänennexperten qualitativ vergleichbar sind

    Fuzzy-Clusteranalyse : Methoden zur Exploration von Daten mit fehlenden Werten sowie klassifizierten Daten

    Get PDF
    Fuzzy-Clusteranalyse, Fuzzy-DatenanlayseMagdeburg, Univ., Fak. für Informatik, Diss., 2002von Heiko Tim

    Analysing functional genomics data using novel ensemble, consensus and data fusion techniques

    Get PDF
    Motivation: A rapid technological development in the biosciences and in computer science in the last decade has enabled the analysis of high-dimensional biological datasets on standard desktop computers. However, in spite of these technical advances, common properties of the new high-throughput experimental data, like small sample sizes in relation to the number of features, high noise levels and outliers, also pose novel challenges. Ensemble and consensus machine learning techniques and data integration methods can alleviate these issues, but often provide overly complex models which lack generalization capability and interpretability. The goal of this thesis was therefore to develop new approaches to combine algorithms and large-scale biological datasets, including novel approaches to integrate analysis types from different domains (e.g. statistics, topological network analysis, machine learning and text mining), to exploit their synergies in a manner that provides compact and interpretable models for inferring new biological knowledge. Main results: The main contributions of the doctoral project are new ensemble, consensus and cross-domain bioinformatics algorithms, and new analysis pipelines combining these techniques within a general framework. This framework is designed to enable the integrative analysis of both large- scale gene and protein expression data (including the tools ArrayMining, Top-scoring pathway pairs and RNAnalyze) and general gene and protein sets (including the tools TopoGSA , EnrichNet and PathExpand), by combining algorithms for different statistical learning tasks (feature selection, classification and clustering) in a modular fashion. Ensemble and consensus analysis techniques employed within the modules are redesigned such that the compactness and interpretability of the resulting models is optimized in addition to the predictive accuracy and robustness. The framework was applied to real-word biomedical problems, with a focus on cancer biology, providing the following main results: (1) The identification of a novel tumour marker gene in collaboration with the Nottingham Queens Medical Centre, facilitating the distinction between two clinically important breast cancer subtypes (framework tool: ArrayMining) (2) The prediction of novel candidate disease genes for Alzheimer’s disease and pancreatic cancer using an integrative analysis of cellular pathway definitions and protein interaction data (framework tool: PathExpand, collaboration with the Spanish National Cancer Centre) (3) The prioritization of associations between disease-related processes and other cellular pathways using a new rule-based classification method integrating gene expression data and pathway definitions (framework tool: Top-scoring pathway pairs) (4) The discovery of topological similarities between differentially expressed genes in cancers and cellular pathway definitions mapped to a molecular interaction network (framework tool: TopoGSA, collaboration with the Spanish National Cancer Centre) In summary, the framework combines the synergies of multiple cross-domain analysis techniques within a single easy-to-use software and has provided new biological insights in a wide variety of practical settings

    Analysing functional genomics data using novel ensemble, consensus and data fusion techniques

    Get PDF
    Motivation: A rapid technological development in the biosciences and in computer science in the last decade has enabled the analysis of high-dimensional biological datasets on standard desktop computers. However, in spite of these technical advances, common properties of the new high-throughput experimental data, like small sample sizes in relation to the number of features, high noise levels and outliers, also pose novel challenges. Ensemble and consensus machine learning techniques and data integration methods can alleviate these issues, but often provide overly complex models which lack generalization capability and interpretability. The goal of this thesis was therefore to develop new approaches to combine algorithms and large-scale biological datasets, including novel approaches to integrate analysis types from different domains (e.g. statistics, topological network analysis, machine learning and text mining), to exploit their synergies in a manner that provides compact and interpretable models for inferring new biological knowledge. Main results: The main contributions of the doctoral project are new ensemble, consensus and cross-domain bioinformatics algorithms, and new analysis pipelines combining these techniques within a general framework. This framework is designed to enable the integrative analysis of both large- scale gene and protein expression data (including the tools ArrayMining, Top-scoring pathway pairs and RNAnalyze) and general gene and protein sets (including the tools TopoGSA , EnrichNet and PathExpand), by combining algorithms for different statistical learning tasks (feature selection, classification and clustering) in a modular fashion. Ensemble and consensus analysis techniques employed within the modules are redesigned such that the compactness and interpretability of the resulting models is optimized in addition to the predictive accuracy and robustness. The framework was applied to real-word biomedical problems, with a focus on cancer biology, providing the following main results: (1) The identification of a novel tumour marker gene in collaboration with the Nottingham Queens Medical Centre, facilitating the distinction between two clinically important breast cancer subtypes (framework tool: ArrayMining) (2) The prediction of novel candidate disease genes for Alzheimer’s disease and pancreatic cancer using an integrative analysis of cellular pathway definitions and protein interaction data (framework tool: PathExpand, collaboration with the Spanish National Cancer Centre) (3) The prioritization of associations between disease-related processes and other cellular pathways using a new rule-based classification method integrating gene expression data and pathway definitions (framework tool: Top-scoring pathway pairs) (4) The discovery of topological similarities between differentially expressed genes in cancers and cellular pathway definitions mapped to a molecular interaction network (framework tool: TopoGSA, collaboration with the Spanish National Cancer Centre) In summary, the framework combines the synergies of multiple cross-domain analysis techniques within a single easy-to-use software and has provided new biological insights in a wide variety of practical settings
    corecore