2,410 research outputs found

    Differential modeling for cancer microarray data

    Get PDF
    Capturing the changes between two biological phenotypes is a crucial task in understanding the mechanisms of various diseases. Most of the existing computational approaches depend on testing the changes in the expression levels of each single gene individually. In this work, we proposed novel computational approaches to identify the differential genes between two phenotypes. These approaches aim to quantitatively characterize the differences between two phenotypes and can provide better insights and understanding of various diseases. The purpose of this thesis is three-fold. Firstly, we review the state-of-the-art approaches for differential analysis of gene expression data. Secondly, we propose a novel differential network analysis approach that is composed of two algorithms, namely, DiffRank and DiffSubNet, to identify differential hubs and differential subnetworks, respectively. In this approach, two datasets are represented as two networks , and then the problem of identifying differential genes is transformed to the problem of comparing two networks to identify the most differential network omponents. Studying such networks can provide valuable knowledge about the data. The DiffRank algorithm ranks the nodes of two networks based on their differential behavior using two novel differential measures: differential connectivity and differential betweenness centrality for each node. These measures are propagated through the network and are optimized to capture the local and global structural changes between two networks. Then, we integrated the results of this algorithm into the proposed differential subnetwork algorithm which is called DiffSubNet. This algorithm aims to identify sets of differentially connected nodes. We demonstrated the effectiveness of these algorithms on synthetic datasets and real-world applications and showed that these algorithms identified meaningful and valuable information compared to some of the baseline methods that can be used for such a task. Thirdly, we propose a novel differential co-clustering approach to efficiently find arbitrarily positioned difeferntial (or discriminative) co-clusters from large datasets. The goal of this approach is to discover a distinguishing set of gene patterns that are highly correlated in a subset of the samples (subspace co-expressions) in one phenotype but not in the other. This approach is useful when the biological samples are assumed to be heterogenous or have multiple subtypes. To achieve this goal, we propose a novel co-clustering algorithm, Ranking-based Arbitrarily Positioned Overlapping Co-Clustering (RAPOCC), to efficiently extract significant co-clusters. This algorithm optimizes a novel ranking-based objective function to find arbitrarily positioned co-clusters, and it can extract large and overlapping co-clusters containing both positively and negatively correlated genes. Then, we extend this algorithm to discover discriminative co-clusters by incorporating the class information into the co-cluster search process. The novel discriminative co-clustering algorithm is called Discriminative RAPOCC (Di-RAPOCC), to efficiently extract the discriminative co-clusters from labeled datasets. We also characterize the discriminative co-clusters and propose three novel measures that can be used to evaluate the performance of any discriminative subspace algorithm. We evaluated the proposed algorithms on several synthetic and real gene expression datasets, and our experimental results showed that the proposed algorithms outperformed several existing algorithms available in the literature. The shift from single gene analysis to the differential gene network analysis and differential co-clustering can play a crucial role in future analysis of gene expression and can help in understanding the mechanism of various diseases

    Correlation Clustering

    Get PDF
    Knowledge Discovery in Databases (KDD) is the non-trivial process of identifying valid, novel, potentially useful, and ultimately understandable patterns in data. The core step of the KDD process is the application of a Data Mining algorithm in order to produce a particular enumeration of patterns and relationships in large databases. Clustering is one of the major data mining techniques and aims at grouping the data objects into meaningful classes (clusters) such that the similarity of objects within clusters is maximized, and the similarity of objects from different clusters is minimized. This can serve to group customers with similar interests, or to group genes with related functionalities. Currently, a challenge for clustering-techniques are especially high dimensional feature-spaces. Due to modern facilities of data collection, real data sets usually contain many features. These features are often noisy or exhibit correlations among each other. However, since these effects in different parts of the data set are differently relevant, irrelevant features cannot be discarded in advance. The selection of relevant features must therefore be integrated into the data mining technique. Since about 10 years, specialized clustering approaches have been developed to cope with problems in high dimensional data better than classic clustering approaches. Often, however, the different problems of very different nature are not distinguished from one another. A main objective of this thesis is therefore a systematic classification of the diverse approaches developed in recent years according to their task definition, their basic strategy, and their algorithmic approach. We discern as main categories the search for clusters (i) w.r.t. closeness of objects in axis-parallel subspaces, (ii) w.r.t. common behavior (patterns) of objects in axis-parallel subspaces, and (iii) w.r.t. closeness of objects in arbitrarily oriented subspaces (so called correlation cluster). For the third category, the remaining parts of the thesis describe novel approaches. A first approach is the adaptation of density-based clustering to the problem of correlation clustering. The starting point here is the first density-based approach in this field, the algorithm 4C. Subsequently, enhancements and variations of this approach are discussed allowing for a more robust, more efficient, or more effective behavior or even find hierarchies of correlation clusters and the corresponding subspaces. The density-based approach to correlation clustering, however, is fundamentally unable to solve some issues since an analysis of local neighborhoods is required. This is a problem in high dimensional data. Therefore, a novel method is proposed tackling the correlation clustering problem in a global approach. Finally, a method is proposed to derive models for correlation clusters to allow for an interpretation of the clusters and facilitate more thorough analysis in the corresponding domain science. Finally, possible applications of these models are proposed and discussed.Knowledge Discovery in Databases (KDD) ist der Prozess der automatischen Extraktion von Wissen aus großen Datenmengen, das gĂŒltig, bisher unbekannt und potentiell nĂŒtzlich fĂŒr eine gegebene Anwendung ist. Der zentrale Schritt des KDD-Prozesses ist das Anwenden von Data Mining-Techniken, um nĂŒtzliche Beziehungen und ZusammenhĂ€nge in einer aufbereiteten Datenmenge aufzudecken. Eine der wichtigsten Techniken des Data Mining ist die Cluster-Analyse (Clustering). Dabei sollen die Objekte einer Datenbank in Gruppen (Cluster) partitioniert werden, so dass Objekte eines Clusters möglichst Ă€hnlich und Objekte verschiedener Cluster möglichst unĂ€hnlich zu einander sind. Hier können beispielsweise Gruppen von Kunden identifiziert werden, die Ă€hnliche Interessen haben, oder Gruppen von Genen, die Ă€hnliche FunktionalitĂ€ten besitzen. Eine aktuelle Herausforderung fĂŒr Clustering-Verfahren stellen hochdimensionale Feature-RĂ€ume dar. Reale DatensĂ€tze beinhalten dank moderner Verfahren zur Datenerhebung hĂ€ufig sehr viele Merkmale (Features). Teile dieser Merkmale unterliegen oft Rauschen oder AbhĂ€ngigkeiten und können meist nicht im Vorfeld ausgesiebt werden, da diese Effekte in Teilen der Datenbank jeweils unterschiedlich ausgeprĂ€gt sind. Daher muss die Wahl der Features mit dem Data-Mining-Verfahren verknĂŒpft werden. Seit etwa 10 Jahren werden vermehrt spezialisierte Clustering-Verfahren entwickelt, die mit den in hochdimensionalen Feature-RĂ€umen auftretenden Problemen besser umgehen können als klassische Clustering-Verfahren. Hierbei wird aber oftmals nicht zwischen den ihrer Natur nach im Einzelnen sehr unterschiedlichen Problemen unterschieden. Ein Hauptanliegen der Dissertation ist daher eine systematische Einordnung der in den letzten Jahren entwickelten sehr diversen AnsĂ€tze nach den Gesichtspunkten ihrer jeweiligen Problemauffassung, ihrer grundlegenden Lösungsstrategie und ihrer algorithmischen Vorgehensweise. Als Hauptkategorien unterscheiden wir hierbei die Suche nach Clustern (1.) hinsichtlich der NĂ€he von Cluster-Objekten in achsenparallelen UnterrĂ€umen, (2.) hinsichtlich gemeinsamer Verhaltensweisen (Mustern) von Cluster-Objekten in achsenparallelen UnterrĂ€umen und (3.) hinsichtlich der NĂ€he von Cluster-Objekten in beliebig orientierten UnterrĂ€umen (sogenannte Korrelations-Cluster). FĂŒr die dritte Kategorie sollen in den weiteren Teilen der Dissertation innovative LösungsansĂ€tze entwickelt werden. Ein erster Lösungsansatz basiert auf einer Erweiterung des dichte-basierten Clustering auf die Problemstellung des Korrelations-Clustering. Den Ausgangspunkt bildet der erste dichtebasierte Ansatz in diesem Bereich, der Algorithmus 4C. Anschließend werden Erweiterungen und Variationen dieses Ansatzes diskutiert, die robusteres, effizienteres oder effektiveres Verhalten aufweisen oder sogar Hierarchien von Korrelations-Clustern und den entsprechenden UnterrĂ€umen finden. Die dichtebasierten Korrelations-Cluster-Verfahren können allerdings einige Probleme grundsĂ€tzlich nicht lösen, da sie auf der Analyse lokaler Nachbarschaften beruhen. Dies ist in hochdimensionalen Feature-RĂ€umen problematisch. Daher wird eine weitere Neuentwicklung vorgestellt, die das Korrelations-Cluster-Problem mit einer globalen Methode angeht. Schließlich wird eine Methode vorgestellt, die Cluster-Modelle fĂŒr Korrelationscluster ableitet, so dass die gefundenen Cluster interpretiert werden können und tiefergehende Untersuchungen in der jeweiligen Fachdisziplin zielgerichtet möglich sind. Mögliche Anwendungen dieser Modelle werden abschließend vorgestellt und untersucht

    A conceptual framework and taxonomy of techniques for analyzing movement

    Get PDF
    Movement data link together space, time, and objects positioned in space and time. They hold valuable and multifaceted information about moving objects, properties of space and time as well as events and processes occurring in space and time. We present a conceptual framework that describes in a systematic and comprehensive way the possible types of information that can be extracted from movement data and on this basis defines the respective types of analytical tasks. Tasks are distinguished according to the type of information they target and according to the level of analysis, which may be elementary (i.e. addressing specific elements of a set) or synoptic (i.e. addressing a set or subsets). We also present a taxonomy of generic analytic techniques, in which the types of tasks are linked to the corresponding classes of techniques that can support fulfilling them. We include techniques from several research fields: visualization and visual analytics, geographic information science, database technology, and data mining. We expect the taxonomy to be valuable for analysts and researchers. Analysts will receive guidance in choosing suitable analytic techniques for their data and tasks. Researchers will learn what approaches exist in different fields and compare or relate them to the approaches they are going to undertake

    Data mining using concepts of independence, unimodality and homophily

    Get PDF
    With the widespread use of information technologies, more and more complex data is generated and collected every day. Such complex data is various in structure, size, type and format, e.g. time series, texts, images, videos and graphs. Complex data is often high-dimensional and heterogeneous, which makes the separation of the wheat (knowledge) from the chaff (noise) more difficult. Clustering is a main mode of knowledge discovery from complex data, which groups objects in such a way that intra-group objects are more similar than inter-group objects. Traditional clustering methods such as k-means, Expectation-Maximization clustering (EM), DBSCAN and spectral clustering are either deceived by "the curse of dimensionality" or spoiled by heterogenous information. So, how to effectively explore complex data? In some cases, people may only have some partial information about the complex data. For example, in social networks, not every user provides his/her profile information such as the personal interests. Can we leverage the limited user information and friendship network wisely to infer the likely labels of the unlabeled users so that the advertisers can do accurate advertising? This is the problem of learning from labeled and unlabeled data, which is literarily attributed to semi-supervised classification. To gain insights into these problems, this thesis focuses on developing clustering and semi-supervised classification methods that are driven by the concepts of independence, unimodality and homophily. The proposed methods leverage techniques from diverse areas, such as statistics, information theory, graph theory, signal processing, optimization and machine learning. Specifically, this thesis develops four methods, i.e. FUSE, ISAAC, UNCut, and wvGN. FUSE and ISAAC are clustering techniques to discover statistically independent patterns from high-dimensional numerical data. UNCut is a clustering technique to discover unimodal clusters in attributed graphs in which not all the attributes are relevant to the graph structure. wvGN is a semi-supervised classification technique using the theory of homophily to infer the labels of the unlabeled vertices in graphs. We have verified our clustering and semi-supervised classification methods on various synthetic and real-world data sets. The results are superior to those of the state-of-the-art.TĂ€glich werden durch den weit verbreiteten Einsatz von Informationstechnologien mehr und mehr komplexe Daten generiert und gesammelt. Diese komplexen Daten unterscheiden sich in der Struktur, GrĂ¶ĂŸe, Art und Format. HĂ€ufig anzutreffen sind beispielsweise Zeitreihen, Texte, Bilder, Videos und Graphen. Dabei sind diese Daten meist hochdimensional und heterogen, was die Trennung des Weizens ( Wissen ) von der Spreu ( Rauschen ) erschwert. Die Cluster Analyse ist dabei eine der wichtigsten Methoden um aus komplexen Daten wssen zu extrahieren. Dabei werden die Objekte eines Datensatzes in einer solchen Weise gruppiert, dass intra-gruppierte Objekte Ă€hnlicher sind als Objekte anderer Gruppen. Der Einsatz von traditionellen Clustering-Methoden wie k-Means, Expectation-Maximization (EM), DBSCAN und Spektralclustering wird dabei entweder "durch der Fluch der DimensionalitĂ€t" erschwert oder ist angesichts der heterogenen Information nicht möglich. Wie erforscht man also solch komplexe Daten effektiv? DarĂŒber hinaus ist es oft der Fall, dass fĂŒr Objekte solcher DatensĂ€tze nur partiell Informationen vorliegen. So gibt in sozialen Netzwerken nicht jeder Benutzer seine Profil-Informationen wie die persönlichen Interessen frei. Können wir diese eingeschrĂ€nkten Benutzerinformation trotzdem in Kombination mit dem Freundschaftsnetzwerk nutzen, um von von wenigen, einer Klasse zugeordneten Nutzern auf die anderen zu schließen. Beispielsweise um zielgerichtete Werbung zu schalten? Dieses Problem des Lernens aus klassifizierten und nicht klassifizierten Daten wird dem semi-supversised Learning zugeordnet. Um Einblicke in diese Probleme zu gewinnen, konzentriert sich diese Arbeit auf die Entwicklung von Clustering- und semi-ĂŒberwachten Klassifikationsmethoden, die von den Konzepten der UnabhĂ€ngigkeit, UnimodalitĂ€t und Homophilie angetrieben werden. Die vorgeschlagenen Methoden nutzen Techniken aus verschiedenen Bereichen der Statistik, Informationstheorie, Graphentheorie, Signalverarbeitung, Optimierung und des maschinelles Lernen. Dabei stellt diese Arbeit vier Techniken vor: FUSE, ISAAC, UNCut, sowie wvGN. FUSE und ISAAC sind Clustering-Techniken, um statistisch unabhĂ€ngige Muster aus hochdimensionalen numerischen Daten zu entdecken. UNCut ist eine Clustering-Technik, um unimodale Cluster in attributierten Graphen zu entdecken, in denen die Kanten und Attribute heterogene Informationen liefern. wvGN ist eine halbĂŒberwachte Klassifikationstechnik, die Homophilie verwendet, um von gelabelten Kanten auf ungelabelte Kanten im Graphen zu schließen. Wir haben diese Clustering und semi-ĂŒberwachten Klassifizierungsmethoden auf verschiedenen synthetischen und realen DatensĂ€tze ĂŒberprĂŒft. Die Ergebnisse sind denen von bisherigen State-of-the-Art-Methoden ĂŒberlegen

    On the topology Of network fine structures

    Get PDF
    Multi-relational dynamics are ubiquitous in many complex systems like transportations, social and biological. This thesis studies the two mathematical objects that encapsulate these relationships --- multiplexes and interval graphs. The former is the modern outlook in Network Science to generalize the edges in graphs while the latter was popularized during the 1960s in Graph Theory. Although multiplexes and interval graphs are nearly 50 years apart, their motivations are similar and it is worthwhile to investigate their structural connections and properties. This thesis look into these mathematical objects and presents their connections. For example we will look at the community structures in multiplexes and learn how unstable the detection algorithms are. This can lead researchers to the wrong conclusions. Thus it is important to get formalism precise and this thesis shows that the complexity of interval graphs is an indicator to the precision. However this measure of complexity is a computational hard problem in Graph Theory and in turn we use a heuristic strategy from Network Science to tackle the problem. One of the main contributions of this thesis is the compilation of the disparate literature on these mathematical objects. The novelty of this contribution is in using the statistical tools from population biology to deduce the completeness of this thesis's bibliography. It can also be used as a framework for researchers to quantify the comprehensiveness of their preliminary investigations. From the large body of multiplex research, the thesis focuses on the statistical properties of the projection of multiplexes (the reduction of multi-relational system to a single relationship network). It is important as projection is always used as the baseline for many relevant algorithms and its topology is insightful to understand the dynamics of the system.Open Acces
    • 

    corecore