135 research outputs found

    M3G: Maximum Margin Microarray Gridding

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Complementary DNA (cDNA) microarrays are a well established technology for studying gene expression. A microarray image is obtained by laser scanning a hybridized cDNA microarray, which consists of thousands of spots representing chains of cDNA sequences, arranged in a two-dimensional array. The separation of the spots into distinct cells is widely known as microarray image gridding.</p> <p>Methods</p> <p>In this paper we propose M<sup>3</sup>G, a novel method for automatic gridding of cDNA microarray images based on the maximization of the margin between the rows and the columns of the spots. Initially the microarray image rotation is estimated and then a pre-processing algorithm is applied for a rough spot detection. In order to diminish the effect of artefacts, only a subset of the detected spots is selected by matching the distribution of the spot sizes to the normal distribution. Then, a set of grid lines is placed on the image in order to separate each pair of consecutive rows and columns of the selected spots. The optimal positioning of the lines is determined by maximizing the margin between these rows and columns by using a maximum margin linear classifier, effectively facilitating the localization of the spots.</p> <p>Results</p> <p>The experimental evaluation was based on a reference set of microarray images containing more than two million spots in total. The results show that M<sup>3</sup>G outperforms state of the art methods, demonstrating robustness in the presence of noise and artefacts. More than 98% of the spots reside completely inside their respective grid cells, whereas the mean distance between the spot center and the grid cell center is 1.2 pixels.</p> <p>Conclusions</p> <p>The proposed method performs highly accurate gridding in the presence of noise and artefacts, while taking into account the input image rotation. Thus, it provides the potential of achieving perfect gridding for the vast majority of the spots.</p

    Algorithmic Techniques in Gene Expression Processing. From Imputation to Visualization

    Get PDF
    The amount of biological data has grown exponentially in recent decades. Modern biotechnologies, such as microarrays and next-generation sequencing, are capable to produce massive amounts of biomedical data in a single experiment. As the amount of the data is rapidly growing there is an urgent need for reliable computational methods for analyzing and visualizing it. This thesis addresses this need by studying how to efficiently and reliably analyze and visualize high-dimensional data, especially that obtained from gene expression microarray experiments. First, we will study the ways to improve the quality of microarray data by replacing (imputing) the missing data entries with the estimated values for these entries. Missing value imputation is a method which is commonly used to make the original incomplete data complete, thus making it easier to be analyzed with statistical and computational methods. Our novel approach was to use curated external biological information as a guide for the missing value imputation. Secondly, we studied the effect of missing value imputation on the downstream data analysis methods like clustering. We compared multiple recent imputation algorithms against 8 publicly available microarray data sets. It was observed that the missing value imputation indeed is a rational way to improve the quality of biological data. The research revealed differences between the clustering results obtained with different imputation methods. On most data sets, the simple and fast k-NN imputation was good enough, but there were also needs for more advanced imputation methods, such as Bayesian Principal Component Algorithm (BPCA). Finally, we studied the visualization of biological network data. Biological interaction networks are examples of the outcome of multiple biological experiments such as using the gene microarray techniques. Such networks are typically very large and highly connected, thus there is a need for fast algorithms for producing visually pleasant layouts. A computationally efficient way to produce layouts of large biological interaction networks was developed. The algorithm uses multilevel optimization within the regular force directed graph layout algorithm.Siirretty Doriast

    Whole-transciptome analysis of [psi+] budding yeast via cDNA microarrays

    Get PDF
    Introduction: Prions of yeast present a novel analytical challenge in terms of both initial characterization and in vitro manipulation as models for human disease research. Presently, few robust analysis strategies have been successfully implemented which enable the efficient study of prion behavior in vivo. This study sought to evaluate the utilization of conventional dual-channel cDNA microarrays for the surveillance of transcriptomic regulation patterns by the [PSI+] yeast prion relative to an identical prion deficient yeast variant, [psi-]. Methods: A data analysis and normalization workflow strategy was developed and applied to cDNA array images, yielded quality-regulated expression ratios for a subset of genes exhibiting statistical congruence across multiple experimental repetitions and nested hybridization events. The significant gene list was analyzed using classical analytical approaches including several clustering-based methods and singular value decomposition. To add biological meaning to the differential expression data in hand, functional annotation using the Gene Ontology as well as several pathway-mapping approaches was conducted. Finally, the expression patterns observed were queried against all publicly curated microarray data performed using S. cerevisiae in order to discover similar expression behavior across a vast array of experimental conditions. Results: These data collectively implicate a low-level of overall genomic regulation as a result of the [PSI+] state, where the maximum statistically significant degree of differential expression was less than ±1 Log2(FC) in all cases. Notwithstanding, the [PSI+] differential expression was localized to several specific classes of structural elements and cellular functions, implying under homeostatic conditions significant up or down regulation is likely unnecessary but possible in those specific systems if environmental conditions warranted. As a result of these findings additional work pertaining to this system should include controlled insult to both yeast variants of differing environmental properties to promote a potential [PSI+] regulatory response coupled with co-surveillance of these conditions using transcriptomic and proteomic analysis methodologies

    Towards Data-Driven Large Scale Scientific Visualization and Exploration

    Get PDF
    Technological advances have enabled us to acquire extremely large datasets but it remains a challenge to store, process, and extract information from them. This dissertation builds upon recent advances in machine learning, visualization, and user interactions to facilitate exploration of large-scale scientific datasets. First, we use data-driven approaches to computationally identify regions of interest in the datasets. Second, we use visual presentation for effective user comprehension. Third, we provide interactions for human users to integrate domain knowledge and semantic information into this exploration process. Our research shows how to extract, visualize, and explore informative regions on very large 2D landscape images, 3D volumetric datasets, high-dimensional volumetric mouse brain datasets with thousands of spatially-mapped gene expression profiles, and geospatial trajectories that evolve over time. The contribution of this dissertation include: (1) We introduce a sliding-window saliency model that discovers regions of user interest in very large images; (2) We develop visual segmentation of intensity-gradient histograms to identify meaningful components from volumetric datasets; (3) We extract boundary surfaces from a wealth of volumetric gene expression mouse brain profiles to personalize the reference brain atlas; (4) We show how to efficiently cluster geospatial trajectories by mapping each sequence of locations to a high-dimensional point with the kernel distance framework. We aim to discover patterns, relationships, and anomalies that would lead to new scientific, engineering, and medical advances. This work represents one of the first steps toward better visual understanding of large-scale scientific data by combining machine learning and human intelligence

    Text Mining and Gene Expression Analysis Towards Combined Interpretation of High Throughput Data

    Get PDF
    Microarrays can capture gene expression activity for thousands of genes simultaneously and thus make it possible to analyze cell physiology and disease processes on molecular level. The interpretation of microarray gene expression experiments profits from knowledge on the analyzed genes and proteins and the biochemical networks in which they play a role. The trend is towards the development of data analysis methods that integrate diverse data types. Currently, the most comprehensive biomedical knowledge source is a large repository of free text articles. Text mining makes it possible to automatically extract and use information from texts. This thesis addresses two key aspects, biomedical text mining and gene expression data analysis, with the focus on providing high-quality methods and data that contribute to the development of integrated analysis approaches. The work is structured in three parts. Each part begins by providing the relevant background, and each chapter describes the developed methods as well as applications and results. Part I deals with biomedical text mining: Chapter 2 summarizes the relevant background of text mining; it describes text mining fundamentals, important text mining tasks, applications and particularities of text mining in the biomedical domain, and evaluation issues. In Chapter 3, a method for generating high-quality gene and protein name dictionaries is described. The analysis of the generated dictionaries revealed important properties of individual nomenclatures and the used databases (Fundel and Zimmer, 2006). The dictionaries are publicly available via a Wiki, a web service, and several client applications (Szugat et al., 2005). In Chapter 4, methods for the dictionary-based recognition of gene and protein names in texts and their mapping onto unique database identifiers are described. These methods make it possible to extract information from texts and to integrate text-derived information with data from other sources. Three named entity identification systems have been set up, two of them building upon the previously existing tool ProMiner (Hanisch et al., 2003). All of them have shown very good performance in the BioCreAtIvE challenges (Fundel et al., 2005a; Hanisch et al., 2005; Fundel and Zimmer, 2007). In Chapter 5, a new method for relation extraction (Fundel et al., 2007) is presented. It was applied on the largest collection of biomedical literature abstracts, and thus a comprehensive network of human gene and protein relations has been generated. A classification approach (Küffner et al., 2006) can be used to specify relation types further; e. g., as activating, direct physical, or gene regulatory relation. Part II deals with gene expression data analysis: Gene expression data needs to be processed so that differentially expressed genes can be identified. Gene expression data processing consists of several sequential steps. Two important steps are normalization, which aims at removing systematic variances between measurements, and quantification of differential expression by p-value and fold change determination. Numerous methods exist for these tasks. Chapter 6 describes the relevant background of gene expression data analysis; it presents the biological and technical principles of microarrays and gives an overview of the most relevant data processing steps. Finally, it provides a short introduction to osteoarthritis, which is in the focus of the analyzed gene expression data sets. In Chapter 7, quality criteria for the selection of normalization methods are described, and a method for the identification of differentially expressed genes is proposed, which is appropriate for data with large intensity variances between spots representing the same gene (Fundel et al., 2005b). Furthermore, a system is described that selects an appropriate combination of feature selection method and classifier, and thus identifies genes which lead to good classification results and show consistent behavior in different sample subgroups (Davis et al., 2006). The analysis of several gene expression data sets dealing with osteoarthritis is described in Chapter 8. This chapter contains the biomedical analysis of relevant disease processes and distinct disease stages (Aigner et al., 2006a), and a comparison of various microarray platforms and osteoarthritis models. Part III deals with integrated approaches and thus provides the connection between parts I and II: Chapter 9 gives an overview of different types of integrated data analysis approaches, with a focus on approaches that integrate gene expression data with manually compiled data, large-scale networks, or text mining. In Chapter 10, a method for the identification of genes which are consistently regulated and have a coherent literature background (Küffner et al., 2005) is described. This method indicates how gene and protein name identification and gene expression data can be integrated to return clusters which contain genes that are relevant for the respective experiment together with literature information that supports interpretation. Finally, in Chapter 11 ideas on how the described methods can contribute to current research and possible future directions are presented

    Multimodal Image Fusion and Its Applications.

    Full text link
    Image fusion integrates different modality images to provide comprehensive information of the image content, increasing interpretation capabilities and producing more reliable results. There are several advantages of combining multi-modal images, including improving geometric corrections, complementing data for improved classification, and enhancing features for analysis...etc. This thesis develops the image fusion idea in the context of two domains: material microscopy and biomedical imaging. The proposed methods include image modeling, image indexing, image segmentation, and image registration. The common theme behind all proposed methods is the use of complementary information from multi-modal images to achieve better registration, feature extraction, and detection performances. In material microscopy, we propose an anomaly-driven image fusion framework to perform the task of material microscopy image analysis and anomaly detection. This framework is based on a probabilistic model that enables us to index, process and characterize the data with systematic and well-developed statistical tools. In biomedical imaging, we focus on the multi-modal registration problem for functional MRI (fMRI) brain images which improves the performance of brain activation detection.PhDElectrical Engineering: SystemsUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/120701/1/yuhuic_1.pd

    Searching and mining in enriched geo-spatial data

    Get PDF
    The emergence of new data collection mechanisms in geo-spatial applications paired with a heightened tendency of users to volunteer information provides an ever-increasing flow of data of high volume, complex nature, and often associated with inherent uncertainty. Such mechanisms include crowdsourcing, automated knowledge inference, tracking, and social media data repositories. Such data bearing additional information from multiple sources like probability distributions, text or numerical attributes, social context, or multimedia content can be called multi-enriched. Searching and mining this abundance of information holds many challenges, if all of the data's potential is to be released. This thesis addresses several major issues arising in that field, namely path queries using multi-enriched data, trend mining in social media data, and handling uncertainty in geo-spatial data. In all cases, the developed methods have made significant contributions and have appeared in or were accepted into various renowned international peer-reviewed venues. A common use of geo-spatial data is path queries in road networks where traditional methods optimise results based on absolute and ofttimes singular metrics, i.e., finding the shortest paths based on distance or the best trade-off between distance and travel time. Integrating additional aspects like qualitative or social data by enriching the data model with knowledge derived from sources as mentioned above allows for queries that can be issued to fit a broader scope of needs or preferences. This thesis presents two implementations of incorporating multi-enriched data into road networks. In one case, a range of qualitative data sources is evaluated to gain knowledge about user preferences which is subsequently matched with locations represented in a road network and integrated into its components. Several methods are presented for highly customisable path queries that incorporate a wide spectrum of data. In a second case, a framework is described for resource distribution with reappearance in road networks to serve one or more clients, resulting in paths that provide maximum gain based on a probabilistic evaluation of available resources. Applications for this include finding parking spots. Social media trends are an emerging research area giving insight in user sentiment and important topics. Such trends consist of bursts of messages concerning a certain topic within a time frame, significantly deviating from the average appearance frequency of the same topic. By investigating the dissemination of such trends in space and time, this thesis presents methods to classify trend archetypes to predict future dissemination of a trend. Processing and querying uncertain data is particularly demanding given the additional knowledge required to yield results with probabilistic guarantees. Since such knowledge is not always available and queries are not easily scaled to larger datasets due to the #P-complete nature of the problem, many existing approaches reduce the data to a deterministic representation of its underlying model to eliminate uncertainty. However, data uncertainty can also provide valuable insight into the nature of the data that cannot be represented in a deterministic manner. This thesis presents techniques for clustering uncertain data as well as query processing, that take the additional information from uncertainty models into account while preserving scalability using a sampling-based approach, while previous approaches could only provide one of the two. The given solutions enable the application of various existing clustering techniques or query types to a framework that manages the uncertainty.Das Erscheinen neuer Methoden zur Datenerhebung in räumlichen Applikationen gepaart mit einer erhöhten Bereitschaft der Nutzer, Daten über sich preiszugeben, generiert einen stetig steigenden Fluss von Daten in großer Menge, komplexer Natur, und oft gepaart mit inhärenter Unsicherheit. Beispiele für solche Mechanismen sind Crowdsourcing, automatisierte Wissensinferenz, Tracking, und Daten aus sozialen Medien. Derartige Daten, angereichert mit mit zusätzlichen Informationen aus verschiedenen Quellen wie Wahrscheinlichkeitsverteilungen, Text- oder numerische Attribute, sozialem Kontext, oder Multimediainhalten, werden als multi-enriched bezeichnet. Suche und Datamining in dieser weiten Datenmenge hält viele Herausforderungen bereit, wenn das gesamte Potenzial der Daten genutzt werden soll. Diese Arbeit geht auf mehrere große Fragestellungen in diesem Feld ein, insbesondere Pfadanfragen in multi-enriched Daten, Trend-mining in Daten aus sozialen Netzwerken, und die Beherrschung von Unsicherheit in räumlichen Daten. In all diesen Fällen haben die entwickelten Methoden signifikante Forschungsbeiträge geleistet und wurden veröffentlicht oder angenommen zu diversen renommierten internationalen, von Experten begutachteten Konferenzen und Journals. Ein gängiges Anwendungsgebiet räumlicher Daten sind Pfadanfragen in Straßennetzwerken, wo traditionelle Methoden die Resultate anhand absoluter und oft auch singulärer Maße optimieren, d.h., der kürzeste Pfad in Bezug auf die Distanz oder der beste Kompromiss zwischen Distanz und Reisezeit. Durch die Integration zusätzlicher Aspekte wie qualitativer Daten oder Daten aus sozialen Netzwerken als Anreicherung des Datenmodells mit aus diesen Quellen abgeleitetem Wissen werden Anfragen möglich, die ein breiteres Spektrum an Anforderungen oder Präferenzen erfüllen. Diese Arbeit präsentiert zwei Ansätze, solche multi-enriched Daten in Straßennetze einzufügen. Zum einen wird eine Reihe qualitativer Datenquellen ausgewertet, um Wissen über Nutzerpräferenzen zu generieren, welches darauf mit Örtlichkeiten im Straßennetz abgeglichen und in das Netz integriert wird. Diverse Methoden werden präsentiert, die stark personalisierbare Pfadanfragen ermöglichen, die ein weites Spektrum an Daten mit einbeziehen. Im zweiten Fall wird ein Framework präsentiert, das eine Ressourcenverteilung im Straßennetzwerk modelliert, bei der einmal verbrauchte Ressourcen erneut auftauchen können. Resultierende Pfade ergeben einen maximalen Ertrag basieren auf einer probabilistischen Evaluation der verfügbaren Ressourcen. Eine Anwendung ist die Suche nach Parkplätzen. Trends in sozialen Medien sind ein entstehendes Forscchungsgebiet, das Einblicke in Benutzerverhalten und wichtige Themen zulässt. Solche Trends bestehen aus großen Mengen an Nachrichten zu einem bestimmten Thema innerhalb eines Zeitfensters, so dass die Auftrittsfrequenz signifikant über den durchschnittlichen Level liegt. Durch die Untersuchung der Fortpflanzung solcher Trends in Raum und Zeit präsentiert diese Arbeit Methoden, um Trends nach Archetypen zu klassifizieren und ihren zukünftigen Weg vorherzusagen. Die Anfragebearbeitung und Datamining in unsicheren Daten ist besonders herausfordernd, insbesondere im Hinblick auf das notwendige Zusatzwissen, um Resultate mit probabilistischen Garantien zu erzielen. Solches Wissen ist nicht immer verfügbar und Anfragen lassen sich aufgrund der \P-Vollständigkeit des Problems nicht ohne Weiteres auf größere Datensätze skalieren. Dennoch kann Datenunsicherheit wertvollen Einblick in die Struktur der Daten liefern, der mit deterministischen Methoden nicht erreichbar wäre. Diese Arbeit präsentiert Techniken zum Clustering unsicherer Daten sowie zur Anfragebearbeitung, die die Zusatzinformation aus dem Unsicherheitsmodell in Betracht ziehen, jedoch gleichzeitig die Skalierbarkeit des Ansatzes auf große Datenmengen sicherstellen

    Algoritmos de compressão sem perdas para imagens de microarrays e alinhamento de genomas completos

    Get PDF
    Doutoramento em InformáticaNowadays, in the 21st century, the never-ending expansion of information is a major global concern. The pace at which storage and communication resources are evolving is not fast enough to compensate this tendency. In order to overcome this issue, sophisticated and efficient compression tools are required. The goal of compression is to represent information with as few bits as possible. There are two kinds of compression, lossy and lossless. In lossless compression, information loss is not tolerated so the decoded information is exactly the same as the encoded one. On the other hand, in lossy compression some loss is acceptable. In this work we focused on lossless methods. The goal of this thesis was to create lossless compression tools that can be used in two types of data. The first type is known in the literature as microarray images. These images have 16 bits per pixel and a high spatial resolution. The other data type is commonly called Whole Genome Alignments (WGA), in particularly applied to MAF files. Regarding the microarray images, we improved existing microarray-specific methods by using some pre-processing techniques (segmentation and bitplane reduction). Moreover, we also developed a compression method based on pixel values estimates and a mixture of finite-context models. Furthermore, an approach based on binary-tree decomposition was also considered. Two compression tools were developed to compress MAF files. The first one based on a mixture of finite-context models and arithmetic coding, where only the DNA bases and alignment gaps were considered. The second tool, designated as MAFCO, is a complete compression tool that can handle all the information that can be found in MAF files. MAFCO relies on several finite-context models and allows parallel compression/decompression of MAF files.Hoje em dia, no século XXI, a expansão interminável de informação é uma grande preocupação mundial. O ritmo ao qual os recursos de armazenamento e comunicação estão a evoluir não é suficientemente rápido para compensar esta tendência. De forma a ultrapassar esta situação, são necessárias ferramentas de compressão sofisticadas e eficientes. A compressão consiste em representar informação utilizando a menor quantidade de bits possível. Existem dois tipos de compressão, com e sem perdas. Na compressão sem perdas, a perda de informação não é tolerada, por isso a informação descodificada é exatamente a mesma que a informação que foi codificada. Por outro lado, na compressão com perdas alguma perda é aceitável. Neste trabalho, focámo-nos apenas em métodos de compressão sem perdas. O objetivo desta tese consistiu na criação de ferramentas de compressão sem perdas para dois tipos de dados. O primeiro tipo de dados é conhecido na literatura como imagens de microarrays. Estas imagens têm 16 bits por píxel e uma resolução espacial elevada. O outro tipo de dados é geralmente denominado como alinhamento de genomas completos, particularmente aplicado a ficheiros MAF. Relativamente às imagens de microarrays, melhorámos alguns métodos de compressão específicos utilizando algumas técnicas de pré-processamento (segmentação e redução de planos binários). Além disso, desenvolvemos também um método de compressão baseado em estimação dos valores dos pixéis e em misturas de modelos de contexto-finito. Foi também considerada, uma abordagem baseada em decomposição em árvore binária. Foram desenvolvidas duas ferramentas de compressão para ficheiros MAF. A primeira ferramenta, é baseada numa mistura de modelos de contexto-finito e codificação aritmética, onde apenas as bases de ADN e os símbolos de alinhamento foram considerados. A segunda, designada como MAFCO, é uma ferramenta de compressão completa que consegue lidar com todo o tipo de informação que pode ser encontrada nos ficheiros MAF. MAFCO baseia-se em vários modelos de contexto-finito e permite compressão/descompressão paralela de ficheiros MAF

    Advances in knowledge discovery and data mining Part II

    Get PDF
    19th Pacific-Asia Conference, PAKDD 2015, Ho Chi Minh City, Vietnam, May 19-22, 2015, Proceedings, Part II</p

    Contextual Analysis of Gene Expression Data

    Get PDF
    As measurement of gene expression using microarrays has become a standard high throughput method in molecular biology, the analysis of gene expression data is still a very active area of research in bioinformatics and statistics. Despite some issues in quality and reproducibility of microarray and derived data, they are still considered as one of the most promising experimental techniques for the understanding of complex molecular mechanisms. This work approaches the problem of expression data analysis using contextual information. While all analyses must be based on sound statistical data processing, it is also important to include biological knowledge to arrive at biologically interpretable results. After giving an introduction and some biological background, in chapter 2 some standard methods for the analysis of microarray data including normalization, computation of differentially expressed genes, and clustering are reviewed. The first source of context information that is used to aid in the interpretation of the data, is functional annotation of genes. Such information is often represented using ontologies such as gene ontology. GO annotations are provided by many gene and protein databases and have been used to find functional groups that are significantly enriched in differentially expressed, or otherwise conspicuous genes. In gene clustering approaches, functional annotations have been used to find enriched functional classes within each cluster. In chapter 3, a clustering method for the samples of an expression data set is described that uses GO annotations during the clustering process in order to find functional classes that imply a particularly strong separation of the samples. The resulting clusters can be interpreted more easily in terms of GO classes. The clustering method was developed in joint work with Henning Redestig. More complex biological information that covers interactions between biological objects is contained in networks. Such networks can be obtained from public databases of metabolic pathways, signaling cascades, transcription factor binding sites, or high-throughput measurements for the detection of protein-protein interactions such as yeast two hybrid experiments. Furthermore, networks can be inferred using literature mining approaches or network inference from expression data. The information contained in such networks is very heterogenous with respect to the type, the quality and the completeness of the contained data. ToPNet, a software tool for the interactive analysis of networks and gene expression data has been developed in cooperation with Daniel Hanisch. The basic analysis and visualization methods as well as some important concepts of this tool are described in chapter 4. In order to access the heterogeneous data represented as networks with annotated experimental data and functions, it is important to provide advanced querying functionality. Pathway queries allow the formulation of network templates that can include functional annotations as well as expression data. The pathway search algorithm finds all instances of the template in a given network. In order to do so, a special case of the well known subgraph isomorphism problem has to be solved. Although the algorithm has exponential running time in the worst case, some implementation tricks make it run fast enough for practical purposes. Often, a pathway query has many matching instances, and it is important to assess the statistical significance of the individual instances with respect to expression data or other criteria. In chapter 5 the pathway query language and the pathway search algorithm are described in detail and some theoretical properties are derived. Furthermore, some scoring methods that have been implemented are described. The possibility of combining different scoring schemes for different parts of the query result in very flexible scoring capabilities. In chapter 6, some applications of the methods are described, using public data sets as well as data sets from research projects. On the basis of the well studied public data sets, it is demonstrated that the methods yield biologically meaningful results. The other analyses show how new hypotheses can be generated in more complex biological systems, but the validation of these hypotheses can only be provided by new experiments. Finally, an outlook is given on how the presented methods can contribute to ongoing research efforts in the area of expression data analysis, their applicability to other types of data (such as proteomics data) and their possible extensions.Während die Messung von RNA-Konzentrationen mittels Microarrays eine Standardtechnik zur genomweiten Bestimmung von Genexpressionswerten geworden ist, ist die Analyse der dabei gewonnenen Daten immer noch ein Gebiet äußerst aktiver Forschung. Trotz einiger Probleme bezüglich der Reproduzierbarkeit von Microarray- und davon abgeleiteten Daten werden diese als eine der vielversprechendsten Technologien zur Aufklärung komplexer molekularer Mechanismen angesehen. Diese Arbeit beschäftigt sich mit dem Problem der Expressionsdatenanalyse mit Hilfe von Kontextinformationen. Alle Analysen müssen auf solider Statistik beruhen, aber es ist außerdem wichtig, biologisches Wissen einzubeziehen, um biologisch interpretierbare Ergebnisse zu erhalten. Nach einer Einleitung und einigem biologischen Hintergrund werden in Kapitel 2 einige Standardmethoden zur Analyse von Expressionsdaten vorgestellt, wie z.B. Normalisierung, Berechnung differenziell exprimierter Gene sowie Clustering. Die erste Quelle von Kontextinformationen, die zur besseren Interpretation der Daten herangezogen wird, ist funktionale Annotation von Genen. Solche Informationen werden oft mit Hilfe von Ontologien wie z.B. der Gene Ontology dargestellt. GO Annotationen werden von vielen Gen- und Proteindatenbanken zur Verfügung gestellt und werden unter anderem benutzt, um Funktionen zu finden, die signifikant angereichert sind an differenziell exprimierten oder aus anderen Gründen auffälligen Genen. Bei Clusteringmethoden werden funktionale Annotationen benutzt, um in den gefundenen Clustern angereicherte Funktionen zu identifizieren. In Kapitel 3 wird ein neues Clusterverfahren für Proben in Expressionsdatensätzen vorgestellt, das GO Annotationen während des Clustering benutzt, um Funktionen zu finden, anhand derer die Expressionsdaten besonders deutlich getrennt werden können. Die resultierenden Cluster können mit Hilfe der GO Annotationen leichter interpretiert werden. Die Clusteringmethode wurde in Zusammenarbeit mit Henning Redestig entwickelt. Komplexere biologische Informationen, die auch die Interaktionen zwischen biologischen Objekten beinhaltet, sind in Netzwerken enthalten. Solche Netzwerke können aus öffentlichen Datenbanken von metabolischen Pfaden, Signalkaskaden, Bindestellen von Transkriptionsfaktoren, aber auch aus Hochdurchsatzexperimenten wie der Yeast Two Hybrid Methode gewonnen werden. Außerdem können Netzwerke durch die automatische Auswertung wissenschaftlicher Literatur oder Inferenz aus Expressionsdaten gewonnen werden. Die Information, die in solchen Netzwerken enthalten ist, ist sehr verschieden in Bezug auf die Art, die Qualität und die Vollständigkeit der Daten. ToPNet, ein Computerprogramm zur interaktiven Analyse von Netzwerken und Genexpressionsdaten, wurde gemeinsam mit Daniel Hanisch entwickelt. Die grundlegenden Analyse und Visualisierungsmethoden sowie einige wichtige Konzepte dieses Programms werden in Kapitel 4 beschrieben. Um auf die verschiedenartigen Daten zugreifen zu können, die durch Netzwerke mit funktionalen Annotationen sowie Expressionsdaten repräsentiert werden, ist es wichtig, flexible und mächtige Anfragefunktionalität zur Verfügung zu stellen. Pathway queries erlauben die Beschreibung von Netzwerkmustern, die funktionale Annotationen sowie Expressionsdaten enthalten. Der pathway search Algorithmus findet alle Instanzen des Musters in einem gegebenen Netzwerk. Dazu muss ein Spezialfall des bekannten Subgraph-Isomorphie-Problems gelöst werden. Obwohl der Algorithmus im schlechtesten Fall exponentielle Laufzeit in der Größe des Musters hat, läuft er durch einige Implementationstricks schnell genug für praktische Anwendungen. Oft hat eine pathway query viele Instanzen, so dass es wichtig ist, die statistische Signifikanz der einzelnen Instanzen in Hinblick auf Expressionsdaten oder andere Kriterien zu bestimmen. In Kapitel 5 werden die Anfragesprache pathway query language sowie der pathway search Algorithmus im Detail vorgestellt und einige theoretische Eigenschaften gezeigt. Außerdem werden einige implementierte Scoring-Methoden beschrieben. Die Möglichkeit, verschiedene Teile der Anfrage mit verschiedenen Scoring-Methoden zu bewerten und zu einem Gesamtscore zusammenzufassen, erlaubt äußerst flexible Bewertungen der Instanzen. In Kapitel 6 werden einige Anwendungen der vorgestellten Methoden beschrieben, die auf öffentlichen Datensätzen sowie Datensätzen aus Forschungsprojekten beruhen. Mit Hilfe der gut untersuchten öffentlichen Datensätze wird gezeigt, dass die Methoden biologisch sinnvolle Ergebnisse liefern. Die anderen Analysen zeigen, wie neue Hypothesen in komplexeren biologischen Systemen generiert werden können, die jedoch nur mit Hilfe von weiteren biologischen Experimenten validiert werden könnten. Schließlich wird ein Ausblick gegeben, was die vorgestellten Methoden zur laufenden Forschung im Bereich der Expressionsdatenanalyse beitragen können, wie sie auf andere Daten angewendet werden können und welche Erweiterungen denkbar und wünschenswert sind
    corecore