52 research outputs found

    Métodos de clustering en datos de expresión génica

    Get PDF
    Clustering is an old data analysis problem that has been extensively studied during the last decades. However, there is not a single algorithm that provides a satisfactory result for every data set. Moreover, there exist some problems related to cluster analysis that also remain unsolved. In this monograph we study some of such problems as they commonly appear in practice, and test how they work when applied to gene expression data analysis, where clustering is widely used. Different clustering algorithms often lead to different results, and in order to make sense out of them it is important to understand how clusters from one analysis relate to those from a different one. A comparison method to find and visualize many-to-many relationships between two clusterings, either two flat clusterings or a flat and a hierarchical clustering, is presented. The similarities between clusters are represented by a weighted bipartite graph, where the nodes are the clusters and an edge weight shows the number of elements in common to the connected nodes. To visualize the relationships between clusterings the number of edge crossings is minimized. When applied to the case of comparing a hierarchical and a flat clustering we use a criterion based either on the graph layout aesthetics or in the mutual information, to decide where to cut the hierarchical tree. Since iterative methods are sensitive to the initial parameters, we have developed two refinement algorithms designed to improve this initial state, based on the notion of data depth. One of these algorithms looks for initial points in the same data space, while the second one, using the bootstrap technique, selects the initial seeds in a new space of bootstrap centroids. Also, this second approach allows to construct a soft (non-hard) clustering of the data, that assigns to each point a probability of belonging to each cluster, and thus a single point may partially belong to more than one cluster. On the other hand, the number of clusters underlying in a data set is usually unknown. Using ideas from the clustering comparison method previously proposed and from the data depth concept, we present three procedures to estimate the number of real groups. The first two methods consist basically in sampling pairs of clusterings from a population and successively performing comparisons between them to find a consensus in the number of clusters, and the third one looks for representative subsets of the clusters whose diameter is used to estimate the optimal number of real groups. The extensive study we carried out in simulated and real gene expression data shows that the techniques presented here are useful and e±cient. The results that we obtained with real data make sense not only from a statistical point of view, but they have proven to have a biological meaning. ______________________________________________El análisis cluster es un antiguo problema revivido en las últimas décadas. En el trabajo presentado abordamos algunos problemas que aparecen en la práctica. Para entender los distintos resultados producidos por diferentes algoritmos es importante estudiar la relación entre clusters procedentes de análisis diferentes, por lo que presentamos un método de comparación para visualizar relaciones entre clusterings jerárquicos o no-jerárquicos, basado en grafos, utilizando un criterio de estética o de información mutua para cortar los dendrogramas en el caso jerárquico. Desarrollamos dos algoritmos de refinamiento del estado inicial de métodos de clustering iterativos, utilizando el concepto de profundidad y bootstrap. Esto además permite desarrollar un algoritmo de clustering no rígido, asignando a los puntos probabilidades de pertenencia a los clusters. Para determinar el número de grupos de un conjunto (habitualmente desconocido) hemos utilizado ideas del método de comparación y el concepto de profundidad, desarrollando tres técnicas de estimación. Hemos realizado un estudio extensivo para todos los métodos propuestos en datos simulados y en datos de expresión génica, y hemos probado que las técnicas desarrolladas en este trabajo son útiles y eficientes, tanto desde un punto de vista estadístico como biológic

    Global trends in coronavirus research at the time of Covid-19: A general bibliometric approach and content analysis using SciMAT

    Get PDF
    Covid-19 represents the greatest challenge facing mankind today. In December 2019, several cases of pneumonia of unknown etiology were reported from China. This coronavirus infection subsequently identified as Covid-19 aroused worldwide concern. As a result, the scientific community has focused attention on Covid-19, as revealed by recent research reported in literature based on a holistic approach. In this regard, this study conducts a bibliometric analysis of coronavirus research in the literature with an emphasis on Covid-19 disease, using as a reference the publications in the Web of Science Core Collection from 1970 to 2020. This research analyzes 12,571 publications from 1970 to (April 18) 2020 by applying advanced bibliometric techniques in SciMAT bibliometric analysis software. The current research therefore provides a complete conceptual analysis of the main coronavirus types and strains in the literature by quantifying the main bibliometric performance indicators, identifying the main authors, organizations, countries, sources, and research areas, and evaluating the development of this field. Furthermore, a science map is constructed to understand the corresponding intellectual structure and main research lines (themes). SciMAT thereby offers a complete approach to the field and evaluates the main performance indicators related to coronavirus, with a focus on Covid-19. Finally, this research serves as a framework to strengthen existing research lines and develop new ones, establishing synergistic relationships that were not visible without the maps generated herein

    Similitud funcional de genes basada en conocimiento biológico

    Get PDF
    Programa de Doctorado en Tecnología e Ingeniería del SoftwareOver the last few year, our knowledge about biological processes in living organisms has greatly expanded both in quantity and resolution, mostly thanks to the introduction of high-throughput sequencing technology. Making sense of these vast amount of biological data through methods such as automated learning is therefore critical to gain further insights into the molecular mechanisms behind fundamental biological processes. This work aims at establishing the quality of new genetic model based on actual biological data. First, a tool for analyzing the coherence of a group of genes according to their common role in metabolic processes is developed. This tool allows the evaluation and validation of different gene sets obtained through any clustering technique. Additionally, a novel measure of functional similarity of a group of genes has been introduced. This measure, called GFD, is based on the Gene Ontology, and it assigns a numerical value to a gene set for each of the three GO ontologies. Concretely, GFD computes the similarity based only on the most common and specific functionality of the genes. GFD compre favorably against the most relevant measures. Our approach is especially relevant in the study of genes that are involved in several functions.Universidad Pablo de Olavide de Sevilla. Departamento de Deporte e InformáticaPostprin

    Systematic Analysis of the Factors Contributing to the Variation and Change of the Microbiome

    Get PDF
    abstract: Understanding changes and trends in biomedical knowledge is crucial for individuals, groups, and institutions as biomedicine improves people’s lives, supports national economies, and facilitates innovation. However, as knowledge changes what evidence illustrates knowledge changes? In the case of microbiome, a multi-dimensional concept from biomedicine, there are significant increases in publications, citations, funding, collaborations, and other explanatory variables or contextual factors. What is observed in the microbiome, or any historical evolution of a scientific field or scientific knowledge, is that these changes are related to changes in knowledge, but what is not understood is how to measure and track changes in knowledge. This investigation highlights how contextual factors from the language and social context of the microbiome are related to changes in the usage, meaning, and scientific knowledge on the microbiome. Two interconnected studies integrating qualitative and quantitative evidence examine the variation and change of the microbiome evidence are presented. First, the concepts microbiome, metagenome, and metabolome are compared to determine the boundaries of the microbiome concept in relation to other concepts where the conceptual boundaries have been cited as overlapping. A collection of publications for each concept or corpus is presented, with a focus on how to create, collect, curate, and analyze large data collections. This study concludes with suggestions on how to analyze biomedical concepts using a hybrid approach that combines results from the larger language context and individual words. Second, the results of a systematic review that describes the variation and change of microbiome research, funding, and knowledge are examined. A corpus of approximately 28,000 articles on the microbiome are characterized, and a spectrum of microbiome interpretations are suggested based on differences related to context. The collective results suggest the microbiome is a separate concept from the metagenome and metabolome, and the variation and change to the microbiome concept was influenced by contextual factors. These results provide insight into how concepts with extensive resources behave within biomedicine and suggest the microbiome is possibly representative of conceptual change or a preview of new dynamics within science that are expected in the future.Dissertation/ThesisDoctoral Dissertation Biology 201

    Design and analysis of clustering algorithms for numerical, categorical and mixed data

    Get PDF
    In recent times, several machine learning techniques have been applied successfully to discover useful knowledge from data. Cluster analysis that aims at finding similar subgroups from a large heterogeneous collection of records, is one o f the most useful and popular of the available techniques o f data mining. The purpose of this research is to design and analyse clustering algorithms for numerical, categorical and mixed data sets. Most clustering algorithms are limited to either numerical or categorical attributes. Datasets with mixed types o f attributes are common in real life and so to design and analyse clustering algorithms for mixed data sets is quite timely. Determining the optimal solution to the clustering problem is NP-hard. Therefore, it is necessary to find solutions that are regarded as “good enough” quickly. Similarity is a fundamental concept for the definition of a cluster. It is very common to calculate the similarity or dissimilarity between two features using a distance measure. Attributes with large ranges will implicitly assign larger contributions to the metrics than the application to attributes with small ranges. There are only a few papers especially devoted to normalisation methods. Usually data is scaled to unit range. This does not secure equal average contributions of all features to the similarity measure. For that reason, a main part o f this thesis is devoted to normalisation

    Desarrollo de algoritmos bioinformáticos para estudios de genómica funcional: aplicaciones en cáncer

    Get PDF
    [ES]La presente Tesis Doctoral se enmarca en las áreas de conocimiento de la Bioinformática y Biología Computacional y también de la Genómica Funcional y Genómica del Cáncer. El objetivo fundamental de la Genómica Funcional es entender cómo funciona el genoma en su conjunto mediante el análisis de la actividad de todos sus genes y de los múltiples factores que regulan o influyen la expresión de los mismos, así como otras entidades biomoleculares relacionadas. La recolección sistemática de información y datos procedentes de tecnologías genómicas experimentales globales a gran escala proporciona un punto de partida para desvelar la actividad del genoma y el comportamiento de los sistemas vivos asociado a su genoma. En este marco temático, el trabajo de esta Tesis Doctoral ha sido el desarrollo y aplicación de varios algoritmos bioinformáticos para el análisis de datos sobre muestras humanas de pacientes con cáncer procedentes de diversas plataformas genómicas de alta densidad, así como su integración e interpretación para descubrir los genes y procesos biológicos alterados en dichas patologías. En concreto se han analizado datos de los tipos mayoritarios de leucemias agudas y crónicas (ALL, AML, CLL, CML), de cáncer colorectal (CRC) metastásico y de tumores cerebrales primarios de tipo glioblastoma multiforme (GBM). Los resultados concretos obtenidos, enunciados modo breve, son: (1) desarrollo de un clasificador multiclase para diferenciar subtipos patológicos basado en perfiles globales de expresión (¿geNetClassifier¿); (2) desarrollo de un método para análisis cuantitativo de alteraciones genómicas del número de copias de DNA (CNA) y detección de puntos de ruptura en el genoma, aplicado a muestras de cáncer; (3) desarrollo de un método para análisis integrado de alteraciones genómicas en número de copias (CN) y alteraciones transcriptómicas de la expresión génica (GE); (4) desarrollo de un algoritmo y una aplicación web para análisis biológico funcional basado en asociación recíproca múltiple de genes y términos biológicos derivados de diferentes espacios de anotación[EN]The present thesis is part of the knowledge areas of Bioinformatics and Computational Biology and Functional Genomics and Cancer Genomics . The fundamental objective of the Functional Genomics is to understand how the genome works as a whole by analyzing the activity of all genes and the multiple factors that regulate or influence the expression of these and other biomolecular related entities. The systematic collection of information and data from global experimental large-scale genomic technologies provides a starting point to unravel genome activity and behavior of living systems associated genome. The work of this thesis has been the development and implementation of several bioinformatics algorithms for analyzing data on human samples of cancer patients from different genomic platforms high density as well as their integration and interpretation to discover altered genes and biological processes in these diseases . Specifically used data of the major types of acute and chronic leukemia (ALL , AML, CLL , CML ) , metastatic colorectal cancer ( CRC) and primary brain tumors glioblastoma multiforme (GBM ) type . Concrete results , statements briefly, they are: (1 ) development of a classifier multiclass to differentiate pathologic subtypes based on global expression profiles ( geNetClassifier ) , (2 ) development of a method for quantitative analysis of genomic alterations in the number DNA copy (CNA ) and detection of breakpoints within the genome , applied to cancer samples , (3) development of a method for analysis of genomic alterations in integrated copy number (CN) and transcriptomic alterations of gene expression (GE ) , (4 ) development of an algorithm and a Web application to biological functional analysis based on mutual association of multiple genes and biologically derived annotation of different spaces

    Prototype based clustering in high-dimensional feature spaces

    Get PDF
    ...In dieser Arbeit untersuche ich den ”Fluch der Dimensionen” mittels dem Begriff der Distanzkonzentration. Ich zeige, dass dieser Effekt im Datenmodell mittels der paarweisen Kovarianzkoeffizienten der Randverteilungen beschrieben werden kann. Zusätzlich vergleiche ich 10 prototypbasierte Clusteralgorithmen mittels 800.000 Clusterergebnissen von künstlich erzeugten Datensätzen. Ich erforsche, wie und warum Clusteralgorithmen von der Anzahl der Merkmale beeinflusst werden. Mit den Clusterergebnissen untersuche ich außerdem, wie gut 5 der populärsten Clusterqualitätsmaße die tatsächliche Clusterqualität schätzen.Magdeburg, Univ., Fak. für Informatik, Diss., 2015von Roland Winkle

    ENDOMET database – A means to identify novel diagnostic and prognostic tools for endometriosis

    Get PDF
    Endometriosis is a common benign hormone reliant inflammatory gynecological disease that affects fertile aged women and has a considerable economic impact on healthcare systems. Symptoms include intense menstrual pain, persistent pelvic pain, and infertility. It is defined by the existence of endometrium-like tissue developing in ectopic locations outside the uterine cavity and inflammation in the peritoneal cavity. Endometriosis presents with multifactorial etiology, and despite extensive research the etiology is still poorly understood. Diagnostic delay from the onset of the disease to when a conclusive diagnosis is reached is between 7–12 years. There is no known cure, although symptoms can be improved with hormonal medications (which often have multiple side effects and prevent pregnancy), or through surgery which carries its own risk. Current non-invasive tools for diagnosis are not sufficiently dependable, and a definite diagnosis is achieved through laparoscopy or laparotomy. This study was based on two prospective cohorts: The ENDOMET study, including 137 endometriosis patients scheduled for surgery and 62 healthy women, and PROENDO that included 138 endometriosis patients and 33 healthy women. Our long-term goal with the current study was to support the discovery of innovative new tools for efficient diagnosis of endometriosis as well as tools to further understand the etiology and pathogenesis of the disease. We set about achieving this goal by creating a database, EndometDB, based on a relational data model, implemented with PostgreSQL programming language. The database allows e.g., for the exploration of global genome-wide expression patterns in the peritoneum, endometrium, and in endometriosis lesions of endometriosis patients as well as in the peritoneum and endometrium of healthy control women of reproductive age. The data collected in the EndometDB was also used for the development and validation of a symptom and biomarker-based predictive model designed for risk evaluation and early prediction of endometriosis without invasive diagnostic methods. Using the data in the EndometDB we discovered that compared with the eutopic endometrium, the WNT- signaling pathway is one of the molecular pathways that undergo strong changes in endometriosis. We then evaluated the potential role for secreted frizzled-related protein 2 (SFRP-2, a WNT-signaling pathway modulator), in improving endometriosis lesion border detection. The SFRP-2 expression visualizes the lesion better than previously used markers and can be used to better define lesion size and that the surgical excision of the lesions is complete.ENDOMET tietokanta – Keino tunnistaa uusi diagnostinen ja ennustava työkalu endometrioosille Endometrioosi on yleinen hyvänlaatuinen, hormoneista riippuvainen tulehduksellinen lisääntymisikäisten naisten gynekologinen sairaus, joka kuormittaa terveydenhuoltojärjestelmää merkittävästi. Endometrioositaudin oireita ovat mm. voimakas kuukautiskipu, jatkuva lantion alueen kipu ja hedelmättömyys. Sairaus määritellään kohdun limakalvon kaltaisen kudoksen esiintymisenä kohdun ulkopuolella sekä siihen liittyvänä vatsakalvon tulehduksena. Endometrioosin etiologia on monitahoinen, ja laajasta tutkimuksesta huolimatta edelleen huonosti tunnettu. Kesto taudin puhkeamisesta lopullisen diagnoosin saamiseen on usein jopa 7–12 vuotta. Sairauteen ei tunneta parannuskeinoa, mutta oireita voidaan lievittää esimerkiksi hormonaalisilla lääkkeillä (joilla on usein monia sivuvaikutuksia ja jotka estävät raskauden) tai leikkauksella, johon liittyy omat tunnetut riskit. Nykyiset ei-invasiiviset diagnoosityökalut eivät ole riittävän luotettavia sairauden tunnistamiseen, ja varma endometrioosin diagnoosi saavutetaan laparoskopian tai laparotomian avulla. Tämä tutkimus perustui kahteen prospektiiviseen kohorttiin: ENDOMET-tutkimuk-seen, johon osallistui 137 endometrioosipotilasta ja 62 terveellistä naista, sekä PROENDO-tutkimukseen, johon osallistui 138 endometrioosipotilasta ja 33 terveellistä naista. Tässä tutkimuksessa pitkän aikavälin tavoitteemme oli löytää uusia työkalujen endometrioosin diagnosointiin, sekä ymmärtää endometrioosin etiologiaa ja patogeneesiä. Ensimmäisessä vaiheessa loimme EndometDB –tietokannan PostgreSQL-ohjelmointi-kielellä. Tämän osittain avoimeen käyttöön vapautetun tietokannan avulla voidaan tutkia genomin, esimerkiksi kaikkien tunnettujen geenien ilmentymistä peritoneumissa, endo-metriumissa ja endometrioosipotilaiden endometrioosileesioissa EndometDB-tietokantaan kerättyjä tietoja käytettiin oireiden ja biomarkkeripohjaisen ennustemallin kehittämiseen ja validointiin. Malli tuottaa riskinarvioinnin endometrioositaudin varhaiseen ennustamiseen ilman laparoskopiaa. Käyttäen EndometDB-tietokannan tietoja havaitsimme, että endo-metrioositautikudoksessa tapahtui voimakkaita geeni-ilmentymisen muutoksia erityisesti geeneissä, jotka liittyvät WNT-signalointireitin säätelyyn. Keskeisin löydös oli, että SFRP-2 proteiinin ilmentyminen oli huomattavasti koholla endometrioosikudoksessa ja SFRP-2 proteiinin immunohistokemiallinen värjäys erottaa endometrioosin tautikudoksen terveestä kudoksesta aiempia merkkiaineita paremmin. Löydetyllä menetelmällä voidaan siten selvittää tautikudoksen laajuus ja tarvittaessa osoittaa, että leikkauksella on kyetty poistamaan koko sairas kudos
    corecore