1,896 research outputs found

    Feature Selection for Big Visual Data: Overview and Challenges

    Get PDF
    International Conference Image Analysis and Recognition (ICIAR 2018, Póvoa de Varzim, Portugal

    05441 Abstracts Collection -- Managing and Mining Genome Information: Frontiers in Bioinformatics

    Get PDF
    From 30.10.05 to 04.11.05, the Dagstuhl Seminar 05441 ``Managing and Mining Genome Information: Frontiers in Bioinformatics\u27\u27 was held in the International Conference and Research Center (IBFI), Schloss Dagstuhl. During the seminar, several participants presented their current research, and ongoing work and open problems were discussed. Abstracts of the presentations given during the seminar as well as abstracts of seminar results and ideas are put together in this paper. The first section describes the seminar topics and goals in general. Links to extended abstracts or full papers are provided, if available

    Text Mining and Gene Expression Analysis Towards Combined Interpretation of High Throughput Data

    Get PDF
    Microarrays can capture gene expression activity for thousands of genes simultaneously and thus make it possible to analyze cell physiology and disease processes on molecular level. The interpretation of microarray gene expression experiments profits from knowledge on the analyzed genes and proteins and the biochemical networks in which they play a role. The trend is towards the development of data analysis methods that integrate diverse data types. Currently, the most comprehensive biomedical knowledge source is a large repository of free text articles. Text mining makes it possible to automatically extract and use information from texts. This thesis addresses two key aspects, biomedical text mining and gene expression data analysis, with the focus on providing high-quality methods and data that contribute to the development of integrated analysis approaches. The work is structured in three parts. Each part begins by providing the relevant background, and each chapter describes the developed methods as well as applications and results. Part I deals with biomedical text mining: Chapter 2 summarizes the relevant background of text mining; it describes text mining fundamentals, important text mining tasks, applications and particularities of text mining in the biomedical domain, and evaluation issues. In Chapter 3, a method for generating high-quality gene and protein name dictionaries is described. The analysis of the generated dictionaries revealed important properties of individual nomenclatures and the used databases (Fundel and Zimmer, 2006). The dictionaries are publicly available via a Wiki, a web service, and several client applications (Szugat et al., 2005). In Chapter 4, methods for the dictionary-based recognition of gene and protein names in texts and their mapping onto unique database identifiers are described. These methods make it possible to extract information from texts and to integrate text-derived information with data from other sources. Three named entity identification systems have been set up, two of them building upon the previously existing tool ProMiner (Hanisch et al., 2003). All of them have shown very good performance in the BioCreAtIvE challenges (Fundel et al., 2005a; Hanisch et al., 2005; Fundel and Zimmer, 2007). In Chapter 5, a new method for relation extraction (Fundel et al., 2007) is presented. It was applied on the largest collection of biomedical literature abstracts, and thus a comprehensive network of human gene and protein relations has been generated. A classification approach (Küffner et al., 2006) can be used to specify relation types further; e. g., as activating, direct physical, or gene regulatory relation. Part II deals with gene expression data analysis: Gene expression data needs to be processed so that differentially expressed genes can be identified. Gene expression data processing consists of several sequential steps. Two important steps are normalization, which aims at removing systematic variances between measurements, and quantification of differential expression by p-value and fold change determination. Numerous methods exist for these tasks. Chapter 6 describes the relevant background of gene expression data analysis; it presents the biological and technical principles of microarrays and gives an overview of the most relevant data processing steps. Finally, it provides a short introduction to osteoarthritis, which is in the focus of the analyzed gene expression data sets. In Chapter 7, quality criteria for the selection of normalization methods are described, and a method for the identification of differentially expressed genes is proposed, which is appropriate for data with large intensity variances between spots representing the same gene (Fundel et al., 2005b). Furthermore, a system is described that selects an appropriate combination of feature selection method and classifier, and thus identifies genes which lead to good classification results and show consistent behavior in different sample subgroups (Davis et al., 2006). The analysis of several gene expression data sets dealing with osteoarthritis is described in Chapter 8. This chapter contains the biomedical analysis of relevant disease processes and distinct disease stages (Aigner et al., 2006a), and a comparison of various microarray platforms and osteoarthritis models. Part III deals with integrated approaches and thus provides the connection between parts I and II: Chapter 9 gives an overview of different types of integrated data analysis approaches, with a focus on approaches that integrate gene expression data with manually compiled data, large-scale networks, or text mining. In Chapter 10, a method for the identification of genes which are consistently regulated and have a coherent literature background (Küffner et al., 2005) is described. This method indicates how gene and protein name identification and gene expression data can be integrated to return clusters which contain genes that are relevant for the respective experiment together with literature information that supports interpretation. Finally, in Chapter 11 ideas on how the described methods can contribute to current research and possible future directions are presented

    The potential of text mining in data integration and network biology for plant research : a case study on Arabidopsis

    Get PDF
    Despite the availability of various data repositories for plant research, a wealth of information currently remains hidden within the biomolecular literature. Text mining provides the necessary means to retrieve these data through automated processing of texts. However, only recently has advanced text mining methodology been implemented with sufficient computational power to process texts at a large scale. In this study, we assess the potential of large-scale text mining for plant biology research in general and for network biology in particular using a state-of-the-art text mining system applied to all PubMed abstracts and PubMed Central full texts. We present extensive evaluation of the textual data for Arabidopsis thaliana, assessing the overall accuracy of this new resource for usage in plant network analyses. Furthermore, we combine text mining information with both protein-protein and regulatory interactions from experimental databases. Clusters of tightly connected genes are delineated from the resulting network, illustrating how such an integrative approach is essential to grasp the current knowledge available for Arabidopsis and to uncover gene information through guilt by association. All large-scale data sets, as well as the manually curated textual data, are made publicly available, hereby stimulating the application of text mining data in future plant biology studies

    Heterogeneous biomedical database integration using a hybrid strategy: a p53 cancer research database.

    Get PDF
    Complex problems in life science research give rise to multidisciplinary collaboration, and hence, to the need for heterogeneous database integration. The tumor suppressor p53 is mutated in close to 50% of human cancers, and a small drug-like molecule with the ability to restore native function to cancerous p53 mutants is a long-held medical goal of cancer treatment. The Cancer Research DataBase (CRDB) was designed in support of a project to find such small molecules. As a cancer informatics project, the CRDB involved small molecule data, computational docking results, functional assays, and protein structure data. As an example of the hybrid strategy for data integration, it combined the mediation and data warehousing approaches. This paper uses the CRDB to illustrate the hybrid strategy as a viable approach to heterogeneous data integration in biomedicine, and provides a design method for those considering similar systems. More efficient data sharing implies increased productivity, and, hopefully, improved chances of success in cancer research. (Code and database schemas are freely downloadable, http://www.igb.uci.edu/research/research.html.)

    Discovering lesser known molecular players and mechanistic patterns in Alzheimer's disease using an integrative disease modelling approach

    Get PDF
    Convergence of exponentially advancing technologies is driving medical research with life changing discoveries. On the contrary, repeated failures of high-profile drugs to battle Alzheimer's disease (AD) has made it one of the least successful therapeutic area. This failure pattern has provoked researchers to grapple with their beliefs about Alzheimer's aetiology. Thus, growing realisation that Amyloid-β and tau are not 'the' but rather 'one of the' factors necessitates the reassessment of pre-existing data to add new perspectives. To enable a holistic view of the disease, integrative modelling approaches are emerging as a powerful technique. Combining data at different scales and modes could considerably increase the predictive power of the integrative model by filling biological knowledge gaps. However, the reliability of the derived hypotheses largely depends on the completeness, quality, consistency, and context-specificity of the data. Thus, there is a need for agile methods and approaches that efficiently interrogate and utilise existing public data. This thesis presents the development of novel approaches and methods that address intrinsic issues of data integration and analysis in AD research. It aims to prioritise lesser-known AD candidates using highly curated and precise knowledge derived from integrated data. Here much of the emphasis is put on quality, reliability, and context-specificity. This thesis work showcases the benefit of integrating well-curated and disease-specific heterogeneous data in a semantic web-based framework for mining actionable knowledge. Furthermore, it introduces to the challenges encountered while harvesting information from literature and transcriptomic resources. State-of-the-art text-mining methodology is developed to extract miRNAs and its regulatory role in diseases and genes from the biomedical literature. To enable meta-analysis of biologically related transcriptomic data, a highly-curated metadata database has been developed, which explicates annotations specific to human and animal models. Finally, to corroborate common mechanistic patterns — embedded with novel candidates — across large-scale AD transcriptomic data, a new approach to generate gene regulatory networks has been developed. The work presented here has demonstrated its capability in identifying testable mechanistic hypotheses containing previously unknown or emerging knowledge from public data in two major publicly funded projects for Alzheimer's, Parkinson's and Epilepsy diseases

    Semantic web data warehousing for caGrid

    Get PDF
    The National Cancer Institute (NCI) is developing caGrid as a means for sharing cancer-related data and services. As more data sets become available on caGrid, we need effective ways of accessing and integrating this information. Although the data models exposed on caGrid are semantically well annotated, it is currently up to the caGrid client to infer relationships between the different models and their classes. In this paper, we present a Semantic Web-based data warehouse (Corvus) for creating relationships among caGrid models. This is accomplished through the transformation of semantically-annotated caBIG® Unified Modeling Language (UML) information models into Web Ontology Language (OWL) ontologies that preserve those semantics. We demonstrate the validity of the approach by Semantic Extraction, Transformation and Loading (SETL) of data from two caGrid data sources, caTissue and caArray, as well as alignment and query of those sources in Corvus. We argue that semantic integration is necessary for integration of data from distributed web services and that Corvus is a useful way of accomplishing this. Our approach is generalizable and of broad utility to researchers facing similar integration challenges

    Biomarker lists stability in genomic studies: analysis and improvement by prior biological knowledge integration into the learning process

    Get PDF
    The analysis of high-throughput sequencing, microarray and mass spectrometry data has been demonstrated extremely helpful for the identification of those genes and proteins, called biomarkers, helpful for answering to both diagnostic/prognostic and functional questions. In this context, robustness of the results is critical both to understand the biological mechanisms underlying diseases and to gain sufficient reliability for clinical/pharmaceutical applications. Recently, different studies have proved that the lists of identified biomarkers are poorly reproducible, making the validation of biomarkers as robust predictors of a disease a still open issue. The reasons of these differences are referable to both data dimensions (few subjects with respect to the number of features) and heterogeneity of complex diseases, characterized by alterations of multiple regulatory pathways and of the interplay between different genes and the environment. Typically in an experimental design, data to analyze come from different subjects and different phenotypes (e.g. normal and pathological). The most widely used methodologies for the identification of significant genes related to a disease from microarray data are based on computing differential gene expression between different phenotypes by univariate statistical tests. Such approach provides information on the effect of specific genes as independent features, whereas it is now recognized that the interplay among weakly up/down regulated genes, although not significantly differentially expressed, might be extremely important to characterize a disease status. Machine learning algorithms are, in principle, able to identify multivariate nonlinear combinations of features and have thus the possibility to select a more complete set of experimentally relevant features. In this context, supervised classification methods are often used to select biomarkers, and different methods, like discriminant analysis, random forests and support vector machines among others, have been used, especially in cancer studies. Although high accuracy is often achieved in classification approaches, the reproducibility of biomarker lists still remains an open issue, since many possible sets of biological features (i.e. genes or proteins) can be considered equally relevant in terms of prediction, thus it is in principle possible to have a lack of stability even by achieving the best accuracy. This thesis represents a study of several computational aspects related to biomarker discovery in genomic studies: from the classification and feature selection strategies to the type and the reliability of the biological information used, proposing new approaches able to cope with the problem of the reproducibility of biomarker lists. The study has highlighted that, although reasonable and comparable classification accuracy can be achieved by different methods, further developments are necessary to achieve robust biomarker lists stability, because of the high number of features and the high correlation among them. In particular, this thesis proposes two different approaches to improve biomarker lists stability by using prior information related to biological interplay and functional correlation among the analyzed features. Both approaches were able to improve biomarker selection. The first approach, using prior information to divide the application of the method into different subproblems, improves results interpretability and offers an alternative way to assess lists reproducibility. The second, integrating prior information in the kernel function of the learning algorithm, improves lists stability. Finally, the interpretability of results is strongly affected by the quality of the biological information available and the analysis of the heterogeneities performed in the Gene Ontology database has revealed the importance of providing new methods able to verify the reliability of the biological properties which are assigned to a specific feature, discriminating missing or less specific information from possible inconsistencies among the annotations. These aspects will be more and more deepened in the future, as the new sequencing technologies will monitor an increasing number of features and the number of functional annotations from genomic databases will considerably grow in the next years.L’analisi di dati high-throughput basata sull’utilizzo di tecnologie di sequencing, microarray e spettrometria di massa si è dimostrata estremamente utile per l’identificazione di quei geni e proteine, chiamati biomarcatori, utili per rispondere a quesiti sia di tipo diagnostico/prognostico che funzionale. In tale contesto, la stabilità dei risultati è cruciale sia per capire i meccanismi biologici che caratterizzano le malattie sia per ottenere una sufficiente affidabilità per applicazioni in campo clinico/farmaceutico. Recentemente, diversi studi hanno dimostrato che le liste di biomarcatori identificati sono scarsamente riproducibili, rendendo la validazione di tali biomarcatori come indicatori stabili di una malattia un problema ancora aperto. Le ragioni di queste differenze sono imputabili sia alla dimensione dei dataset (pochi soggetti rispetto al numero di variabili) sia all’eterogeneità di malattie complesse, caratterizzate da alterazioni di più pathway di regolazione e delle interazioni tra diversi geni e l’ambiente. Tipicamente in un disegno sperimentale, i dati da analizzare provengono da diversi soggetti e diversi fenotipi (e.g. normali e patologici). Le metodologie maggiormente utilizzate per l’identificazione di geni legati ad una malattia si basano sull’analisi differenziale dell’espressione genica tra i diversi fenotipi usando test statistici univariati. Tale approccio fornisce le informazioni sull’effetto di specifici geni considerati come variabili indipendenti tra loro, mentre è ormai noto che l’interazione tra geni debolmente up/down regolati, sebbene non differenzialmente espressi, potrebbe rivelarsi estremamente importante per caratterizzare lo stato di una malattia. Gli algoritmi di machine learning sono, in linea di principio, capaci di identificare combinazioni non lineari delle variabili e hanno quindi la possibilità di selezionare un insieme più dettagliato di geni che sono sperimentalmente rilevanti. In tale contesto, i metodi di classificazione supervisionata vengono spesso utilizzati per selezionare i biomarcatori, e diversi approcci, quali discriminant analysis, random forests e support vector machines tra altri, sono stati utilizzati, soprattutto in studi oncologici. Sebbene con tali approcci di classificazione si ottenga un alto livello di accuratezza di predizione, la riproducibilità delle liste di biomarcatori rimane ancora una questione aperta, dato che esistono molteplici set di variabili biologiche (i.e. geni o proteine) che possono essere considerati ugualmente rilevanti in termini di predizione. Quindi in teoria è possibile avere un’insufficiente stabilità anche raggiungendo il massimo livello di accuratezza. Questa tesi rappresenta uno studio su diversi aspetti computazionali legati all’identificazione di biomarcatori in genomica: dalle strategie di classificazione e di feature selection adottate alla tipologia e affidabilità dell’informazione biologica utilizzata, proponendo nuovi approcci in grado di affrontare il problema della riproducibilità delle liste di biomarcatori. Tale studio ha evidenziato che sebbene un’accettabile e comparabile accuratezza nella predizione può essere ottenuta attraverso diversi metodi, ulteriori sviluppi sono necessari per raggiungere una robusta stabilità nelle liste di biomarcatori, a causa dell’alto numero di variabili e dell’alto livello di correlazione tra loro. In particolare, questa tesi propone due diversi approcci per migliorare la stabilità delle liste di biomarcatori usando l’informazione a priori legata alle interazioni biologiche e alla correlazione funzionale tra le features analizzate. Entrambi gli approcci sono stati in grado di migliorare la selezione di biomarcatori. Il primo approccio, usando l’informazione a priori per dividere l’applicazione del metodo in diversi sottoproblemi, migliora l’interpretabilità dei risultati e offre un modo alternativo per verificare la riproducibilità delle liste. Il secondo, integrando l’informazione a priori in una funzione kernel dell’algoritmo di learning, migliora la stabilità delle liste. Infine, l’interpretabilità dei risultati è fortemente influenzata dalla qualità dell’informazione biologica disponibile e l’analisi delle eterogeneità delle annotazioni effettuata sul database Gene Ontology rivela l’importanza di fornire nuovi metodi in grado di verificare l’attendibilità delle proprietà biologiche che vengono assegnate ad una specifica variabile, distinguendo la mancanza o la minore specificità di informazione da possibili inconsistenze tra le annotazioni. Questi aspetti verranno sempre più approfonditi in futuro, dato che le nuove tecnologie di sequencing monitoreranno un maggior numero di variabili e il numero di annotazioni funzionali derivanti dai database genomici crescer`a considerevolmente nei prossimi anni

    A semantic web framework to integrate cancer omics data with biological knowledge

    Get PDF
    BACKGROUND: The RDF triple provides a simple linguistic means of describing limitless types of information. Triples can be flexibly combined into a unified data source we call a semantic model. Semantic models open new possibilities for the integration of variegated biological data. We use Semantic Web technology to explicate high throughput clinical data in the context of fundamental biological knowledge. We have extended Corvus, a data warehouse which provides a uniform interface to various forms of Omics data, by providing a SPARQL endpoint. With the querying and reasoning tools made possible by the Semantic Web, we were able to explore quantitative semantic models retrieved from Corvus in the light of systematic biological knowledge. RESULTS: For this paper, we merged semantic models containing genomic, transcriptomic and epigenomic data from melanoma samples with two semantic models of functional data - one containing Gene Ontology (GO) data, the other, regulatory networks constructed from transcription factor binding information. These two semantic models were created in an ad hoc manner but support a common interface for integration with the quantitative semantic models. Such combined semantic models allow us to pose significant translational medicine questions. Here, we study the interplay between a cell's molecular state and its response to anti-cancer therapy by exploring the resistance of cancer cells to Decitabine, a demethylating agent. CONCLUSIONS: We were able to generate a testable hypothesis to explain how Decitabine fights cancer - namely, that it targets apoptosis-related gene promoters predominantly in Decitabine-sensitive cell lines, thus conveying its cytotoxic effect by activating the apoptosis pathway. Our research provides a framework whereby similar hypotheses can be developed easily

    GREAT: gene regulation evaluation tool

    Get PDF
    Tese de mestrado. Tecnologias de Informação aplicadas às Ciências Biológicas e Médicas. Universidade de Lisboa, Faculdade de Ciências, 2009A correcta compreensão de como funcionam os sistemas biológicos depende do estudo dos mecanismos que regulam a expressão genética. Estes mecanismos controlam em que momento e durante quanto tempo é utilizada a informação codificada num gene, e podem actuar em diversas etapas do processo de expressão genética. No presente trabalho, a etapa em análise é a transcrição, na qual a sequência de ADN de um gene é transformada numa sequência de ARN, que posteriormente dará origem a uma proteína. A regulação da transcrição centra-se na acção de uma classe de proteínas reguladoras denominadas factores de transcrição. Estes ligam-se à cadeia de ADN na região próxima do início de um gene (a região promotora), potenciando ou inibindo a ligação da proteína responsável pelo processo de transcrição. Os factores de transcrição têm especificidade para pequenas sequências de ADN (denominados motivos de ligação) que estão presentes nas regiões promotoras dos genes que regulam. Um gene pode ser regulado por diferentes factores de transcrição; um factor de transcrição pode regular diferentes genes; e dois factores de transcrição podem ter motivos de ligação iguais. A regulação dos genes que codificam factores de transcrição é ela própria regulada, podendo sê-lo por uma série de mecanismos que incluem a interacção com outros factores de transcrição. O conhecimento de como genes e proteínas interagem entre si permite a criação de modelos que representam o modo como o sistema em questão (seja um processo biológico ou uma célula) se comporta. Estes modelos podem ser representados como redes de regulação genética, que embora possam diferir estruturalmente, os seus componentes elementares podem ser descritos da seguinte forma: os vértices representam genes (ou as proteínas codificadas) e as arestas representam reacções moleculares individuais, como as interacções entre proteínas através das quais os produtos de um gene afectam os de outro. A representação de regulações genéticas em redes de regulação genética promove, entre outros aspectos, a descoberta de grupos de genes que, sendo co-regulados, participam no mesmo processo biológico. Tal como referido anteriormente, os factores de transcrição podem ser regulados por outros factores de transcrição, o que significa que existem dois tipos de regulações: directas e indirectas. Regulações directas dizem respeito a pares gene-factor de transcrição em que a expressão do gene é regulada pelo factor de transcrição considerado no par; regulações indirectas dizem respeito a pares em que a expressão do gene é regulada por um factor de transcrição cuja expressão é regulada pelo factor de transcrição considerado no par. Existem dois tipos de métodos experimentais que permitem a identificação de regulações genéticas: métodos directos, que identificam regulações directas; métodos indirectos, identificam regulações mas sem ser possível diferenciar entre directas e indirectas. Os métodos directos avaliam a ligação física do factor de transcrição ao gene, enquanto os métodos indirectos avaliam a existência de alterações nos padrões de expressão dos genes devido à influência dos factores de transcrição (isto é, se a acção de um determinado factor de transcrição se deixar de sentir, quais os genes cuja transcrição sofrerá alterações, e com que intensidade). Dos quatro métodos descritos em seguida, os dois primeiros são directos e os dois últimos indirectos:Chip (imunoprecipitação de cromatina) – esta técnica é utilizada na investigação de interacções in vivo entre DNA e proteínas [1,2]. CHIP-chip – esta técnica consiste numa adaptação da anterior, sendo realizada à escala genómica: um microarray representativo do genoma completo de um organismo é exposto a um dado FT, permitindo a identificação de todos os genes que este regula [3].Microarrays – a utilização de microarrays permite a avaliação de alterações de expressão genética em grande escala, considerando o genoma completo de um organismo ou apenas uma via metabólica [4]. Proteómica – esta abordagem inclui diversos métodos que permitem a identificação dos genes regulados por um determinado factor de transcrição através do estudo do nível de expressão das proteínas codificadas pelos genes [5]. O conhecimento existente sobre regulações genéticas encontra-se disponível essencialmente na literatura. Embora actualmente exista um número elevado de bases de dados biológicas públicas, a grande maioria contém dados sobre entidades biológicas mas não sobre regulações genéticas de forma explícita. Com o objectivo de colocar à disposição da comunidade científica dados existentes sobre regulações genéticas em Saccharomyces cerevisiae, foi criada uma base de dados portuguesa, denominada Yeastract, mantida por curação manual de literatura científica. Devido à crescente quantidade de artigos publicados actualmente, é de extrema importância o desenvolvimento de ferramentas automáticas que auxiliem o processo de curação manual. No caso concreto da Yeastract, surgiu a necessidade de criar uma ferramenta que auxiliasse o processo de identificação de artigos científicos que descrevam regulações genéticas em S. cerevisiae. Esta ferramenta é composta por dois componentes: um primeiro que identifica factores de transcrição nos resumos dos artigos e que verifica se os resumos contêm descrições de regulações genéticas; um segundo que avalia se as regulações hipotéticas que o artigo contém correspondem a regulações válidas do ponto de vista biológico. Este segundo componente foi denominado GREAT (Gene Regulation EvAluation Tool) e constitui o objectivo do meu trabalho. A ferramenta que desenvolvi recebe como input uma lista de artigos em cujos resumos foram identificados factores de transcrição e, na validação das regulações, explora dados obtidos exclusivamente de bases de dados biológicas de acesso público. Esses dados são utilizados na avaliação dos seguintes aspectos: participação de um gene e de um factor de transcrição no mesmo processo biológico; existência do local de ligação do factor de transcrição na região promotora do gene; método experimental com que a regulação foi identificada. O resultado de cada um destes aspectos é utilizado por um método de aprendizagem automática, árvores de regressão ou árvores modelo, para o cálculo de um score de confiança, a atribuir a cada potencial regulação. Artigos que contenham regulações com scores elevados serão curados manualmente para extracção das regulações genéticas. Foi implementado com sucesso um primeiro protótipo do GREAT. No entanto, do ponto de vista biológico, os resultados obtidos não foram satisfatórios, pelo que se realizou uma análise detalhada dos dados utilizados. Esta análise revelou questões importantes, essencialmente relacionadas com a insuficiência de dados disponíveis, e permitiu a identificação de medidas que poderão ser implementadas no actual protótipo para a resolução dos problemas encontrados.The understanding of biological systems is dependent on the study of the mechanisms that regulate gene expression. These mechanisms control when and for how long the information coded in a gene is used, and can act several of the steps in the gene expression process. In the present work, the step of interest is the transcription, where the DNA sequence of a gene is transformed into an RNA sequence, which will later be used to synthesise a protein. The knowledge about gene regulations is mainly available in the literature. Although there are currently multiple public biological databases, the majority of those contain data on biological entities but not explicitly on gene regulations. In order to provide the scientific community with data on Saccharomyces cerevisiae transcription regulations, a Portuguese public repository maintained by manual curation of scientific literature, named Yeastract, was created. Due to the increasing amount of papers published nowadays, the development of automatic tools that can help the curation process is of great importance. In the specific case of Yeastract, a tool was needed to help in the identification of papers describing gene regulations of S. cerevisiae. This tool was created with two components: one that identifies transcription factors in the papers’ abstracts and verifies if they describe gene regulations; the other that evaluates if the hypothetical regulations the paper contains correspond to valid regulations from a biological point of view. This second component was named GREAT, Gene Regulation EvAluation Tool, and is the goal of my work. The tool I developed uses data obtained exclusively from public biological databases to validate the regulations. That data is used in the evaluation of three aspects: the participation of a gene and a transcription factor in the same biological process; the existence of the transcription factor binding motif in the gene promoter region; the experimental method with which the regulation was identified. The output of these features is used by a machine learning method, either regression or model trees, to calculate a confidence score to attribute to each putative gene regulation. Papers containing regulations with high scores will be manually curated to extract the gene regulations. Although a first prototype of GREAT was implemented, from a biological point of view the results obtained are unsatisfactory. This prompted a detailed analysis of the data used, which uncovered important questions that need to be addressed in order to further improve this tool
    corecore