6 research outputs found

    Gene List significance at-a-glance with GeneValorization

    Get PDF
    Motivation: High-throughput technologies provide fundamental informations concerning thousands of genes. Many of the current research laboratories daily use one or more of these technologies and end-up with lists of genes. Assessing the originality of the results obtained includes being aware of the number of publications available concerning individual or multiple genes and accessing information about these publications. Faced with the exponential growth of publications avaliable and number of genes involved in a study, this task is becoming particularly difficult to achieve

    A fast computational framework for genome-wide association studies with neuroimaging data

    Get PDF
    International audienceIn the last few years, it has become possible to acquire high-dimensional neuroimaging and genetic data on relatively large cohorts of subjects, which provides novel means to understand the large between-subject variability observed in brain organization. Genetic association studies aim at unveiling correlations between the genetic variants and the numerous phenotypes extracted from brain images and thus face a dire multiple comparisons issue. While these statistics can be accumulated across the brain volume for the sake of sensitivity, the significance of the resulting summary statistics can only be assessed through permutations. Fortunately, the increase of computational power can be exploited, but this requires designing new parallel algorithms. The MapReduce framework coupled with efficient algorithms permits to deliver a scalable analysis tool that deals with high-dimensional data and thousands of permutations in a few hours. On a real functional MRI dataset, this tool shows promising results with a genetic variant that survives the very strict correction for multiple testing

    ProDiGe: Prioritization Of Disease Genes with multitask machine learning from positive and unlabeled examples

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Elucidating the genetic basis of human diseases is a central goal of genetics and molecular biology. While traditional linkage analysis and modern high-throughput techniques often provide long lists of tens or hundreds of disease gene candidates, the identification of disease genes among the candidates remains time-consuming and expensive. Efficient computational methods are therefore needed to prioritize genes within the list of candidates, by exploiting the wealth of information available about the genes in various databases.</p> <p>Results</p> <p>We propose ProDiGe, a novel algorithm for Prioritization of Disease Genes. ProDiGe implements a novel machine learning strategy based on learning from positive and unlabeled examples, which allows to integrate various sources of information about the genes, to share information about known disease genes across diseases, and to perform genome-wide searches for new disease genes. Experiments on real data show that ProDiGe outperforms state-of-the-art methods for the prioritization of genes in human diseases.</p> <p>Conclusions</p> <p>ProDiGe implements a new machine learning paradigm for gene prioritization, which could help the identification of new disease genes. It is freely available at <url>http://cbio.ensmp.fr/prodige</url>.</p

    Extracting knowledge from documents related with invasive fungal infections in iron overload context

    Get PDF
    Dissertação de Mestrado em BioinformáticaInvasive fungal infections caused by Candida are associated with high mortality and morbidity rates in hospitalized patients. Iron plays a major role in these infections, as they are exacerbated under iron overload conditions. In this context, it is important to understand the association between iron levels and invasive fungal infections, as it can serve as an indicator of the severity of the disease, and eventually it can help establish measures to improve treatment efficacy. Nowadays, manually inferring these associations from biomedical documents is a time consuming task, due to the high amount of available scientific text data. As such, these tasks naturally benefit from the Biomedical Text Mining field, which includes a wide variety of methods for automatic extraction of high-quality information from biomedical text documents. In this work, relevant documents related to iron overload and fungal infections were retrieved from PubMed to build a corpus. Then, both Named Entity Recognition and Relation Extraction processes were executed using the @Note text mining tool. Finally, relevant sentences were manually extracted and a curated dataset with documents containing those sentences was created. Since the number of publications obtained about Candida and iron overload was very low, the analysis was made taking into account all fungi. A total of 15 publications were considered relevant and 168 relevant associations were extracted. Although associations of iron levels with both severity of infection and treatment efficacy were not extracted, it was possible to conclude that, in many cases, iron overload is a predictor for fungal infections, and patients’ iron levels highly affect treatment efficacy. The Biomedical Text Mining process described in the present thesis enabled the creation of a dataset of relevant biomedical publications containing interesting associations between fungal infections, drugs and associated diseases in a clinical context of iron overload, although in the future this process could be improved, especially regarding dictionaries, in order to obtain a higher number of relevant publications.As infeções fúngicas invasivas causadas por Candida estão associadas a elevadas taxas de mortalidade e morbilidade em doentes hospitalizados. O ferro tem um papel importante neste tipo de infeções, visto que estas são exacerbadas em condições de excesso de ferro. Neste contexto, é extremamente importante compreender a associação entre os níveis de ferro e infeções fúngicas invasivas, pois pode servir como indicador da severidade da doença e, eventualmente, ajudar a estabelecer medidas para melhorar a eficácia de tratamento. Atualmente, inferir manualmente este tipo de associações de documentos biomédicos revela-se uma tarefa bastante demorada, devido ao elevado volume de dados de texto científico disponíveis. Como tal, estas tarefas beneficiam claramente da área da mineração de textos biomédicos, que inclui uma ampla variedade de métodos para extração de informação de alta qualidade de documentos de texto biomédicos. No presente trabalho, foram identificados, inicialmente, documentos relevantes que associam o ferro com infeções fúngicas invasivas para construir um corpus. De seguida, os processos de Reconhecimento de entidades nomeadas e Extração de relações foram realizados usando a ferramenta de mineração de textos @Note. Finalmente, as frases mais relevantes foram extraídas e foi criado um corpus curado de documentos contendo essas mesmas frases. Visto que o número de publicações obtidas relacionadas com Candida e excesso de ferro foi muito baixo, a análise foi feita tendo em conta todos os fungos. Um total de 15 publicações foram consideradas relevantes e 168 associações foram extraídas. Embora não tivesse sido possível extrair associações entre níveis de ferro e a eficácia do tratamento/severidade da infeção, foi possível concluir que o excesso de ferro prevê o surgimento de infeções fúngicas em muitos casos, e que os níveis de ferro dos pacientes afetam fortemente a eficácia do tratamento. O processo de mineração de textos biomédicos no presente trabalho possibilitou a criação de um corpus de publicações biomédicas relevantes contendo associações interessantes entre infeções fúngicas, fármacos e doenças associadas, no contexto clínico de excesso de ferro, embora este processo pudesse ser melhorado no futuro, especialmente no que diz respeito aos dicionários, para que seja possível a obtenção de um maior número de publicações relevantes

    Data Integration in the Life Sciences: Scientific Workflows, Provenance, and Ranking

    Get PDF
    Biological research is a science which derives its findings from the proper analysis of experiments. Today, a large variety of experiments are carried-out in hundreds of labs around the world, and their results are reported in a myriad of different databases, web-sites, publications etc., using different formats, conventions, and schemas. Providing a uniform access to these diverse and distributed databases is the aim of data integration solutions, which have been designed and implemented within the bioinformatics community for more than 20 years. However, the perception of the problem of data integration research in the life sciences has changed: While early approaches concentrated on handling schema-dependent queries over heterogeneous and distributed databases, current research emphasizes instances rather than schemas, tries to place the human back into the loop, and intertwines data integration and data analysis. Transparency -- providing users with the illusion that they are using a centralized database and thus completely hiding the original databases -- was one of the main goals of federated databases. It is not a target anymore. Instead, users want to know exactly which data from which source was used in which way in studies (Provenance). The old model of "first integrate, then analyze" is replaced by a new, process-oriented paradigm: "integration is analysis - and analysis is integration". This paradigm change gives rise to some important research trends. First, the process of integration itself, i.e., the integration workflow, is becoming a research topic in its own. Scientific workflows actually implement the paradigm "integration is analysis". A second trend is the growing importance of sensible ranking, because data sets grow and grow and it becomes increasingly difficult for the biologist user to distinguish relevant data from large and noisy data sets. This HDR thesis outlines my contributions to the field of data integration in the life sciences. More precisely, my work takes place in the first two contexts mentioned above, namely, scientific workflows and biological data ranking. The reported results were obtained from 2005 to late 2014, first as a postdoctoral fellow at the Uniersity of Pennsylvania (Dec 2005 to Aug 2007) and then as an Associate Professor at Université Paris-Sud (LRI, UMR CNRS 8623, Bioinformactics team) and Inria (Saclay-Ile-de-France, AMIB team 2009-2014)
    corecore