11 research outputs found

    Tools for Large-scale Genomic Analysis and Gene Expression Outlier Modeling for Precision Therapeutics

    Get PDF
    In terms of data acquisition, storage, and distribution, genomics data will soon become the largest “big data” domain in science and, as such, needs appropriate tools to process the ever-increasing amount of genomic data so researchers can leverage the power afforded by such enormous datasets. I present my work on Toil: a portable, open-source workflow software that supports contemporary workflow definition languages and can securely and reproducibly run scientific workflows efficiently at large-scale. Yet efficient computation is only one component of enabling scientific research, as data is not always accessible to researchers who can use it. Data barriers hinder scientific progress and stymie research collaboration by denying access to large amounts of biomedical information, due to the need for patient privacy and potential liability on behalf of data stewards. As such, research institutions and consortiums should prioritize making large datasets open-access to enable research teams to develop novel therapeutics and garner valuable insight into a wide variety of diseases. One such research group who benefits from both large open-access datasets is Treehouse, a pediatric cancer research group that investigates the role of RNA-seq in therapeutics. However, Treehouse also needs methods to extract rare pediatric cancer data from information silos. Treehouse uses RNA-seq to identify target drug candidates by comparing gene expression for individual patients to their own public compendium, which combines multiple open-access datasets with thousands of pediatric samples. I discuss a solution for extracting data from information silos by using portable and reproducible software that produces anonymized secondary output that can be sent back to the researcher for analysis. This computation-to-data method also addresses the logistical difficulty of securely sharing and storing large amounts of primary sequence data. Finally, I propose a robust Bayesian statistical framework for detecting gene expression outliers in single samples that leverages all available data to produce a consensus background distribution for each gene of interest without requiring the researcher to manually select a comparison set and provides posterior predictive p-values to quantify over- or under-expression

    Contextual Analysis of Large-Scale Biomedical Associations for the Elucidation and Prioritization of Genes and their Roles in Complex Disease

    Get PDF
    Vast amounts of biomedical associations are easily accessible in public resources, spanning gene-disease associations, tissue-specific gene expression, gene function and pathway annotations, and many other data types. Despite this mass of data, information most relevant to the study of a particular disease remains loosely coupled and difficult to incorporate into ongoing research. Current public databases are difficult to navigate and do not interoperate well due to the plethora of interfaces and varying biomedical concept identifiers used. Because no coherent display of data within a specific problem domain is available, finding the latent relationships associated with a disease of interest is impractical. This research describes a method for extracting the contextual relationships embedded within associations relevant to a disease of interest. After applying the method to a small test data set, a large-scale integrated association network is constructed for application of a network propagation technique that helps uncover more distant latent relationships. Together these methods are adept at uncovering highly relevant relationships without any a priori knowledge of the disease of interest. The combined contextual search and relevance methods power a tool which makes pertinent biomedical associations easier to find, easier to assimilate into ongoing work, and more prominent than currently available databases. Increasing the accessibility of current information is an important component to understanding high-throughput experimental results and surviving the data deluge

    Charting the single-cell landscape of colorectal cancer stem cell polarisation

    Get PDF
    Colonic epithelia is regulated by cell-intrinsic and cell-extrinsic cues, both in homeostatic tissues and colorectal cancer (CRC), where the tumour microenvironment closely interacts with mutated epithelia. Our understanding on how these cues polarise colonic stem cell (CSC) states remains incomplete. Indeed, charting the interaction between intrinsic and stromal cues requires a systematic study yet to be found in the literature. In this work I present my efforts towards computationally studying colonic stem cell polarisation at single-cell resolution. Leveraging the scalability of organoid models, my colleagues and I dissected the heterocellular CRC organoid system presented in Qin & Cardoso Rodriguez et al. using single-cell omic analyses, resolving complex interaction and polarisation processes. First, I identified bottlenecks in common mass cytometry (MC) analysis workflows benefiting from either increased accessibility or automation; designing the CyGNAL pipeline and developing a cell-state classifier to tackle these points respectively. I then used single-cell RNA sequencing (scRNA-seq) data to reveal a shared landscape of CSC polarisation; wherein stromal cues polarise the epithelia towards slow-cycling revival CSC (revCSC) and oncogenic mutations trap cells in a hyper-proliferative CSC (proCSC) state. I then developed a method to visualise single-cell differentation using a novel valley-ridge (VR) score, which can generate data-driven Waddington-like landscapes that recapitulate differentiation dynamics of the colonic epithelia. Finally, I explored an approach for holistic inter- and intra-cellular communication analysis by incorporating literature information as a directed knowledge graph (KG), showing that low-dimensional representations of the graph retain biological information and that projected cellular profiles recapitulate their transcriptomes. These results reveal a polarisation landscape where CRC epithelia is trapped in a proCSC state refractory to stromal cues, and broadly show the importance of joint collaborative wet- and dry-lab work; central towards targeting gaps in the method space and generating a comprehensive analysis of heterocellular signalling in cancer

    Network-driven strategies to integrate and exploit biomedical data

    Get PDF
    [eng] In the quest for understanding complex biological systems, the scientific community has been delving into protein, chemical and disease biology, populating biomedical databases with a wealth of data and knowledge. Currently, the field of biomedicine has entered a Big Data era, in which computational-driven research can largely benefit from existing knowledge to better understand and characterize biological and chemical entities. And yet, the heterogeneity and complexity of biomedical data trigger the need for a proper integration and representation of this knowledge, so that it can be effectively and efficiently exploited. In this thesis, we aim at developing new strategies to leverage the current biomedical knowledge, so that meaningful information can be extracted and fused into downstream applications. To this goal, we have capitalized on network analysis algorithms to integrate and exploit biomedical data in a wide variety of scenarios, providing a better understanding of pharmacoomics experiments while helping accelerate the drug discovery process. More specifically, we have (i) devised an approach to identify functional gene sets associated with drug response mechanisms of action, (ii) created a resource of biomedical descriptors able to anticipate cellular drug response and identify new drug repurposing opportunities, (iii) designed a tool to annotate biomedical support for a given set of experimental observations, and (iv) reviewed different chemical and biological descriptors relevant for drug discovery, illustrating how they can be used to provide solutions to current challenges in biomedicine.[cat] En la cerca d’una millor comprensió dels sistemes biològics complexos, la comunitat científica ha estat aprofundint en la biologia de les proteïnes, fàrmacs i malalties, poblant les bases de dades biomèdiques amb un gran volum de dades i coneixement. En l’actualitat, el camp de la biomedicina es troba en una era de “dades massives” (Big Data), on la investigació duta a terme per ordinadors se’n pot beneficiar per entendre i caracteritzar millor les entitats químiques i biològiques. No obstant, la heterogeneïtat i complexitat de les dades biomèdiques requereix que aquestes s’integrin i es representin d’una manera idònia, permetent així explotar aquesta informació d’una manera efectiva i eficient. L’objectiu d’aquesta tesis doctoral és desenvolupar noves estratègies que permetin explotar el coneixement biomèdic actual i així extreure informació rellevant per aplicacions biomèdiques futures. Per aquesta finalitat, em fet servir algoritmes de xarxes per tal d’integrar i explotar el coneixement biomèdic en diferents tasques, proporcionant un millor enteniment dels experiments farmacoòmics per tal d’ajudar accelerar el procés de descobriment de nous fàrmacs. Com a resultat, en aquesta tesi hem (i) dissenyat una estratègia per identificar grups funcionals de gens associats a la resposta de línies cel·lulars als fàrmacs, (ii) creat una col·lecció de descriptors biomèdics capaços, entre altres coses, d’anticipar com les cèl·lules responen als fàrmacs o trobar nous usos per fàrmacs existents, (iii) desenvolupat una eina per descobrir quins contextos biològics corresponen a una associació biològica observada experimentalment i, finalment, (iv) hem explorat diferents descriptors químics i biològics rellevants pel procés de descobriment de nous fàrmacs, mostrant com aquests poden ser utilitzats per trobar solucions a reptes actuals dins el camp de la biomedicina

    Assemblage adaptatif de génomes et de méta-génomes par passage de messages

    Get PDF
    De manière générale, les procédés et processus produisent maintenant plus de données qu’un humain peut en assimiler. Les grosses données (Big Data), lorsque bien analysées, augmentent la compréhension des processus qui sont opérationnels à l’intérieur de systèmes et, en conséquence, encouragent leur amélioration. Analyser les séquences de l’acide désoxyribonucléique (ADN) permet de mieux comprendre les êtres vivants, en exploitant par exemple la biologie des systèmes. Les séquenceurs d’ADN à haut débit sont des instruments massivement parallèles et produisent beaucoup de données. Les infrastructures informatiques, comme les superordinateurs ou l’informatique infonuagique, sont aussi massivement parallèles de par leur nature distribuée. Par contre, les ordinateurs ne comprennent ni le français, ni l’anglais – il faut les programmer. Les systèmes logiciels pour analyser les données génomiques avec des superordinateurs doivent être aussi massivement parallèles. L’interface de passage de messages permet de créer de tels logiciels et une conception granulaire permet d’entrelacer la communication et le calcul à l’intérieur des processus d’un système de calcul. De tels systèmes produisent des résultats rapidement à partir de données. Ici, les logiciels RayPlatform, Ray (incluant les flux de travail appelé Ray Meta et Ray Communities) et Ray Cloud Browser sont présentés. L’application principale de cette famille de produits est l’assemblage et le profilage adaptatifs de génomes par passage de messages.Generally speaking, current processes – industrial, for direct-to-consumers, or researchrelated – yield far more data than humans can manage. Big Data is a trend of its own and concerns itself with the betterment of humankind through better understanding of processes and systems. To achieve that end, the mean is to leverage massive amounts of big data in order to better comprehend what they contain, mean, and imply. DNA sequencing is such a process and contributes to the discovery of knowledge in genetics and other fields. DNA sequencing instruments are parallel objects and output unprecedented volumes of data. Computer infrastructures, cloud and other means of computation open the door to the analysis of data stated above. However, they need to be programmed for they are not acquainted with natural languages. Massively parallel software must match the parallelism of supercomputers and other distributed computing systems before attempting to decipher big data. Message passing – and the message passing interface – allows one to create such tools, and a granular design of blueprints consolidate production of results. Herein, a line of products that includes RayPlatform, Ray (which includes workflows called Ray Meta and Ray Communities for metagenomics) and Ray Cloud Browser are presented. Its main application is scalable (adaptive) assembly and profiling of genomes using message passing

    In silico dynamic optimisation studies for batch/fed-batch mammalian cell suspension cultures producing biopharmaceuticals

    Get PDF
    Mammalian cell cultures are valuable for synthesis of therapeutic proteins and antibodies. They are commonly cultivated in bioindustry in form of large-scale suspension fed-batch cultures. The structure and regulatory responses of mammalian cells are complex, making it challenging to model them for practical process optimisation. The adjustable degrees of freedom in the cell cultures can be continuous variables as well as binary-type variables. The binary-type variables may be irreversible in cases such as cell-cycle arrest. The main aim of this study was to develop a general model for mammalian cell cultures using extracellular variables and capturing major changes in cellular responses between batch and fed-batch cultures. The model development started with a simple model for a hybridoma cell culture using first-principle equations. The growth kinetics was only linked to glucose and glutamine and the cell population was divided into three cell-cycle phases to study the phenomenon of cell-cycle arrest. But there were certain deficiencies in predicting growth rates in the death phase in fed-batch cultures although it was successful to simultaneously optimise a combination of continuous and binary-irreversible degrees of freedom. Thus, the growth kinetics was further related to amino acids concentration and cellular responses to high versus low concentration of glutamine and glucose based on a Chinese hamster ovary cell-line where amino acids data were available. The model contained 192 parameters with 26 measured cell culture variables. Most of the sensitive parameters were able to be identified using the Sobol' method of Global Sensitivity Analysis. The model could capture the main trends of key variables and be used to search for the optimal working range of the controllable variables. But uncertainties in the sensitive model parameters caused non-negligible variations in the model-based optimisation results. It is recommended to couple such off-line optimisation with on-line measurements of a few major variables to tackle the real-time uncertain nature of the complex cell culture system.Open acces

    Handbook of Stemmatology

    Get PDF
    Stemmatology studies aspects of textual criticism that use genealogical methods. This handbook is the first to cover the entire field, encompassing both theoretical and practical aspects, ranging from traditional to digital methods. Authors from all the disciplines involved examine topics such as the material aspects of text traditions, methods of traditional textual criticism and their genesis, and modern digital approaches used in the field

    July 21, 2007 (Pages 3353-4040)

    Get PDF
    corecore