31 research outputs found

    Deep Learning and Random Forest-Based Augmentation of sRNA Expression Profiles

    Full text link
    The lack of well-structured annotations in a growing amount of RNA expression data complicates data interoperability and reusability. Commonly - used text mining methods extract annotations from existing unstructured data descriptions and often provide inaccurate output that requires manual curation. Automatic data-based augmentation (generation of annotations on the base of expression data) can considerably improve the annotation quality and has not been well-studied. We formulate an automatic augmentation of small RNA-seq expression data as a classification problem and investigate deep learning (DL) and random forest (RF) approaches to solve it. We generate tissue and sex annotations from small RNA-seq expression data for tissues and cell lines of homo sapiens. We validate our approach on 4243 annotated small RNA-seq samples from the Small RNA Expression Atlas (SEA) database. The average prediction accuracy for tissue groups is 98% (DL), for tissues - 96.5% (DL), and for sex - 77% (DL). The "one dataset out" average accuracy for tissue group prediction is 83% (DL) and 59% (RF). On average, DL provides better results as compared to RF, and considerably improves classification performance for 'unseen' datasets

    Computationally intensive, distributed and decentralised machine learning: from theory to applications

    Get PDF
    Machine learning (ML) is currently one of the most important research fields, spanning computer science, statistics, pattern recognition, data mining, and predictive analytics. It plays a central role in automatic data processing and analysis in numerous research domains owing to widely distributed and geographically scattered data sources, powerful computing clouds, and high digitisation requirements. However, aspects such as the accuracy of methods, data privacy, and model explainability remain challenging and require additional research. Therefore, it is necessary to analyse centralised and distributed data processing architectures, and to create novel computationally intensive explainable and privacy-preserving ML methods, to investigate their properties, to propose distributed versions of prospective ML baseline methods, and to evaluate and apply these in various applications. This thesis addresses the theoretical and practical aspects of state-of-the-art ML methods. The contributions of this thesis are threefold. In Chapter 2, novel non-distributed, centralised, computationally intensive ML methods are proposed, their properties are investigated, and state-of-the-art ML methods are applied to real-world data from two domains, namely transportation and bioinformatics. Moreover, algorithms for ‘black-box’ model interpretability are presented. Decentralised ML methods are considered in Chapter 3. First, we investigate data processing as a preliminary step in data-driven, agent-based decision-making. Thereafter, we propose novel decentralised ML algorithms that are based on the collaboration of the local models of agents. Within this context, we consider various regression models. Finally, the explainability of multiagent decision-making is addressed. In Chapter 4, we investigate distributed centralised ML methods. We propose a distributed parallelisation algorithm for the semi-parametric and non-parametric regression types, and implement these in the computational environment and data structures of Apache SPARK. Scalability, speed-up, and goodness-of-fit experiments using real-world data demonstrate the excellent performance of the proposed methods. Moreover, the federated deep-learning approach enables us to address the data privacy challenges caused by processing of distributed private data sources to solve the travel-time prediction problem. Finally, we propose an explainability strategy to interpret the influence of the input variables on this federated deep-learning application. This thesis is based on the contribution made by 11 papers to the theoretical and practical aspects of state-of-the-art and proposed ML methods. We successfully address the stated challenges with various data processing architectures, validate the proposed approaches in diverse scenarios from the transportation and bioinformatics domains, and demonstrate their effectiveness in scalability, speed-up, and goodness-of-fit experiments with real-world data. However, substantial future research is required to address the stated challenges and to identify novel issues in ML. Thus, it is necessary to advance the theoretical part by creating novel ML methods and investigating their properties, as well as to contribute to the application part by using of the state-of-the-art ML methods and their combinations, and interpreting their results for different problem setting

    Regulatory role of small RNAs and RNA-binding proteins in carbon metabolism and collective behaviour of Vibrio cholerae

    Get PDF
    The importance of small regulatory RNAs (sRNAs) has been recognized across all domains of life. Originally considered “non-coding RNAs,” several bacterial sRNAs have been found to encode functional proteins that are under 50 amino acids long. This group of regulators are called dual-function regulators. To date, only five such regulators have been characterized in bacteria. In the primary study, the first dual-function RNA of Vibrio cholerae was discovered and characterized. The pathogen colonizes and infects the upper intestines by producing two key virulence determinants – toxin co-regulated pilus (TCP) and cholera toxin (CT). While all the known sRNAs of V. cholerae act directly or indirectly to regulate the production of TCP, the sRNA VqmR is the only known direct repressor of CT production to date. Therefore, a forward genetic screen was employed to score for CT repression. This screen identified another promising candidate called Vcr082. Interestingly, Vcr082 also encodes 29 amino acids long ORF and hence was re-named VcdRP, for V. cholerae dual RNA regulator and protein, eponymous to their roles. The dual regulator is controlled by the global transcription factor of carbon utilization, cAMP-CRP. The riboregulatory component is conserved at the 3’ end of the dual regulator. By employing a conserved stretch of four cytosines, VcdR base-pairs with and represses mRNAs that encode for transporters that import PTS sugars. Additionally, VcdR also downregulates the phosphor-carrier proteins PtsH and PtsI that are involved in the phospho-relay during glycolysis. The small protein, VcdP exerts its regulatory role by interacting with and accelerating the activity of citrate synthase enzyme, opening the gateway into the TCA cycle. This way, both VcdR and VcdP act to block sugar uptake and modulate the flux through the TCA cycle, thereby striking a balance to maintain overall carbon metabolism in V. cholerae. The diverse environments that V. cholerae inhabits necessitates that the organism rapidly perceives changes in its external environment and appropriately tailors its gene expression paradigm. To achieve this, the bacteria employ quorum sensing (QS) to communicate and coordinate a suitable response. While this mechanism of census taking has been well-documented early on in several marine bacteria, more recent studies have identified additional QS systems in V. cholerae. Similarly, while biofilm formation has been extensively studied, the transition into and subsequent dispersal was only documented recently. These incomplete underpinnings thereby prompted further investigation of the QS pathway. Therefore, in the second study, a forward genetic screen in a V. cholerae mutant library was employed to score for an altered QS phenotypic transition. This screen identified a novel RNA-binding protein called MbrA (membrane-bound RNA-binding protein A). This protein localizes to the membrane and contains two trans-membrane domains at the N-terminus and a conserved RNA recognition motif-type RNA-binding domain located towards the C-terminus. MbrA is activated by the global transcription factor cAMP-CRP and a subsequent transcriptome analysis revealed its role in the regulation of motility genes and flagellar assembly complex in V. cholerae

    Étude du rôle et de l'importance de petits ARN non-codants dans la relation hôte-pathogènes

    Get PDF
    On sait maintenant depuis quelques décennies que seule une petite fraction du génome est constituée de séquences codantes pour des protéines et que la majorité de l'ADN non codant, jadis considéré comme « poubelle », assure d'importantes fonctions biologiques. Avec ce nouveau paradigme, notre perception de l'expression et la régulation génique est passée d'une vision axée sur les protéines à une vision plus centrée sur les ARN, tant chez les procaryotes que chez les eucaryotes. L'avènement des techniques de séquençage à haut débit comme le RNA-Seq (séquençage ARN) a fortement contribué à la démystification de cette partie « non codante » de l'ARN. Outre le triumvirat de gènes d'ARNt, d'ARNr et d'ARNm, les génomes abritent de nombreux loci qui codent pour de petits ARN régulateurs non canoniques repartis dans de nombreuses classes. Les microARN (miARN) et les fragments dérivés des ARNt (tRFs) sont les deux classes de petits ARN non codants (ARNnc) les plus abondants et partageant des similarités dans leurs mécanismes. Bien que souvent éclipsés par les protéines, ils sont au cœur de la régulation post-transcriptionnelle et sont des acteurs émergents de la relation hôte-pathogène. Ces travaux de thèse s'inscrivent dans ce thème et traitent des relations hôtes-pathogènes sous l'angle des petits ARNnc à travers deux projets (#1 et #2) complémentaires qui ambitionnent d'apporter une lecture et des perspectives nouvelles. Dans le projet #1, nous avons travaillé avec le virus à ARN Ebola (EBOV), un agent pathogène connu pour provoquer une fièvre hémorragique mortelle, qui a été responsable de plusieurs épidémies en Afrique et demeure encore aujourd'hui une menace pour la santé publique mondiale. En combinant RNA-Seq, PCR quantitative et analyses computationnelles, nous avons obtenu le premier transcriptome détaillé des miRNA (miRNome) d'une lignée de cellules hépatiques humaines infectées par l'une des trois souches variantes de EBOV dont Mayinga, Makona et Reston. Lors de l'infection par EBOV, il y'a une expression différentielle de seulement 1/5 du miRNome de l'hôte au cours du temps avec une modulation spécifique des miR-122-5p, miR-148a-3p et miR-21-5p. Les données obtenues mettent en relief, au-delà des manifestations cliniques jusque-là connues, de nouvelles différences substantielles entre les souches vis-à-vis de leur effet sur le miRNome. Dans une seconde phase, avec la même approche, nous avons découvert, caractérisé et validé deux miARN viraux codés par les génomes EBOV (Mayinga et Makona). Ces deux miARN viraux peuvent potentiellement cibler des gènes impliqués dans le phénotype hémorragique, la régulation de la réplication virale et la modulation de la défense immunitaire de l'hôte. Le projet #2 s'inscrit dans un contexte où nous avions découvert fortuitement l'existence d'espèces d'ARN inférieures à 16 nt (appelés ici vsRNA pour very small RNA) qui s'avèrent fonctionnels chez les eucaryotes alors qu'ils étaient souvent retirés des jeux de données de séquençage, car considérés comme étant des « produits de dégradation ». Nous avons étendu notre analyse RNA-Seq aux bactéries pour caractériser les vsRNAs de Escherichia coli K-12 MG1655 et cinq autres souches bactériennes. L'étude est complétée par l'analyse des vésicules dérivées de la membrane externe (Outer Membrane Vesicles ; OMVs) produites par E. coli K-12 MG1655 en raison de leurs rôles déterminants pour améliorer des chances de survie, la régulation des interactions microbiennes et la promotion de la pathogenèse. Les résultats montrent l'existence de vsRNAs variés et très abondants avec les tRFs comme un biotype majeur, notamment ceux dérivés de l'ARNt isoleucine (Ile-tRF). En guise de preuve de concept de la fonctionnalité de ces vsRNAs de type tRFs, nous avons étudié en détail le très abondant et thermodynamiquement stable Ile-tRF. Nos analyses montrent qu'il est modulé sélectivement par le stress environnemental et peut être transféré via les OMVs (où il est particulièrement enrichi) aux cellules humaines HCT116 où il favorise l'expression des ARNm codant pour des membres de la famille des MAP-kinase. Notre étude est la toute première chez E. coli à rapporter l'existence de tRF abondants, trouvés dans des vésicules (OMVs) et assumant des fonctions potentielles chez l'hôte. L'Ile-tRF est également le premier tRF fonctionnel de 13 nt rapporté chez les bactéries.For several decades now, it has been known that only a small fraction of the genome is made up of protein-coding sequences and that the majority of non-coding DNA, historically considered as "junk", carries out important biological functions. With this new paradigm, our perception of gene expression and regulation has shifted from a protein-centered view to a more RNA-centered view in both prokaryotes and eukaryotes. The advent of high-throughput sequencing techniques such as RNA sequencing (RNA-Seq), has strongly contributed to the demystification of this "non-coding" part of RNA. Besides the triumvirate of tRNA, rRNA, and mRNA genes, genomes harbor numerous loci that encode small non-canonical regulatory RNAs distributed in many classes. MicroRNAs (miRNAs) and tRNA-derived fragments (tRFs) are the two most abundant classes of small non-coding RNAs (ncRNAs) and share similarities in their mechanisms. Although often overshadowed by proteins, they are at the heart of post-transcriptional regulation and are emerging players in the host-pathogen relationship. This thesis addresses the host-pathogen relationship from the perspective of small ncRNAs through two complementary projects (#1 and #2) that aim to provide new insights and perspectives. In project #1, we worked with the Ebola RNA virus (EBOV), a pathogen known to cause a deadly hemorrhagic fever, which has been responsible for several epidemics in Africa and remains a global public health concern to this day. By combining RNA-Seq, quantitative PCR and computational analyses, we obtained the first detailed miRNA transcriptome (miRNome) of a human liver cell line infected with one of the three variant strains of EBOV including Mayinga, Makona and Reston. During EBOV infection, there is a differential expression of only 1/5 of the host miRNome over time with specific modulation of miR-122-5p, miR-148a-3p and miR-21-5p. The data obtained highlight, beyond the previously known clinical manifestations, substantial new differences between the strains with respect to their effect on the miRNome. In a second phase, using the same approach, we discovered, characterized and validated two viral miRNAs encoded by the EBOV genomes (Mayinga and Makona). These two viral miRNAs can potentially target genes involved in the hemorrhagic phenotype, the regulation of viral replication and the modulation of host immune defense. Project #2 was developed in a context where we had fortuitously discovered the existence of RNA species smaller than 16 nt (called here vsRNA for very small RNA) that were functional in eukaryotes but were often removed from sequencing datasets because they were considered as "degradation products". We have extended our RNA-Seq analysis to bacteria in order to characterize vsRNAs from Escherichia coli K-12 MG1655 and five other bacterial strains. The study is completed by the analysis of Outer Membrane Vesicles (OMVs) produced by E. coli K-12 MG1655 because of their critical roles in enhancing survival, regulating microbial interactions and promoting pathogenesis. The results show the existence of diverse and highly abundant vsRNAs with tRFs as a major biotype, especially those derived from isoleucine tRNA (Ile-tRF). As a proof of concept of the functionality of these tRF-like vsRNAs, we have studied in detail the highly abundant and thermodynamically stable Ile-tRF. Our analyses show that it is selectively modulated by environmental stress and can be transferred via OMVs (where it is particularly enriched) to human HCT116 cells where it promotes the expression of mRNAs encoding members of the MAP-kinase family. Our study is the first ever in E. coli to report the existence of abundant tRFs found in vesicles (OMVs) with potential functions in the host. Ile-tRF is also the first functional 13 nt tRF reported in bacteria

    Summer Research Fellowship Project Descriptions 2018

    Get PDF
    A summary of research done by Smith College’s 2018 Summer Research Fellowship (SURF) Program participants. Ever since its 1967 start, SURF has been a cornerstone of Smith’s science education. Supervised by faculty mentor-advisors drawn from the Clark Science Center and connected to its eighteen science, mathematics, and engineering departments and programs and associated centers and units. At summer’s end, SURF participants were asked to summarize their research experiences for this publication.https://scholarworks.smith.edu/clark_womeninscience/1007/thumbnail.jp

    Similarity reasoning for local surface analysis and recognition

    Get PDF
    This thesis addresses the similarity assessment of digital shapes, contributing to the analysis of surface characteristics that are independent of the global shape but are crucial to identify a model as belonging to the same manufacture, the same origin/culture or the same typology (color, common decorations, common feature elements, compatible style elements, etc.). To face this problem, the interpretation of the local surface properties is crucial. We go beyond the retrieval of models or surface patches in a collection of models, facing the recognition of geometric patterns across digital models with different overall shape. To address this challenging problem, the use of both engineered and learning-based descriptions are investigated, building one of the first contributions towards the localization and identification of geometric patterns on digital surfaces. Finally, the recognition of patterns adds a further perspective in the exploration of (large) 3D data collections, especially in the cultural heritage domain. Our work contributes to the definition of methods able to locally characterize the geometric and colorimetric surface decorations. Moreover, we showcase our benchmarking activity carried out in recent years on the identification of geometric features and the retrieval of digital models completely characterized by geometric or colorimetric patterns

    Le modèle algue brune pour l'analyse fonctionnelle et évolutive du déterminisme sexuel

    Get PDF
    Genetically determined sex determination mechanisms, which are controlled by non-recombinant chromosome regions or sex chromosomes, have emerged independently and repeatedly across several eukaryotic lineages. Most of the knowledge acquired in this area has been obtained for a limited number of eukaryotic groups. The availability of a model organism for the brown algae, Ectocarpus, whose genome has been sequenced, allows the development of tools to study these mechanisms in a lineage that is phylogenetically distant from classically studied models. One of the first challenges was to identify the sex chromosomes in Ectocarpus and to carry out a comparative analysis of these genomic structures. Analysis of gene expression in males and females at different stages of the life cycle then allowed the identification of differentially expressed genes. The functions and molecular evolution of these sex-biased genes was then studied. The large amount of data generated during the course of these analyses allowed the establishment of a new version of the genome assembly and refined structural and functional annotation of both coding and non-coding genes in Ectocarpus. This work helped made a significant contribution to knowledge in the field of functional and evolutionary analysis of sex determination in brown algae and a significantly updated the genomic resources available for the model organism Ectocarpus.Les mécanismes de détermination génétique du sexe, qui requièrent la présence de régions chromosomiques non recombinantes ou bien de chromosomes sexuels, ont émergé de manière indépendante et répétée au sein de plusieurs lignées d'eucaryotes. La plupart des connaissances acquises dans ce domaine portent sur un nombre limité de groupes d'eucaryotes. La disponibilité d'une espèce modèle pour le groupe des algues brunes, Ectocarpus siliculosus, dont le génome a été séquencé, permet de disposer des outils nécessaires pour étudier ces mécanismes au sein d'une lignée phylogénétiquement éloignée des modèles classiquement étudiés. L'un des premiers défis a été d'identifier les chromosomes sexuels dans le génome d'E. siliculosus et de réaliser l'analyse comparative de ces structures. Par la suite, l'analyse de l'expression des gènes entre individus mâles et femelles à différents stades du cycle de vie a permis d'identifier les gènes différentiellement exprimés, de caractériser leurs fonctions et d'analyser leur évolution moléculaire. Les nombreuses données générées afin de réaliser ces différentes analyses ont permis de proposer une nouvelle version de l'assemblage du génome et de l'annotation structurale et fonctionnelle de l'ensemble des gènes codants et non-codants d'E. siliculosus. Ces différents travaux ont permis d'apporter une importante contribution sur les connaissances dans le domaine de l'analyse fonctionnelle et évolutive du déterminisme sexuel chez les algues brunes ainsi qu'une importante actualisation des ressources génomiques du modèle Ectocarpus

    University of South Alabama College of Medicine Annual Report for 2017-2018

    Get PDF
    This Annual Report of the College of Medicine catalogues accomplishments of our faculty, students, residents, fellows and staff in teaching, research, scholarly and community service during the 2017-2018 fiscal year.https://jagworks.southalabama.edu/com_report/1002/thumbnail.jp
    corecore