thesis

From gene to function: using new technologies for solving old problems.

Abstract

Recent advances in DNA sequencing have changed the field of genomics as well as that of proteomics making it possible to generate gigabases of genome and transcriptome sequence data at substantially lower cost than it was possible just ten years ago. In recent years, many high-throughput technologies have been developed to interrogate various aspects of cellular processes, including sequence and structural variation and the transcriptome, epigenome, proteome and interactome. These Next Generation Sequencing (NGS) experimental technologies are more mature and accessible than the computational tools available for individual researchers to move, store, analyse and present data in a user-friendly and reproducible fashion. My research work is placed in this scenario and focuses on the analysis of data produced by NGS technologies as well as on the development of new tools aimed at solving the different problems that arise during NGS data analysis. In order to achieve this aim, my group and I have dealt with several open biomedical problems in collaboration with different research groups of the Sapienza University. Some of these experiments have already given interesting results but mostly have represented the occasion and starting point for the development of new tools able to improve some crucial steps of the analyses, solve problems derived by the system complexity and make the results easier to understand for the researchers. Some examples are IsomirT, a tool for the small RNA-Seq analysis and isomiR identification, Phagotto, a tool for analysing deep sequencing data derived from phage-displayed libraries and FIDEA, a web server for the functional interpretation of differential expression analysis. Recent reports have demonstrated that individual microRNAs can be heterogeneous in length and/or sequence producing multiple mature variants that have been dubbed isomiRs. IsomirT is a useful tool to improve and simplify the search for isomiRs starting directly from the results of a miRNA-sequencing experiment. By using it, we observed the behaviour of isomiRs in different cell types and in different biological replicates. Our results indicate that the distribution of the microRNA variants is similar among replicates and different among cells/tissues suggesting that the isomiRs have a functional role in the cell. The use of the NGS technologies for the analysis of antibody selected sequences both using phage display libraries and in vitro selection processes is becoming increasingly popular. By using these technologies, the experimental group headed by prof. Felici has introduced a new experimental pipeline, named PROFILER, aimed at significantly empowering the analysis of antigen-specific libraries. A key step to exploit this idea has been to develop a new tool, Phagotto, for processing and analysing the data derived by sequencing. PROFILER, in combination with Phagotto, seems ideally suited to streamline and guide rational antigen design, adjuvant selection, and quality control of newly produced vaccines. The publicly available web server FIDEA allows experimentalists to obtain a functional interpretation of the results derived from differential expression analysis and to test their hypothesis quickly and easily. The tool performs an enrichment analysis i.e. an analysis of specific properties that are distributed in a non random fashion in the up-regulated and down-regulated genes, taken both together and separately. It has been shown to be very useful and is being heavily used from scientists all over the world, more than 1500 requests for analysis have been submitted to the server in six months. Furthermore, during the course of the PhD I implemented pipelines for the speeding up and optimization of protocols for NGS data analysis and applied them to biomedical projects. Of course not all the proteins have a complete functional annotation and consequently the issue of predicting the function of proteins with a partial or no functional annotation arises. This can be done both by exploiting the 3D structure of the protein or by inferring the function directly from the sequence. A real challenge, however, is the assessment of the accuracy of existing methods. In this context the help that critical assessment experiments can give is essential. We have had the possibility to be involved, as assessors, in the world wide experiment CASP (Critical Assessment of protein Structure Prediction). In particular, we are involved in the assessment of the residue-residue contacts in which the participant groups provide a list of predicted contacts between residues that hopefully can be used as constraints to fold the protein. We proposed and implemented new methodologies to understand which method works better and where future efforts should be focused

    Similar works