2,246 research outputs found

    Parallelisation of EST clustering

    Get PDF
    Master of Science - ScienceThe field of bioinformatics has been developing steadily, with computational problems related to biology taking on an increased importance as further advances are sought. The large data sets involved in problems within computational biology have dictated a search for good, fast approximations to computationally complex problems. This research aims to improve a method used to discover and understand genes, which are small subsequences of DNA. A difficulty arises because genes contain parts we know to be functional and other parts we assume are non-functional as there functions have not been determined. Isolating the functional parts requires the use of natural biological processes which perform this separation. However, these processes cannot read long sequences, forcing biologists to break a long sequence into a large number of small sequences, then reading these. This creates the computational difficulty of categorizing the short fragments according to gene membership. Expressed Sequence Tag Clustering is a technique used to facilitate the identification of expressed genes by grouping together similar fragments with the assumption that they belong to the same gene. The aim of this research was to investigate the usefulness of distributed memory parallelisation for the Expressed Sequence Tag Clustering problem. This was investigated empirically, with a distributed system tested for speed against a sequential one. It was found that distributed memory parallelisation can be very effective in this domain. The results showed a super-linear speedup for up to 100 processors, with higher numbers not tested, and likely to produce further speedups. The system was able to cluster 500000 ESTs in 641 minutes using 101 processors

    Molecular solutions for double and partial digest problems in polynomial time

    Get PDF
    A fundamental problem in computational biology is the construction of physical maps of chromosomes from the hybridization experiments between unique probes and clones of chromosome fragments. Double and partial digest problems are two intractable problems used to construct physical maps of DNA molecules in bioinformatics. Several approaches, including exponential algorithms and heuristic algorithms, have been proposed to tackle these problems. In this paper we present two polynomial time molecular algorithms for both problems. For this reason, a molecular model similar to Adleman and Lipton model is presented. The presented operations are simple and performed in polynomial time. Our algorithms are computationally simulated

    Alternative splicing detection across different tissues in cork oak

    Get PDF
    Tese de mestrado, Bioinformática e Biologia Computacional (Bioinformática) Universidade de Lisboa, Faculdade de Ciências, 2017As florestas de sobreiro (Quercus suber L.) são recursos únicos e emblemáticos em Portugal, com elevado impacto económico, ecológico e social. A disponibilidade recente da sequência do genoma de sobreiro forneceu um importante contributo para revitalizar a pesquisa em temas como desenvolvimento de cortiça e melhoramento da planta, assim como promover a competitividade da indústria da cortiça. No entanto, é ainda necessário adicionar mais detalhe à anotação estrutural do genoma, nomeadamente ao nível dos transcritos, incluindo previsão de eventos de splicing alternativo. O splicing alternativo (AS) é um processo usado durante a expressão génica que origina diferentes variantes de transcritos (isoformas) e produtos proteicos a partir um único gene. No presente estudo, procedemos à análise de dezasseis bibliotecas de RNA-seq, preparadas a partir de quatro tecidos de sobreiro (folhas, felema, entrecasco e xilema), de modo a prever novas formas de AS para genes já previstos e melhorar a anotação estrutural do genoma. Um protocolo bioinformático foi definido para testar o desempenho do software HISAT2 e STAR para mapeamento de reads de RNAseq no genoma de referência, e do software Cufflinks e StringTie para (re)construção de transcritos. O alinhamento de reads no genoma efetuado com STAR resultou em taxas de mapeamento (de 84,22% a 86,86%) superiores aos resultados atingidos com HISAT2 (73,88% a 76,55%). Assim, os resultados de mapeamento com STAR foram utilizados para a (re)construção de transcritos. O uso do StringTie para este processo foi globalmente mais conservador do que com Cufflinks, gerando menos transcritos novos, mas com melhor cobertura de reads por pares de base. Para melhorar a precisão da anotação e reduzir falsos positivos, foi realizado um passo adicional de otimização com StringTie. Desta otimização resultou uma anotação que prevê a ocorrência de 7 958 novos transcritos (8% dos transcritos totais), dos quais 5 453 são novas isoformas para genes previstos na anotação de referência. Esta nova anotação foi utilizada como referência para estimar a abundância dos transcritos em cada um dos tecidos estudados e efetuar a análise de expressão diferencial. Cerca de 16% de todos os genes expressos nos quatro tecidos e que contêm intrões apresentaram splicing alternativo, e os principais eventos de splicing foram alternative acceptor site e intron retention. Grupos de transcritos com expressão diferencial entre os quatro tecidos foram identificados e a análise de enriquecimento funcional confirmou os principais processos biológicos esperados para cada tecido: os transcritos mais expressos nas folhas e no xilema estavam relacionados com a fotossíntese e com transporte, respetivamente; transcritos mais expressos na periderme (felema e entrecasco) mostraram um enriquecimento em categorias funcionais relacionadas com a síntese de suberina e outros componentes de parede celular presentes nas células de cortiça. Estes grupos específicos mostraram também um enriquecimento em transcritos envolvidos na resposta ao stresse (biótico ou abiótico). Nos tecidos que compõem a periderme, este enriquecimento foi observado principalmente no entrecasco, enquanto que no felema foi detetado um enriquecimento em transcritos envolvidos no metabolismo secundário. A presente tese permitiu a definição de um protocolo padrão que poderá ser usado para estudar o splicing alternativo no sobreiro e para uma análise mais aprofundada na nova versão do genoma, que estará disponível em breve.Cork oak (Quercus suber L.) forests are unique and emblematic resources for Portugal, with high economical, ecological and social significance. The recent availability of the cork oak genome sequence provided an important contribution to reinvigorate research in fundamental topics such as cork development and plant improvement, and to promote the competitiveness of cork industry. Yet, further analysis is required to add detail to genome structure annotation, namely at the transcript level, also taking into account alternative splicing. Alternative splicing (AS) is a process used during gene expression to yield different transcript variants and protein products derived from a single gene. In the present study, we analyzed sixteen RNA-seq libraries prepared from four cork oak tissues (leaf, xylem, phellem and inner bark), in order to predict new AS forms for the already predicted genes and improve genome structural annotation. A bioinformatics pipeline was defined in order to test the performance of HISAT2 and STAR for read mapping against the reference genome, and Cufflinks and StringTie for transcript assembly. STAR yielded higher mapping efficiencies (84.22% to 86.86%) for the cork oak datasets, as compared to HISAT2 (73.88% to 76.55%), and the corresponding mapping data was selected for transcript assembly. The use of StringTie for this step was globally more conservative than Cufflinks, generating less novel transcripts, but with better support by read per base coverage. A further optimization step was performed using StringTie in order to improve annotation precision. The final transcript annotation was selected from this optimization step, predicting 7,958 novel transcripts (8% of total transcripts in the new annotation), 5,453 of which were novel isoforms for genes in reference annotation. This new annotation was used as reference to estimate transcript abundance in each tissue and differential expression analysis. Approximately 16% of all intron-containing genes expressed in the four tissues were alternatively spliced and the main event found in the four cork oak tissues was alternative acceptor site, followed by intron retention. Transcript clusters showing differential expression among the four tissues were identified and functional enrichment analysis confirmed the main biological processes expected for each tissue: transcripts highly expressed in leaves and xylem were mostly related to photosynthesis and transport, respectively; transcripts highly expressed in peridermis (phellem and inner bark) showed an enrichment in functional categories related to the synthesis of suberin and other component of cork cell walls. These tissue-specific clusters also showed an enrichment in transcripts involved in the response to stress (biotic or abiotic). Yet, in peridermis, this enrichment was mostly observed in inner bark samples, while phellem samples showed an enrichment in transcripts related to secondary metabolism. This thesis allowed the definition of a standard workflow that can be used to study alternative splicing in cork oak and used for further analysis on the new improved genome version that will be available soon
    corecore