2,246 research outputs found
Parallelisation of EST clustering
Master of Science - ScienceThe field of bioinformatics has been developing steadily, with computational problems related
to biology taking on an increased importance as further advances are sought. The large data sets involved in problems within computational biology have dictated a search for good, fast approximations to computationally complex problems. This research aims to improve a method used to discover and understand genes, which are small subsequences of DNA. A difficulty arises because genes contain parts we know to be functional and other parts we assume are non-functional as there functions have not been
determined. Isolating the functional parts requires the use of natural biological processes
which perform this separation. However, these processes cannot read long sequences, forcing
biologists to break a long sequence into a large number of small sequences, then reading these. This creates the computational difficulty of categorizing the short fragments according to gene membership.
Expressed Sequence Tag Clustering is a technique used to facilitate the identification of expressed genes by grouping together similar fragments with the assumption that they belong to the same gene.
The aim of this research was to investigate the usefulness of distributed memory parallelisation
for the Expressed Sequence Tag Clustering problem. This was investigated empirically,
with a distributed system tested for speed against a sequential one. It was found that distributed memory parallelisation can be very effective in this domain.
The results showed a super-linear speedup for up to 100 processors, with higher numbers not tested, and likely to produce further speedups. The system was able to cluster 500000 ESTs in 641 minutes using 101 processors
Molecular solutions for double and partial digest problems in polynomial time
A fundamental problem in computational biology is the construction of physical maps of chromosomes from the hybridization experiments between unique probes and clones of chromosome fragments. Double and partial digest problems are two intractable problems used to construct physical maps of DNA molecules in bioinformatics. Several approaches, including exponential algorithms and heuristic algorithms, have been proposed to tackle these problems. In this paper we present two polynomial time molecular algorithms for both problems. For this reason, a molecular model similar to Adleman and Lipton model is presented. The presented operations are simple and performed in polynomial time. Our algorithms are computationally simulated
Recommended from our members
Skeleton Structures and Origami Design
In this dissertation we study problems related to polygonal skeleton structures that have applications to computational origami. The two main structures studied are the straight skeleton of a simple polygon (and its generalizations to planar straight line graphs) and the universal molecule of a Lang polygon. This work builds on results completed jointly with my advisor Ileana Streinu.
Skeleton structures are used in many computational geometry algorithms. Examples include the medial axis, which has applications including shape analysis, optical character recognition, and surface reconstruction; and the Voronoi diagram, which has a wide array of applications including geographic information systems (GIS), point location data structures, motion planning, etc.
The straight skeleton, studied in this work, has applications in origami design, polygon interpolation, biomedical imaging, and terrain modeling, to name just a few. Though the straight skeleton has been well studied in the computational geometry literature for over 20 years, there still exists a significant gap between the fastest algorithms for constructing it and the known lower bounds.
One contribution of this thesis is an efficient algorithm for computing the straight skeleton of a polygon, polygon with holes, or a planar straight-line graph given a secondary structure called the induced motorcycle graph.
The universal molecule is a generalization of the straight skeleton to certain convex polygons that have a particular relationship to a metric tree. It is used in Robert Lang\u27s seminal TreeMaker method for origami design. Informally, the universal molecule is a subdivision of a polygon (or polygonal sheet of paper) that allows the polygon to be ``folded\u27\u27 into a particular 3D shape with certain tree-like properties. One open problem is whether the universal molecule can be rigidly folded: given the initial flat state and a particular desired final ``folded\u27\u27 state, is there a continuous motion between the two states that maintains the faces of the subdivision as rigid panels? A partial characterization is known: for a certain measure zero class of universal molecules there always exists such a folding motion. Another open problem is to remove the restriction of the universal molecule to convex polygons. This is of practical importance since the TreeMaker method sometimes fails to produce an output on valid input due the convexity restriction and extending the universal molecule to non-convex polygons would allow TreeMaker to work on all valid inputs. One further interesting problem is the development of faster algorithms for computing the universal molecule. In this thesis we make the following contributions to the study of the universal molecule. We first characterize the tree-like family of surfaces that are foldable from universal molecules. In order to do this we define a new family of surfaces we call Lang surfaces and prove that a restricted class of these surfaces are equivalent to the universal molecules. Next, we develop and compare efficient implementations for computing the universal molecule. Then, by investigating properties of broader classes of Lang surfaces, we arrive at a generalization of the universal molecule from convex polygons in the plane to non-convex polygons in arbitrary flat surfaces. This is of both practical and theoretical interest. The practical interest is that this work removes the case from Lang\u27s TreeMaker method that causes TreeMaker to fail to produce output in the presence of non-convex polygons. The theoretical interest comes from the fact that our generalization encompasses more than just those surfaces that can be cut out of a sheet of paper, and pertains to polygons that cannot be lied flat in the plane without self-intersections. Finally, we identify a large class of universal molecules that are not foldable by rigid folding motions. This makes progress towards a complete characterization of the foldability of the universal molecule
Alternative splicing detection across different tissues in cork oak
Tese de mestrado, Bioinformática e Biologia Computacional (Bioinformática) Universidade de Lisboa, Faculdade de Ciências, 2017As florestas de sobreiro (Quercus suber L.) são recursos únicos e emblemáticos em Portugal, com elevado impacto económico, ecológico e social. A disponibilidade recente da sequência do genoma de sobreiro forneceu um importante contributo para revitalizar a pesquisa em temas como desenvolvimento de cortiça e melhoramento da planta, assim como promover a competitividade da indústria da cortiça. No entanto, é ainda necessário adicionar mais detalhe à anotação estrutural do genoma, nomeadamente ao nível dos transcritos, incluindo previsão de eventos de splicing alternativo. O splicing alternativo (AS) é um processo usado durante a expressão génica que origina diferentes variantes de transcritos (isoformas) e produtos proteicos a partir um único gene. No presente estudo, procedemos à análise de dezasseis bibliotecas de RNA-seq, preparadas a partir de quatro tecidos de sobreiro (folhas, felema, entrecasco e xilema), de modo a prever novas formas de AS para genes já previstos e melhorar a anotação estrutural do genoma. Um protocolo bioinformático foi definido para testar o desempenho do software HISAT2 e STAR para mapeamento de reads de RNAseq no genoma de referência, e do software Cufflinks e StringTie para (re)construção de transcritos. O alinhamento de reads no genoma efetuado com STAR resultou em taxas de mapeamento (de 84,22% a 86,86%) superiores aos resultados atingidos com HISAT2 (73,88% a 76,55%). Assim, os resultados de mapeamento com STAR foram utilizados para a (re)construção de transcritos. O uso do StringTie para este processo foi globalmente mais conservador do que com Cufflinks, gerando menos transcritos novos, mas com melhor cobertura de reads por pares de base. Para melhorar a precisão da anotação e reduzir falsos positivos, foi realizado um passo adicional de otimização com StringTie. Desta otimização resultou uma anotação que prevê a ocorrência de 7 958 novos transcritos (8% dos transcritos totais), dos quais 5 453 são novas isoformas para genes previstos na anotação de referência. Esta nova anotação foi utilizada como referência para estimar a abundância dos transcritos em cada um dos tecidos estudados e efetuar a análise de expressão diferencial. Cerca de 16% de todos os genes expressos nos quatro tecidos e que contêm intrões apresentaram splicing alternativo, e os principais eventos de splicing foram alternative acceptor site e intron retention. Grupos de transcritos com expressão diferencial entre os quatro tecidos foram identificados e a análise de enriquecimento funcional confirmou os principais processos biológicos esperados para cada tecido: os transcritos mais expressos nas folhas e no xilema estavam relacionados com a fotossíntese e com transporte, respetivamente; transcritos mais expressos na periderme (felema e entrecasco) mostraram um enriquecimento em categorias funcionais relacionadas com a síntese de suberina e outros componentes de parede celular presentes nas células de cortiça. Estes grupos específicos mostraram também um enriquecimento em transcritos envolvidos na resposta ao stresse (biótico ou abiótico). Nos tecidos que compõem a periderme, este enriquecimento foi observado principalmente no entrecasco, enquanto que no felema foi detetado um enriquecimento em transcritos envolvidos no metabolismo secundário. A presente tese permitiu a definição de um protocolo padrão que poderá ser usado para estudar o splicing alternativo no sobreiro e para uma análise mais aprofundada na nova versão do genoma, que estará disponível em breve.Cork oak (Quercus suber L.) forests are unique and emblematic resources for Portugal, with high economical, ecological and social significance. The recent availability of the cork oak genome sequence provided an important contribution to reinvigorate research in fundamental topics such as cork development and plant improvement, and to promote the competitiveness of cork industry. Yet, further analysis is required to add detail to genome structure annotation, namely at the transcript level, also taking into account alternative splicing. Alternative splicing (AS) is a process used during gene expression to yield different transcript variants and protein products derived from a single gene. In the present study, we analyzed sixteen RNA-seq libraries prepared from four cork oak tissues (leaf, xylem, phellem and inner bark), in order to predict new AS forms for the already predicted genes and improve genome structural annotation. A bioinformatics pipeline was defined in order to test the performance of HISAT2 and STAR for read mapping against the reference genome, and Cufflinks and StringTie for transcript assembly. STAR yielded higher mapping efficiencies (84.22% to 86.86%) for the cork oak datasets, as compared to HISAT2 (73.88% to 76.55%), and the corresponding mapping data was selected for transcript assembly. The use of StringTie for this step was globally more conservative than Cufflinks, generating less novel transcripts, but with better support by read per base coverage. A further optimization step was performed using StringTie in order to improve annotation precision. The final transcript annotation was selected from this optimization step, predicting 7,958 novel transcripts (8% of total transcripts in the new annotation), 5,453 of which were novel isoforms for genes in reference annotation. This new annotation was used as reference to estimate transcript abundance in each tissue and differential expression analysis. Approximately 16% of all intron-containing genes expressed in the four tissues were alternatively spliced and the main event found in the four cork oak tissues was alternative acceptor site, followed by intron retention. Transcript clusters showing differential expression among the four tissues were identified and functional enrichment analysis confirmed the main biological processes expected for each tissue: transcripts highly expressed in leaves and xylem were mostly related to photosynthesis and transport, respectively; transcripts highly expressed in peridermis (phellem and inner bark) showed an enrichment in functional categories related to the synthesis of suberin and other component of cork cell walls. These tissue-specific clusters also showed an enrichment in transcripts involved in the response to stress (biotic or abiotic). Yet, in peridermis, this enrichment was mostly observed in inner bark samples, while phellem samples showed an enrichment in transcripts related to secondary metabolism. This thesis allowed the definition of a standard workflow that can be used to study alternative splicing in cork oak and used for further analysis on the new improved genome version that will be available soon
- …