41 research outputs found

    Efficient large-scale protein sequence comparison and gene matching to identify orthologs and co-orthologs

    Get PDF
    Broadly, computational approaches for ortholog assignment is a three steps process: (i) identify all putative homologs between the genomes, (ii) identify gene anchors and (iii) link anchors to identify best gene matches given their order and context. In this article, we engineer two methods to improve two important aspects of this pipeline [specifically steps (ii) and (iii)]. First, computing sequence similarity data [step (i)] is a computationally intensive task for large sequence sets, creating a bottleneck in the ortholog assignment pipeline. We have designed a fast and highly scalable sort-join method (afree) based on k-mer counts to rapidly compare all pairs of sequences in a large protein sequence set to identify putative homologs. Second, availability of complex genomes containing large gene families with prevalence of complex evolutionary events, such as duplications, has made the task of assigning orthologs and co-orthologs difficult. Here, we have developed an iterative graph matching strategy where at each iteration the best gene assignments are identified resulting in a set of orthologs and co-orthologs. We find that the afree algorithm is faster than existing methods and maintains high accuracy in identifying similar genes. The iterative graph matching strategy also showed high accuracy in identifying complex gene relationships. Standalone afree available from http://vbc.med.monash.edu.au/∼kmahmood/afree. EGM2, complete ortholog assignment pipeline (including afree and the iterative graph matching method) available from http://vbc.med.monash.edu.au/∼kmahmood/EGM2

    Molecular Dynamics Study of Supercoiled DNA Minicircles Tightly Bent and Supercoiled DNA in Atomistic Resolution

    Get PDF
    Towards the complete understanding of the DNA response to superhelical stress, sequence dependence structural disruptions on the ~100 base pairs supercoiled DNA minicircles were examined through a series of atomistic MD simulations. The results showed the effects from some subtle structural characteristics of DNA on defect formation, including flexibility at base pair step level and anisotropy, whose dynamic information are available only from atomistic MD simulations. For longer supercoiled DNA minicircles (240-340 bp), the molecules adapt into their writhed conformations. Writhe can be calculated by a Gauss’ integral performed along the DNA central axis path. A new mathematical definition for the DNA central axis path was developed for the more exact writhe calculation. Finally, atomistic representation of supercoiled 336 base pairs minicircles was provided by fitting the DNA structure obtained by explicitly solvated MD simulations into the density maps from electron cryo-tomography. Structural data were analysed and provided a decent explanation for the mechanism of the sequence specific binding of the enzyme topoisomerase 1B onto the negatively supercoiled DNA

    Modelling the Extensionally Driven Transitions of DNA

    Get PDF
    Empirical measurements on DNA under tension show a jump by a factor of ≈ 1.5 − 1.7 in the relative extension at applied force of ≈ 65 − 70 pN, indi- cating a structural transition. The still ambiguously characterised stretched ‘phase’ is known as S-DNA. Using atomistic and coarse-grained Monte Carlo simulations we study DNA over-stretching in the presence of organic salts Ethidium Bromide (EtBr) and Arginine (an amino acid present in the RecA binding cleft). We present planar-stacked triplet disproportionated DNA as a solution phase of the double helix under tension, and dub it ‘Σ DNA’, with the three right-facing points of the Σ character serving as a mnemonic for the three grouped bases. Like unstretched Watson-Crick base paired DNA structures, the structure of the Σ phase is linked to function: the partitioning of bases into codons of three base-pairs each is the first stage of operation of recombinase enzymes such as RecA, facilitating alignment of homologous or near-homologous sequences for genetic exchange or repair. By showing that this process does not require any very sophisticated manipulation of the DNA, we position it as potentially appearing as an early step in the de- velopment of life, and correlate the postulated sequence of incorporation of amino acids (GADV then GADVESPLIT and then the full 20 residue set of canonical amino acids) into molecular biology with the ease of Σ-formation for sequences including the associated codons. To further investigate the de- pendence of stretching behaviour on the concentration of intercalating salt molecules, we present a physically motivated coarse-grained force-field for DNA under tension and use it to qualitatively reproduce regimes of force- extension behaviour which are not atomistically accessible

    Geometric modeling, simulation, and visualization methods for plasmid DNA molecules

    Get PDF
    Plasmid DNA molecules are a special type of DNA molecules that are used, among other applications, in DNA vaccination and gene therapy. These molecules are characterized by, when in their natural state, presenting a closed-circular conformation and by being supercoiled. The production of plasmid DNA using bacteria as hosts implies a purification step where the plasmid DNA molecules are separated from the DNA of the host and other contaminants. This purification process, and all the physical and chemical variations involved, such as temperature changes, may affect the plasmid DNA molecules conformation by uncoiling or even by open them, which makes them useless for therapeutic applications. Because of that, researchers are always searching for new purification techniques that maximize the amount of supercoiled plasmid DNA that is produced. Computer simulations and 3D visualization of plasmid DNA can bring many advantages because they allow researchers to actually see what can happen to the molecules under certain conditions. In this sense, it was necessary to develop reliable and accurate geometric models specific for plasmid DNA simulations. This dissertation presents a new assembling algorithm for B-DNA specifically developed for plasmid DNA assembling. This new assembling algorithm is completely adaptive in the sense that it allows researchers to assemble any plasmid DNA base-pair sequence along any arbitrary conformation that fits the length of the plasmid DNA molecule. This is specially suitable for plasmid DNA simulations, where conformations are generated by simulation procedures and there is the need to assemble the given base-pair sequence over that conformation, what can not be done by conventional predictive DNA assembling methods. Unlike traditional molecular visualization methods that are based on the atomic structure, this new assembling algorithm uses color coded 3D molecular surfaces of the nucleotides as the building blocks for DNA assembling. This new approach, not only reduces the amount of graphical objects and, consequently, makes the rendering faster, but also makes it easier to visually identify the nucleotides in the DNA strands. The algorithm used to triangulate the molecular surfaces of the nucleotides building blocks is also a novelty presented as part of this dissertation. This new triangulation algorithm for Gaussian molecular surfaces introduces a new mechanism that divides the atomic structure of molecules into boxes and spheres. This new space division method is faster because it confines the local calculation of the molecular surface to a specific region of influence of the atomic structure, not taking into account atoms that do not influence the triangulation of the molecular surface in that region. This new method also guarantees the continuity of the molecular surface. Having in mind that the aim of this dissertation is to present a complete set of methods for plasmid DNA visualization and simulation, it is also proposed a new deformation algorithm to be used for plasmid DNA Monte Carlo simulations. This new deformation algorithm uses a 3D polyline to represent the plasmid DNA conformation and performs small deformations on that polyline, keeping the segments length and connectivity. Experiments have been performed in order to compare this new deformation method with deformation methods traditionally used by Monte Carlo plasmid DNA simulations These experiments shown that the new method is more efficient in the sense that its trial acceptance ratio is higher and it converges sooner and faster to the elastic energy equilibrium state of the plasmid DNA molecule. In sum, this dissertation successfully presents an end-to-end set of models and algorithms for plasmid DNA geometric modelling, visualization and simulation

    Comparison of existing aneurysm models and their path forward

    Full text link
    The two most important aneurysm types are cerebral aneurysms (CA) and abdominal aortic aneurysms (AAA), accounting together for over 80\% of all fatal aneurysm incidences. To minimise aneurysm related deaths, clinicians require various tools to accurately estimate its rupture risk. For both aneurysm types, the current state-of-the-art tools to evaluate rupture risk are identified and evaluated in terms of clinical applicability. We perform a comprehensive literature review, using the Web of Science database. Identified records (3127) are clustered by modelling approach and aneurysm location in a meta-analysis to quantify scientific relevance and to extract modelling patterns and further assessed according to PRISMA guidelines (179 full text screens). Beside general differences and similarities of CA and AAA, we identify and systematically evaluate four major modelling approaches on aneurysm rupture risk: finite element analysis and computational fluid dynamics as deterministic approaches and machine learning and assessment-tools and dimensionless parameters as stochastic approaches. The latter score highest in the evaluation for their potential as clinical applications for rupture prediction, due to readiness level and user friendliness. Deterministic approaches are less likely to be applied in a clinical environment because of their high model complexity. Because deterministic approaches consider underlying mechanism for aneurysm rupture, they have improved capability to account for unusual patient-specific characteristics, compared to stochastic approaches. We show that an increased interdisciplinary exchange between specialists can boost comprehension of this disease to design tools for a clinical environment. By combining deterministic and stochastic models, advantages of both approaches can improve accessibility for clinicians and prediction quality for rupture risk.Comment: 46 pages, 5 figure

    Combining Linguistic and Machine Learning Techniques for Word Alignment Improvement

    Get PDF
    Alignment of words, i.e., detection of corresponding units between two sentences that are translations of each other, has been shown to be crucial for the success of many NLP applications such as statistical machine translation (MT), construction of bilingual lexicons, word-sense disambiguation, and projection of resources between languages. With the availability of large parallel texts, statistical word alignment systems have proven to be quite successful on many language pairs. However, these systems are still faced with several challenges due to the complexity of the word alignment problem, lack of enough training data, difficulty learning statistics correctly, translation divergences, and lack of a means for incremental incorporation of linguistic knowledge. This thesis presents two new frameworks to improve existing word alignments using supervised learning techniques. In the first framework, two rule-based approaches are introduced. The first approach, Divergence Unraveling for Statistical MT (DUSTer), specifically targets translation divergences and corrects the alignment links related to them using a set of manually-crafted, linguistically-motivated rules. In the second approach, Alignment Link Projection (ALP), the rules are generated automatically by adapting transformation-based error-driven learning to the word alignment problem. By conditioning the rules on initial alignment and linguistic properties of the words, ALP manages to categorize the errors of the initial system and correct them. The second framework, Multi-Align, is an alignment combination framework based on classifier ensembles. The thesis presents a neural-network based implementation of Multi-Align, called NeurAlign. By treating individual alignments as classifiers, NeurAlign builds an additional model to learn how to combine the input alignments effectively. The evaluations show that the proposed techniques yield significant improvements (up to 40% relative error reduction) over existing word alignment systems on four different language pairs, even with limited manually annotated data. Moreover, all three systems allow an easy integration of linguistic knowledge into statistical models without the need for large modifications to existing systems. Finally, the improvements are analyzed using various measures, including the impact of improved word alignments in an external application---phrase-based MT

    Computational Geometric and Algebraic Topology

    Get PDF
    Computational topology is a young, emerging field of mathematics that seeks out practical algorithmic methods for solving complex and fundamental problems in geometry and topology. It draws on a wide variety of techniques from across pure mathematics (including topology, differential geometry, combinatorics, algebra, and discrete geometry), as well as applied mathematics and theoretical computer science. In turn, solutions to these problems have a wide-ranging impact: already they have enabled significant progress in the core area of geometric topology, introduced new methods in applied mathematics, and yielded new insights into the role that topology has to play in fundamental problems surrounding computational complexity. At least three significant branches have emerged in computational topology: algorithmic 3-manifold and knot theory, persistent homology and surfaces and graph embeddings. These branches have emerged largely independently. However, it is clear that they have much to offer each other. The goal of this workshop was to be the first significant step to bring these three areas together, to share ideas in depth, and to pool our expertise in approaching some of the major open problems in the field
    corecore