41 research outputs found
Efficient large-scale protein sequence comparison and gene matching to identify orthologs and co-orthologs
Broadly, computational approaches for ortholog assignment is a three steps process: (i) identify all putative homologs between the genomes, (ii) identify gene anchors and (iii) link anchors to identify best gene matches given their order and context. In this article, we engineer two methods to improve two important aspects of this pipeline [specifically steps (ii) and (iii)]. First, computing sequence similarity data [step (i)] is a computationally intensive task for large sequence sets, creating a bottleneck in the ortholog assignment pipeline. We have designed a fast and highly scalable sort-join method (afree) based on k-mer counts to rapidly compare all pairs of sequences in a large protein sequence set to identify putative homologs. Second, availability of complex genomes containing large gene families with prevalence of complex evolutionary events, such as duplications, has made the task of assigning orthologs and co-orthologs difficult. Here, we have developed an iterative graph matching strategy where at each iteration the best gene assignments are identified resulting in a set of orthologs and co-orthologs. We find that the afree algorithm is faster than existing methods and maintains high accuracy in identifying similar genes. The iterative graph matching strategy also showed high accuracy in identifying complex gene relationships. Standalone afree available from http://vbc.med.monash.edu.au/∼kmahmood/afree. EGM2, complete ortholog assignment pipeline (including afree and the iterative graph matching method) available from http://vbc.med.monash.edu.au/∼kmahmood/EGM2
Molecular Dynamics Study of Supercoiled DNA Minicircles Tightly Bent and Supercoiled DNA in Atomistic Resolution
Towards the complete understanding of the DNA response to superhelical stress, sequence dependence structural disruptions on the ~100 base pairs supercoiled DNA minicircles were examined through a series of atomistic MD simulations. The results showed the effects from some subtle structural characteristics of DNA on defect formation, including flexibility at base pair step level and anisotropy, whose dynamic information are available only from atomistic MD simulations. For longer supercoiled DNA minicircles (240-340 bp), the molecules adapt into their writhed conformations. Writhe can be calculated by a Gauss’ integral performed along the DNA central axis path. A new mathematical definition for the DNA central axis path was developed for the more exact writhe calculation. Finally, atomistic representation of supercoiled 336 base pairs minicircles was provided by fitting the DNA structure obtained by explicitly solvated MD simulations into the density maps from electron cryo-tomography. Structural data were analysed and provided a decent explanation for the mechanism of the sequence specific binding of the enzyme topoisomerase 1B onto the negatively supercoiled DNA
Modelling the Extensionally Driven Transitions of DNA
Empirical measurements on DNA under tension show a jump by a factor of
≈ 1.5 − 1.7 in the relative extension at applied force of ≈ 65 − 70 pN, indi-
cating a structural transition. The still ambiguously characterised stretched
‘phase’ is known as S-DNA. Using atomistic and coarse-grained Monte Carlo
simulations we study DNA over-stretching in the presence of organic salts
Ethidium Bromide (EtBr) and Arginine (an amino acid present in the RecA
binding cleft). We present planar-stacked triplet disproportionated DNA as
a solution phase of the double helix under tension, and dub it ‘Σ DNA’, with
the three right-facing points of the Σ character serving as a mnemonic for
the three grouped bases. Like unstretched Watson-Crick base paired DNA
structures, the structure of the Σ phase is linked to function: the partitioning
of bases into codons of three base-pairs each is the first stage of operation
of recombinase enzymes such as RecA, facilitating alignment of homologous
or near-homologous sequences for genetic exchange or repair. By showing
that this process does not require any very sophisticated manipulation of
the DNA, we position it as potentially appearing as an early step in the de-
velopment of life, and correlate the postulated sequence of incorporation of
amino acids (GADV then GADVESPLIT and then the full 20 residue set of
canonical amino acids) into molecular biology with the ease of Σ-formation
for sequences including the associated codons. To further investigate the de-
pendence of stretching behaviour on the concentration of intercalating salt
molecules, we present a physically motivated coarse-grained force-field for
DNA under tension and use it to qualitatively reproduce regimes of force-
extension behaviour which are not atomistically accessible
Geometric modeling, simulation, and visualization methods for plasmid DNA molecules
Plasmid DNA molecules are a special type of DNA molecules that are used, among other applications,
in DNA vaccination and gene therapy. These molecules are characterized by, when in
their natural state, presenting a closed-circular conformation and by being supercoiled. The
production of plasmid DNA using bacteria as hosts implies a purification step where the plasmid
DNA molecules are separated from the DNA of the host and other contaminants. This purification
process, and all the physical and chemical variations involved, such as temperature
changes, may affect the plasmid DNA molecules conformation by uncoiling or even by open
them, which makes them useless for therapeutic applications. Because of that, researchers
are always searching for new purification techniques that maximize the amount of supercoiled
plasmid DNA that is produced. Computer simulations and 3D visualization of plasmid DNA can
bring many advantages because they allow researchers to actually see what can happen to the
molecules under certain conditions. In this sense, it was necessary to develop reliable and accurate
geometric models specific for plasmid DNA simulations. This dissertation presents a new
assembling algorithm for B-DNA specifically developed for plasmid DNA assembling. This new
assembling algorithm is completely adaptive in the sense that it allows researchers to assemble
any plasmid DNA base-pair sequence along any arbitrary conformation that fits the length
of the plasmid DNA molecule. This is specially suitable for plasmid DNA simulations, where
conformations are generated by simulation procedures and there is the need to assemble the
given base-pair sequence over that conformation, what can not be done by conventional predictive
DNA assembling methods. Unlike traditional molecular visualization methods that are
based on the atomic structure, this new assembling algorithm uses color coded 3D molecular
surfaces of the nucleotides as the building blocks for DNA assembling. This new approach, not
only reduces the amount of graphical objects and, consequently, makes the rendering faster,
but also makes it easier to visually identify the nucleotides in the DNA strands. The algorithm
used to triangulate the molecular surfaces of the nucleotides building blocks is also a novelty
presented as part of this dissertation. This new triangulation algorithm for Gaussian molecular
surfaces introduces a new mechanism that divides the atomic structure of molecules into boxes
and spheres. This new space division method is faster because it confines the local calculation
of the molecular surface to a specific region of influence of the atomic structure, not taking into
account atoms that do not influence the triangulation of the molecular surface in that region.
This new method also guarantees the continuity of the molecular surface. Having in mind that
the aim of this dissertation is to present a complete set of methods for plasmid DNA visualization
and simulation, it is also proposed a new deformation algorithm to be used for plasmid
DNA Monte Carlo simulations. This new deformation algorithm uses a 3D polyline to represent
the plasmid DNA conformation and performs small deformations on that polyline, keeping the
segments length and connectivity. Experiments have been performed in order to compare this
new deformation method with deformation methods traditionally used by Monte Carlo plasmid
DNA simulations These experiments shown that the new method is more efficient in the sense
that its trial acceptance ratio is higher and it converges sooner and faster to the elastic energy
equilibrium state of the plasmid DNA molecule. In sum, this dissertation successfully presents
an end-to-end set of models and algorithms for plasmid DNA geometric modelling, visualization
and simulation
Comparison of existing aneurysm models and their path forward
The two most important aneurysm types are cerebral aneurysms (CA) and
abdominal aortic aneurysms (AAA), accounting together for over 80\% of all
fatal aneurysm incidences. To minimise aneurysm related deaths, clinicians
require various tools to accurately estimate its rupture risk. For both
aneurysm types, the current state-of-the-art tools to evaluate rupture risk are
identified and evaluated in terms of clinical applicability. We perform a
comprehensive literature review, using the Web of Science database. Identified
records (3127) are clustered by modelling approach and aneurysm location in a
meta-analysis to quantify scientific relevance and to extract modelling
patterns and further assessed according to PRISMA guidelines (179 full text
screens). Beside general differences and similarities of CA and AAA, we
identify and systematically evaluate four major modelling approaches on
aneurysm rupture risk: finite element analysis and computational fluid dynamics
as deterministic approaches and machine learning and assessment-tools and
dimensionless parameters as stochastic approaches. The latter score highest in
the evaluation for their potential as clinical applications for rupture
prediction, due to readiness level and user friendliness. Deterministic
approaches are less likely to be applied in a clinical environment because of
their high model complexity. Because deterministic approaches consider
underlying mechanism for aneurysm rupture, they have improved capability to
account for unusual patient-specific characteristics, compared to stochastic
approaches. We show that an increased interdisciplinary exchange between
specialists can boost comprehension of this disease to design tools for a
clinical environment. By combining deterministic and stochastic models,
advantages of both approaches can improve accessibility for clinicians and
prediction quality for rupture risk.Comment: 46 pages, 5 figure
Combining Linguistic and Machine Learning Techniques for Word Alignment Improvement
Alignment of words, i.e., detection of corresponding units between two sentences that are translations of each other, has been shown to be crucial for the success of many NLP applications such as statistical machine translation (MT), construction of bilingual lexicons, word-sense disambiguation, and projection of resources between languages. With the availability of large parallel texts, statistical word alignment systems have proven to be quite successful on many language pairs. However, these systems are still faced with several challenges due to the complexity of the word alignment problem, lack of enough training data, difficulty learning statistics correctly, translation divergences, and lack of a means for incremental incorporation of linguistic knowledge.
This thesis presents two new frameworks to improve existing word alignments using supervised learning techniques. In the first framework, two rule-based approaches are introduced. The first approach, Divergence Unraveling for Statistical MT (DUSTer), specifically targets translation divergences and corrects the alignment links related to them using a set of manually-crafted, linguistically-motivated rules. In the second approach, Alignment Link Projection (ALP), the rules are generated automatically by adapting transformation-based error-driven learning to the word alignment problem. By conditioning the rules on initial alignment and linguistic properties of the words, ALP manages to categorize the errors of the initial system and correct them.
The second framework, Multi-Align, is an alignment combination framework based on classifier ensembles. The thesis presents a neural-network based implementation of Multi-Align, called NeurAlign. By treating individual alignments as classifiers, NeurAlign builds an additional model to learn how to combine the input alignments effectively.
The evaluations show that the proposed techniques yield significant improvements (up to 40% relative error reduction) over existing word alignment systems on four different language pairs, even with limited manually annotated data. Moreover, all three systems allow an easy integration of linguistic knowledge into statistical models without the need for large modifications to existing systems. Finally, the improvements are analyzed using various measures, including the impact of improved word alignments in an external application---phrase-based MT
Computational Geometric and Algebraic Topology
Computational topology is a young, emerging field of mathematics that seeks out practical algorithmic methods for solving complex and fundamental problems in geometry and topology. It draws on a wide variety of techniques from across pure mathematics (including topology, differential geometry, combinatorics, algebra, and discrete geometry), as well as applied mathematics and theoretical computer science. In turn, solutions to these problems have a wide-ranging impact: already they have enabled significant progress in the core area of geometric topology, introduced new methods in applied mathematics, and yielded new insights into the role that topology has to play in fundamental problems surrounding computational complexity.
At least three significant branches have emerged in computational topology: algorithmic 3-manifold and knot theory, persistent homology and surfaces and graph embeddings. These branches have emerged largely independently. However, it is clear that they have much to offer each other. The goal of this workshop was to be the first significant step to bring these three areas together, to share ideas in depth, and to pool our expertise in approaching some of the major open problems in the field