4 research outputs found
A Surrogate Function for One-Dimensional Phylogenetic Likelihoods
© The Author 2017. Published by Oxford University Press on behalf of the Society for Molecular Biology and Evolution. All rights reserved. For permissions, please e-mail: [email protected]. Phylogenetics has seen a steady increase in data set size and substitution model complexity, which require increasing amounts of computational power to compute likelihoods. This motivates strategies to approximate the likelihood functions for branch length optimization and Bayesian sampling. In this article, we develop an approximation to the 1D likelihood function as parametrized by a single branch length. Our method uses a four-parameter surrogate function abstracted from the simplest phylogenetic likelihood function, the binary symmetric model. We show that it offers a surrogate that can be fit over a variety of branch lengths, that it is applicable to a wide variety of models and trees, and that it can be used effectively as a proposal mechanism for Bayesian sampling. The method is implemented as a stand-Alone open-source C library for calling from phylogenetics algorithms; it has proven essential for good performance of our online phylogenetic algorithm sts
Computational analysis of genetic variation
High throughput sequences are generating increasingly detailed
catalogues of
genetic variation both in human disease and within the larger
population. To
effectively utilise this rich data set for maximum research
benefit, as a discipline we
require robust, flexible, and reproducible analysis pipelines
capable of accurately
detecting and prioritising variants. While data-specific
computational algorithms
aimed at deriving accurate data from these technologies have
reached maturity, two
major challenges remain in order to realise the goals of
elucidating the underlying
genetic causes of disease as a means of developing custom
treatment options. The
first challenge is the creation of high-throughput variant
detection pipelines able to
reliably detect sample variation from a variety of sequence data
types. Such a system
needs to be scalable, flexible, robust, highly automated, and
able to support
reproducible analyses in order to support both default and custom
variant detection
workflows. The second challenge is the effective prioritisation
of the huge number of
variants detected in each sample, a task required to reduce the
large search space for
causal variants down to variant lists suitable for manual
interrogation. This thesis
describes six publications describing components of the larger
informatics framework
I have developed over the last four years to address these
challenges, a framework
designed from the onset to effectively manage and process large
data sets with an end
goal of utilising computational analysis of sequence data to
further understand the
relationship between genetic variation and human disease. The
first publication
“Reliably detecting clinically important variants requires both
combined variant calls
and optimized filtering strategies” describes a variant
detection strategy designed to
minimize false negative variants as is desired when utilising
patient variation data in
the clinic. The next four publications describe custom workflows
developed for
detecting variants in sequence data from different sample types,
namely paired cancer
samples (“Tumour procurement, DNA extraction, coverage analysis
and optimisation
of mutation-calling algorithms for human melanoma genomes”),
pedigrees
(“Reducing the search space for causal genetic variants with
VASP: Variant Analysis
of Sequenced Pedigrees”), mixed cell populations containing
ultra-rare mutations
(“DeepSNVMiner: A sequence analysis tool to detect emergent,
rare mutations in
sub-sets of cell populations”) and mouse exome data containing
ENU mutations
(“Massively parallel sequencing of the mouse exome to
accurately identify rare,
induced mutations: an immediate source for thousands of new mouse
models”) . The
last publication, “Comparison of predicted and actual
consequences of missense
mutations” focuses on the validation of computational tools
that predict functional
impact of missense mutations and further attempts to explain why
many missense
mutations predicted to be damaging do not result in an observable
phenotype as might
be expected. Collectively these publications detail efforts to
reliably detect and
prioritise variants across a wide variety of data types, efforts
all based around the
significant underlying software framework I have developed to
better elucidate the
link between genetic variation and disease