This thesis describes the development and application of machine learning-based
methods for the prediction of alpha-helical transmembrane protein
structure from sequence alone. It is divided into six chapters.
Chapter 1 provides an introduction to membrane structure and dynamics,
membrane protein classes and families, and membrane protein structure prediction.
Chapter 2 describes a topological study of the transmembrane protein
CLN3 using a consensus of bioinformatic approaches constrained by experimental
data. Mutations in CLN3 can cause juvenile neuronal ceroid
lipofuscinosis, or Batten disease, an inherited neurodegenerative lysosomal
storage disease affecting children, therefore such studies are important
for directing further experimental work into this incurable illness.
Chapter 3 explores the possibility of using biologically meaningful signatures
described as regular expressions to influence the assignment of inside
and outside loop locations during transmembrane topology prediction. Using
this approach, it was possilbe to modify a recent topology prediction method
leading to an improvement of 6% prediction accuracy using a standard data set.
Chapter 4 describes the development of a novel support vector machine-based
topology predictor that integrates both signal peptide and re-entrant helix prediction,
benchmarked with full cross-validation on a novel data set of sequences with
known crystal structures. The method achieves state-of-the-art performance in predicting
topology and discriminating between globular and transmembrane proteins.
We also present the results of applying these tools to a number of complete genomes.
Chapter 5 describes a novel approach to predict lipid exposure, residue
contacts, helix-helix interactions and finally the optimal helical packing arrangement of transmembrane proteins. It is based on two support vector
machine classifiers that predict per residue lipid exposure and residue contacts,
which are used to determine helix-helix interaction with up to 65%
accuracy. The method is also able to discriminate native from decoy helical
packing arrangements with up to 70% accuracy. Finally, a force-directed
algorithm is employed to construct the optimal helical packing arrangement
which demonstrates success for proteins containing up to 13 transmembrane helices.
The final chapter summarises the major contributions of this thesis to biology,
before future perspectives for TM protein structure prediction are discussed