The ability of machines to recognise and interpret human action and gesture from standard video footage has wide-ranging applications for control, analysis and security. However, in many scenarios the use of commercial motion capture systems is undesirable or infeasible (e.g. intelligent surveillance). In particular, commercial systems are restricted by their dependence on markers and the use of multiple cameras that must be synchronized and calibrated by hand. It is the aim of this thesis to develop methods that relax these constraints in order to bring inexpensive, off-the-shelf motion capture several steps closer to a reality. In doing so, we demonstrate that image projections of important anatomical landmarks on the body (specifically, joint centre projections) can be recovered automatically from image data. One approach exploits geometric methods developed in the field of Structure From Motion (SFM), whereby point features on the surface of an articulated body impose constraints on the hidden joint locations, even for a single view. An alternative approach explores Machine Learning to employ context-specifi