80 research outputs found

    APES: Approximate Posterior Ensemble Sampler

    Full text link
    This paper proposes a novel approach to generate samples from target distributions that are difficult to sample from using Markov Chain Monte Carlo (MCMC) methods. Traditional MCMC algorithms often face slow convergence due to the difficulty in finding proposals that suit the problem at hand. To address this issue, the paper introduces the Approximate Posterior Ensemble Sampler (APES) algorithm, which employs kernel density estimation and radial basis interpolation to create an adaptive proposal, leading to fast convergence of the chains. The APES algorithm's scalability to higher dimensions makes it a practical solution for complex problems. The proposed method generates an approximate posterior probability that closely approximates the desired distribution and is easy to sample from, resulting in smaller autocorrelation times and a higher probability of acceptance by the chain. In this work, we compare the performance of the APES algorithm with the affine invariance ensemble sampler with the stretch move in various contexts, demonstrating the efficiency of the proposed method. For instance, on the Rosenbrock function, the APES presented an autocorrelation time 140 times smaller than the affine invariance ensemble sampler. The comparison showcases the effectiveness of the APES algorithm in generating samples from challenging distributions. This paper presents a practical solution to generating samples from complex distributions while addressing the challenge of finding suitable proposals. With new cosmological surveys set to deal with many new systematics, which will require many new nuisance parameters in the models, this method offers a practical solution for the upcoming era of cosmological analyses.Comment: 15 pages, 6 figures, 7 table

    Visualisation of bioinformatics datasets

    Get PDF
    Analysing the molecular polymorphism and interactions of DNA, RNA and proteins is of fundamental importance in biology. Predicting functions of polymorphic molecules is important in order to design more effective medicines. Analysing major histocompatibility complex (MHC) polymorphism is important for mate choice, epitope-based vaccine design and transplantation rejection etc. Most of the existing exploratory approaches cannot analyse these datasets because of the large number of molecules with a high number of descriptors per molecule. This thesis develops novel methods for data projection in order to explore high dimensional biological dataset by visualising them in a low-dimensional space. With increasing dimensionality, some existing data visualisation methods such as generative topographic mapping (GTM) become computationally intractable. We propose variants of these methods, where we use log-transformations at certain steps of expectation maximisation (EM) based parameter learning process, to make them tractable for high-dimensional datasets. We demonstrate these proposed variants both for synthetic and electrostatic potential dataset of MHC class-I. We also propose to extend a latent trait model (LTM), suitable for visualising high dimensional discrete data, to simultaneously estimate feature saliency as an integrated part of the parameter learning process of a visualisation model. This LTM variant not only gives better visualisation by modifying the project map based on feature relevance, but also helps users to assess the significance of each feature. Another problem which is not addressed much in the literature is the visualisation of mixed-type data. We propose to combine GTM and LTM in a principled way where appropriate noise models are used for each type of data in order to visualise mixed-type data in a single plot. We call this model a generalised GTM (GGTM). We also propose to extend GGTM model to estimate feature saliencies while training a visualisation model and this is called GGTM with feature saliency (GGTM-FS). We demonstrate effectiveness of these proposed models both for synthetic and real datasets. We evaluate visualisation quality using quality metrics such as distance distortion measure and rank based measures: trustworthiness, continuity, mean relative rank errors with respect to data space and latent space. In cases where the labels are known we also use quality metrics of KL divergence and nearest neighbour classifications error in order to determine the separation between classes. We demonstrate the efficacy of these proposed models both for synthetic and real biological datasets with a main focus on the MHC class-I dataset

    The discovery of new functional oxides using combinatorial techniques and advanced data mining algorithms

    Get PDF
    Electroceramic materials research is a wide ranging field driven by device applications. For many years, the demand for new materials was addressed largely through serial processing and analysis of samples often similar in composition to those already characterised. The Functional Oxide Discovery project (FOXD) is a combinatorial materials discovery project combining high-throughput synthesis and characterisation with advanced data mining to develop novel materials. Dielectric ceramics are of interest for use in telecommunications equipment; oxygen ion conductors are examined for use in fuel cell cathodes. Both applications are subject to ever increasing industry demands and materials designs capable of meeting the stringent requirements are urgently required. The London University Search Instrument (LUSI) is a combinatorial robot employed for materials synthesis. Ceramic samples are produced automatically using an ink-jet printer which mixes and prints inks onto alumina slides. The slides are transferred to a furnace for sintering and transported to other locations for analysis. Production and analysis data are stored in the project database. The database forms a valuable resource detailing the progress of the project and forming a basis for data mining. Materials design is a two stage process. The first stage, forward prediction, is accomplished using an artificial neural network, a Baconian, inductive technique. In a second stage, the artificial neural network is inverted using a genetic algorithm. The artificial neural network prediction, stoichiometry and prediction reliability form objectives for the genetic algorithm which results in a selection of materials designs. The full potential of this approach is realised through the manufacture and characterisation of the materials. The resulting data improves the prediction algorithms, permitting iterative improvement to the designs and the discovery of completely new materials

    Machine Learning for Hand Gesture Classification from Surface Electromyography Signals

    Get PDF
    Classifying hand gestures from Surface Electromyography (sEMG) is a process which has applications in human-machine interaction, rehabilitation and prosthetic control. Reduction in the cost and increase in the availability of necessary hardware over recent years has made sEMG a more viable solution for hand gesture classification. The research challenge is the development of processes to robustly and accurately predict the current gesture based on incoming sEMG data. This thesis presents a set of methods, techniques and designs that improve upon evaluation of, and performance on, the classification problem as a whole. These are brought together to set a new baseline for the potential classification. Evaluation is improved by careful choice of metrics and design of cross-validation techniques that account for data bias caused by common experimental techniques. A landmark study is re-evaluated with these improved techniques, and it is shown that data augmentation can be used to significantly improve upon the performance using conventional classification methods. A novel neural network architecture and supporting improvements are presented that further improve performance and is refined such that the network can achieve similar performance with many fewer parameters than competing designs. Supporting techniques such as subject adaptation and smoothing algorithms are then explored to improve overall performance and also provide more nuanced trade-offs with various aspects of performance, such as incurred latency and prediction smoothness. A new study is presented which compares the performance potential of medical grade electrodes and a low-cost commercial alternative showing that for a modest-sized gesture set, they can compete. The data is also used to explore data labelling in experimental design and to evaluate the numerous aspects of performance that must be traded off

    Challenges in biomedical data science: data-driven solutions to clinical questions

    Get PDF
    Data are influencing every aspect of our lives, from our work activities, to our spare time and even to our health. In this regard, medical diagnosis and treatments are often supported by quantitative measures and observations, such as laboratory tests, medical imaging or genetic analysis. In medicine, as well as in several other scientific domains, the amount of data involved in each decision-making process has become overwhelming. The complexity of the phenomena under investigation and the scale of modern data collections has long superseded human analysis and insights potential

    Learning-based methods for planning and control of humanoid robots

    Get PDF
    Nowadays, humans and robots are more and more likely to coexist as time goes by. The anthropomorphic nature of humanoid robots facilitates physical human-robot interaction, and makes social human-robot interaction more natural. Moreover, it makes humanoids ideal candidates for many applications related to tasks and environments designed for humans. No matter the application, an ubiquitous requirement for the humanoid is to possess proper locomotion skills. Despite long-lasting research, humanoid locomotion is still far from being a trivial task. A common approach to address humanoid locomotion consists in decomposing its complexity by means of a model-based hierarchical control architecture. To cope with computational constraints, simplified models for the humanoid are employed in some of the architectural layers. At the same time, the redundancy of the humanoid with respect to the locomotion task as well as the closeness of such a task to human locomotion suggest a data-driven approach to learn it directly from experience. This thesis investigates the application of learning-based techniques to planning and control of humanoid locomotion. In particular, both deep reinforcement learning and deep supervised learning are considered to address humanoid locomotion tasks in a crescendo of complexity. First, we employ deep reinforcement learning to study the spontaneous emergence of balancing and push recovery strategies for the humanoid, which represent essential prerequisites for more complex locomotion tasks. Then, by making use of motion capture data collected from human subjects, we employ deep supervised learning to shape the robot walking trajectories towards an improved human-likeness. The proposed approaches are validated on real and simulated humanoid robots. Specifically, on two versions of the iCub humanoid: iCub v2.7 and iCub v3

    Gaussian Process in Computational Biology: Covariance Functions for Transcriptomics

    Get PDF
    In the field of machine learning, Gaussian process models are widely used families of stochastic process for modelling data observed over time, space or both. Gaussian processes models are nonparametric, meaning that the models are developed on an infinite-dimensional parameter space. The parameter space is then typically learnt as the set of all possible solutions for a given learning problem. Gaussian process distributions are distribution over functions. The covariance function determines the properties of functions samples drawn from the process. Once the decision to model with a Gaussian process has been made the choice of the covariance function is a central step in modelling. In molecular biology and genetics, a transcription factor is a protein that binds to specific DNA sequences and controls the flow of genetic information from DNA to mRNA. To develop models of cellular processes, quantitative estimation of the regulatory relationship between transcription factors and genes is a basic requirement. Quantitative estimation is complex due to various reasons. Many of the transcription factors' activities and their own transcription level are post transcriptionally modified; very often the levels of the transcription factors' expressions are low and noisy. So, from the expression levels of their target genes, it is useful to infer the activity of the transcription factors. Here we developed a Gaussian process based nonparametric regression model to infer the exact transcription factor activities from a combination of mRNA expression levels and DNA-protein binding measurements. Clustering of gene expression time series gives insight into which genes may be coregulated, allowing us to discern the activity of pathways in a given microarray experiment. Of particular interest is how a given group of genes varies with different conditions or genetic backgrounds. In this thesis, we developed a new clustering method that allows each cluster to be parametrized according to the behaviour of the genes across conditions whether they are correlated or anti-correlated. By specifying the correlation between such genes, we gain more information within the cluster about how the genes interrelate. Our study shows the effectiveness of sharing information between replicates and different model conditions while modelling gene expression time series
    corecore