77 research outputs found

    Statistical Analysis of Spherical Data: Clustering, Feature Selection and Applications

    Get PDF
    In the light of interdisciplinary applications, data to be studied and analyzed have witnessed a growth in volume and change in their intrinsic structure and type. In other words, in practice the diversity of resources generating objects have imposed several challenges for decision maker to determine informative data in terms of time, model capability, scalability and knowledge discovery. Thus, it is highly desirable to be able to extract patterns of interest that support the decision of data management. Clustering, among other machine learning approaches, is an important data engineering technique that empowers the automatic discovery of similar object’s clusters and the consequent assignment of new unseen objects to appropriate clusters. In this context, the majority of current research does not completely address the true structure and nature of data for particular application at hand. In contrast to most previous research, our proposed work focuses on the modeling and classification of spherical data that are naturally generated in many data mining and knowledge discovery applications. Thus, in this thesis we propose several estimation and feature selection frameworks based on Langevin distribution which are devoted to spherical patterns in offline and online settings. In this thesis, we first formulate a unified probabilistic framework, where we build probabilistic kernels based on Fisher score and information divergences from finite Langevin mixture for Support Vector Machine. We are motivated by the fact that the blending of generative and discriminative approaches has prevailed by exploring and adopting distinct characteristic of each approach toward constructing a complementary system combining the best of both. Due to the high demand to construct compact and accurate statistical models that are automatically adjustable to dynamic changes, next in this thesis, we propose probabilistic frameworks for high-dimensional spherical data modeling based on finite Langevin mixtures that allow simultaneous clustering and feature selection in offline and online settings. To this end, we adopted finite mixture models which have long been heavily relied on deterministic learning approaches such as maximum likelihood estimation. Despite their successful utilization in wide spectrum of areas, these approaches have several drawbacks as we will discuss in this thesis. An alternative approach is the adoption of Bayesian inference that naturally addresses data uncertainty while ensuring good generalization. To address this issue, we also propose a Bayesian approach for finite Langevin mixture model estimation and selection. When data change dynamically and grow drastically, finite mixture is not always a feasible solution. In contrast with previous approaches, which suppose an unknown finite number of mixture components, we finally propose a nonparametric Bayesian approach which assumes an infinite number of components. We further enhance our model by simultaneously detecting informative features in the process of clustering. Through extensive empirical experiments, we demonstrate the merits of the proposed learning frameworks on diverse high dimensional datasets and challenging real-world applications

    Exploring QCD matter in extreme conditions with Machine Learning

    Full text link
    In recent years, machine learning has emerged as a powerful computational tool and novel problem-solving perspective for physics, offering new avenues for studying strongly interacting QCD matter properties under extreme conditions. This review article aims to provide an overview of the current state of this intersection of fields, focusing on the application of machine learning to theoretical studies in high energy nuclear physics. It covers diverse aspects, including heavy ion collisions, lattice field theory, and neutron stars, and discuss how machine learning can be used to explore and facilitate the physics goals of understanding QCD matter. The review also provides a commonality overview from a methodology perspective, from data-driven perspective to physics-driven perspective. We conclude by discussing the challenges and future prospects of machine learning applications in high energy nuclear physics, also underscoring the importance of incorporating physics priors into the purely data-driven learning toolbox. This review highlights the critical role of machine learning as a valuable computational paradigm for advancing physics exploration in high energy nuclear physics.Comment: 146 pages,53 figure

    Molecular Dynamics Simulation

    Get PDF
    Condensed matter systems, ranging from simple fluids and solids to complex multicomponent materials and even biological matter, are governed by well understood laws of physics, within the formal theoretical framework of quantum theory and statistical mechanics. On the relevant scales of length and time, the appropriate ‘first-principles’ description needs only the Schroedinger equation together with Gibbs averaging over the relevant statistical ensemble. However, this program cannot be carried out straightforwardly—dealing with electron correlations is still a challenge for the methods of quantum chemistry. Similarly, standard statistical mechanics makes precise explicit statements only on the properties of systems for which the many-body problem can be effectively reduced to one of independent particles or quasi-particles. [...

    Bayesian inference for stochastic processes

    Get PDF
    This thesis builds upon two strands of recent research related to conducting Bayesian inference for stochastic processes. Firstly, this thesis will introduce a new residual-bridge proposal for approximately simulating conditioned diffusions formed by applying the modified diffusion bridge approximation of Durham and Gallant, 2002 to the difference between the true diffusion and a second, approximate, diffusion driven by the same Brownian motion. This new proposal attempts to account for volatilities which are not constant and can, therefore, lead to gains in efficiency over recently proposed residual-bridge constructs (Whitaker et al., 2017) in situations where the volatility varies considerably, as is often the case for larger interobservation times and for time-inhomogeneous volatilities. These gains in efficiency are illustrated via a simulation study for three diffusions; the Birth-Death (BD) diffusion, the Lotka-Volterra (LV) diffusion, and a diffusion corresponding to a simple model of gene expression (GE). Secondly, this thesis will introduce two new classes of Markov Chain Monte Carlo samplers, named the Exchangeable Sampler and the Exchangeable Particle Gibbs Sampler, which, at each iteration, use exchangeablility to simulate multiple, weighted proposals whose weights indicate how likely the chain is to move to such a proposal. By generalising the Independence Sampler and the Particle Gibbs Sampler respectively, these new samplers allow for the locality of moves to be controlled by a scaling parameter which can be tuned to optimise the mixing of the resulting MCMC procedure, while still benefiting from the increase in acceptance probability that typically comes with using multiple proposals. These samplers can lead to chains with better mixing properties, and, therefore, to MCMC estimators with smaller variances than their corresponding algorithms based on independent proposals. This improvement in mixing is illustrated, numerically, for both samplers through simulation studies, and, theoretically, for the Exchangeable Sampler through a result which states that, under certain conditions, the Exchangeable Sampler is geometrically ergodic even when the importance weights are unbounded and, hence, in scenarios where the Independence Sampler cannot be geometrically ergodic. To provide guidance in the practical implementation of such samplers, this thesis derives asymptotic expected squared-jump distance results for the Exchangeable Sampler and the Exchangeable Particle Gibbs Sampler. Moreover, simulation studies demonstrate, numerically, how the theory plays out in practice when d is finite

    Advancing Molecular Simulations of Crystal Nucleation: Applications to Clathrate Hydrates

    Get PDF
    Crystallization is a fundamental physical phenomenon with broad impacts in science and engineering. Nonetheless, mechanisms of crystallization in many systems remain incompletely understood. Molecular dynamics (MD) simulations are a powerful computational technique that, in principle, are well-suited to offer insights into the mechanisms of crystallization. Unfortunately, the waiting time required to observe crystal nucleation in simulated systems often falls far beyond the limits of modern MD simulations. This rare-event problem is the primary barrier to simulation studies of crystallization in complex systems. This dissertation takes a combined approach to advance simulation studies of nucleation in complex systems. First, we apply existing tools to a challenging problem — clathrate hydrate nucleation. We then use methods development, software development, and machine learning to address the specific challenges to simulation studies of crystallization posed by the rare-event problem. Clathrate hydrate formation is an exemplar of crystallization in complex systems. Nucleation of clathrate hydrates generally occurs in systems with interfaces, and even homogeneous hydrate nucleation is inherently a multicomponent process. We address two aspects of clathrate hydrate nucleation which are not well-studied. The first aspect is the effects of interfaces on clathrate hydrate nucleation. Interfaces are common in hydrate systems, yet there are few studies probing the effects of interfaces on clathrate hydrate nucleation. We find that nucleation occurs through a homogeneous mechanism near model hydrophobic and hydrophilic surfaces. The only effect of the surfaces is through a partitioning of guest molecules which results in aggregation of guest molecules at the hydrophobic surface. The second aspect is the effect of guest solubility in water on the homogeneous nucleation mechanism. Experiments show that soluble guests act as strong promoter molecules for hydrate formation, but the molecular mechanisms of this effect are unclear. We apply forward flux sampling (FFS) and a committor analysis to identify good approximations of the reaction coordinate for homogeneous nucleation of hydrates formed from a water-soluble guest molecule. Our results suggest the possibility that the nucleation mechanism for hydrates formed from water-soluble guest molecules is different than the nucleation mechanism for hydrates formed from sparingly soluble guest molecules. FFS studies of crystal nucleation can require hundreds of thousands of individual MD simulations. For complex systems, these simulations easily generate terabytes of intermediate data. Furthermore, each simulation must be completed, analyzed, and individually processed based upon the behavior of the system. The scale of these calculations thus quickly exceeds the practical limits of traditional scripting tools (e.g., bash). In order to apply FFS to study clathrate hydrate nucleation we developed a software package, SAFFIRE. SAFFIRE automates and manages FFS with a user-friendly interface. It is compatible with any simulation software and/or analysis codes. Since SAFFIRE is built on the Hadoop framework, it easily scales to tens or hundreds of nodes. SAFFIRE can be deployed on commodity computing clusters such as the Palmetto cluster at Clemson University or XSEDE resources. Studying crystal nucleation in simulations generally requires selecting an order parameter for advanced sampling a priori. This is particularly challenging since one of the very goals of the study itself may be to elucidate the nucleation mechanism, and thus order parameters that provide a good description of the nucleation process. Furthermore, despite many strengths of FFS, it is somewhat more sensitive to the choice of order parameter than some other advanced sampling methods. To address these challenges, we develop a new method, contour forward flux sampling (cFFS), to perform FFS with multiple order parameters simultaneously. cFFS places nonlinear interfaces on-the-fly from the collective progress of the simulations, without any prior knowledge of the energy landscape or appropriate combination of order parameters. cFFS thus allows testing multiple prospective order parameters on-the-fly. Order parameters clearly play a key role in simulation studies of crystal nucleation. However, developing new order parameters is difficult and time consuming. Using ideas from computer vision, we adapt a specific type of neural network called a PointNet to identify local structural environments (e.g., crystalline environments) in molecular simulations. Our approach requires no system-specific feature engineering and operates on the raw output of the simulations, i.e., atomic positions. We demonstrate the method on crystal structure identification in Lennard-Jones, water, and mesophase systems. The method can even predict the crystal phases of atoms near external interfaces. We demonstrate the versatility of our approach by using our method to identify surface hydrophobicity based solely upon positions and orientations of nearby water molecules. Our results suggest the approach will be broadly applicable to many types of local structure in simulations. We address several interdependent challenges to studying crystallization in molecular simulations by combining software development, method development, and machine learning. While motivated by specific challenges identified during studies of clathrate hydrate nucleation, these contributions help extend the applicability of molecular simulations to crystal nucleation in a broad variety of systems. The next step of the development cycle is to apply these methods on complex systems to motivate further improvements. We believe that continued integration of software, methods, and machine learning will prove a fruitful framework for improving molecular simulations of crystal nucleation

    Knowledge Augmented Machine Learning with Applications in Autonomous Driving: A Survey

    Get PDF
    The existence of representative datasets is a prerequisite of many successful artificial intelligence and machine learning models. However, the subsequent application of these models often involves scenarios that are inadequately represented in the data used for training. The reasons for this are manifold and range from time and cost constraints to ethical considerations. As a consequence, the reliable use of these models, especially in safety-critical applications, is a huge challenge. Leveraging additional, already existing sources of knowledge is key to overcome the limitations of purely data-driven approaches, and eventually to increase the generalization capability of these models. Furthermore, predictions that conform with knowledge are crucial for making trustworthy and safe decisions even in underrepresented scenarios. This work provides an overview of existing techniques and methods in the literature that combine data-based models with existing knowledge. The identified approaches are structured according to the categories integration, extraction and conformity. Special attention is given to applications in the field of autonomous driving

    Untangling hotel industry’s inefficiency: An SFA approach applied to a renowned Portuguese hotel chain

    Get PDF
    The present paper explores the technical efficiency of four hotels from Teixeira Duarte Group - a renowned Portuguese hotel chain. An efficiency ranking is established from these four hotel units located in Portugal using Stochastic Frontier Analysis. This methodology allows to discriminate between measurement error and systematic inefficiencies in the estimation process enabling to investigate the main inefficiency causes. Several suggestions concerning efficiency improvement are undertaken for each hotel studied.info:eu-repo/semantics/publishedVersio

    Coarse-grained modeling for molecular discovery:Applications to cardiolipin-selectivity

    Get PDF
    The development of novel materials is pivotal for addressing global challenges such as achieving sustainability, technological progress, and advancements in medical technology. Traditionally, developing or designing new molecules was a resource-intensive endeavor, often reliant on serendipity. Given the vast space of chemically feasible drug-like molecules, estimated between 106 - 10100 compounds, traditional in vitro techniques fall short.Consequently, in silico tools such as virtual screening and molecular modeling have gained increasing recognition. However, the computational cost and the limited precision of the utilized molecular models still limit computational molecular design.This thesis aimed to enhance the molecular design process by integrating multiscale modeling and free energy calculations. Employing a coarse-grained model allowed us to efficiently traverse a significant portion of chemical space and reduce the sampling time required by molecular dynamics simulations. The physics-informed nature of the applied Martini force field and its level of retained structural detail make the model a suitable starting point for the focused learning of molecular properties.We applied our proposed approach to a cardiolipin bilayer, posing a relevant and challenging problem and facilitating reasonable comparison to experimental measurements.We identified promising molecules with defined properties within the resolution limit of a coarse-grained representation. Furthermore, we were able to bridge the gap from in silico predictions to in vitro and in vivo experiments, supporting the validity of the theoretical concept. The findings underscore the potential of multiscale modeling and free-energy calculations in enhancing molecular discovery and design and offer a promising direction for future research
    • …
    corecore