721 research outputs found

    On the Suitability of Genetic-Based Algorithms for Data Mining

    Get PDF
    Data mining has as goal to extract knowledge from large databases. A database may be considered as a search space consisting of an enormous number of elements, and a mining algorithm as a search strategy. In general, an exhaustive search of the space is infeasible. Therefore, efficient search strategies are of vital importance. Search strategies on genetic-based algorithms have been applied successfully in a wide range of applications. We focus on the suitability of genetic-based algorithms for data mining. We discuss the design and implementation of a genetic-based algorithm for data mining and illustrate its potentials

    Scalable Deep Traffic Flow Neural Networks for Urban Traffic Congestion Prediction

    Full text link
    Tracking congestion throughout the network road is a critical component of Intelligent transportation network management systems. Understanding how the traffic flows and short-term prediction of congestion occurrence due to rush-hour or incidents can be beneficial to such systems to effectively manage and direct the traffic to the most appropriate detours. Many of the current traffic flow prediction systems are designed by utilizing a central processing component where the prediction is carried out through aggregation of the information gathered from all measuring stations. However, centralized systems are not scalable and fail provide real-time feedback to the system whereas in a decentralized scheme, each node is responsible to predict its own short-term congestion based on the local current measurements in neighboring nodes. We propose a decentralized deep learning-based method where each node accurately predicts its own congestion state in real-time based on the congestion state of the neighboring stations. Moreover, historical data from the deployment site is not required, which makes the proposed method more suitable for newly installed stations. In order to achieve higher performance, we introduce a regularized Euclidean loss function that favors high congestion samples over low congestion samples to avoid the impact of the unbalanced training dataset. A novel dataset for this purpose is designed based on the traffic data obtained from traffic control stations in northern California. Extensive experiments conducted on the designed benchmark reflect a successful congestion prediction

    Parallel sampling of decomposable graphs using Markov chain on junction trees

    Full text link
    Bayesian inference for undirected graphical models is mostly restricted to the class of decomposable graphs, as they enjoy a rich set of properties making them amenable to high-dimensional problems. While parameter inference is straightforward in this setup, inferring the underlying graph is a challenge driven by the computational difficultly in exploring the space of decomposable graphs. This work makes two contributions to address this problem. First, we provide sufficient and necessary conditions for when multi-edge perturbations maintain decomposability of the graph. Using these, we characterize a simple class of partitions that efficiently classify all edge perturbations by whether they maintain decomposability. Second, we propose a new parallel non-reversible Markov chain Monte Carlo sampler for distributions over junction tree representations of the graph, where at every step, all edge perturbations within a partition are executed simultaneously. Through simulations, we demonstrate the efficiency of our new edge perturbation conditions and class of partitions. We find that our parallel sampler yields improved mixing properties in comparison to the single-move variate, and outperforms current methods. The implementation of our work is available in a Python package.Comment: 20 pages, 10 figures, with appendix and supplementary materia

    A hierarchical Bayesian model for predicting ecological interactions using scaled evolutionary relationships

    Full text link
    Identifying undocumented or potential future interactions among species is a challenge facing modern ecologists. Recent link prediction methods rely on trait data, however large species interaction databases are typically sparse and covariates are limited to only a fraction of species. On the other hand, evolutionary relationships, encoded as phylogenetic trees, can act as proxies for underlying traits and historical patterns of parasite sharing among hosts. We show that using a network-based conditional model, phylogenetic information provides strong predictive power in a recently published global database of host-parasite interactions. By scaling the phylogeny using an evolutionary model, our method allows for biological interpretation often missing from latent variable models. To further improve on the phylogeny-only model, we combine a hierarchical Bayesian latent score framework for bipartite graphs that accounts for the number of interactions per species with the host dependence informed by phylogeny. Combining the two information sources yields significant improvement in predictive accuracy over each of the submodels alone. As many interaction networks are constructed from presence-only data, we extend the model by integrating a correction mechanism for missing interactions, which proves valuable in reducing uncertainty in unobserved interactions.Comment: To appear in the Annals of Applied Statistic

    A Skew-Normal Copula-Driven Generalized Linear Mixed Model for Longitudinal Data

    Get PDF
    Using the advancements of Arellano-Valle et al. [2005], which characterize the likelihood function of a linear mixed model (LMM) under a skew-normal distribution for the random effects, this thesis attempt to construct a copula-driven generalized linear mixed model (GLMM). Assuming a multivariate distribution from the exponential family for the response variable and a skew-normal copula, we drive a complete characterization of the general likelihood function. For estimation, we apply a Monte Carlo expectation maximization (MC-EM) algorithm. Some special cases are discussed, in particular, the exponential and gamma distributions. Simulations with multiple link functions are shown alongside a real data example from the Framingham Heart Study
    corecore