602 research outputs found

    Complex Data Imputation by Auto-Encoders and Convolutional Neural Networks—A Case Study on Genome Gap-Filling

    Get PDF
    Missing data imputation has been a hot topic in the past decade, and many state-of-the-art works have been presented to propose novel, interesting solutions that have been applied in a variety of fields. In the past decade, the successful results achieved by deep learning techniques have opened the way to their application for solving difficult problems where human skill is not able to provide a reliable solution. Not surprisingly, some deep learners, mainly exploiting encoder-decoder architectures, have also been designed and applied to the task of missing data imputation. However, most of the proposed imputation techniques have not been designed to tackle \u201ccomplex data\u201d, that is high dimensional data belonging to datasets with huge cardinality and describing complex problems. Precisely, they often need critical parameters to be manually set or exploit complex architecture and/or training phases that make their computational load impracticable. In this paper, after clustering the state-of-the-art imputation techniques into three broad categories, we briefly review the most representative methods and then describe our data imputation proposals, which exploit deep learning techniques specifically designed to handle complex data. Comparative tests on genome sequences show that our deep learning imputers outperform the state-of-the-art KNN-imputation method when filling gaps in human genome sequences

    Design and validation of novel methods for long-term road traffic forecasting

    Get PDF
    132 p.Road traffic management is a critical aspect for the design and planning of complex urban transport networks for which vehicle flow forecasting is an essential component. As a testimony of its paramount relevance in transport planning and logistics, thousands of scientific research works have covered the traffic forecasting topic during the last 50 years. In the beginning most approaches relied on autoregressive models and other analysis methods suited for time series data. During the last two decades, the development of new technology, platforms and techniques for massive data processing under the Big Data umbrella, the availability of data from multiple sources fostered by the Open Data philosophy and an ever-growing need of decision makers for accurate traffic predictions have shifted the spotlight to data-driven procedures. Even in this convenient context, with abundance of open data to experiment and advanced techniques to exploit them, most predictive models reported in literature aim for shortterm forecasts, and their performance degrades when the prediction horizon is increased. Long-termforecasting strategies are more scarce, and commonly based on the detection and assignment to patterns. These approaches can perform reasonably well unless an unexpected event provokes non predictable changes, or if the allocation to a pattern is inaccurate.The main core of the work in this Thesis has revolved around datadriven traffic forecasting, ultimately pursuing long-term forecasts. This has broadly entailed a deep analysis and understanding of the state of the art, and dealing with incompleteness of data, among other lesser issues. Besides, the second part of this dissertation presents an application outlook of the developed techniques, providing methods and unexpected insights of the local impact of traffic in pollution. The obtained results reveal that the impact of vehicular emissions on the pollution levels is overshadowe

    Design and validation of novel methods for long-term road traffic forecasting

    Get PDF
    132 p.Road traffic management is a critical aspect for the design and planning of complex urban transport networks for which vehicle flow forecasting is an essential component. As a testimony of its paramount relevance in transport planning and logistics, thousands of scientific research works have covered the traffic forecasting topic during the last 50 years. In the beginning most approaches relied on autoregressive models and other analysis methods suited for time series data. During the last two decades, the development of new technology, platforms and techniques for massive data processing under the Big Data umbrella, the availability of data from multiple sources fostered by the Open Data philosophy and an ever-growing need of decision makers for accurate traffic predictions have shifted the spotlight to data-driven procedures. Even in this convenient context, with abundance of open data to experiment and advanced techniques to exploit them, most predictive models reported in literature aim for shortterm forecasts, and their performance degrades when the prediction horizon is increased. Long-termforecasting strategies are more scarce, and commonly based on the detection and assignment to patterns. These approaches can perform reasonably well unless an unexpected event provokes non predictable changes, or if the allocation to a pattern is inaccurate.The main core of the work in this Thesis has revolved around datadriven traffic forecasting, ultimately pursuing long-term forecasts. This has broadly entailed a deep analysis and understanding of the state of the art, and dealing with incompleteness of data, among other lesser issues. Besides, the second part of this dissertation presents an application outlook of the developed techniques, providing methods and unexpected insights of the local impact of traffic in pollution. The obtained results reveal that the impact of vehicular emissions on the pollution levels is overshadowe

    Microarray Data Mining and Gene Regulatory Network Analysis

    Get PDF
    The novel molecular biological technology, microarray, makes it feasible to obtain quantitative measurements of expression of thousands of genes present in a biological sample simultaneously. Genome-wide expression data generated from this technology are promising to uncover the implicit, previously unknown biological knowledge. In this study, several problems about microarray data mining techniques were investigated, including feature(gene) selection, classifier genes identification, generation of reference genetic interaction network for non-model organisms and gene regulatory network reconstruction using time-series gene expression data. The limitations of most of the existing computational models employed to infer gene regulatory network lie in that they either suffer from low accuracy or computational complexity. To overcome such limitations, the following strategies were proposed to integrate bioinformatics data mining techniques with existing GRN inference algorithms, which enables the discovery of novel biological knowledge. An integrated statistical and machine learning (ISML) pipeline was developed for feature selection and classifier genes identification to solve the challenges of the curse of dimensionality problem as well as the huge search space. Using the selected classifier genes as seeds, a scale-up technique is applied to search through major databases of genetic interaction networks, metabolic pathways, etc. By curating relevant genes and blasting genomic sequences of non-model organisms against well-studied genetic model organisms, a reference gene regulatory network for less-studied organisms was built and used both as prior knowledge and model validation for GRN reconstructions. Networks of gene interactions were inferred using a Dynamic Bayesian Network (DBN) approach and were analyzed for elucidating the dynamics caused by perturbations. Our proposed pipelines were applied to investigate molecular mechanisms for chemical-induced reversible neurotoxicity

    Modeling Time-Series and Spatial Data for Recommendations and Other Applications

    Full text link
    With the research directions described in this thesis, we seek to address the critical challenges in designing recommender systems that can understand the dynamics of continuous-time event sequences. We follow a ground-up approach, i.e., first, we address the problems that may arise due to the poor quality of CTES data being fed into a recommender system. Later, we handle the task of designing accurate recommender systems. To improve the quality of the CTES data, we address a fundamental problem of overcoming missing events in temporal sequences. Moreover, to provide accurate sequence modeling frameworks, we design solutions for points-of-interest recommendation, i.e., models that can handle spatial mobility data of users to various POI check-ins and recommend candidate locations for the next check-in. Lastly, we highlight that the capabilities of the proposed models can have applications beyond recommender systems, and we extend their abilities to design solutions for large-scale CTES retrieval and human activity prediction. A significant part of this thesis uses the idea of modeling the underlying distribution of CTES via neural marked temporal point processes (MTPP). Traditional MTPP models are stochastic processes that utilize a fixed formulation to capture the generative mechanism of a sequence of discrete events localized in continuous time. In contrast, neural MTPP combine the underlying ideas from the point process literature with modern deep learning architectures. The ability of deep-learning models as accurate function approximators has led to a significant gain in the predictive prowess of neural MTPP models. In this thesis, we utilize and present several neural network-based enhancements for the current MTPP frameworks for the aforementioned real-world applications.Comment: Ph.D. Thesis (2022

    Social Choice for Partial Preferences Using Imputation

    Get PDF
    Within the field of multiagent systems, the area of computational social choice considers the problems arising when decisions must be made collectively by a group of agents. Usually such systems collect a ranking of the alternatives from each member of the group in turn, and aggregate these individual rankings to arrive at a collective decision. However, when there are many alternatives to consider, individual agents may be unwilling, or unable, to rank all of them, leading to decisions that must be made on the basis of incomplete information. While earlier approaches attempt to work with the provided rankings by making assumptions about the nature of the missing information, this can lead to undesirable outcomes when the assumptions do not hold, and is ill-suited to certain problem domains. In this thesis, we propose a new approach that uses machine learning algorithms (both conventional and purpose-built) to generate plausible completions of each agent’s rankings on the basis of the partial rankings the agent provided (imputations), in a way that reflects the agents’ true preferences. We show that the combination of existing social choice functions with certain classes of imputation algorithms, which forms the core of our proposed solution, is equivalent to a form of social choice. Our system then undergoes an extensive empirical validation under 40 different test conditions, involving more than 50,000 group decision problems generated from real-world electoral data, and is found to outperform existing competitors significantly, leading to better group decisions overall. Detailed empirical findings are also used to characterize the behaviour of the system, and illustrate the circumstances in which it is most advantageous. A general testbed for comparing solutions using real-world and artificial data (Prefmine) is then described, in conjunction with results that justify its design decisions. We move on to propose a new machine learning algorithm intended specifically to learn and impute the preferences of agents, and validate its effectiveness. This Markov-Tree approach is demonstrated to be superior to imputation using conventional machine learning, and has a simple interpretation that characterizes the problems on which it will perform well. Later chapters contain an axiomatic validation of both of our new approaches, as well as techniques for mitigating their manipulability. The thesis concludes with a discussion of the applicability of its contributions, both for multiagent systems and for settings involving human elections. In all, we reveal an interesting connection between machine learning and computational social choice, and introduce a testbed which facilitates future research efforts on computational social choice for partial preferences, by allowing empirical comparisons between competing approaches to be conducted easily, accurately, and quickly. Perhaps most importantly, we offer an important and effective new direction for enabling group decision making when preferences are not completely specified, using imputation methods

    Computationally Comparing Biological Networks and Reconstructing Their Evolution

    Get PDF
    Biological networks, such as protein-protein interaction, regulatory, or metabolic networks, provide information about biological function, beyond what can be gleaned from sequence alone. Unfortunately, most computational problems associated with these networks are NP-hard. In this dissertation, we develop algorithms to tackle numerous fundamental problems in the study of biological networks. First, we present a system for classifying the binding affinity of peptides to a diverse array of immunoglobulin antibodies. Computational approaches to this problem are integral to virtual screening and modern drug discovery. Our system is based on an ensemble of support vector machines and exhibits state-of-the-art performance. It placed 1st in the 2010 DREAM5 competition. Second, we investigate the problem of biological network alignment. Aligning the biological networks of different species allows for the discovery of shared structures and conserved pathways. We introduce an original procedure for network alignment based on a novel topological node signature. The pairwise global alignments of biological networks produced by our procedure, when evaluated under multiple metrics, are both more accurate and more robust to noise than those of previous work. Next, we explore the problem of ancestral network reconstruction. Knowing the state of ancestral networks allows us to examine how biological pathways have evolved, and how pathways in extant species have diverged from that of their common ancestor. We describe a novel framework for representing the evolutionary histories of biological networks and present efficient algorithms for reconstructing either a single parsimonious evolutionary history, or an ensemble of near-optimal histories. Under multiple models of network evolution, our approaches are effective at inferring the ancestral network interactions. Additionally, the ensemble approach is robust to noisy input, and can be used to impute missing interactions in experimental data. Finally, we introduce a framework, GrowCode, for learning network growth models. While previous work focuses on developing growth models manually, or on procedures for learning parameters for existing models, GrowCode learns fundamentally new growth models that match target networks in a flexible and user-defined way. We show that models learned by GrowCode produce networks whose target properties match those of real-world networks more closely than existing models
    • …
    corecore