11,627 research outputs found

    Neural Computing for Event Log Quality Improvement

    Get PDF
    Department of Management EngineeringAn event log is a vital part used for process mining such as process discovery, conformance checking or enhancement. Like any other data, the initial event logs can be too coarse resulting in severe data mining mistakes. Traditional statistical reconstruction methods work poorly with event logs, because of the complex interrelations among attributes, events and cases. As such, machine learning approaches appear more suitable for reconstructing or repairing event logs. However, there is very limited work on exploiting neural networks to do this task. This thesis focuses on two issues that may arise in the coarse event logs, incorrect attribute values and missing attribute values. We are interested in exploring the application of different kinds of autoencoders on the task of reconstructing event logs since this architecture suits the problem of unsupervised learning, such as the ones we are considering. When repairing an event log, in fact, one cannot assume that a training set with true labels is available for model training. We also propose the techniques for preprocessing and training the event logs data. In order to provide an insight on how feasible and applicable our work is, we have carried out experiments using real-life datasets. Regarding the first issue, we train autoencoders under purely unsupervised manner to deal with the problem of anomaly detection without using any prior knowledge of the domain. We focus on developing algorithms that can capture the general pattern and sequence aspect of the data. In order to solve the second issue, we develop models that should not only learn the representation and underlying true distribution of the data but also be able to generate the realistic and reliable output that has the characteristic of the logs.ope

    Data Mining and Machine Learning in Astronomy

    Full text link
    We review the current state of data mining and machine learning in astronomy. 'Data Mining' can have a somewhat mixed connotation from the point of view of a researcher in this field. If used correctly, it can be a powerful approach, holding the potential to fully exploit the exponentially increasing amount of available data, promising great scientific advance. However, if misused, it can be little more than the black-box application of complex computing algorithms that may give little physical insight, and provide questionable results. Here, we give an overview of the entire data mining process, from data collection through to the interpretation of results. We cover common machine learning algorithms, such as artificial neural networks and support vector machines, applications from a broad range of astronomy, emphasizing those where data mining techniques directly resulted in improved science, and important current and future directions, including probability density functions, parallel algorithms, petascale computing, and the time domain. We conclude that, so long as one carefully selects an appropriate algorithm, and is guided by the astronomical problem at hand, data mining can be very much the powerful tool, and not the questionable black box.Comment: Published in IJMPD. 61 pages, uses ws-ijmpd.cls. Several extra figures, some minor additions to the tex

    Big Data Preprocessing for Multivariate Time Series Forecast

    Get PDF
    Big data platforms alleviate collecting and organizing large datasets of varying content. A downside of this is the heavy preprocessing required to analyze their data by conventional analysis techniques. Especially time series data is found challenging to transform from platform-provided raw format into tables of feature and target values, required by supervised machine learning models. This thesis presents an experiment of preprocessing a data-platform-extracted collection of multivariate time series and forecasting it by machine learning models such as neural networks and support vector machines. Reviewed techniques of data preprocessing and time series analysis literature are utilized, but also custom solutions such as log level-based target variable, and valuedistribution-based feature elimination are developed. No significant forecasting accuracies are achieved, which indicates the difficulty of modelling big data. The expected reason for this is the inadequate validation of model parameters and preprocessing decisions, which would require excessive testing to improve.Big data -alustat helpottavat isojen datamäärien talletusta ja hallintaa. Niiden haittapuolena on kuitenkin laaja data-analyysiin vaadittava esikäsittelyn tarve, mikäli halutaan käyttää tavanomaisia analyysimenetelmiä. Erityisen haastavaksi todetaan aikasarjojen muuntaminen alustan tarjoamasta muodosta ohjatun koneoppimisen vaatimaan taulumuotoon, koostuen ennustettavasta kohdemuuttujasta sekä muista ominaisuusmuuttujista. Tässä tutkielmassa tutkitaan usean muuttujan aikasarjadatan esikäsittelyä, sekä käsitellyn datan ennustamista koneoppimismenetelmillä, kuten neuroverkoilla ja tukivektorimallinnuksella. Tutkimusmenetelmät perustuvat kirjallisuuteen datan esikäsittelystä ja aikasarja-analyysistä, mutta myös uusia menetelmiä kehitetään, kuten lokitasoon perustuva kohdemuuttuja sekä muuttujien arvojakaumaan perustuva karsiminen. Ennustustulokset jättävät kuitenkin toivomisen varaa, mikä kertoo big datan mallinnuksen vaikeudesta. Epäiltyinä syinä ovat liian vähäinen malliparametrien ja esikäsittelyvalintojen optimointi, joiden täydentäminen vaatisi resursseihin nähden liian kattavaa testausta

    Interval Temporal Random Forests with an Application to COVID-19 Diagnosis

    Get PDF
    Symbolic learning is the logic-based approach to machine learning. The mission of symbolic learning is to provide algorithms and methodologies to extract logical information from data and express it in an interpretable way. In the context of temporal data, interval temporal logic has been recently proposed as a suitable tool for symbolic learning, specifically via the design of an interval temporal logic decision tree extraction algorithm. Building on it, we study here its natural generalization to interval temporal random forests, mimicking the corresponding schema at the propositional level. Interval temporal random forests turn out to be a very performing multivariate time series classification method, which, despite the introduction of a functional component, are still logically interpretable to some extent. We apply this method to the problem of diagnosing COVID-19 based on the time series that emerge from cough and breath recording of positive versus negative subjects. Our experiment show that our models achieve very high accuracies and sensitivities, often superior to those achieved by classical methods on the same data. Although other recent approaches to the same problem (based on different and more numerous data) show even better statistical results, our solution is the first logic-based, interpretable, and explainable one

    Coping with new Challenges in Clustering and Biomedical Imaging

    Get PDF
    The last years have seen a tremendous increase of data acquisition in different scientific fields such as molecular biology, bioinformatics or biomedicine. Therefore, novel methods are needed for automatic data processing and analysis of this large amount of data. Data mining is the process of applying methods like clustering or classification to large databases in order to uncover hidden patterns. Clustering is the task of partitioning points of a data set into distinct groups in order to minimize the intra cluster similarity and to maximize the inter cluster similarity. In contrast to unsupervised learning like clustering, the classification problem is known as supervised learning that aims at the prediction of group membership of data objects on the basis of rules learned from a training set where the group membership is known. Specialized methods have been proposed for hierarchical and partitioning clustering. However, these methods suffer from several drawbacks. In the first part of this work, new clustering methods are proposed that cope with problems from conventional clustering algorithms. ITCH (Information-Theoretic Cluster Hierarchies) is a hierarchical clustering method that is based on a hierarchical variant of the Minimum Description Length (MDL) principle which finds hierarchies of clusters without requiring input parameters. As ITCH may converge only to a local optimum we propose GACH (Genetic Algorithm for Finding Cluster Hierarchies) that combines the benefits from genetic algorithms with information-theory. In this way the search space is explored more effectively. Furthermore, we propose INTEGRATE a novel clustering method for data with mixed numerical and categorical attributes. Supported by the MDL principle our method integrates the information provided by heterogeneous numerical and categorical attributes and thus naturally balances the influence of both sources of information. A competitive evaluation illustrates that INTEGRATE is more effective than existing clustering methods for mixed type data. Besides clustering methods for single data objects we provide a solution for clustering different data sets that are represented by their skylines. The skyline operator is a well-established database primitive for finding database objects which minimize two or more attributes with an unknown weighting between these attributes. In this thesis, we define a similarity measure, called SkyDist, for comparing skylines of different data sets that can directly be integrated into different data mining tasks such as clustering or classification. The experiments show that SkyDist in combination with different clustering algorithms can give useful insights into many applications. In the second part, we focus on the analysis of high resolution magnetic resonance images (MRI) that are clinically relevant and may allow for an early detection and diagnosis of several diseases. In particular, we propose a framework for the classification of Alzheimer's disease in MR images combining the data mining steps of feature selection, clustering and classification. As a result, a set of highly selective features discriminating patients with Alzheimer and healthy people has been identified. However, the analysis of the high dimensional MR images is extremely time-consuming. Therefore we developed JGrid, a scalable distributed computing solution designed to allow for a large scale analysis of MRI and thus an optimized prediction of diagnosis. In another study we apply efficient algorithms for motif discovery to task-fMRI scans in order to identify patterns in the brain that are characteristic for patients with somatoform pain disorder. We find groups of brain compartments that occur frequently within the brain networks and discriminate well among healthy and diseased people

    Map Based Visualization of Product Catalogs

    Get PDF
    Traditionally, recommender systems present recommendations in lists to the user. In content- and knowledge-based recommendation systems these list are often sorted on some notion of similarity with a query, ideal product specification, or sample product. However, a lot of information is lost in this way, since two even similar products can differ from the query on a completely different set of product characteristics. When using a two dimensional, that is, a map-based, representation of the recommendations, it is possible to retain this information. In the map we can then position recommendations that are similar to each other in the same area of the map. Both in science and industry an increasing number of two dimensional graphical interfaces have been introduced over the last years. However, some of them lack a sound scientific foundation, while other approaches are not applicable in a recommendation setting. In our chapter, we will describe a framework, which has a solid scientific foundation (using state-of-the-art statistical models) and is specifically designed to work with e-commerce product catalogs. Basis of the framework is the Product Catalog Map interface based on multidimensional scaling. Also, we show another type of interface based on nonlinear principal components analysis, which provides an easy way in constraining the space based on specific characteristic values. Then, we discuss some advanced issues. Firstly, we discuss how the product catalog interface can be adapted to better fit the users' notion of importance of attributes using click stream analysis. Secondly, we show an user interface that combines recommendation by proposing with the map based approach. Finally, we show how these methods can be applied to a real e-commerce product catalog of MP3-players

    Hierarchical Knowledge-Gradient for Sequential Sampling

    Get PDF
    We consider the problem of selecting the best of a finite but very large set of alternatives. Each alternative may be characterized by a multi-dimensional vector and has independent normal rewards. This problem arises in various settings such as (i) ranking and selection, (ii) simulation optimization where the unknown mean of each alternative is estimated with stochastic simulation output, and (iii) approximate dynamic programming where we need to estimate values based on Monte-Carlo simulation. We use a Bayesian probability model for the unknown reward of each alternative and follow a fully sequential sampling policy called the knowledge-gradient policy. This policy myopically optimizes the expected increment in the value of sampling information in each time period. Because the number of alternatives is large, we propose a hierarchical aggregation technique that uses the common features shared by alternatives to learn about many alternatives from even a single measurement, thus greatly reducing the measurement effort required. We demonstrate how this hierarchical knowledge-gradient policy can be applied to efficiently maximize a continuous function and prove that this policy finds a globally optimal alternative in the limit
    corecore