1,946 research outputs found

    Tie-breaking in Hoeffding trees

    Get PDF
    A thorough examination of the performance of Hoeffding trees, state-of-the-art in classification for data streams, on a range of datasets reveals that tie breaking, an essential but supposedly rare procedure, is employed much more than expected. Testing with a lightweight method for handling continuous attributes, we find that the excessive invocation of tie breaking causes performance to degrade significantly on complex and noisy data. Investigating ways to reduce the number of tie breaks, we propose an adaptive method that overcomes the problem while not significantly affecting performance on simpler datasets

    Propositionalisation of multiple sequence alignments using probabilistic models

    Get PDF
    Multiple sequence alignments play a central role in Bioinformatics. Most alignment representations are designed to facilitate knowledge extraction by human experts. Additionally statistical models like Profile Hidden Markov Models are used as representations. They offer the advantage to provide sound, probabilistic scores. The basic idea we present in this paper is to use the structure of a Profile Hidden Markov Model for propositionalisation. This way we get a simple, extendable representation of multiple sequence alignments which facilitates further analysis by Machine Learning algorighms

    An application of data mining to fruit and vegetable sample identiļ¬cation using Gas Chromatography-Mass Spectrometry

    Get PDF
    One of the uses of Gas Chromatography-Mass Spectrometry (GC-MS) is in the detection of pesticide residues in fruit and vegetables. In a high throughput laboratory there is the potential for sample swaps or mislabelling, as once a sample has been pre-processed to be injected into the GC-MS analyser, it is no longer distinguishable by eye. Possible consequences of such mistakes can be the destruction of large amounts of actually safe produce or pesticide-contaminated produce reaching the consumer. For the purposes of food safety and traceability, it can also be extremely valuable to know the source (country of origin) of a food product. This can help uncover fraudulent attempts of trying to sell food originating from countries deemed unsafe. In this study, we use the workflow environment ADAMS to examine whether we can determine the fruit/vegetable, and the country of origin of a sample from a GC-MS chromatogram. A workflow is used to generate data sets using different data pre-processing methods, and data representations from a database of over 8000 GC-MS chromatograms, consisting of more than 100 types of fruit and vegetables from more than 120 countries. A variety of classification algorithms are evaluated using the WEKA data mining workbench. We demonstrate excellent results, both for the determination of fruit/vegetable type and for the country of origin, using a histogram of ion counts, and Classification by Regression using Random Regression Forest with PLS-transformed data

    Multi-label classification using ensembles of pruned sets

    Get PDF
    This paper presents a Pruned Sets method (PS) for multi-label classification. It is centred on the concept of treating sets of labels as single labels. This allows the classification process to inherently take into account correlations between labels. By pruning these sets, PS focuses only on the most important correlations, which reduces complexity and improves accuracy. By combining pruned sets in an ensemble scheme (EPS), new label sets can be formed to adapt to irregular or complex data. The results from experimental evaluation on a variety of multi-label datasets show that [E]PS can achieve better performance and train much faster than other multi-label methods

    Millions of random rules

    Get PDF
    In this paper we report on work in progress based on the induction of vast numbers of almost random rules. This work tries to combine and explore ideas from both Random Forests as well as Stochastic Discrimination. We describe a fast algorithm for generating almost random rules and study its performance. Rules are generated in such a way that all training examples are covered roughly by the same number of rules each. Rules themselves usually have a clear majority class among the examples they cover, but they are not limited in terms of either minimal coverage, nor minimal purity. A preliminary experimental evaluation indicates really promising results for both predictive accuracy as well as speed of induction, but at the expense of both large memory consumption as well as slow prediction. Finally, we discuss various directions for our future research

    Predicting polycyclic aromatic hydrocarbon concentrations in soil and water samples

    Get PDF
    Polycyclic Aromatic Hydrocarbons (PAHs) are compounds found in the environment that can be harmful to humans. They are typically formed due to incomplete combustion and as such remain after burning coal, oil, petrol, diesel, wood, household waste and so forth. Testing laboratories routinely screen soil and water samples taken from potentially contaminated sites for PAHs using Gas Chromatography Mass Spectrometry (GC-MS). A GC-MS device produces a chromatogram which is processed by an analyst to determine the concentrations of PAH compounds of interest. In this paper we investigate the application of data mining techniques to PAH chromatograms in order to provide reliable prediction of compound concentrations. A workflow engine with an easy-to-use graphical user interface is at the heart of processing the data. This engine allows a domain expert to set up workflows that can load the data, preprocess it in parallel in various ways and convert it into data suitable for data mining toolkits. The generated output can then be evaluated using different data mining techniques, to determine the impact of preprocessing steps on the performance of the generated models and for picking the best approach. Encouraging results for predicting PAH compound concentrations, in terms of correlation coefficients and root-mean-squared error are demonstrated

    Morbidity and mortality in coeliac disease.

    Get PDF
    Celiac disease is a small intestinal immune-mediated enteropathy precipitated by exposure to gluten, a protein complex in the cereals wheat, barley and rye, in genetically susceptible people. Once considered an uncommon disorder restricted to children of European descent, it is now known to be one of the most common chronic diseases encountered in the Western world, with a serological prevalence of 1% that can be diagnosed at any age. Since it is so common much co-morbidity comprising malignant and non-malignant conditions will occur in association. Malignant complications particularly lymphoma were first described over 50 years ago but the natural history and how commonly these occurred were unknown until relatively recently. Similarly, many non-malignant conditions were known to occur but initially the risks were unclear. It was not until the frequency of coeliac disease could be determined accurately in the community and population-based studies of morbidity and mortality in coeliac disease patients carried out in defined cohorts that these questions could be answered. My research into these aspects of coeliac disease began in 1971 and a body of 35 of my publications spanning the years 1974 to 2018 on the morbidity and mortality of the disorder are presented in this thesis. I have introduced my research findings at many international and national meetings and these data have been influential in shaping the research agenda of other workers. One of my papers (Publication 9) published in Gut in 1989, was the most cited of all papers which appeared in the journal for that year. To date it has been cited 1122 times. Information exists for 24 papers presented here and for these the total number of citations stands at 3,887. This excludes references to book chapters. Anecdotal evidence indicates frequent mentions in lectures and clinical practice.N/

    Batch-Incremental Learning for Mining Data Streams

    Get PDF
    The data stream model for data mining places harsh restrictions on a learning algorithm. First, a model must be induced incrementally. Second, processing time for instances must keep up with their speed of arrival. Third, a model may only use a constant amount of memory, and must be ready for prediction at any point in time. We attempt to overcome these restrictions by presenting a data stream classification algorithm where the data is split into a stream of disjoint batches. Single batches of data can be processed one after the other by any standard non-incremental learning algorithm. Our approach uses ensembles of decision trees. These tree ensembles are iteratively merged into a single interpretable model of constant maximal size. Using benchmark datasets the algorithm is evaluated for accuracy against state-of-the-art algorithms that make use of the entire dataset

    Mining data streams using option trees (revised edition, 2004)

    Get PDF
    The data stream model for data mining places harsh restrictions on a learning algorithm. A model must be induced following the briefest interrogation of the data, must use only available memory and must update itself over time within these constraints. Additionally, the model must be able to be used for data mining at any point in time. This paper describes a data stream classi_cation algorithm using an ensemble of option trees. The ensemble of trees is induced by boosting and iteratively combined into a single interpretable model. The algorithm is evaluated using benchmark datasets for accuracy against state-of-the-art algorithms that make use of the entire dataset

    Cache Hierarchy Inspired Compression: a Novel Architecture for Data Streams

    Get PDF
    We present an architecture for data streams based on structures typically found in web cache hierarchies. The main idea is to build a meta level analyser from a number of levels constructed over time from a data stream. We present the general architecture for such a system and an application to classification. This architecture is an instance of the general wrapper idea allowing us to reuse standard batch learning algorithms in an inherently incremental learning environment. By artificially generating data sources we demonstrate that a hierarchy containing a mixture of models is able to adapt over time to the source of the data. In these experiments the hierarchies use an elementary performance based replacement policy and unweighted voting for making classification decisions
    • ā€¦
    corecore