334 research outputs found

    A Taxonomy of Big Data for Optimal Predictive Machine Learning and Data Mining

    Full text link
    Big data comes in various ways, types, shapes, forms and sizes. Indeed, almost all areas of science, technology, medicine, public health, economics, business, linguistics and social science are bombarded by ever increasing flows of data begging to analyzed efficiently and effectively. In this paper, we propose a rough idea of a possible taxonomy of big data, along with some of the most commonly used tools for handling each particular category of bigness. The dimensionality p of the input space and the sample size n are usually the main ingredients in the characterization of data bigness. The specific statistical machine learning technique used to handle a particular big data set will depend on which category it falls in within the bigness taxonomy. Large p small n data sets for instance require a different set of tools from the large n small p variety. Among other tools, we discuss Preprocessing, Standardization, Imputation, Projection, Regularization, Penalization, Compression, Reduction, Selection, Kernelization, Hybridization, Parallelization, Aggregation, Randomization, Replication, Sequentialization. Indeed, it is important to emphasize right away that the so-called no free lunch theorem applies here, in the sense that there is no universally superior method that outperforms all other methods on all categories of bigness. It is also important to stress the fact that simplicity in the sense of Ockham's razor non plurality principle of parsimony tends to reign supreme when it comes to massive data. We conclude with a comparison of the predictive performance of some of the most commonly used methods on a few data sets.Comment: 18 pages, 2 figures 3 table

    Majority Voting by Independent Classifiers Can Increase Error Rates

    Get PDF
    The technique of ā€œmajority votingā€ of classifiers is used in machine learning with the aim of constructing a new combined classification rule that has better characteristics than any of a given set of rules. The ā€œCondorcet Jury Theoremā€ is often cited, incorrectly, as support for a claim that this practice leads to an improved classifier (i.e., one with smaller error probabilities) when the given classifiers are sufficiently good and are uncorrelated. We specifically address the case of two-category classification, and argue that a correct claim can be made for independent (not just uncorrelated) classification errors (not the classifiers themselves), and offer an example demonstrating that the common claim is false. Supplementary materials for this article are available online

    Improving acoustic vehicle classification by information fusion

    No full text
    We present an information fusion approach for ground vehicle classification based on the emitted acoustic signal. Many acoustic factors can contribute to the classification accuracy of working ground vehicles. Classification relying on a single feature set may lose some useful information if its underlying sound production model is not comprehensive. To improve classification accuracy, we consider an information fusion diagram, in which various aspects of an acoustic signature are taken into account and emphasized separately by two different feature extraction methods. The first set of features aims to represent internal sound production, and a number of harmonic components are extracted to characterize the factors related to the vehicleā€™s resonance. The second set of features is extracted based on a computationally effective discriminatory analysis, and a group of key frequency components are selected by mutual information, accounting for the sound production from the vehicleā€™s exterior parts. In correspondence with this structure, we further put forward a modifiedBayesian fusion algorithm, which takes advantage of matching each specific feature set with its favored classifier. To assess the proposed approach, experiments are carried out based on a data set containing acoustic signals from different types of vehicles. Results indicate that the fusion approach can effectively increase classification accuracy compared to that achieved using each individual features set alone. The Bayesian-based decision level fusion is found fusion is found to be improved than a feature level fusion approac

    A Taxonomy of Massive Data for Optimal Predictive Machine Learning and Data Mining

    Get PDF
    Big data comes in various ways, types, shapes, forms and sizes. Indeed, almost all areas of science, technology, medicine, public health, economics, business, linguistics and social science are bombarded by ever increasing flows of data begging to analyzed efficiently and effectively. In this paper, we propose a rough idea of a possible taxonomy of big data, along with some of the most commonly used tools for handling each particular category of bigness. The dimensionality p of the input space and the sample size n are usually the main ingredients in the characterization of data bigness. The specific statistical machine learning technique used to handle a particular big data set will depend on which category it falls in within the bigness taxonomy. Large p small n data sets for instance require a different set of tools from the large n small p variety. Among other tools, we discuss Preprocessing, Standardization, Imputation, Projection, Regularization, Penalization, Compression, Reduction, Selection, Kernelization, Hybridization, Parallelization, Aggregation, Randomization, Replication, Sequentialization. Indeed, it is important to emphasize right away that the so-called no free lunch theorem applies here, in the sense that there is no universally superior method that outperforms all other methods on all categories of bigness. It is also important to stress the fact that simplicity in the sense of Ockham's razor non plurality principle of parsimony tends to reign supreme when it comes to massive data. We conclude with a comparison of the predictive performance of some of the most commonly used methods on a few data sets.Comment: 18 pages, 2 figures 3 table

    Specializing for predicting obesity and its co-morbidities

    Get PDF
    AbstractWe present specializing, a method for combining classifiers for multi-class classification. Specializing trains one specialist classifier per class and utilizes each specialist to distinguish that class from all others in a one-versus-all manner. It then supplements the specialist classifiers with a catch-all classifier that performs multi-class classification across all classes. We refer to the resulting combined classifier as a specializing classifier.We develop specializing to classify 16 diseases based on discharge summaries. For each discharge summary, we aim to predict whether each disease is present, absent, or questionable in the patient, or unmentioned in the discharge summary. We treat the classification of each disease as an independent multi-class classification task. For each disease, we develop one specialist classifier for each of the present, absent, questionable, and unmentioned classes; we supplement these specialist classifiers with a catch-all classifier that encompasses all of the classes for that disease. We evaluate specializing on each of the 16 diseases and show that it improves significantly over voting and stacking when used for multi-class classification on our data

    Neonatal seizure detection based on single-channel EEG: instrumentation and algorithms

    Get PDF
    Seizure activity in the perinatal period, which constitutes the most common neurological emergency in the neonate, can cause brain disorders later in life or even death depending on their severity. This issue remains unsolved to date, despite the several attempts in tackling it using numerous methods. Therefore, a method is still needed that can enable neonatal cerebral activity monitoring to identify those at risk. Currently, electroencephalography (EEG) and amplitude-integrated EEG (aEEG) have been exploited for the identification of seizures in neonates, however both lack automation. EEG and aEEG are mainly visually analysed, requiring a specific skill set and as a result the presence of an expert on a 24/7 basis, which is not feasible. Additionally, EEG devices employed in neonatal intensive care units (NICU) are mainly designed around adults, meaning that their design specifications are not neonate specific, including their size due to multi-channel requirement in adults - adults minimum requirement is ā‰„ 32 channels, while gold standard in neonatal is equal to 10; they are bulky and occupy significant space in NICU. This thesis addresses the challenge of reliably, efficiently and effectively detecting seizures in the neonatal brain in a fully automated manner. Two novel instruments and two novel neonatal seizure detection algorithms (SDAs) are presented. The first instrument, named PANACEA, is a high-performance, wireless, wearable and portable multi-instrument, able to record neonatal EEG, as well as a plethora of (bio)signals. This device despite its high-performance characteristics and ability to record EEG, is mostly suggested to be used for the concurrent monitoring of other vital biosignals, such as electrocardiogram (ECG) and respiration, which provide vital information about a neonate's medical condition. The two aforementioned biosignals constitute two of the most important artefacts in the EEG and their concurrent acquisition benefit the SDA by providing information to an artefact removal algorithm. The second instrument, called neoEEG Board, is an ultra-low noise, wireless, portable and high precision neonatal EEG recording instrument. It is able to detect and record minute signals (< 10 nVp) enabling cerebral activity monitoring even from lower layers in the cortex. The neoEEG Board accommodates 8 inputs each one equipped with a patent-pending tunable filter topology, which allows passband formation based on the application. Both the PANACEA and the neoEEG Board are able to host low- to middle-complexity SDAs and they can operate continuously for at least 8 hours on 3-AA batteries. Along with PANACEA and the neoEEG Board, two novel neonatal SDAs have been developed. The first one, termed G prime-smoothed (G Ģ_s), is an on-line, automated, patient-specific, single-feature and single-channel EEG based SDA. The G Ģ_s SDA, is enabled by the invention of a novel feature, termed G prime (G Ģ) and can be characterised as an energy operator. The trace that the G Ģ_s creates, can also be used as a visualisation tool because of its distinct change at a presence of a seizure. Finally, the second SDA is machine learning (ML)-based and uses numerous features and a support vector machine (SVM) classifier. It can be characterised as automated, on-line and patient-independent, and similarly to G Ģ_s it makes use of a single-channel EEG. The proposed neonatal SDA introduces the use of the Hilbert-Huang transforms (HHT) in the field of neonatal seizure detection. The HHT analyses the non-linear and non-stationary EEG signal providing information for the signal as it evolves. Through the use of HHT novel features, such as the per intrinsic mode function (IMF) (0-3 Hz) sub-band power, were also employed. Detection rates of this novel neonatal SDA is comparable to multi-channel SDAs.Open Acces

    Using Formal Methods for Autonomous Systems: Five Recipes for Formal Verification

    Get PDF
    Formal Methods are mathematically-based techniques for software design and engineering, which enable the unambiguous description of and reasoning about a system's behaviour. Autonomous systems use software to make decisions without human control, are often embedded in a robotic system, are often safety-critical, and are increasingly being introduced into everyday settings. Autonomous systems need robust development and verification methods, but formal methods practitioners are often asked: Why use Formal Methods for Autonomous Systems? To answer this question, this position paper describes five recipes for formally verifying aspects of an autonomous system, collected from the literature. The recipes are examples of how Formal Methods can be an effective tool for the development and verification of autonomous systems. During design, they enable unambiguous description of requirements; in development, formal specifications can be verified against requirements; software components may be synthesised from verified specifications; and behaviour can be monitored at runtime and compared to its original specification. Modern Formal Methods often include highly automated tool support, which enables exhaustive checking of a system's state space. This paper argues that Formal Methods are a powerful tool for the repertoire of development techniques for safe autonomous systems, alongside other robust software engineering techniques.Comment: Accepted at Journal of Risk and Reliabilit

    Comparison of classification algorithms to predict outcomes of feedlot cattle identified and treated for Bovine Respiratory Disease

    Get PDF
    Bovine respiratory disease (BRD) continues to be the primary cause of morbidity and mortality in feedyard cattle. Accurate identification of those animals that will not finish the production cycle normally following initial treatment for BRD would provide feedyard managers with opportunities to more effectively manage those animals. Our objectives were to assess the ability of different classification algorithms to accurately predict an individual calfā€™s outcome based on data available at first identification of and treatment for BRD and also to identify characteristics of calves where predictive models performed well as gauged by accuracy. Data from 23 feedyards in multiple geographic locations within the U.S. from 2000 to 2009 representing over one million animals were analyzed to identify animals clinically diagnosed with BRD and treated with an antimicrobial. These data were analyzed both as a single dataset and as multiple datasets based on individual feedyards and partitioned into training, testing, and validation datasets. Classifiers were trained and optimized to identify calves that did not finish the production cycle with their cohort. Following classifier training, accuracy was evaluated using validation data. Analysis was also done to identify sub-groups of calves within populations where classifiers performed better compared to other sub-groups. Accuracy of individual classifiers varied by dataset. The accuracy of the best performing classifier by dataset ranged from a low of 63% in one dataset up to 95% in a different dataset. Sub-groups of calves were identified within some datasets where accuracy of a classifiers were greater than 98%; however these accuracies must be interpreted in relation to the prevalence of the class of interest within those populations. We found that by pairing the correct classifier with the data available, accurate predictions could be made that would provide feedlot managers with valuable information
    • ā€¦
    corecore