1,099 research outputs found

    MRQAR: A generic MapReduce framework to discover quantitative association rules in big data problems

    Get PDF
    Many algorithms have emerged to address the discovery of quantitative association rules from datasets in the last years. However, this task is becoming a challenge because the processing power of most existing techniques is not enough to handle the large amount of data generated nowadays. These vast amounts of data are known as Big Data. A number of previous studies have been focused on mining boolean or nominal association rules from Big Data problems, nevertheless, the data in real-world applications usually consist of quantitative values and designing data mining algorithms able to extract quantitative association rules presents a challenge to workers in this research field. In spite of the fact that we can find classical methods to discover boolean or nominal association rules in the most well-known repositories of Big Data algorithms, such repositories do not provide methods to discover quantitative association rules. Indeed, no methodologies have been proposed in the literature without prior discretization in Big Data. Hence, this work proposes MRQAR, a new generic parallel framework to discover quantitative association rules in large amounts of data, designed following the MapReduce paradigm using Apache Spark. MRQAR performs an incremental learning able to run any sequential quantitative association rule algorithm in Big Data problems without needing to redesign such algorithms. As a case study, we have integrated the multiobjective evolutionary algorithm MOPNAR into MRQAR to validate the generic MapReduce framework proposed in this work. The results obtained in the experimental study performed on five Big Data problems prove the capability of MRQAR to obtain reduced set of high quality rules in reasonable time.Ministerio de Economía y Competitividad TIN2017-89517-PMinisterio de Economía y Competitividad TIN2014-55894-C2-1-RMinisterio de Economía y Competitividad TIN2017-88209-C2-2-

    Rule-based preprocessing for data stream mining using complex event processing

    Get PDF
    Data preprocessing is known to be essential to produce accurate data from which mining methods are able to extract valuable knowledge. When data constantly arrives from one or more sources, preprocessing techniques need to be adapted to efficiently handle these data streams. To help domain experts to define and execute preprocessing tasks for data streams, this paper proposes the use of active rule-based systems and, more specifically, complex event processing (CEP) languages and engines. The main contribution of our approach is the formulation of preprocessing procedures as event detection rules, expressed in an SQL-like language, that provide domain experts a simple way to manipulate temporal data. This idea is materialized into a publicly available solution that integrates a CEP engine with a library for online data mining. To evaluate our approach, we present three practical scenarios in which CEP rules preprocess data streams with the aim of adding temporal information, transforming features and handling missing values. Experiments show how CEP rules provide an effective language to express preprocessing tasks in a modular and high-level manner, without significant time and memory overheads. The resulting data streams do not only help improving the predictive accuracy of classification algorithms, but also allow reducing the complexity of the decision models and the time needed for learning in some cases

    Knowledge Generation with Rule Induction in Cancer Omics

    Get PDF
    The explosion of omics data availability in cancer research has boosted the knowledge of the molecular basis of cancer, although the strategies for its definitive resolution are still not well established. The complexity of cancer biology, given by the high heterogeneity of cancer cells, leads to the development of pharmacoresistance for many patients, hampering the efficacy of therapeutic approaches. Machine learning techniques have been implemented to extract knowledge from cancer omics data in order to address fundamental issues in cancer research, as well as the classification of clinically relevant sub-groups of patients and for the identification of biomarkers for disease risk and prognosis. Rule induction algorithms are a group of pattern discovery approaches that represents discovered relationships in the form of human readable associative rules. The application of such techniques to the modern plethora of collected cancer omics data can effectively boost our understanding of cancer-related mechanisms. In fact, the capability of these methods to extract a huge amount of human readable knowledge will eventually help to uncover unknown relationships between molecular attributes and the malignant phenotype. In this review, we describe applications and strategies for the usage of rule induction approaches in cancer omics data analysis. In particular, we explore the canonical applications and the future challenges and opportunities posed by multi-omics integration problems.Peer reviewe

    ur-CAIM: Improved CAIM Discretization for Unbalanced and Balanced Data

    Get PDF
    Supervised discretization is one of basic data preprocessing techniques used in data mining. CAIM (Class- Attribute InterdependenceMaximization) is a discretization algorithm of data for which the classes are known. However, new arising challenges such as the presence of unbalanced data sets, call for new algorithms capable of handling them, in addition to balanced data. This paper presents a new discretization algorithm named ur-CAIM, which improves on the CAIM algorithm in three important ways. First, it generates more flexible discretization schemes while producing a small number of intervals. Second, the quality of the intervals is improved based on the data classes distribution, which leads to better classification performance on balanced and, especially, unbalanced data. Third, the runtime of the algorithm is lower than CAIM’s. The algorithm has been designed free-parameter and it self-adapts to the problem complexity and the data class distribution. The ur-CAIM was compared with 9 well-known discretization methods on 28 balanced, and 70 unbalanced data sets. The results obtained were contrasted through non-parametric statistical tests, which show that our proposal outperforms CAIM and many of the other methods on both types of data but especially on unbalanced data, which is its significant advantage

    Intelligent network intrusion detection using an evolutionary computation approach

    Get PDF
    With the enormous growth of users\u27 reliance on the Internet, the need for secure and reliable computer networks also increases. Availability of effective automatic tools for carrying out different types of network attacks raises the need for effective intrusion detection systems. Generally, a comprehensive defence mechanism consists of three phases, namely, preparation, detection and reaction. In the preparation phase, network administrators aim to find and fix security vulnerabilities (e.g., insecure protocol and vulnerable computer systems or firewalls), that can be exploited to launch attacks. Although the preparation phase increases the level of security in a network, this will never completely remove the threat of network attacks. A good security mechanism requires an Intrusion Detection System (IDS) in order to monitor security breaches when the prevention schemes in the preparation phase are bypassed. To be able to react to network attacks as fast as possible, an automatic detection system is of paramount importance. The later an attack is detected, the less time network administrators have to update their signatures and reconfigure their detection and remediation systems. An IDS is a tool for monitoring the system with the aim of detecting and alerting intrusive activities in networks. These tools are classified into two major categories of signature-based and anomaly-based. A signature-based IDS stores the signature of known attacks in a database and discovers occurrences of attacks by monitoring and comparing each communication in the network against the database of signatures. On the other hand, mechanisms that deploy anomaly detection have a model of normal behaviour of system and any significant deviation from this model is reported as anomaly. This thesis aims at addressing the major issues in the process of developing signature based IDSs. These are: i) their dependency on experts to create signatures, ii) the complexity of their models, iii) the inflexibility of their models, and iv) their inability to adapt to the changes in the real environment and detect new attacks. To meet the requirements of a good IDS, computational intelligence methods have attracted considerable interest from the research community. This thesis explores a solution to automatically generate compact rulesets for network intrusion detection utilising evolutionary computation techniques. The proposed framework is called ESR-NID (Evolving Statistical Rulesets for Network Intrusion Detection). Using an interval-based structure, this method can be deployed for any continuous-valued input data. Therefore, by choosing appropriate statistical measures (i.e. continuous-valued features) of network trafc as the input to ESRNID, it can effectively detect varied types of attacks since it is not dependent on the signatures of network packets. In ESR-NID, several innovations in the genetic algorithm were developed to keep the ruleset small. A two-stage evaluation component in the evolutionary process takes the cooperation of rules into consideration and results into very compact, easily understood rulesets. The effectiveness of this approach is evaluated against several sources of data for both detection of normal and abnormal behaviour. The results are found to be comparable to those achieved using other machine learning methods from both categories of GA-based and non-GA-based methods. One of the significant advantages of ESR-NIS is that it can be tailored to specific problem domains and the characteristics of the dataset by the use of different fitness and performance functions. This makes the system a more flexible model compared to other learning techniques. Additionally, an IDS must adapt itself to the changing environment with the least amount of configurations. ESR-NID uses an incremental learning approach as new flow of traffic become available. The incremental learning approach benefits from less required storage because it only keeps the generated rules in its database. This is in contrast to the infinitely growing size of repository of raw training data required for traditional learning

    Exploring the Concept of the Digital Educator During COVID-19

    Get PDF
    T In many machine learning classification problems, datasets are usually of high dimensionality and therefore require efficient and effective methods for identifying the relative importance of their attributes, eliminating the redundant and irrelevant ones. Due to the huge size of the search space of the possible solutions, the attribute subset evaluation feature selection methods are not very suitable, so in these scenarios feature ranking methods are used. Most of the feature ranking methods described in the literature are univariate methods, which do not detect interactions between factors. In this paper, we propose two new multivariate feature ranking methods based on pairwise correlation and pairwise consistency, which have been applied for cancer gene expression and genotype-tissue expression classification tasks using public datasets. We statistically proved that the proposed methods outperform the state-of-the-art feature ranking methods Clustering Variation, Chi Squared, Correlation, Information Gain, ReliefF and Significance, as well as other feature selection methods for attribute subset evaluation based on correlation and consistency with the multi-objective evolutionary search strategy, and with the embedded feature selection methods C4.5 and LASSO. The proposed methods have been implemented on the WEKA platform for public use, making all the results reported in this paper repeatable and replicabl

    Stock market prediction using weighted inter-transaction class association rule mining and evolutionary algorithm

    Get PDF
    Evolutionary computation and data mining are two fascinating fields that have attracted many researchers. This paper proposes a new rule mining method, named genetic network programming (GNP), to solve the prediction problem using the evolutionary algorithm. Compared with the conventional association rule methods that do not consider the weight factor, the proposed algorithm provides many advantages in financial prediction, since it can discover relationships among the attributes of different transactions. Experimental results on data from the New York Exchange Market show that the new method outperforms other conventional models in terms of both accuracy and profitability, and the proposed method can establish more important and accurate rules than the conventional methods. The results confirmed the effectiveness of the proposed data mining method in financial prediction
    corecore