167 research outputs found

    On Influence of Representations of Discretized Data on Performance of a Decision System

    Get PDF
    AbstractWhen discretization is used for preprocessing datasets in a decision system different representations of data can be taken into consideration. Typical approach is to use data as it is returned by discretizer, namely as nominal values. But in specific cases such form of data cannot be utilized by next modules of the decision system. Then the possible solution is to convert nominal data again into a numerical form. The paper presents comparison of such approaches applied for different classifiers in stylometry domain

    Differential diagnosis of Erythmato-Squamous Diseases using classification and regression tree

    Get PDF
    Introduction: Differential diagnosis of Erythmato-Squamous Diseases (ESD) is a major challenge in the field of dermatology. The ESD diseases are placed into six different classes. Data mining is the process for detection of hidden patterns. In the case of ESD, data mining help us to predict the diseases. Different algorithms were developed for this purpose. Objective: we aimed to use the Classification and Regression Tree (CART) to predict differential diagnosis of ESD. Methods: we used the Cross Industry Standard Process for Data Mining (CRISP-DM) methodology. For this purpose, the dermatology data set from machine learning repository, UCI was obtained. The Clementine 12.0 software from IBM Company was used for modelling. In order to evaluation of the model we calculate the accuracy, sensitivity and specificity of the model. Results: The proposed model had an accuracy of 94.84 (Standard Deviation: 24.42) in order to correct prediction of the ESD disease. Conclusions: Results indicated that using of this classifier could be useful. But, it would be strongly recommended that the combination of machine learning methods could be more useful in terms of prediction of ESD. © 2016 Keivan Maghooli, Mostafa Langarizadeh, Leila Shahmoradi, Mahdi Habibi-koolaee, Mohamad Jebraeily, and Hamid Bouraghi

    Messaging Forensic Framework for Cybercrime Investigation

    Get PDF
    Online predators, botmasters, and terrorists abuse the Internet and associated web technologies by conducting illegitimate activities such as bullying, phishing, and threatening. These activities often involve online messages between a criminal and a victim, or between criminals themselves. The forensic analysis of online messages to collect empirical evidence that can be used to prosecute cybercriminals in a court of law is one way to minimize most cybercrimes. The challenge is to develop innovative tools and techniques to precisely analyze large volumes of suspicious online messages. We develop a forensic analysis framework to help an investigator analyze the textual content of online messages with two main objectives. First, we apply our novel authorship analysis techniques for collecting patterns of authorial attributes to address the problem of anonymity in online communication. Second, we apply the proposed knowledge discovery and semantic anal ysis techniques for identifying criminal networks and their illegal activities. The focus of the framework is to collect creditable, intuitive, and interpretable evidence for both technical and non-technical professional experts including law enforcement personnel and jury members. To evaluate our proposed methods, we share our collaborative work with a local law enforcement agency. The experimental result on real-life data suggests that the presented forensic analysis framework is effective for cybercrime investigation

    A fast supervised density-based discretization algorithm for classification tasks in the medical domain

    Get PDF
    Discretization is a preprocessing technique used for converting continuous features into categorical. This step is essential for processing algorithms that cannot handle continuous data as input. In addition, in the big data era, it is important for a discretizer to be able to efficiently discretize data. In this paper, a new supervised density-based discretization (DBAD) algorithm is proposed, which satisfies these requirements. For the evaluation of the algorithm, 11 datasets that cover a wide range of datasets in the medical domain were used. The proposed algorithm was tested against three state-of-the art discretizers using three classifiers with different characteristics. A parallel version of the algorithm was evaluated using two synthetic big datasets. In the majority of the performed tests, the algorithm was found performing statistically similar or better than the other three discretization algorithms it was compared to. Additionally, the algorithm was faster than the other discretizers in all of the performed tests. Finally, the parallel version of DBAD shows almost linear speedup for a Message Passing Interface (MPI) implementation (9.64× for 10 nodes), while a hybrid MPI/OpenMP implementation improves execution time by 35.3× for 10 nodes and 6 threads per node.Peer ReviewedPostprint (published version

    Condition attributes, properties of decision rules, and discretisation : analysis of relations and dependencies

    Get PDF
    When mining of input data is focused on rule induction, knowledge, discovered in exploration of existing patterns, is stored in combinations of certain conditions on attributes included in rule premises, leading to specific decisions. Through their properties, such as lengths, supports, cardinalities of rule sets, inferred rules characterise relations detected among variables. The paper presents research dedicated to analysis of these dependencies, considered in the context of various discretisation methods applied to the input data from stylometric domain. For induction of decision rules from data, Classical Rough Set Approach was employed. Next, based on rule properties, several factors were proposed and evaluated, reflecting characteristics of available condition attributes. They allowed to observe how variables and rule sets changed depending on applied discretisation algorithms

    Interpreting communities based on the evolution of a dynamic attributed network

    Get PDF
    International audienceMany methods have been proposed to detect communities , not only in plain, but also in attributed, directed or even dynamic complex networks. From the modeling point of view, to be of some utility, the community structure must be characterized relatively to the properties of the studied system. However, most of the existing works focus on the detection of communities, and only very few try to tackle this interpretation problem. Moreover, the existing approaches are limited either by the type of data they handle, or by the nature of the results they output. In this work, we see the interpretation of communities as a problem independent from the detection process, consisting in identifying the most characteristic features of communities. We give a formal definition of this problem and propose a method to solve it. To this aim, we first define a sequence-based representation of networks, combining temporal information, community structure, topological measures, and nodal attributes. We then describe how to identify the most emerging sequential patterns of this dataset, and use them to characterize the communities. We study the performance of our method on artificially generated dynamic attributed networks. We also empirically validate our framework on real-world systems: a DBLP network of scientific collaborations, and a LastFM network of social and musical interactions

    Pathway to Future Symbiotic Creativity

    Full text link
    This report presents a comprehensive view of our vision on the development path of the human-machine symbiotic art creation. We propose a classification of the creative system with a hierarchy of 5 classes, showing the pathway of creativity evolving from a mimic-human artist (Turing Artists) to a Machine artist in its own right. We begin with an overview of the limitations of the Turing Artists then focus on the top two-level systems, Machine Artists, emphasizing machine-human communication in art creation. In art creation, it is necessary for machines to understand humans' mental states, including desires, appreciation, and emotions, humans also need to understand machines' creative capabilities and limitations. The rapid development of immersive environment and further evolution into the new concept of metaverse enable symbiotic art creation through unprecedented flexibility of bi-directional communication between artists and art manifestation environments. By examining the latest sensor and XR technologies, we illustrate the novel way for art data collection to constitute the base of a new form of human-machine bidirectional communication and understanding in art creation. Based on such communication and understanding mechanisms, we propose a novel framework for building future Machine artists, which comes with the philosophy that a human-compatible AI system should be based on the "human-in-the-loop" principle rather than the traditional "end-to-end" dogma. By proposing a new form of inverse reinforcement learning model, we outline the platform design of machine artists, demonstrate its functions and showcase some examples of technologies we have developed. We also provide a systematic exposition of the ecosystem for AI-based symbiotic art form and community with an economic model built on NFT technology. Ethical issues for the development of machine artists are also discussed

    Plurigaussian Simulation of rocktypes using data from a gold mine in Western Australia

    Get PDF
    Stochastic simulation of rocktypes, or the geometry of the geology, is a major area of continuing research as earth scientists seek a better understanding of an orebody as a precursor to the assignment of continuous rock properties, allowing more economically appropriate decisions regarding mine planning. This thesis analyses the suitability of particular geostatistical rock type modelling algorithms when applied to the five rocktypes evident in drill hole data from the Big Bell gold mine near Cue, Western Australia. The background of the geostatistical theory is considered, in particular the concept of the random function model and the link between the categorical statistics determined from the drill hole data and the three models used for estimation and simulation. The commonly applied indicator kriging (IK) and sequential indicator simulation (SIS) algorithms are compared in a non-sedimentary gold deposit environment to the more computationally demanding and more complex plurigaussian simulation (PGS). Comparisons between the three models are made by examining global and regional rocktype (lithotype) proportions of the outputs of the models, both visually and empirically. The models are validated by considering the contacts which occur in reality between different lithotypes and the proportion of contacts which do not conform to this reality in each of the models. This „inadmissible contact‟ ratio measures the short range validity of the estimation and simulation techniques. Finally, cores taken from the output of the models are compared to the drill hole data in terms of transition proportions between the twenty five possible transitions for the five lithotypes. Inadmissible contacts were at a minimum with PGS, and the visual and empirical natures of the PGS output were closely linked to the reality of the drill hole data. Whilst each model produced similar 3D images, PGS was a realistic balance between the clustering effect produced by IK and the fine mosaic effect from SIS. The PGS output numerically outperformed the other two models in terms of admissible contacts and connectivity, most closely matching the drill hole data. All results indicate that, whilst demanding to implement, PGS produces the most adequate model of the study region

    EDM 2011: 4th international conference on educational data mining : Eindhoven, July 6-8, 2011 : proceedings

    Get PDF

    Gene Subset Selection Approaches Based on Linear Separability

    Get PDF
    We address the concept of linear separability of gene expression data sets with respect to two classes, which has been recently studied in the literature. The problem is to efficiently find all pairs of genes which induce a linear separation of the data. We study the Containment Angle (CA) defined on the unit circle for a linearly separating gene-pair (LS-pair) as an alternative to the paired t-test ranking function for gene selection. Using the CA we also show empirically that a given classifier\u27s error is related to the degree of linear separability of a given data set. Finally we propose gene subset selection methods based on the CA ranking function for LS-pairs and a ranking function for linearly separation genes (LS-genes), and which select only among LS-genes and LS-pairs. Overall, our proposed methods give better results in terms of subset sizes and classification accuracy when compared to well-performing methods, on many gene expression data sets
    corecore