25 research outputs found

    Neural Techniques for Improving the Classification Accuracy of Microarray Data Set using Rough Set Feature Selection Method

    Get PDF
    Abstract---Classification, a data mining task is an effective method to classify the data in the process of Knowledge Data Discovery. Classification method algorithms are widely used in medical field to classify the medical data for diagnosis. Feature Selection increases the accuracy of the Classifier because it eliminates irrelevant attributes. This paper analyzes the performance of neural network classifiers with and without feature selection in terms of accuracy and efficiency to build a model on four different datasets. This paper provides rough feature selection scheme, and evaluates the relative performance of four different neural network classification procedures such as Learning Vector Quantisation (LVQ) -LVQ1, LVQ3, optimizedlearning-rate LVQ1 (OLVQ1), and The Self-Organizing Map (SOM) incorporating those methods. Experimental results show that the LVQ3 neural classification is an appropriate classification method makes it possible to construct high performance classification models for microarray data

    Computational models and approaches for lung cancer diagnosis

    Full text link
    The success of treatment of patients with cancer depends on establishing an accurate diagnosis. To this end, the aim of this study is to developed novel lung cancer diagnostic models. New algorithms are proposed to analyse the biological data and extract knowledge that assists in achieving accurate diagnosis results

    Click Fraud Detection in Online and In-app Advertisements: A Learning Based Approach

    Get PDF
    Click Fraud is the fraudulent act of clicking on pay-per-click advertisements to increase a site’s revenue, to drain revenue from the advertiser, or to inflate the popularity of content on social media platforms. In-app advertisements on mobile platforms are among the most common targets for click fraud, which makes companies hesitant to advertise their products. Fraudulent clicks are supposed to be caught by ad providers as part of their service to advertisers, which is commonly done using machine learning methods. However: (1) there is a lack of research in current literature addressing and evaluating the different techniques of click fraud detection and prevention, (2) threat models composed of active learning systems (smart attackers) can mislead the training process of the fraud detection model by polluting the training data, (3) current deep learning models have significant computational overhead, (4) training data is often in an imbalanced state, and balancing it still results in noisy data that can train the classifier incorrectly, and (5) datasets with high dimensionality cause increased computational overhead and decreased classifier correctness -- while existing feature selection techniques address this issue, they have their own performance limitations. By extending the state-of-the-art techniques in the field of machine learning, this dissertation provides the following solutions: (i) To address (1) and (2), we propose a hybrid deep-learning-based model which consists of an artificial neural network, auto-encoder and semi-supervised generative adversarial network. (ii) As a solution for (3), we present Cascaded Forest and Extreme Gradient Boosting with less hyperparameter tuning. (iii) To overcome (4), we propose a row-wise data reduction method, KSMOTE, which filters out noisy data samples both in the raw data and the synthetically generated samples. (iv) For (5), we propose different column-reduction methods such as multi-time-scale Time Series analysis for fraud forecasting, using binary labeled imbalanced datasets and hybrid filter-wrapper feature selection approaches

    Coevolutionary fuzzy attribute order reduction with complete attribute-value space tree

    Get PDF
    Since big data sets are structurally complex, high-dimensional, and their attributes exhibit some redundant and irrelevant information, the selection, evaluation, and combination of those large-scale attributes pose huge challenges to traditional methods. Fuzzy rough sets have emerged as a powerful vehicle to deal with uncertain and fuzzy attributes in big data problems that involve a very large number of variables to be analyzed in a very short time. In order to further overcome the inefficiency of traditional algorithms in the uncertain and fuzzy big data, in this paper we present a new coevolutionary fuzzy attribute order reduction algorithm (CFAOR) based on a complete attribute-value space tree. A complete attribute-value space tree model of decision table is designed in the attribute space to adaptively prune and optimize the attribute order tree. The fuzzy similarity of multimodality attributes can be extracted to satisfy the needs of users with the better convergence speed and classification performance. Then, the decision rule sets generate a series of rule chains to form an efficient cascade attribute order reduction and classification with a rough entropy threshold. Finally, the performance of CFAOR is assessed with a set of benchmark problems that contain complex high dimensional datasets with noise. The experimental results demonstrate that CFAOR can achieve the higher average computational efficiency and classification accuracy, compared with the state-of-the-art methods. Furthermore, CFAOR is applied to extract different tissues surfaces of dynamical changing infant cerebral cortex and it achieves a satisfying consistency with those of medical experts, which shows its potential significance for the disorder prediction of infant cerebrum

    Recent Trends in Computational Intelligence

    Get PDF
    Traditional models struggle to cope with complexity, noise, and the existence of a changing environment, while Computational Intelligence (CI) offers solutions to complicated problems as well as reverse problems. The main feature of CI is adaptability, spanning the fields of machine learning and computational neuroscience. CI also comprises biologically-inspired technologies such as the intellect of swarm as part of evolutionary computation and encompassing wider areas such as image processing, data collection, and natural language processing. This book aims to discuss the usage of CI for optimal solving of various applications proving its wide reach and relevance. Bounding of optimization methods and data mining strategies make a strong and reliable prediction tool for handling real-life applications

    I2ECR: Integrated and Intelligent Environment for Clinical Research

    Get PDF
    Clinical trials are designed to produce new knowledge about a certain disease, drug or treatment. During these studies, a huge amount of data is collected about participants, therapies, clinical procedures, outcomes, adverse events and so on. A multicenter, randomized, phase III clinical trial in Hematology enrolls up to hundreds of subjects and evaluates post-treatment outcomes on stratified sub- groups of subjects for a period of many years. Therefore, data collection in clinical trials is becoming complex, with huge amount of clinical and biological variables. Outside the medical field, data warehouses (DWs) are widely employed. A Data Ware-house is a “collection of integrated, subject-oriented databases designed to support the decision-making process”. To verify whether DWs might be useful for data quality and association analysis, a team of biomedical engineers, clinicians, biologists and statisticians developed the “I2ECR” project. I2ECR is an Integrated and Intelligent Environment for Clinical Research where clinical and omics data stand together for clinical use (reporting) and for generation of new clinical knowledge. I2ECR has been built from the “MCL0208” phase III, prospective, clinical trial, sponsored by the Fondazione Italiana Linfomi (FIL); this is actually a translational study, accounting for many clinical data, along with several clinical prognostic indexes (e.g. MIPI - Mantle International Prognostic Index), pathological information, treatment and outcome data, biological assessments of disease (MRD - Minimal Residue Disease), as well as many biological, ancillary studies, such as Mutational Analysis, Gene Expression Profiling (GEP) and Pharmacogenomics. In this trial forty-eight Italian medical centers were actively involved, for a total of 300 enrolled subjects. Therefore, I2ECR main objectives are: • to propose an integration project on clinical and molecular data quality concepts. The application of a clear row-data analysis as well as clinical trial monitoring strategies to implement a digital platform where clinical, biological and “omics” data are imported from different sources and well-integrated in a data- ware-house • to be a dynamic repository of data congruency quality rules. I2ECR allows to monitor, in a semi-automatic manner, the quality of data, in relation to the clinical data imported from eCRFs (electronic Case Report Forms) and from biologic and mutational datasets internally edited by local laboratories. Therefore, I2ECR will be able to detect missing data and mistakes derived from non-conventional data- entry activities by centers. • to provide to clinical stake-holders a platform from where they can easily design statistical and data mining analysis. The term Data Mining (DM) identifies a set of tools to searching for hidden patterns of interest in large and multivariate datasets. The applications of DM techniques in the medical field range from outcome prediction and patient classification to genomic medicine and molecular biology. I2ECR allows to clinical stake-holders to propose innovative methods of supervised and unsupervised feature extraction, data classification and statistical analysis on heterogeneous datasets associated to the MCL0208 clinical trial. Although MCL0208 study is the first example of data-population of I2ECR, the environment will be able to import data from clinical studies designed for other onco-hematologic diseases, too

    Variable precision rough set theory decision support system: With an application to bank rating prediction

    Get PDF
    This dissertation considers, the Variable Precision Rough Sets (VPRS) model, and its development within a comprehensive software package (decision support system), incorporating methods of re sampling and classifier aggregation. The concept of /-reduct aggregation is introduced, as a novel approach to classifier aggregation within the VPRS framework. The software is applied to the credit rating prediction problem, in particularly, a full exposition of the prediction and classification of Fitch's Individual Bank Strength Ratings (FIBRs), to a number of banks from around the world is presented. The ethos of the developed software was to rely heavily on a simple 'point and click' interface, designed to make a VPRS analysis accessible to an analyst, who is not necessarily an expert in the field of VPRS or decision rule based systems. The development of the software has also benefited from consultations with managers from one of Europe's leading hedge funds, who gave valuable insight, advice and recommendations on what they considered as pertinent issues with regards to data mining, and what they would like to see from a modern data mining system. The elements within the developed software reflect each stage of the knowledge discovery process, namely, pre-processing, feature selection, data mining, interpretation and evaluation. The developed software encompasses three software packages, a pre-processing package incorporating some of the latest pre-processing and feature selection methods a VPRS data mining package, based on a novel "vein graph" interface, which presents the analyst with selectable /-reducts over the domain of / and a third more advanced VPRS data mining package, which essentially automates the vein graph interface for incorporation into a re-sampling environment, and also implements the introduced aggregated /-reduct, developed to optimise and stabilise the predictive accuracy of a set of decision rules induced from the aggregated /-reduct

    Evolutionary approaches for feature selection in biological data

    Get PDF
    Data mining techniques have been used widely in many areas such as business, science, engineering and medicine. The techniques allow a vast amount of data to be explored in order to extract useful information from the data. One of the foci in the health area is finding interesting biomarkers from biomedical data. Mass throughput data generated from microarrays and mass spectrometry from biological samples are high dimensional and is small in sample size. Examples include DNA microarray datasets with up to 500,000 genes and mass spectrometry data with 300,000 m/z values. While the availability of such datasets can aid in the development of techniques/drugs to improve diagnosis and treatment of diseases, a major challenge involves its analysis to extract useful and meaningful information. The aims of this project are: 1) to investigate and develop feature selection algorithms that incorporate various evolutionary strategies, 2) using the developed algorithms to find the “most relevant” biomarkers contained in biological datasets and 3) and evaluate the goodness of extracted feature subsets for relevance (examined in terms of existing biomedical domain knowledge and from classification accuracy obtained using different classifiers). The project aims to generate good predictive models for classifying diseased samples from control
    corecore