530 research outputs found

    A systematic review of data quality issues in knowledge discovery tasks

    Get PDF
    Hay un gran crecimiento en el volumen de datos porque las organizaciones capturan permanentemente la cantidad colectiva de datos para lograr un mejor proceso de toma de decisiones. El desafío mas fundamental es la exploración de los grandes volúmenes de datos y la extracción de conocimiento útil para futuras acciones por medio de tareas para el descubrimiento del conocimiento; sin embargo, muchos datos presentan mala calidad. Presentamos una revisión sistemática de los asuntos de calidad de datos en las áreas del descubrimiento de conocimiento y un estudio de caso aplicado a la enfermedad agrícola conocida como la roya del café.Large volume of data is growing because the organizations are continuously capturing the collective amount of data for better decision-making process. The most fundamental challenge is to explore the large volumes of data and extract useful knowledge for future actions through knowledge discovery tasks, nevertheless many data has poor quality. We presented a systematic review of the data quality issues in knowledge discovery tasks and a case study applied to agricultural disease named coffee rust

    Statistical Challenges and Methods for Missing and Imbalanced Data

    Get PDF
    Missing data remains a prevalent issue in every area of research. The impact of missing data, if not carefully handled, can be detrimental to any statistical analysis. Some statistical challenges associated with missing data include, loss of information, reduced statistical power and non-generalizability of findings in a study. It is therefore crucial that researchers pay close and particular attention when dealing with missing data. This multi-paper dissertation provides insight into missing data across different fields of study and addresses some of the above mentioned challenges of missing data through simulation studies and application to real datasets. The first paper of this dissertation addresses the dropout phenomenon in single-cell RNA (scRNA) sequencing through a comparative analyses of some existing scRNA sequencing techniques. The second paper of this work focuses on using simulation studies to assess whether it is appropriate to address the issue of non-detects in data using a traditional substitution approach, imputation, or a non-imputation based approach. The final paper of this dissertation presents an efficient strategy to address the issue of imbalance in data at any degree (whether moderate or highly imbalanced) by combining random undersampling with different weighting strategies. We conclude generally, based on findings from this dissertation that, missingness is not always lack of information but interestingness that needs to investigated

    Click Fraud Detection in Online and In-app Advertisements: A Learning Based Approach

    Get PDF
    Click Fraud is the fraudulent act of clicking on pay-per-click advertisements to increase a site’s revenue, to drain revenue from the advertiser, or to inflate the popularity of content on social media platforms. In-app advertisements on mobile platforms are among the most common targets for click fraud, which makes companies hesitant to advertise their products. Fraudulent clicks are supposed to be caught by ad providers as part of their service to advertisers, which is commonly done using machine learning methods. However: (1) there is a lack of research in current literature addressing and evaluating the different techniques of click fraud detection and prevention, (2) threat models composed of active learning systems (smart attackers) can mislead the training process of the fraud detection model by polluting the training data, (3) current deep learning models have significant computational overhead, (4) training data is often in an imbalanced state, and balancing it still results in noisy data that can train the classifier incorrectly, and (5) datasets with high dimensionality cause increased computational overhead and decreased classifier correctness -- while existing feature selection techniques address this issue, they have their own performance limitations. By extending the state-of-the-art techniques in the field of machine learning, this dissertation provides the following solutions: (i) To address (1) and (2), we propose a hybrid deep-learning-based model which consists of an artificial neural network, auto-encoder and semi-supervised generative adversarial network. (ii) As a solution for (3), we present Cascaded Forest and Extreme Gradient Boosting with less hyperparameter tuning. (iii) To overcome (4), we propose a row-wise data reduction method, KSMOTE, which filters out noisy data samples both in the raw data and the synthetically generated samples. (iv) For (5), we propose different column-reduction methods such as multi-time-scale Time Series analysis for fraud forecasting, using binary labeled imbalanced datasets and hybrid filter-wrapper feature selection approaches

    An insight into imbalanced Big Data classification: outcomes and challenges

    Get PDF
    Big Data applications are emerging during the last years, and researchers from many disciplines are aware of the high advantages related to the knowledge extraction from this type of problem. However, traditional learning approaches cannot be directly applied due to scalability issues. To overcome this issue, the MapReduce framework has arisen as a “de facto” solution. Basically, it carries out a “divide-and-conquer” distributed procedure in a fault-tolerant way to adapt for commodity hardware. Being still a recent discipline, few research has been conducted on imbalanced classification for Big Data. The reasons behind this are mainly the difficulties in adapting standard techniques to the MapReduce programming style. Additionally, inner problems of imbalanced data, namely lack of data and small disjuncts, are accentuated during the data partitioning to fit the MapReduce programming style. This paper is designed under three main pillars. First, to present the first outcomes for imbalanced classification in Big Data problems, introducing the current research state of this area. Second, to analyze the behavior of standard pre-processing techniques in this particular framework. Finally, taking into account the experimental results obtained throughout this work, we will carry out a discussion on the challenges and future directions for the topic.This work has been partially supported by the Spanish Ministry of Science and Technology under Projects TIN2014-57251-P and TIN2015-68454-R, the Andalusian Research Plan P11-TIC-7765, the Foundation BBVA Project 75/2016 BigDaPTOOLS, and the National Science Foundation (NSF) Grant IIS-1447795

    A Comprehensive Survey on Rare Event Prediction

    Full text link
    Rare event prediction involves identifying and forecasting events with a low probability using machine learning and data analysis. Due to the imbalanced data distributions, where the frequency of common events vastly outweighs that of rare events, it requires using specialized methods within each step of the machine learning pipeline, i.e., from data processing to algorithms to evaluation protocols. Predicting the occurrences of rare events is important for real-world applications, such as Industry 4.0, and is an active research area in statistical and machine learning. This paper comprehensively reviews the current approaches for rare event prediction along four dimensions: rare event data, data processing, algorithmic approaches, and evaluation approaches. Specifically, we consider 73 datasets from different modalities (i.e., numerical, image, text, and audio), four major categories of data processing, five major algorithmic groupings, and two broader evaluation approaches. This paper aims to identify gaps in the current literature and highlight the challenges of predicting rare events. It also suggests potential research directions, which can help guide practitioners and researchers.Comment: 44 page

    A Computational Framework for Host-Pathogen Protein-Protein Interactions

    Get PDF
    Infectious diseases cause millions of illnesses and deaths every year, and raise great health concerns world widely. How to monitor and cure the infectious diseases has become a prevalent and intractable problem. Since the host-pathogen interactions are considered as the key infection processes at the molecular level for infectious diseases, there have been a large amount of researches focusing on the host-pathogen interactions towards the understanding of infection mechanisms and the development of novel therapeutic solutions. For years, the continuously development of technologies in biology has benefitted the wet lab-based experiments, such as small-scale biochemical, biophysical and genetic experiments and large-scale methods (for example yeast-two-hybrid analysis and cryogenic electron microscopy approach). As a result of past decades of efforts, there has been an exploded accumulation of biological data, which includes multi omics data, for example, the genomics data and proteomics data. Thus, an initiative review of omics data has been conducted in Chapter 2, which has exclusively demonstrated the recent update of ‘omics’ study, particularly focusing on proteomics and genomics. With the high-throughput technologies, the increasing amount of ‘omics’ data, including genomics and proteomics, has even further boosted. An upsurge of interest for data analytics in bioinformatics comes as no surprise to the researchers from a variety of disciplines. Specifically, the astonishing rate at which genomics and proteomics data are generated leads the researchers into the realm of ‘Big Data’ research. Chapter 2 is thus developed to providing an update of the omics background and the state-of-the-art developments in the omics area, with a focus on genomics data, from the perspective of big data analytics..

    Cost-Sensitive Learning-based Methods for Imbalanced Classification Problems with Applications

    Get PDF
    Analysis and predictive modeling of massive datasets is an extremely significant problem that arises in many practical applications. The task of predictive modeling becomes even more challenging when data are imperfect or uncertain. The real data are frequently affected by outliers, uncertain labels, and uneven distribution of classes (imbalanced data). Such uncertainties create bias and make predictive modeling an even more difficult task. In the present work, we introduce a cost-sensitive learning method (CSL) to deal with the classification of imperfect data. Typically, most traditional approaches for classification demonstrate poor performance in an environment with imperfect data. We propose the use of CSL with Support Vector Machine, which is a well-known data mining algorithm. The results reveal that the proposed algorithm produces more accurate classifiers and is more robust with respect to imperfect data. Furthermore, we explore the best performance measures to tackle imperfect data along with addressing real problems in quality control and business analytics

    TrollBus, An Empirical Study Of Features For Troll Detection

    Get PDF
    No atual contexto de redes sociais, a discussão política tornou-se um evento normal. Utilizadores de todos os segmentos do espetro político têm a possibilidade de expressar as suas opiniões livremente e discutir as suas visões em várias redes sociais, incluindo o Twitter. Desde 2016, um grupo de utilizadores cujo objetivo é polarizar discussões e semear a discórdia começou a ganhar notoriedade nesta rede social. Estas contas são conhecidas como Trolls, e têm sido ligadas a vários eventos na história recente, tais como a interferência em eleições e a organização de manifestações violentas. Desde a sua descoberta, vários trabalhos de investigação têm sido realizados de modo a detetar estas contas através de machine learning. As abordagens existentes usaram tipos diferentes de atributos. O objetivo deste trabalho é comparar esses grupos de atributos. Para tal, um estudo empírico foi realizado, no qual estes atributos são adaptados à comunidade portuguesa do Twitter. O objetivo deste trabalho foi de analisar as múltiplas abordagens realizadas para a deteção de trolls, com uma descrição das suas features e a sua comparação, quer individualmente quer em grupo. Para tal, um estudo empírico foi realizado, em que estas features são adaptadas à comunidade portuguesa do Twitter. Os dados para este projeto foram recolhidos através do SocialBus, uma ferramenta para a recolha, processamento e armazenamento de dados de redes sociais, nomeadamente do Twitter. O conjunto de contas usado para a recolha de dados foi obtido a partir de jornalistas de política portugueses, e a anotação de trolls foi realizada através de um conjunto restrito de regras comportamentais, auxiliada por uma função de pontuação. Um novo módulo para esta plataforma foi desenvolvido, chamado Trollbus, que realiza a deteção de trolls em tempo real. Um dataset público foi também disponibilizado. Os atributos do melhor modelo combinam os metadados do perfil de uma conta com os aspetos superficiais presentes no seu texto. O grupo de atributos mais importantes revelou ser os aspetos numéricos dos dados, com o mais importante a revelar ser a presença de insultos políticos.In today's social network context, the discussion of politics online has become a normal event. Users from all sides of the political spectrum are able to express their opinions freely and discuss their views in various social networks, including Twitter. From 2016 onward, a group of users whose objective is to polarize discussions and sow discord began to gain notoriety in this social network. These accounts are known as Trolls, and they have been linked to several events in recent history such as the influencing of elections and the organizing of violent protests. Since their discovery, several approaches have been developed to detect these accounts using machine learning techniques. Existing approaches have used different types of features. The goal of this work is to compare those different sets of features. To do so, an empirical study was performed, which adapts these features to the Portuguese Twitter community. The necessary data was collected through SocialBus, a tool for the collection, processing and storage of data from social networks, namely Twitter. The set of accounts used to collect the data were obtained from Portuguese political journalists and the labelling of trolls was performed with a strict set of behavioural rules, aided by a scoring function. A new module for SocialBus was developed, called Trollbus, which performs troll detection in real time. A public dataset was also released. The features of the best model obtained combine an account's profile metadata with the superficial aspects present in its text. The most important feature set noted to be the numerical aspects of the text, with the most important feature revealing to be the presence of political insults

    A robust machine learning approach for the prediction of allosteric binding sites

    Get PDF
    Previously held under moratorium from 28 March 2017 until 28 March 2022Allosteric regulatory sites are highly prized targets in drug discovery. They remain difficult to detect by conventional methods, with the vast majority of known examples being found serendipitously. Herein, a rigorous, wholly-computational protocol is presented for the prediction of allosteric sites. Previous attempts to predict the location of allosteric sites by computational means drew on only a small amount of data. Moreover, no attempt was made to modify the initial crystal structure beyond the in silico deletion of the allosteric ligand. This behaviour can leave behind a conformation with a significant structural deformation, often betraying the location of the allosteric binding site. Despite this artificial advantage, modest success rates are observed at best. This work addresses both of these issues. A set of 60 protein crystal structures with known allosteric modulators was collected. To remove the imprint on protein structure caused by the presence of bound modulators, molecular dynamics was performed on each protein prior to analysis. A wide variety of analytical techniques were then employed to extract meaningful data from the trajectories. Upon fusing them into a single, coherent dataset, random forest - a machine learning algorithm - was applied to train a high performance classification model. After successive rounds of optimisation, the final model presented in this work correctly identified the allosteric site for 72% of the proteins tested. This is not only an improvement over alternative strategies in the literature; crucially, this method is unique among site prediction tools in that is does not abuse crystal structures containing imprints of bound ligands - of key importance when making live predictions, where no allosteric regulatory sites are known.Allosteric regulatory sites are highly prized targets in drug discovery. They remain difficult to detect by conventional methods, with the vast majority of known examples being found serendipitously. Herein, a rigorous, wholly-computational protocol is presented for the prediction of allosteric sites. Previous attempts to predict the location of allosteric sites by computational means drew on only a small amount of data. Moreover, no attempt was made to modify the initial crystal structure beyond the in silico deletion of the allosteric ligand. This behaviour can leave behind a conformation with a significant structural deformation, often betraying the location of the allosteric binding site. Despite this artificial advantage, modest success rates are observed at best. This work addresses both of these issues. A set of 60 protein crystal structures with known allosteric modulators was collected. To remove the imprint on protein structure caused by the presence of bound modulators, molecular dynamics was performed on each protein prior to analysis. A wide variety of analytical techniques were then employed to extract meaningful data from the trajectories. Upon fusing them into a single, coherent dataset, random forest - a machine learning algorithm - was applied to train a high performance classification model. After successive rounds of optimisation, the final model presented in this work correctly identified the allosteric site for 72% of the proteins tested. This is not only an improvement over alternative strategies in the literature; crucially, this method is unique among site prediction tools in that is does not abuse crystal structures containing imprints of bound ligands - of key importance when making live predictions, where no allosteric regulatory sites are known
    corecore