744 research outputs found

    A survey of outlier detection methodologies

    Get PDF
    Outlier detection has been used for centuries to detect and, where appropriate, remove anomalous observations from data. Outliers arise due to mechanical faults, changes in system behaviour, fraudulent behaviour, human error, instrument error or simply through natural deviations in populations. Their detection can identify system faults and fraud before they escalate with potentially catastrophic consequences. It can identify errors and remove their contaminating effect on the data set and as such to purify the data for processing. The original outlier detection methods were arbitrary but now, principled and systematic techniques are used, drawn from the full gamut of Computer Science and Statistics. In this paper, we introduce a survey of contemporary techniques for outlier detection. We identify their respective motivations and distinguish their advantages and disadvantages in a comparative review

    A taxonomy framework for unsupervised outlier detection techniques for multi-type data sets

    Get PDF
    The term "outlier" can generally be defined as an observation that is significantly different from the other values in a data set. The outliers may be instances of error or indicate events. The task of outlier detection aims at identifying such outliers in order to improve the analysis of data and further discover interesting and useful knowledge about unusual events within numerous applications domains. In this paper, we report on contemporary unsupervised outlier detection techniques for multiple types of data sets and provide a comprehensive taxonomy framework and two decision trees to select the most suitable technique based on data set. Furthermore, we highlight the advantages, disadvantages and performance issues of each class of outlier detection techniques under this taxonomy framework

    Improving the resolution of interaction maps: A middleground between high-resolution complexes and genome-wide interactomes

    Get PDF
    Protein-protein interactions are ubiquitous in Biology and therefore central to understand living organisms. In recent years, large-scale studies have been undertaken to describe, at least partially, protein-protein interaction maps or interactomes for a number of relevant organisms including human. Although the analysis of interaction networks is proving useful, current interactomes provide a blurry and granular picture of the molecular machinery, i.e. unless the structure of the protein complex is known the molecular details of the interaction are missing and sometime is even not possible to know if the interaction between the proteins is direct, i.e. physical interaction or part of functional, not necessary, direct association. Unfortunately, the determination of the structure of protein complexes cannot keep pace with the discovery of new protein-protein interactions resulting in a large, and increasing, gap between the number of complexes that are thought to exist and the number for which 3D structures are available. The aim of the thesis was to tackle this problem by implementing computational approaches to derive structural models of protein complexes and thus reduce this existing gap. Over the course of the thesis, a novel modelling algorithm to predict the structure of protein complexes, V-D2OCK, was implemented. This new algorithm combines structure-based prediction of protein binding sites by means of a novel algorithm developed over the course of the thesis: VORFFIP and M-VORFFIP, data-driven docking and energy minimization. This algorithm was used to improve the coverage and structural content of the human interactome compiled from different sources of interactomic data to ensure the most comprehensive interactome. Finally, the human interactome and structural models were compiled in a database, V-D2OCK DB, that offers an easy and user-friendly access to the human interactome including a bespoken graphical molecular viewer to facilitate the analysis of the structural models of protein complexes. Furthermore, new organisms, in addition to human, were included providing a useful resource for the study of all known interactomes

    Frequency Domain Decomposition of Digital Video Containing Multiple Moving Objects

    Get PDF
    Motion estimation has been dominated by time domain methods such as block matching and optical flow. However, these methods have problems with multiple moving objects in the video scene, moving backgrounds, noise, and fractional pixel/frame motion. This dissertation proposes a frequency domain method (FDM) that solves these problems. The methodology introduced here addresses multiple moving objects, with or without a moving background, 3-D frequency domain decomposition of digital video as the sum of locally translational (or, in the case of background, a globally translational motion), with high noise rejection. Additionally, via a version of the chirp-Z, fractional pixel/frame motion detection and quantification is accomplished. Furthermore, images of particular moving objects can be extracted and reconstructed from the frequency domain. Finally, this method can be integrated into a larger system to support motion analysis. The method presented here has been tested with synthetic data, realistic, high fidelity simulations, and actual data from established video archives to verify the claims made for the method, all presented here. In addition, a convincing comparison with an up-and-coming spatial domain method, incremental principal component pursuit (iPCP), is presented, where the FDM performs markedly better than its competition

    Finding Favourite Tuples on Data Streams with Provably Few Comparisons

    Full text link
    One of the most fundamental tasks in data science is to assist a user with unknown preferences in finding high-utility tuples within a large database. To accurately elicit the unknown user preferences, a widely-adopted way is by asking the user to compare pairs of tuples. In this paper, we study the problem of identifying one or more high-utility tuples by adaptively receiving user input on a minimum number of pairwise comparisons. We devise a single-pass streaming algorithm, which processes each tuple in the stream at most once, while ensuring that the memory size and the number of requested comparisons are in the worst case logarithmic in nn, where nn is the number of all tuples. An important variant of the problem, which can help to reduce human error in comparisons, is to allow users to declare ties when confronted with pairs of tuples of nearly equal utility. We show that the theoretical guarantees of our method can be maintained for this important problem variant. In addition, we show how to enhance existing pruning techniques in the literature by leveraging powerful tools from mathematical programming. Finally, we systematically evaluate all proposed algorithms over both synthetic and real-life datasets, examine their scalability, and demonstrate their superior performance over existing methods.Comment: To appear in KDD 202

    Risk estimation by maximizing area under receiver operating characteristics curve with application to cardiovascular surgery

    Get PDF
    Ankara : The Department of Computer Engineering and the Institute of Engineering and Science of Bilkent University, 2010.Thesis (Master's) -- Bilkent University, 2010.Includes bibliographical references leaves 56-64.Risks exist in many different domains; medical diagnoses, financial markets, fraud detection and insurance policies are some examples. Various risk measures and risk estimation systems have hitherto been proposed and this thesis suggests a new risk estimation method. Risk estimation by maximizing the area under a Receiver Operating Characteristics (ROC) curve (REMARC) defines risk estimation as a ranking problem. Since the area under ROC curve (AUC) is related to measuring the quality of ranking, REMARC aims to maximize the AUC value on a single feature basis to obtain the best ranking possible on each feature. For a given categorical feature, we prove a sufficient condition that any function must satisfy to achieve the maximum AUC. Continuous features are also discretized by a method that uses AUC as a metric. Then, a heuristic is used to extend this maximization to all features of a dataset. REMARC can handle missing data, binary classes and continuous and nominal feature values. The REMARC method does not only estimate a single risk value, but also analyzes each feature and provides valuable information to domain experts for decision making. The performance of REMARC is evaluated with many datasets in the UCI repository by using different state-of-the-art algorithms such as Support Vector Machines, naïve Bayes, decision trees and boosting methods. Evaluations of the AUC metric show REMARC achieves predictive performance significantly better compared with other machine learning classification methods and is also faster than most of them. In order to develop new risk estimation framework by using the REMARC method cardiovascular surgery domain is selected. The TurkoSCORE project is used to collect data for training phase of the REMARC algorithm. The predictive performance of REMARC is compared with one of the most popular cardiovascular surgical risk evaluation method, called EuroSCORE. EuroSCORE is evaluated on Turkish patients and it is shown that EuroSCORE model is insufficient for Turkish population. Then, the predictive performances of EuroSCORE and TurkoSCORE that uses REMARC for prediction are compared. Empirical evaluations show that REMARC achieves better prediction than EuroSCORE on Turkish patient population.Kurtcephe, MuratM.S

    Mining Predictive Patterns and Extension to Multivariate Temporal Data

    Get PDF
    An important goal of knowledge discovery is the search for patterns in the data that can help explaining its underlying structure. To be practically useful, the discovered patterns should be novel (unexpected) and easy to understand by humans. In this thesis, we study the problem of mining patterns (defining subpopulations of data instances) that are important for predicting and explaining a specific outcome variable. An example is the task of identifying groups of patients that respond better to a certain treatment than the rest of the patients. We propose and present efficient methods for mining predictive patterns for both atemporal and temporal (time series) data. Our first method relies on frequent pattern mining to explore the search space. It applies a novel evaluation technique for extracting a small set of frequent patterns that are highly predictive and have low redundancy. We show the benefits of this method on several synthetic and public datasets. Our temporal pattern mining method works on complex multivariate temporal data, such as electronic health records, for the event detection task. It first converts time series into time-interval sequences of temporal abstractions and then mines temporal patterns backwards in time, starting from patterns related to the most recent observations. We show the benefits of our temporal pattern mining method on two real-world clinical tasks

    A Metric for Measuring Customer Turnover Prediction Models

    Get PDF
    The interest for data mining techniques has increased tremendously during the past decades, and numerous classification techniques have been applied in a wide range of business applications. Hence, the need for adequate performance measures has become more important than ever. In this application, a cost-benefit analysis framework is formalized in order to define performance measures which are aligned with the main objectives of the end users, i.e., profit maximization. A new performance measure is defined, the expected maximum profit criterion. This general framework is then applied to the customer churn problem with its particular cost-benefit structure. The advantage of this approach is that it assists companies with selecting the classifier which maximizes the profit. Moreover, it aids with the practical implementation in the sense that it provides guidance about the fraction of the customer base to be included in the retention campaign

    Novel Approaches to Pervasive and Remote Sensing in Cardiovascular Disease Assessment

    Get PDF
    Cardiovascular diseases (CVDs) are the leading cause of death worldwide, responsible for 45% of all deaths. Nevertheless, their mortality is decreasing in the last decade due to better prevention, diagnosis, and treatment resources. An important medical instrument for the latter processes is the Electrocardiogram (ECG). The ECG is a versatile technique used worldwide for its ease of use, low cost, and accessibility, having evolved from devices that filled up a room, to small patches or wrist- worn devices. Such evolution allowed for more pervasive and near-continuous recordings. The analysis of an ECG allows for studying the functioning of other physiological systems of the body. One such is the Autonomic Nervous System (ANS), responsible for controlling key bodily functions. The ANS can be studied by analyzing the characteristic inter-beat variations, known as Heart Rate Variability (HRV). Leveraging this relation, a pilot study was developed, where HRV was used to quantify the contribution of the ANS in modulating cardioprotection offered by an experimental medical procedure called Remote Ischemic Conditioning (RIC), offering a more objective perspective. To record an ECG, electrodes are responsible for converting the ion-propagated action potential to electrons, needed to record it. They are produced from different materials, including metal, carbon-based, or polymers. Also, they can be divided into wet (if an elec- trolyte gel is used) or dry (if no added electrolyte is used). Electrodes can be positioned either inside the body (in-the-person), attached to the skin (on-the-body), or embedded in daily life objects (off-the-person), with the latter allowing for more pervasive recordings. To this effect, a novel mobile acquisition device for recording ECG rhythm strips was developed, where polymer-based embedded electrodes are used to record ECG signals similar to a medical-grade device. One drawback of off-the-person solutions is the increased noise, mainly caused by the intermittent contact with the recording surfaces. A new signal quality metric was developed based on delayed phase mapping, a technique that maps time series to a two-dimensional space, which is then used to classify a segment into good or noisy. Two different approaches were developed, one using a popular image descriptor, the Hu image moments; and the other using a Convolutional Neural Network, both with promising results for their usage as signal quality index classifiers.As doenças cardiovasculares (DCVs) são a principal causa de morte no mundo, res- ponsáveis por 45% de todas estas. No entanto, a sua mortalidade tem vindo a diminuir na última década, devido a melhores recursos na prevenção, diagnóstico e tratamento. Um instrumento médico importante para estes recursos é o Eletrocardiograma (ECG). O ECG é uma técnica versátil utilizada em todo o mundo pela sua facilidade de uso, baixo custo e acessibilidade, tendo evoluído de dispositivos que ocupavam uma sala inteira para pequenos adesivos ou dispositivos de pulso. Tal evolução permitiu aquisições mais pervasivas e quase contínuas. A análise de um ECG permite estudar o funcionamento de outros sistemas fisiológi- cos do corpo. Um deles é o Sistema Nervoso Autônomo (SNA), responsável por controlar as principais funções corporais. O SNA pode ser estudado analisando as variações inter- batidas, conhecidas como Variabilidade da Frequência Cardíaca (VFC). Aproveitando essa relação, foi desenvolvido um estudo piloto, onde a VFC foi utilizada para quantificar a contribuição do SNA na modulação da cardioproteção oferecida por um procedimento mé- dico experimental, denominado Condicionamento Isquêmico Remoto (CIR), oferecendo uma perspectiva mais objetiva. Na aquisição de um ECG, os elétrodos são os responsáveis por converter o potencial de ação propagado por iões em eletrões, necessários para a sua recolha. Estes podem ser produzidos a partir de diferentes materiais, incluindo metal, à base de carbono ou polímeros. Além disso, os elétrodos podem ser classificados em húmidos (se for usado um gel eletrolítico) ou secos (se não for usado um eletrólito adicional). Os elétrodos podem ser posicionados dentro do corpo (dentro-da-pessoa), colocados em contacto com a pele (na-pessoa) ou embutidos em objetos da vida quotidiana (fora-da-pessoa), sendo que este último permite gravações mais pervasivas . Para este efeito, foi desenvolvido um novo dispositivo de aquisição móvel para gravar sinal de ECG, onde elétrodos embutidos à base de polímeros são usados para recolher sinais de ECG semelhantes a um dispositivo de grau médico. Uma desvantagem das soluções onde os elétrodos estão embutidos é o aumento do ruído, causado principalmente pelo contato intermitente com as superfícies de aquisição. Uma nova métrica de qualidade de sinal foi desenvolvida com base no mapeamento de fase atrasada, uma técnica que mapeia séries temporais para um espaço bidimensional, que é então usado para classificar um segmento em bom ou ruidoso. Duas abordagens diferentes foram desenvolvidas, uma usando um popular descritor de imagem, e outra utilizando uma Rede Neural Convolucional, com resultados promissores para o seu uso como classificadores de qualidade de sinal
    corecore