6,167 research outputs found

    A survey of outlier detection methodologies

    Get PDF
    Outlier detection has been used for centuries to detect and, where appropriate, remove anomalous observations from data. Outliers arise due to mechanical faults, changes in system behaviour, fraudulent behaviour, human error, instrument error or simply through natural deviations in populations. Their detection can identify system faults and fraud before they escalate with potentially catastrophic consequences. It can identify errors and remove their contaminating effect on the data set and as such to purify the data for processing. The original outlier detection methods were arbitrary but now, principled and systematic techniques are used, drawn from the full gamut of Computer Science and Statistics. In this paper, we introduce a survey of contemporary techniques for outlier detection. We identify their respective motivations and distinguish their advantages and disadvantages in a comparative review

    Providing Diversity in K-Nearest Neighbor Query Results

    Full text link
    Given a point query Q in multi-dimensional space, K-Nearest Neighbor (KNN) queries return the K closest answers according to given distance metric in the database with respect to Q. In this scenario, it is possible that a majority of the answers may be very similar to some other, especially when the data has clusters. For a variety of applications, such homogeneous result sets may not add value to the user. In this paper, we consider the problem of providing diversity in the results of KNN queries, that is, to produce the closest result set such that each answer is sufficiently different from the rest. We first propose a user-tunable definition of diversity, and then present an algorithm, called MOTLEY, for producing a diverse result set as per this definition. Through a detailed experimental evaluation on real and synthetic data, we show that MOTLEY can produce diverse result sets by reading only a small fraction of the tuples in the database. Further, it imposes no additional overhead on the evaluation of traditional KNN queries, thereby providing a seamless interface between diversity and distance.Comment: 20 pages, 11 figure

    Role based behavior analysis

    Get PDF
    Tese de mestrado, Segurança Informática, Universidade de Lisboa, Faculdade de Ciências, 2009Nos nossos dias, o sucesso de uma empresa depende da sua agilidade e capacidade de se adaptar a condições que se alteram rapidamente. Dois requisitos para esse sucesso são trabalhadores proactivos e uma infra-estrutura ágil de Tecnologias de Informacão/Sistemas de Informação (TI/SI) que os consiga suportar. No entanto, isto nem sempre sucede. Os requisitos dos utilizadores ao nível da rede podem nao ser completamente conhecidos, o que causa atrasos nas mudanças de local e reorganizações. Além disso, se não houver um conhecimento preciso dos requisitos, a infraestrutura de TI/SI poderá ser utilizada de forma ineficiente, com excessos em algumas áreas e deficiências noutras. Finalmente, incentivar a proactividade não implica acesso completo e sem restrições, uma vez que pode deixar os sistemas vulneráveis a ameaças externas e internas. O objectivo do trabalho descrito nesta tese é desenvolver um sistema que consiga caracterizar o comportamento dos utilizadores do ponto de vista da rede. Propomos uma arquitectura de sistema modular para extrair informação de fluxos de rede etiquetados. O processo é iniciado com a criação de perfis de utilizador a partir da sua informação de fluxos de rede. Depois, perfis com características semelhantes são agrupados automaticamente, originando perfis de grupo. Finalmente, os perfis individuais são comprados com os perfis de grupo, e os que diferem significativamente são marcados como anomalias para análise detalhada posterior. Considerando esta arquitectura, propomos um modelo para descrever o comportamento de rede dos utilizadores e dos grupos. Propomos ainda métodos de visualização que permitem inspeccionar rapidamente toda a informação contida no modelo. O sistema e modelo foram avaliados utilizando um conjunto de dados reais obtidos de um operador de telecomunicações. Os resultados confirmam que os grupos projectam com precisão comportamento semelhante. Além disso, as anomalias foram as esperadas, considerando a população subjacente. Com a informação que este sistema consegue extrair dos dados em bruto, as necessidades de rede dos utilizadores podem sem supridas mais eficazmente, os utilizadores suspeitos são assinalados para posterior análise, conferindo uma vantagem competitiva a qualquer empresa que use este sistema.In our days, the success of a corporation hinges on its agility and ability to adapt to fast changing conditions. Proactive workers and an agile IT/IS infrastructure that can support them is a requirement for this success. Unfortunately, this is not always the case. The user’s network requirements may not be fully understood, which slows down relocation and reorganization. Also, if there is no grasp on the real requirements, the IT/IS infrastructure may not be efficiently used, with waste in some areas and deficiencies in others. Finally, enabling proactivity does not mean full unrestricted access, since this may leave the systems vulnerable to outsider and insider threats. The purpose of the work described on this thesis is to develop a system that can characterize user network behavior. We propose a modular system architecture to extract information from tagged network flows. The system process begins by creating user profiles from their network flows’ information. Then, similar profiles are automatically grouped into clusters, creating role profiles. Finally, the individual profiles are compared against the roles, and the ones that differ significantly are flagged as anomalies for further inspection. Considering this architecture, we propose a model to describe user and role network behavior. We also propose visualization methods to quickly inspect all the information contained in the model. The system and model were evaluated using a real dataset from a large telecommunications operator. The results confirm that the roles accurately map similar behavior. The anomaly results were also expected, considering the underlying population. With the knowledge that the system can extract from the raw data, the users network needs can be better fulfilled, the anomalous users flagged for inspection, giving an edge in agility for any company that uses it

    Customer Churn Prediction

    Get PDF
    Churned customers identification plays an essential role for the functioning and growth of any business. Identification of churned customers can help the business to know the reasons for the churn and they can plan their market strategies accordingly to enhance the growth of a business. This research is aimed at developing a machine learning model that can precisely predict the churned customers from the total customers of a Credit Union financial institution. A quantitative and deductive research strategies are employed to build a supervised machine learning model that addresses the class imbalance problem handled feature selection and efficiently predict the customer churn. The overall accuracy of the model, Receiver Operating Characteristic curve and Area Under the Receiver Operating Characteristic Curve is used as the evaluation metrics for this research to identify the best classifier. A comparative study on the most popular supervised machine learning methods – Logistic Regression, Random Forest, Support Vector Machine (SVM) and Neural Network were applied to customer churning prediction in a CU context. In the first phase of our experiments, the various feature selection techniques were studied. In the second phase of our study, all models were applied on the imbalance dataset and results were evaluated. SMOTE technique is used to balance the data and then the same models were applied on the balanced dataset and results were evaluated and compared. The best over-all classifier was Random Forest with accuracy almost 97%, precision 91% and recall as 98%

    On Graph Stream Clustering with Side Information

    Full text link
    Graph clustering becomes an important problem due to emerging applications involving the web, social networks and bio-informatics. Recently, many such applications generate data in the form of streams. Clustering massive, dynamic graph streams is significantly challenging because of the complex structures of graphs and computational difficulties of continuous data. Meanwhile, a large volume of side information is associated with graphs, which can be of various types. The examples include the properties of users in social network activities, the meta attributes associated with web click graph streams and the location information in mobile communication networks. Such attributes contain extremely useful information and has the potential to improve the clustering process, but are neglected by most recent graph stream mining techniques. In this paper, we define a unified distance measure on both link structures and side attributes for clustering. In addition, we propose a novel optimization framework DMO, which can dynamically optimize the distance metric and make it adapt to the newly received stream data. We further introduce a carefully designed statistics SGS(C) which consume constant storage spaces with the progression of streams. We demonstrate that the statistics maintained are sufficient for the clustering process as well as the distance optimization and can be scalable to massive graphs with side attributes. We will present experiment results to show the advantages of the approach in graph stream clustering with both links and side information over the baselines.Comment: Full version of SIAM SDM 2013 pape
    corecore