818 research outputs found

    ETL for data science?: A case study

    Get PDF
    Big data has driven data science development and research over the last years. However, there is a problem - most of the data science projects don't make it to production. This can happen because many data scientists don’t use a reference data science methodology. Another aggravating element is data itself, its quality and processing. The problem can be mitigated through research, progress and case studies documentation about the topic, fostering knowledge dissemination and reuse. Namely, data mining can benefit from other mature fields’ knowledge that explores similar matters, like data warehousing. To address the problem, this dissertation performs a case study about the project “IA-SI - Artificial Intelligence in Incentives Management”, which aims to improve the management of European grant funds through data mining. The key contributions of this study, to the academia and to the project’s development and success are: (1) A combined process model of the most used data mining process models and their tasks, extended with the ETL’s subsystems and other selected data warehousing best practices. (2) Application of this combined process model to the project and all its documentation. (3) Contribution to the project’s prototype implementation, regarding the data understanding and data preparation tasks. This study concludes that CRISP-DM is still a reference, as it includes all the other data mining process models’ tasks and detailed descriptions, and that its combination with the data warehousing best practices is useful to the project IA-SI and potentially to other data mining projects.A big data tem impulsionado o desenvolvimento e a pesquisa da ciência de dados nos últimos anos. No entanto, há um problema - a maioria dos projetos de ciência de dados não chega à produção. Isto pode acontecer porque muitos deles não usam uma metodologia de ciência de dados de referência. Outro elemento agravador são os próprios dados, a sua qualidade e o seu processamento. O problema pode ser mitigado através da documentação de estudos de caso, pesquisas e desenvolvimento da área, nomeadamente o reaproveitamento de conhecimento de outros campos maduros que exploram questões semelhantes, como data warehousing. Para resolver o problema, esta dissertação realiza um estudo de caso sobre o projeto “IA-SI - Inteligência Artificial na Gestão de Incentivos”, que visa melhorar a gestão dos fundos europeus de investimento através de data mining. As principais contribuições deste estudo, para a academia e para o desenvolvimento e sucesso do projeto são: (1) Um modelo de processo combinado dos modelos de processo de data mining mais usados e as suas tarefas, ampliado com os subsistemas de ETL e outras recomendadas práticas de data warehousing selecionadas. (2) Aplicação deste modelo de processo combinado ao projeto e toda a sua documentação. (3) Contribuição para a implementação do protótipo do projeto, relativamente a tarefas de compreensão e preparação de dados. Este estudo conclui que CRISP-DM ainda é uma referência, pois inclui todas as tarefas dos outros modelos de processos de data mining e descrições detalhadas e que a sua combinação com as melhores práticas de data warehousing é útil para o projeto IA-SI e potencialmente para outros projetos de data mining

    On NIS-Apriori Based Data Mining in SQL

    Get PDF
    We have proposed a framework of Rough Non-deterministic Information Analysis (RNIA) for tables with non-deterministic information, and applied RNIA to analyzing tables with uncertainty. We have also developed the RNIA software tool in Prolog and getRNIA in Python, in addition to these two tools we newly consider the RNIA software tool in SQL for handling large size data sets. This paper reports the current state of the prototype named NIS-Apriori in SQL, which will afford us more convenient environment for data analysis.International Joint Conference on Rough Sets (IJCRS 2016), October 7-11, 2016, Santiago, Chil

    Improving package recommendations through query relaxation

    Full text link
    Recommendation systems aim to identify items that are likely to be of interest to users. In many cases, users are interested in package recommendations as collections of items. For example, a dietitian may wish to derive a dietary plan as a collection of recipes that is nutritionally balanced, and a travel agent may want to produce a vacation package as a coordinated collection of travel and hotel reservations. Recent work has explored extending recommendation systems to support packages of items. These systems need to solve complex combinatorial problems, enforcing various properties and constraints defined on sets of items. Introducing constraints on packages makes recommendation queries harder to evaluate, but also harder to express: Queries that are under-specified produce too many answers, whereas queries that are over-specified frequently miss interesting solutions. In this paper, we study query relaxation techniques that target package recommendation systems. Our work offers three key insights: First, even when the original query result is not empty, relaxing constraints can produce preferable solutions. Second, a solution due to relaxation can only be preferred if it improves some property specified by the query. Third, relaxation should not treat all constraints as equals: some constraints are more important to the users than others. Our contributions are threefold: (a) we define the problem of deriving package recommendations through query relaxation, (b) we design and experimentally evaluate heuristics that relax query constraints to derive interesting packages, and (c) we present a crowd study that evaluates the sensitivity of real users to different kinds of constraints and demonstrates that query relaxation is a powerful tool in diversifying package recommendations

    CleanML: A Study for Evaluating the Impact of Data Cleaning on ML Classification Tasks

    Full text link
    Data quality affects machine learning (ML) model performances, and data scientists spend considerable amount of time on data cleaning before model training. However, to date, there does not exist a rigorous study on how exactly cleaning affects ML -- ML community usually focuses on developing ML algorithms that are robust to some particular noise types of certain distributions, while database (DB) community has been mostly studying the problem of data cleaning alone without considering how data is consumed by downstream ML analytics. We propose a CleanML study that systematically investigates the impact of data cleaning on ML classification tasks. The open-source and extensible CleanML study currently includes 14 real-world datasets with real errors, five common error types, seven different ML models, and multiple cleaning algorithms for each error type (including both commonly used algorithms in practice as well as state-of-the-art solutions in academic literature). We control the randomness in ML experiments using statistical hypothesis testing, and we also control false discovery rate in our experiments using the Benjamini-Yekutieli (BY) procedure. We analyze the results in a systematic way to derive many interesting and nontrivial observations. We also put forward multiple research directions for researchers.Comment: published in ICDE 202

    Deep Learning Data and Indexes in a Database

    Get PDF
    A database is used to store and retrieve data, which is a critical component for any software application. Databases requires configuration for efficiency, however, there are tens of configuration parameters. It is a challenging task to manually configure a database. Furthermore, a database must be reconfigured on a regular basis to keep up with newer data and workload. The goal of this thesis is to use the query workload history to autonomously configure the database and improve its performance. We achieve proposed work in four stages: (i) we develop an index recommender using deep reinforcement learning for a standalone database. We evaluated the effectiveness of our algorithm by comparing with several state-of-the-art approaches, (ii) we build a real-time index recommender that can, in real-time, dynamically create and remove indexes for better performance in response to sudden changes in the query workload, (iii) we develop a database advisor. Our advisor framework will be able to learn latent patterns from a workload. It is able to enhance a query, recommend interesting queries, and summarize a workload, (iv) we developed LinkSocial, a fast, scalable, and accurate framework to gain deeper insights from heterogeneous data

    A monitoring and threat detection system using stream processing as a virtual function for big data

    Get PDF
    The late detection of security threats causes a significant increase in the risk of irreparable damages, disabling any defense attempt. As a consequence, fast realtime threat detection is mandatory for security guarantees. In addition, Network Function Virtualization (NFV) provides new opportunities for efficient and low-cost security solutions. We propose a fast and efficient threat detection system based on stream processing and machine learning algorithms. The main contributions of this work are i) a novel monitoring threat detection system based on stream processing; ii) two datasets, first a dataset of synthetic security data containing both legitimate and malicious traffic, and the second, a week of real traffic of a telecommunications operator in Rio de Janeiro, Brazil; iii) a data pre-processing algorithm, a normalizing algorithm and an algorithm for fast feature selection based on the correlation between variables; iv) a virtualized network function in an open-source platform for providing a real-time threat detection service; v) near-optimal placement of sensors through a proposed heuristic for strategically positioning sensors in the network infrastructure, with a minimum number of sensors; and, finally, vi) a greedy algorithm that allocates on demand a sequence of virtual network functions.A detecção tardia de ameaças de segurança causa um significante aumento no risco de danos irreparáveis, impossibilitando qualquer tentativa de defesa. Como consequência, a detecção rápida de ameaças em tempo real é essencial para a administração de segurança. Além disso, A tecnologia de virtualização de funções de rede (Network Function Virtualization - NFV) oferece novas oportunidades para soluções de segurança eficazes e de baixo custo. Propomos um sistema de detecção de ameaças rápido e eficiente, baseado em algoritmos de processamento de fluxo e de aprendizado de máquina. As principais contribuições deste trabalho são: i) um novo sistema de monitoramento e detecção de ameaças baseado no processamento de fluxo; ii) dois conjuntos de dados, o primeiro ´e um conjunto de dados sintético de segurança contendo tráfego suspeito e malicioso, e o segundo corresponde a uma semana de tráfego real de um operador de telecomunicações no Rio de Janeiro, Brasil; iii) um algoritmo de pré-processamento de dados composto por um algoritmo de normalização e um algoritmo para seleção rápida de características com base na correlação entre variáveis; iv) uma função de rede virtualizada em uma plataforma de código aberto para fornecer um serviço de detecção de ameaças em tempo real; v) posicionamento quase perfeito de sensores através de uma heurística proposta para posicionamento estratégico de sensores na infraestrutura de rede, com um número mínimo de sensores; e, finalmente, vi) um algoritmo guloso que aloca sob demanda uma sequencia de funções de rede virtual

    Collaborative OLAP with Tag Clouds: Web 2.0 OLAP Formalism and Experimental Evaluation

    Full text link
    Increasingly, business projects are ephemeral. New Business Intelligence tools must support ad-lib data sources and quick perusal. Meanwhile, tag clouds are a popular community-driven visualization technique. Hence, we investigate tag-cloud views with support for OLAP operations such as roll-ups, slices, dices, clustering, and drill-downs. As a case study, we implemented an application where users can upload data and immediately navigate through its ad hoc dimensions. To support social networking, views can be easily shared and embedded in other Web sites. Algorithmically, our tag-cloud views are approximate range top-k queries over spontaneous data cubes. We present experimental evidence that iceberg cuboids provide adequate online approximations. We benchmark several browser-oblivious tag-cloud layout optimizations.Comment: Software at https://github.com/lemire/OLAPTagClou

    The Family of MapReduce and Large Scale Data Processing Systems

    Full text link
    In the last two decades, the continuous increase of computational power has produced an overwhelming flow of data which has called for a paradigm shift in the computing architecture and large scale data processing mechanisms. MapReduce is a simple and powerful programming model that enables easy development of scalable parallel applications to process vast amounts of data on large clusters of commodity machines. It isolates the application from the details of running a distributed program such as issues on data distribution, scheduling and fault tolerance. However, the original implementation of the MapReduce framework had some limitations that have been tackled by many research efforts in several followup works after its introduction. This article provides a comprehensive survey for a family of approaches and mechanisms of large scale data processing mechanisms that have been implemented based on the original idea of the MapReduce framework and are currently gaining a lot of momentum in both research and industrial communities. We also cover a set of introduced systems that have been implemented to provide declarative programming interfaces on top of the MapReduce framework. In addition, we review several large scale data processing systems that resemble some of the ideas of the MapReduce framework for different purposes and application scenarios. Finally, we discuss some of the future research directions for implementing the next generation of MapReduce-like solutions.Comment: arXiv admin note: text overlap with arXiv:1105.4252 by other author
    corecore