38 research outputs found

    Neuropsychological constraints to human data production on a global scale

    Get PDF
    Which are the factors underlying human information production on a global level? In order to gain an insight into this question we study a corpus of 252-633 Million publicly available data files on the Internet corresponding to an overall storage volume of 284-675 Terabytes. Analyzing the file size distribution for several distinct data types we find indications that the neuropsychological capacity of the human brain to process and record information may constitute the dominant limiting factor for the overall growth of globally stored information, with real-world economic constraints having only a negligible influence. This supposition draws support from the observation that the files size distributions follow a power law for data without a time component, like images, and a log-normal distribution for multimedia files, for which time is a defining qualia.Comment: to be published in: European Physical Journal

    High Performance CDR Processing with MapReduce

    Get PDF
    A call detail record (CDR) is a data record produced by telecommunication equipment consisting of call detail transaction logs. It contains valuable information for many purposes in several domains, such as billing, fraud detection and analytical purposes. However, in the real world these needs face a big data challenge. Billions of CDRs are generated every day and the processing systems are expected to deliver results in a timely manner. The capacity of our current production system is not enough to meet these needs. Therefore a better performing system based on MapReduce and running on Hadoop cluster was designed and implemented. This paper presents an analysis of the previous system and the design and implementation of the new system, called MS2. In this paper also empirical evidence is provided to demonstrate the efficiency and linearity of MS2. Tests have shown that MS2 reduces overhead by 44% and speeds up performance nearly twice compared to the previous system. From benchmarking with several related technologies in large-scale data processing, MS2 was also shown to perform better in the case of CDR batch processing.  When it runs on a cluster consisting of eight CPU cores and two conventional disks, MS2 is able to process 67,000 CDRs/second

    Wiretapping the Internet

    Full text link
    With network security threats and vulnerabilities increasing, solutions based on online detection remain attractive. A complete, durable record of all activity on a network can be used to evaluate and train intrusion detection algorithms, assist in responding to an intrusion in progress, and, if properly constructed, serve as evidence in legal proceedings. This paper describes the Advanced Packet Vault, a technology for creating such a record by collecting and securely storing all packets observed on a network, with a scalable architecture intended to support network speeds in excess of 100 Mbps. Encryption is used to preserve users' security and privacy, permitting selected traffic to be made available without revealing other traffic. The Vault implementation, based on Linux and OpenBSD, is open-source. A Vault attached to a heavily loaded 100 Mbps network must capture, process, and store about a terabyte each day, so we have to be very sensitive to the recurring cost of operation and the reliability issues of 24x7 operation. We must also be sensitive to the admissibility of information collected by the Vault in support of legal proceedings; the legal ramifications of operating a vault, particularly at a public institution; and the public perception of its use.http://deepblue.lib.umich.edu/bitstream/2027.42/107911/1/citi-tr-00-9.pd

    Data engineering and best practices

    Get PDF
    Mestrado Bolonha em Data Analytics for BusinessThis report presents the results of a study on the current state of data engineering at LGG Advisors company. Analyzing existing data, we identified several key trends and challenges facing data engineers in this field. Our study's key findings include a lack of standardization and best practices for data engineering processes, a growing need for more sophisticated data management and analysis tools and data security, and a lack of trained and experienced data engineers to meet the increasing demand for data-driven solutions. Based on these findings, we recommend several steps that organizations at LGG Advisors company can take to improve their data engineering capabilities, including investing in training and education programs, adopting best practices for data management and analysis, and collaborating with other organizations to share knowledge and resources. Data security is also an essential concern for data engineers, as data breaches can have significant consequences for organizations, including financial losses, reputational damage, and regulatory penalties. In this thesis, we will review and evaluate some of the best software tools for securing data in data engineering environments. We will discuss these tools' key features and capabilities and their strengths and limitations to help data engineers choose the best software for protecting their data. Some of the tools we will consider include encryption software, access control systems, network security tools, and data backup and recovery solutions. We will also discuss best practices for implementing and managing these tools to ensure data security in data engineering environments. We engineer data using intuition and rules of thumb. Many of these rules are folklore. Given the rapid technological changes, these rules must be constantly reevaluated.info:eu-repo/semantics/publishedVersio

    Ingeniería de Computadores en la era del Big Data: Computación de altas prestaciones en clasificación y optimización

    Get PDF
    Este artículo describe las necesidades que las aplicaciones de la ciencia de datos plantean a las arquitecturas de computador y las consecuencias que comportan en la enseñanza de asignaturas de este ámbito. Como ejemplo, de esta relación entre aplicaciones y arquitectura de computadores se describe la asignatura Computación de Altas Prestaciones en Clasificación y Optimización, impartida en el Máster en Ciencia de Datos e Ingeniería de Computadores de la Universidad de Granada.his paper describes the requi rements that present big data applications demand from computer archit ectures and their consequences in the teaching of subjects in this area. As an example of these relationships between applications and computer architectures, the subject High Performance Computing on Classification and Optimiza tion included in the Master of Data Science and Computer Engineering of the University of Granada (Spain) is described.Universidad de Granada: Departamento de Arquitectura y Tecnología de Computadores; Vicerrectorado para la Garantía de la Calidad

    Introducing distributed dynamic data-intensive (D3) science: Understanding applications and infrastructure

    Get PDF
    A common feature across many science and engineering applications is the amount and diversity of data and computation that must be integrated to yield insights. Data sets are growing larger and becoming distributed; and their location, availability and properties are often time-dependent. Collectively, these characteristics give rise to dynamic distributed data-intensive applications. While "static" data applications have received significant attention, the characteristics, requirements, and software systems for the analysis of large volumes of dynamic, distributed data, and data-intensive applications have received relatively less attention. This paper surveys several representative dynamic distributed data-intensive application scenarios, provides a common conceptual framework to understand them, and examines the infrastructure used in support of applications.Comment: 38 pages, 2 figure
    corecore