33 research outputs found

    Improving pattern tracking with a language-aware tree differencing algorithm

    Get PDF
    International audienceTracking code fragments of interest is important in monitoring a software project over multiple versions. Various approaches, including our previous work on Herodotos, exploit the notion of Longest Common Subsequence, as computed by readily available tools such as GNU Diff, to map corresponding code fragments. Nevertheless, the efficient code differencing algorithms are typically line-based or word-based, and thus do not report changes at the level of language constructs. Furthermore, they identify only additions and removals, but not the moving of a block of code from one part of a file to another. Code fragments of interest that fall within the added and removed regions of code have to be manually correlated across versions, which is tedious and error-prone. When studying a very large code base over a long time, the number of manual correlations can become an obstacle to the success of a study. In this paper, we investigate the effect of replacing the current line-based algorithm used by Herodotos by tree-matching, as provided by the algorithm of the differencing tool GumTree. In contrast to the line-based approach, the tree-based approach does not generate any manual correlations, but it incurs a high execution time. To address the problem, we propose a hybrid strategy that gives the best of both approaches

    Faults in Linux 2.6

    Get PDF
    In August 2011, Linux entered its third decade. Ten years before, Chou et al. published a study of faults found by applying a static analyzer to Linux versions 1.0 through 2.4.1. A major result of their work was that the drivers directory contained up to 7 times more of certain kinds of faults than other directories. This result inspired numerous efforts on improving the reliability of driver code. Today, Linux is used in a wider range of environments, provides a wider range of services, and has adopted a new development and release model. What has been the impact of these changes on code quality? To answer this question, we have transported Chou et al.'s experiments to all versions of Linux 2.6; released between 2003 and 2011. We find that Linux has more than doubled in size during this period, but the number of faults per line of code has been decreasing. Moreover, the fault rate of drivers is now below that of other directories, such as arch. These results can guide further development and research efforts for the decade to come. To allow updating these results as Linux evolves, we define our experimental protocol and make our checkers available

    Herodotos: A Tool to Expose Bugs' Lives

    Get PDF
    Software is continually evolving, to improve performance, correct errors, and add new features. Code modifications, however, inevitably lead to the introduction of defects. To prevent the introduction of defects, one has to understand why they occur. Thus, it is important to develop tools and practices that aid in defect finding, tracking and prevention. In this paper, we propose a methodology and associated tool, Herodotos, to study defects over time. Herodotos semi-automatically tracks defects over multiple versions of a software project, independent of other changes in the source files. It builds a graphical history of each defect and gives some statistics based on the results. We have evaluated this approach on the history of a representative range of open source projects over the last three years. For each project, we explore several kinds of defects that have been found by static code analysis. We analyze the generated results to compare the selected software projects and defect kinds

    The Family of MapReduce and Large Scale Data Processing Systems

    Full text link
    In the last two decades, the continuous increase of computational power has produced an overwhelming flow of data which has called for a paradigm shift in the computing architecture and large scale data processing mechanisms. MapReduce is a simple and powerful programming model that enables easy development of scalable parallel applications to process vast amounts of data on large clusters of commodity machines. It isolates the application from the details of running a distributed program such as issues on data distribution, scheduling and fault tolerance. However, the original implementation of the MapReduce framework had some limitations that have been tackled by many research efforts in several followup works after its introduction. This article provides a comprehensive survey for a family of approaches and mechanisms of large scale data processing mechanisms that have been implemented based on the original idea of the MapReduce framework and are currently gaining a lot of momentum in both research and industrial communities. We also cover a set of introduced systems that have been implemented to provide declarative programming interfaces on top of the MapReduce framework. In addition, we review several large scale data processing systems that resemble some of the ideas of the MapReduce framework for different purposes and application scenarios. Finally, we discuss some of the future research directions for implementing the next generation of MapReduce-like solutions.Comment: arXiv admin note: text overlap with arXiv:1105.4252 by other author

    A Survey on Automatic Parameter Tuning for Big Data Processing Systems

    Get PDF
    Big data processing systems (e.g., Hadoop, Spark, Storm) contain a vast number of configuration parameters controlling parallelism, I/O behavior, memory settings, and compression. Improper parameter settings can cause significant performance degradation and stability issues. However, regular users and even expert administrators grapple with understanding and tuning them to achieve good performance. We investigate existing approaches on parameter tuning for both batch and stream data processing systems and classify them into six categories: rule-based, cost modeling, simulation-based, experiment-driven, machine learning, and adaptive tuning. We summarize the pros and cons of each approach and raise some open research problems for automatic parameter tuning.Peer reviewe

    Clang and Coccinelle:synergising program analysis tools for CERT C Secure Coding Standard certification

    Get PDF
    Writing correct C programs is well-known to be hard, not least due to the many language features intrinsic to C. Writing secure C programs is even harder and, at times, seemingly impossible. To improve on this situation the US CERT has developed and published a set of coding standards, the “CERT C Secure Coding Standard”, that (in the current version) enumerates 118 rules and 182 recommendations with the aim of making C programs (more) secure. The large number of rules and recommendations makes automated tool support essential for certifying that a given system is in compliance with the standard. In this paper we report on ongoing work on integrating two state of the art analysis tools, Clang and Coccinelle, into a combined tool well suited for analysing and certifying C programs according to, e.g., the CERT C Secure Coding standard or the MISRA (the Motor Industry Software Reliability Assocation) C standard. We further argue that such a tool must be highly adaptable and customisable to each software project as well as to the certification rules required by a given standard. Clang is the C frontend for the LLVM compiler/virtual machine project which includes a comprehensive set of static analyses and code checkers. Coccinelle is a program transformation tool and bug-finder developed originally for the Linux kernel, but has been successfully used to find bugs in other Open Source projects such as WINE and OpenSSL

    Variability Anomalies in Software Product Lines

    Get PDF
    Software Product Lines (SPLs) allow variants of a software system to be generated based on the configuration selected by the user. In this thesis, we focus on C based software systems with build-time variability using a build system and C preprocessor. Such systems usually consist of a configuration space, a code space, and a build space. The configuration space describes the features that the user can select and any configuration constraints between them. The features and the constraints between them are commonly documented in a variability model. The code and build spaces contain the actual implementation of the system where the former contains the C code files with conditional compilation directives (e.g., #ifdefs), and the latter contains the build scripts with conditionally compiled files. We study the relationship between the three spaces as follows: (1) we detect variability anomalies which arise due to inconsistencies among the three spaces, and (2) we use anomaly detection techniques to automatically extract configuration constraints from the implementation. For (1), we complement previous research which mainly focused on the relationship between the configuration space and code space. We additionally analyze the build space to ensure that the constraints in all three spaces are consistent. We detect inconsistencies, which we call variability anomalies, in particular dead and undead artifacts. Dead artifacts are conditional artifacts which are not included in any valid configuration while undead artifacts are those which are always included. We look for such anomalies at both the code block and source file levels using the Linux kernel as a case study. Our work shows that almost half the configurable features are only used to control source file compilation in Linux’s build system, KBUILD . We analyze KBUILD to extract file presence conditions which determine under which feature combinations is each file compiled. We show that by considering the build system, we can detect an additional 20% variability anomalies on the code block level when compared to only using the configuration and code spaces. Our work also shows that file level anomalies occur less frequently than block level ones. We analyze the evolution of the detected anomalies and identify some of their causes and fixes. For (2), we develop novel analyses to automatically extract configuration constraints from implementation and compare them to those in existing variability models. We rely on two means of detecting variability anomalies: (a) conditional build-time errors and (b) detecting under which conditions a feature has an effect on the compiled code (to avoid duplicate variants). We apply this to four real-world systems: uClibc, BusyBox, eCos, and the Linux kernel. We show that our extraction is 93% and 77% accurate respectively for the two means we use and that we can recover 19 % of the existing variability-model constraints using our approach. We qualitatively investigate the non-recovered constraints and find that many of them stem from domain knowledge. For systems with existing variability models, understanding where each constraint comes from can aid in traceability between the code and the model which can help in debugging conflicts. More importantly, in systems which do not have a formal variability model, automatically extracting constraints from code provides the basis for reverse engineering a variability model. Overall, we provide tools and techniques to help maintain and create software product lines. Our work helps to ensure the consistency of variability constraints scattered across SPLs and provides tools to help reverse engineer variability models

    Automatic physical layer tuning of mapreduce-based query processing engines

    Get PDF
    Orientador: Eduardo Cunha de AlmeidaTese (doutorado) - Universidade Federal do Paraná, Setor de Ciências Exatas, Programa de Pós-Graduação em Informática. Defesa : Curitiba, 29/06/2020Inclui referências: p. 98-109Área de concentração: Ciência da ComputaçãoResumo: A crescente necessidade de processar grandes quantidades de dados semi-estruturados e nãoestruturados levou ao desenvolvimento de mecanismos de processamento especializados como o MapReduce. O MapReduce é um modelo de programação projetado para processar grandes quantidades de dados semiestruturados de maneira distribuída e paralela. Os sistemas SQLon-Hadoop são interfaces SQL construídas sobre os mecanismos de processamento baseados em MapReduce para consultar grandes quantidades de dados semi-estruturados. No entanto, o número de máquinas, o número de sistemas na pilha de software e os mecanismos de controle fornecidos pelos mecanismos do MapReduce aumentam a complexidade e os custos operacionais de um cluster SQL-on-Hadoop. O aumento do desempenho dos motores de processamento MapReduce é um fator chave que pode ser alcançado delegando a quantidade certa de recursos físicos para suas tarefas. No entanto, usuários e até administradores especializados lutam para entender e ajustar as tarefas MapReduce para obter um desempenho melhor. A falta de conhecimento para ajustar as tarefas MapReduce deu origem a uma linha de pesquisa bem-sucedida sobre o ajuste automático dos parâmetros do MapReduce, originando vários Orientadores de Ajuste. No entanto, o problema de ajustar automaticamente as consultas SQL-no-Hadoop permanece amplamente inexplorado, pois a abordagem atual da aplicação dos Orientadores de Ajuste projetados para MapReduce em consultas SQL-on-Hadoop acarreta em vários problemas. Por exemplo, o processador de consultas do Hive, um sistema SQL-on-Hadoop popular, traduz consultas HiveQL em grafos de tarefas MapReduce, e seria fácil supor que, ajustando as configurações do motor de processamento MapReduce, as consultas HiveQL também se beneficiariam. Entretanto, essa suposição não se aplica quando os Orientadores de Ajuste existentes são aplicados ingenuamente às consultas HiveQL devido a arquitetura do Hive, Hadoop e dos Orientadores de Ajuste. Nesta tese tratamos da questão de como ajustar corretamente as consultas SQL-no-Hadoop. Por "corretamente", entendemos que, ao ajustar as configurações das consultas SQL-no-Hadoop, a geração das configurações deve considerar várias características que estão presentes apenas em tarefas geradas pelos sistemas SQL-no-Hadoop. Essas características incluem: (i) no caso de consultas individuais, todas as tarefas MapReduce que constituem o plano de consulta desta consulta são executadas com configurações idênticas. (ii) apesar da busca e geração das configurações de ajuste serem realizadas para cada tarefa MapReduce, apenas uma configuração de ajuste é selecionada e aplicada à consulta e as demais configurações de ajuste são simplesmente descartadas. (iii) Os Orientadores de Ajuste do Hadoop tratam as funções do MapReduce como caixas-pretas e fazem suposições de modelagem simplificadoras que podem valer para tarefas clássicas do MapReduce (Sort, Grep), mas não são verdadeiras para consultas do tipo SQL como o HiveQL, onde as tarefas contêm vários operadores de álgebra relacional como junções e agregadores. Estendemos o processador de consultas do Hive para ajustar as consultas SQL-no-Hadoop. Esta extensão compreende uma abordagem chamada de ajuste não-uniforme que permite que os sistemas SQL-on-Hadoop tenham um controle mais refinado da configuração das consultas, onde cada tarefa MapReduce recebe uma configuração especializada. Apresentamos um modelo conceitual, chamado assinatura de código, que usa informações estáticas disponíveis antes da execução de cada tafera para mapear tarefas que tenham padrões de consumo de recursos similares. Também apresentamos um cache que armazena configurações de ajuste, geradas por algum Orientadore de Ajuste, e as recicla entre tarefas que possuem consumo de recursos semelhantes. Nossa extensão funciona em conjunto como uma solução única para o ajuste automático de consultas SQL-no-Hadoop. Para validar nossa solução, realizamos um estudo experimental focado no Hive executando sobre o Hadoop porque (i) O Hive é um bom representante dos sistemas SQL-on-Hadoop nativos (como o System-R fez para os sistemas de bancos de dados relacionais); (ii) o Hive e o Hadoop são altamente populares para processamento analítico; e (iii) O ajuste de parâmetros do Hadoop foi estudado extensivamente nos últimos anos. Para preencher o cache de ajuste, empregamos o Starfish, o primeiro Orientador de Ajuste baseado em custo que encontra configurações (quase) ótimas e é o único Orientador de Ajuste disponível ao público para fins de pesquisa acadêmica. Em nossos experimentos, apresentamos que as consultas otimizadas com nossa abordagem de ajuste apresentaram acelerações de até 25%, contrastando com a abordagem atual que degradou o desempenho em várias ocasiões. Especificamente, a abordagem atual de ajuste pode causar variações no tempo de execução entre -171% e 27% em relação à configuração padrão. Mais importante ainda, nosso método de ajuste leva a uma melhor utilização de recursos, diminuindo o uso da CPU e a paginação de memória em até 40%. Nossa abordagem também reduziu a quantidade total de dados gravados em discos em 5×. Nossa abordagem de ajuste tem um cache usado para evitar a recriação de perfis de tarefas MapReduce semelhantes. Nosso cache reduziu a geração de perfils em 50% para a carga de trabalho TPC-H, permitindo até o ajuste parcial de consultas ad-hoc antes de sua execução. Palavras-chave: Sintonia da camada física. Processamento de consulta em MapReduce. SQL-On-Hadoop.Abstract: The increasing need to process large amounts of semi- and non-structured data has led to the development of specialized processing engines like MapReduce. MapReduce is a programming model designed to process large-scale semi-structured data in a distributed and parallel fashion. SQL-on-Hadoop systems are SQL-like interfaces build on top of MapReduce processing engines to query semi-structured data in large-scale. However, the number of computing nodes, the number of systems in the software stack, and the controlling mechanisms provided by MapReduce engines increase the complexity and the operational costs of maintaining a large SQL-on-Hadoop cluster. Increasing performance of such engines is a key factor that can be achieved by delegating the right amount of physical resources. Yet, regular users and even expert administrators struggle to understand and tune MapReduce jobs to achieve good performance. This skill gap has given rise to a successful line of research on automatically tuning MapReduce parameters, originating several tuning advisors. Yet, the problem of automatically tuning SQL-on-Hadoop queries remains largely unexplored today as the current approach of applying MapReduce tuning advisors direct to SQL-on-Hadoop queries entail a number of problems. For instance, the Hive SQL-on-Hadoop engine compiles HiveQL queries into a workflow of MapReduce jobs, and it would be straightforward to assume that by tuning the underlying Hadoop processing engine, HiveQL queries would benefit as well. However, this assumption does not hold when existing tuning advisors are naively applied to HiveQL queries due to the design choices of Hive, Hadoop, and the tuning advisors. This thesis addresses the question of how to properly tune SQL-on-Hadoop queries? By "properly" we mean, when tuning SQL-on-Hadoop queries, the generation of the tuning setups has to consider several characteristics that are only present in jobs generated by SQL-on-Hadoop systems. These characteristics include: (i) at the level of individual queries, all MapReduce jobs that constitute a query plan are executed with identical configuration settings. (ii) despite profiling and search heuristics being performed in a job-basis to generate tuning setups, only one tuning setup is applied to the query and the remaining tuning setups are simply discarded. (iii) Hadoop tuning advisors treat the MapReduce functions as black boxes and make simplifying modeling assumptions that may hold for classical MapReduce jobs (Sort, Grep), but they are not true for SQL-like queries like HiveQL where jobs contain multiple relational algebra operators like joins and aggregators. We extended the Hive query processor for tune SQL-on-Hadoop queries. This extension comprises an approach called non-uniform tuning that enables SQL-on-Hadoop systems to have a fine-grained control for tuning queries, where jobs receive specialized tuning setups. We present a conceptual model, called code-signature, that uses static information available upfront execution to match jobs with similar resource consumption patterns. We also present a tuning cache that stores tuning setups, generated by third part tuning advisors, and recycle them between jobs that have the similar resource consumption. The extension works together as a single solution for automatic tuning of SQL-on-Hadoop queries. In order to validate our solution, we conduct an experimental study focused on Hive over Hadoop because (i) Hive is a good representative of native SQL-on-Hadoop systems (like System-R did for relational database systems); (ii) both Hive and Hadoop are highly popular for analytical processing; and (iii) Hadoop parameter tuning has been studied extensively in recent years. For populate the Tuning Cache, we employ Starfish, the first cost-based optimizer for finding (near-) optimal configuration parameter settings and the only publicly available tuning advisor for academic research purposes. In our experiments, we present that queries optimized with our tuning approach always presented positive speed ups up to 25%, contrasting the current approach that degraded performance in several occasions. Specifically, the current tuning approach can cause variations in the execution run time between -171% and 27% over default configuration. Most importantly, our tuning method leads to considerable better resource utilization, decreasing CPU usage and Memory paging over 40%. Also reducing the total amount of data written to disks in 5×. Our tuning approach has a Tuning Cache used to avoid reprofiling similar jobs. Our Tuning Cache reduced the profilings in 50% for TPC-H queries, enabling upfront tuning of ad-hoc queries. Keywords: Physical-layer tuning. MapReduce query processing. SQL-On-Hadoop

    Data Infrastructure for Medical Research

    Get PDF
    While we are witnessing rapid growth in data across the sciences and in many applications, this growth is particularly remarkable in the medical domain, be it because of higher resolution instruments and diagnostic tools (e.g. MRI), new sources of structured data like activity trackers, the wide-spread use of electronic health records and many others. The sheer volume of the data is not, however, the only challenge to be faced when using medical data for research. Other crucial challenges include data heterogeneity, data quality, data privacy and so on. In this article, we review solutions addressing these challenges by discussing the current state of the art in the areas of data integration, data cleaning, data privacy, scalable data access and processing in the context of medical data. The techniques and tools we present will give practitioners — computer scientists and medical researchers alike — a starting point to understand the challenges and solutions and ultimately to analyse medical data and gain better and quicker insights

    Cross Domain IW Threats to SOF Maritime Missions: Implications for U.S. SOF

    Get PDF
    As cyber vulnerabilities proliferate with the expansion of connected devices, wherein security is often forsaken for ease of use, Special Operations Forces (SOF) cannot escape the obvious, massive risk that they are assuming by incorporating emerging technologies into their toolkits. This is especially true in the maritime sector where SOF operates nearshore in littoral zones. As SOF—in support to the U.S. Navy— increasingly operate in these contested maritime environments, they will gradually encounter more hostile actors looking to exploit digital vulnerabilities. As such, this monograph comes at a perfect time as the world becomes more interconnected but also more vulnerable
    corecore