6,148 research outputs found

    A Brief History of Web Crawlers

    Full text link
    Web crawlers visit internet applications, collect data, and learn about new web pages from visited pages. Web crawlers have a long and interesting history. Early web crawlers collected statistics about the web. In addition to collecting statistics about the web and indexing the applications for search engines, modern crawlers can be used to perform accessibility and vulnerability checks on the application. Quick expansion of the web, and the complexity added to web applications have made the process of crawling a very challenging one. Throughout the history of web crawling many researchers and industrial groups addressed different issues and challenges that web crawlers face. Different solutions have been proposed to reduce the time and cost of crawling. Performing an exhaustive crawl is a challenging question. Additionally capturing the model of a modern web application and extracting data from it automatically is another open question. What follows is a brief history of different technique and algorithms used from the early days of crawling up to the recent days. We introduce criteria to evaluate the relative performance of web crawlers. Based on these criteria we plot the evolution of web crawlers and compare their performanc

    A Framework for File Format Fuzzing with Genetic Algorithms

    Get PDF
    Secure software, meaning software free from vulnerabilities, is desirable in today\u27s marketplace. Consumers are beginning to value a product\u27s security posture as well as its functionality. Software development companies are recognizing this trend, and they are factoring security into their entire software development lifecycle. Secure development practices like threat modeling, static analysis, safe programming libraries, run-time protections, and software verification are being mandated during product development. Mandating these practices improves a product\u27s security posture before customer delivery, and these practices increase the difficulty of discovering and exploiting vulnerabilities. Since the 1980\u27s, security researchers have uncovered software defects by fuzz testing an application. In fuzz testing\u27s infancy, randomly generated data could discover multiple defects quickly. However, as software matures and software development companies integrate secure development practices into their development life cycles, fuzzers must apply more sophisticated techniques in order to retain their ability to uncover defects. Fuzz testing must evolve, and fuzz testing practitioners must devise new algorithms to exercise an application in unexpected ways. This dissertation\u27s objective is to create a proof-of-concept genetic algorithm fuzz testing framework to exercise an application\u27s file format parsing routines. The framework includes multiple genetic algorithm variations, provides a configuration scheme, and correlates data gathered from static and dynamic analysis to guide negative test case evolution. Experiments conducted for this dissertation illustrate the effectiveness of a genetic algorithm fuzzer in comparison to standard fuzz testing tools. The experiments showcase a genetic algorithm fuzzer\u27s ability to discover multiple unique defects within a limited number of negative test cases. These experiments also highlight an application\u27s increased execution time when fuzzing with a genetic algorithm. To combat increased execution time, a distributed architecture is implemented and additional experiments demonstrate a decrease in execution time comparable to standard fuzz testing tools. A final set of experiments provide guidance on fitness function selection with a CHC genetic algorithm fuzzer with different population size configurations

    Concepts Extraction from Execution Traces

    Get PDF
    RÉSUMÉ L’identification de concepts est l’activité qui permet de trouver et localiser l’implémentation d’une fonctionnalité d’un logiciel dans le code source. L’identification de concepts permet d’aider les développeurs à comprendre les programmes et de minimiser l’effort de maintenance et d’évolution des logiciels. Dans la littérature, plusieurs approches statiques, dynamiques et hybrides pour l’identification des concepts ont été proposées. Les deux types statiques et dynamiques ont des avantages et des inconvénients et se complètent mutuellement en approches hybrides. Par conséquent, de nombreux travaux récents ont porté sur des approches hybrides pour améliorer les performances en terme de temps et de précision du processus d’identification de concepts. De plus, les traces d’exécution sont souvent trop larges (en termes de nombre de méthodes invoquées) et elles ne peuvent pas être utilisées directement par les développeurs pour les activités de maintenance. Dans cette thèse, nous proposons d’extraire l’ensemble des concepts des traces d’exécution en se basant sur des approches hybrides. En effet durant la maintenance d’un logiciel, les développeurs cherchent à trouver et à comprendre le(s) segment(s) qui implémente(nt) le concept à maintenir au lieu d’analyser en détails toute la trace d’exécution. L’extraction de concepts facilite les tâches de maintenance et d’évolution des logiciels en guidant les développeurs sur les segments qui implémentent les concepts à maintenir et ainsi réduire le nombre de méthodes à analyser. Nous proposons une approche basée sur la programmation dynamique qui divise la trace d’exécution en segments qui représentent des concepts. Chaque segment implémente un et un seul concept et est défini comme une liste ordonnée des méthodes invoquées, c’est-à-dire une partie de la trace d’exécution. Un concept peut être implémenté par un ou plusieurs segments. Ensuite, nous proposons une nouvelle approche (SCAN) pour attacher des étiquettes aux segments de la trace d’exécution. Nous utilisons la recherche d’information (IR) pour extraire une étiquette formée par des mots clés qui définissent le concept implémenté par un segment. Les étiquettes permettent aux développeurs d’avoir une idée du concept implémenté par les méthodes du segment et de choisir les segments qui implémentent les concepts à maintenir. Les segments qui implémentent les concepts à maintenir peuvent être de larges tailles en terme de nombre de méthodes invoquées et ainsi difficiles à comprendre. Nous proposons de diminuer la taille des segments en gardant juste les plus importantes méthodes invoquées. Nous réalisons des expériences pour évaluer si des participants produisent des étiquettes différentes lorsqu’on leur fournit une quantité différente d’informations sur les segments. Nous montrons qu’on conserve 50% ou plus des termes des étiquettes fournies par les participants tout en réduisant considérablement la quantité d’informations, jusqu’à 92% des segments, que les participants doivent traiter pour comprendre un segment. Enfin, nous étudions la précision et le rappel des étiquettes générées automatiquement par SCAN. Nous montrons que SCAN attribue automatiquement des étiquettes avec une précision moyenne de 69% et de un rappel moyen de 63%, par rapport aux étiquettes manuelles produites par au moins deux participants. L’approche SCAN propose aussi l’identification des relations entre les segments d’une même trace d’exécution. Ces relations fournissent une présentation globale et de haut niveau des concepts misent en œuvre dans une trace d’exécution. Ceci permet aux développeurs de comprendre la trace d’exécution en découvrant les méthodes et invocations communes entre les segments. Nous montrons que SCAN identifie les relations entre les segments avec une précision supérieure à 75% dans la plupart des logiciels étudiés. À la fin de cette thèse, nous étudions l’utilité de la segmentation automatique des traces d’exécution et l’affectation des étiquettes durant les tâches d’identification des concepts. Nous prouvons que SCAN est une technique qui supporte les tâches d’identification de concepts. Nous démontrons que l’extraction de l’ensemble des concepts des traces d’exécution présentée dans cette thèse guide les développeurs vers les segments qui implémentent les concepts à maintenir et ainsi réduire le nombre de méthodes à analyser.----------ABSTRACT Concept location is the task of locating and identifying concepts into code region. Concept location is fundamental to program comprehension, software maintenance, and evolution. Different static, dynamic, and hybrid approaches for concept location exist in the literature. Both static and dynamic approaches have advantages and limitations and they complement each other. Therefore, recent works focused on hybrid approaches to improve the performance in time as well as the accuracy of the concept location process. In addition, execution traces are often overly large (in terms of method calls) and they cannot be used directly by developers for program comprehension activities, in general, and concept location, in particular. In this dissertation, we extract the set of concepts exercised in an execution trace based on hybrid approaches. Indeed, during maintenance tasks, developers generally seek to identify and understand some segments of the trace that implement concepts of interest rather than to analyse in-depth the entire execution trace. Concept location facilitates maintenance tasks by guiding developers towards segments that implement concepts to maintain and reducing the number of methods to investigate using execution traces. We propose an approach built upon a dynamic programming algorithm to split an execution trace into segments representing concepts. A segment implements one concept and it is defined as an ordered list of the invoked methods, i.e., a part of the execution trace. A concept may be implemented by one or more segments. Then, we propose SCAN, an approach to assign labels to the identified segments. We uses information retrieval methods to extract labels that consist of a set of words defining the concept implemented by a segment. The labels allow developers to have a global idea of the concept implemented by the segment and identify the segments implementing the concept to maintain. Although the segments implementing the concept to maintain are smaller than the execution traces, some of them are still very large (in terms of method calls). It is difficult to understand a segment with a large size. To help developers to understand a very large segment, we propose to characterise a segment using only the most relevant method calls. Then, we perform an experiment to evaluate the performances of SCAN approach. We investigate whether participants produce different labels when provided with different amount of information on a segment. We show that 50% or more of the terms of labels provided by participants are preserved while drastically reducing, up to 92%, the amount of information that participants must process to understand a segment. Finally, we study the precision and recall of labels that are automatically generated by SCAN. We show that SCAN automatically assigns labels with an average precision and recall of 69% and 63%, respectively, when compared to manual labels produced by merging the labels of at least two participants. SCAN also identifies the relations among execution trace segments. These relations provide a high-level presentation of the concepts implemented in an execution trace. The latter allows developers to understand the execution trace content by discovering commonalities between segments. Results show also that SCAN identifies relations among segments with an overall precision greater than 75% in the majority of the programs. Finally, we evaluate the usefulness of the automatic segmentation of execution traces and assigning labels in the context of concept location. We show that SCAN support concept location tasks if used as a standalone technique. The obtained results guide developers on segments that implement the concepts to maintain and thus reduce the number of methods to analyse

    The Oracle Problem in Software Testing: A Survey

    Get PDF
    Testing involves examining the behaviour of a system in order to discover potential faults. Given an input for a system, the challenge of distinguishing the corresponding desired, correct behaviour from potentially incorrect behavior is called the “test oracle problem”. Test oracle automation is important to remove a current bottleneck that inhibits greater overall test automation. Without test oracle automation, the human has to determine whether observed behaviour is correct. The literature on test oracles has introduced techniques for oracle automation, including modelling, specifications, contract-driven development and metamorphic testing. When none of these is completely adequate, the final source of test oracle information remains the human, who may be aware of informal specifications, expectations, norms and domain specific information that provide informal oracle guidance. All forms of test oracles, even the humble human, involve challenges of reducing cost and increasing benefit. This paper provides a comprehensive survey of current approaches to the test oracle problem and an analysis of trends in this important area of software testing research and practice

    Managing multi-tiered suppliers in the high-tech industry

    Get PDF
    Thesis (M. Eng. in Logistics)--Massachusetts Institute of Technology, Engineering Systems Division, 2009.Includes bibliographical references (leaves 131-135).This thesis presents a roadmap for companies to follow as they manage multi-tiered suppliers in the high-tech industry. Our research covered a host of sources including interviews and publications from various companies, consulting companies, software companies, the computer industry, trade associations, and analyst firms among others. While our review found that many companies begin supplier relationship management after sourcing events, we show that managing suppliers should start as companies form their competitive strategy. Our five step roadmap provides a deliberate approach for companies as they build the foundation for effective and successful multi-tiered supplier relationship management.by Charles E. Frantz and Jimin Lee.M.Eng.in Logistic

    Effective Removal of Operational Log Messages: an Application to Model Inference

    Full text link
    Model inference aims to extract accurate models from the execution logs of software systems. However, in reality, logs may contain some "noise" that could deteriorate the performance of model inference. One form of noise can commonly be found in system logs that contain not only transactional messages---logging the functional behavior of the system---but also operational messages---recording the operational state of the system (e.g., a periodic heartbeat to keep track of the memory usage). In low-quality logs, transactional and operational messages are randomly interleaved, leading to the erroneous inclusion of operational behaviors into a system model, that ideally should only reflect the functional behavior of the system. It is therefore important to remove operational messages in the logs before inferring models. In this paper, we propose LogCleaner, a novel technique for removing operational logs messages. LogCleaner first performs a periodicity analysis to filter out periodic messages, and then it performs a dependency analysis to calculate the degree of dependency for all log messages and to remove operational messages based on their dependencies. The experimental results on two proprietary and 11 publicly available log datasets show that LogCleaner, on average, can accurately remove 98% of the operational messages and preserve 81% of the transactional messages. Furthermore, using logs pre-processed with LogCleaner decreases the execution time of model inference (with a speed-up ranging from 1.5 to 946.7 depending on the characteristics of the system) and significantly improves the accuracy of the inferred models, by increasing their ability to accept correct system behaviors (+43.8 pp on average, with pp=percentage points) and to reject incorrect system behaviors (+15.0 pp on average)

    Command & Control: Understanding, Denying and Detecting - A review of malware C2 techniques, detection and defences

    Full text link
    In this survey, we first briefly review the current state of cyber attacks, highlighting significant recent changes in how and why such attacks are performed. We then investigate the mechanics of malware command and control (C2) establishment: we provide a comprehensive review of the techniques used by attackers to set up such a channel and to hide its presence from the attacked parties and the security tools they use. We then switch to the defensive side of the problem, and review approaches that have been proposed for the detection and disruption of C2 channels. We also map such techniques to widely-adopted security controls, emphasizing gaps or limitations (and success stories) in current best practices.Comment: Work commissioned by CPNI, available at c2report.org. 38 pages. Listing abstract compressed from version appearing in repor

    Understanding How Reverse Engineers Make Sense of Programs from Assembly Language Representations

    Get PDF
    This dissertation develops a theory of the conceptual and procedural aspects involved with how reverse engineers make sense of executable programs. Software reverse engineering is a complex set of tasks which require a person to understand the structure and functionality of a program from its assembly language representation, typically without having access to the program\u27s source code. This dissertation describes the reverse engineering process as a type of sensemaking, in which a person combines reasoning and information foraging behaviors to develop a mental model of the program. The structure of knowledge elements used in making sense of executable programs are elicited from a case study, interviews with subject matter experts, and observational studies with software reverse engineers. The results from this research can be used to improve reverse engineering tools, to develop training requirements for reverse engineers, and to develop robust computational models of human comprehension in complex tasks where sensemaking is required
    • …
    corecore