21 research outputs found

    Protecting Systems From Exploits Using Language-Theoretic Security

    Get PDF
    Any computer program processing input from the user or network must validate the input. Input-handling vulnerabilities occur in programs when the software component responsible for filtering malicious input---the parser---does not perform validation adequately. Consequently, parsers are among the most targeted components since they defend the rest of the program from malicious input. This thesis adopts the Language-Theoretic Security (LangSec) principle to understand what tools and research are needed to prevent exploits that target parsers. LangSec proposes specifying the syntactic structure of the input format as a formal grammar. We then build a recognizer for this formal grammar to validate any input before the rest of the program acts on it. To ensure that these recognizers represent the data format, programmers often rely on parser generators or parser combinators tools to build the parsers. This thesis propels several sub-fields in LangSec by proposing new techniques to find bugs in implementations, novel categorizations of vulnerabilities, and new parsing algorithms and tools to handle practical data formats. To this end, this thesis comprises five parts that tackle various tenets of LangSec. First, I categorize various input-handling vulnerabilities and exploits using two frameworks. First, I use the mismorphisms framework to reason about vulnerabilities. This framework helps us reason about the root causes leading to various vulnerabilities. Next, we built a categorization framework using various LangSec anti-patterns, such as parser differentials and insufficient input validation. Finally, we built a catalog of more than 30 popular vulnerabilities to demonstrate the categorization frameworks. Second, I built parsers for various Internet of Things and power grid network protocols and the iccMAX file format using parser combinator libraries. The parsers I built for power grid protocols were deployed and tested on power grid substation networks as an intrusion detection tool. The parser I built for the iccMAX file format led to several corrections and modifications to the iccMAX specifications and reference implementations. Third, I present SPARTA, a novel tool I built that generates Rust code that type checks Portable Data Format (PDF) files. The type checker I helped build strictly enforces the constraints in the PDF specification to find deviations. Our checker has contributed to at least four significant clarifications and corrections to the PDF 2.0 specification and various open-source PDF tools. In addition to our checker, we also built a practical tool, PDFFixer, to dynamically patch type errors in PDF files. Fourth, I present ParseSmith, a tool to build verified parsers for real-world data formats. Most parsing tools available for data formats are insufficient to handle practical formats or have not been verified for their correctness. I built a verified parsing tool in Dafny that builds on ideas from attribute grammars, data-dependent grammars, and parsing expression grammars to tackle various constructs commonly seen in network formats. I prove that our parsers run in linear time and always terminate for well-formed grammars. Finally, I provide the earliest systematic comparison of various data description languages (DDLs) and their parser generation tools. DDLs are used to describe and parse commonly used data formats, such as image formats. Next, I conducted an expert elicitation qualitative study to derive various metrics that I use to compare the DDLs. I also systematically compare these DDLs based on sample data descriptions available with the DDLs---checking for correctness and resilience

    An Exploratory Study of Ad Hoc Parsers in Python

    Full text link
    Background: Ad hoc parsers are pieces of code that use common string functions like split, trim, or slice to effectively perform parsing. Whether it is handling command-line arguments, reading configuration files, parsing custom file formats, or any number of other minor string processing tasks, ad hoc parsing is ubiquitous -- yet poorly understood. Objective: This study aims to reveal the common syntactic and semantic characteristics of ad hoc parsing code in real world Python projects. Our goal is to understand the nature of ad hoc parsers in order to inform future program analysis efforts in this area. Method: We plan to conduct an exploratory study based on large-scale mining of open-source Python repositories from GitHub. We will use program slicing to identify program fragments related to ad hoc parsing and analyze these parsers and their surrounding contexts across 9 research questions using 25 initial syntactic and semantic metrics. Beyond descriptive statistics, we will attempt to identify common parsing patterns by cluster analysis.Comment: 5 pages, accepted as a registered report for MSR 2023 with Continuity Acceptance (CA

    Automatic Generation of Input Grammars Using Symbolic Execution

    Get PDF
    Invalid input often leads to unexpected behavior in a program and is behind a plethora of known and unknown vulnerabilities. To prevent improper input from being processed, the input needs to be validated before the rest of the program executes. Formal language theory facilitates the definition and recognition of proper inputs. We focus on the problem of defining valid input after the program has already been written. We construct a parser that infers the structure of inputs which avoid vulnerabilities while existing work focuses on inferring the structure of input the program anticipates. We present a tool that constructs an input language, given the program as input, using symbolic execution on symbolic arguments. This differs from existing work which tracks the execution of concrete inputs to infer a grammar. We test our tool on programs with known vulnerabilities, including programs in the GNU Coreutils library, and we demonstrate how the parser catches known invalid inputs. We conclude that the synthesis of the complete parser cannot be entirely automated due to limitations of symbolic execution tools and issues of computability. A more comprehensive parser must additionally be informed by examples and counterexamples of the input language

    Mining Version Histories for Detecting Code Smells

    Get PDF
    Code smells are symptoms of poor design and implementation choices that may hinder code comprehension, and possibly increase change- and fault-proneness. While most of the detection techniques just rely on structural information, many code smells are intrinsically characterized by how code elements change over time. In this paper, we propose HIST (Historical Information for Smell deTection), an approach exploiting change history information to detect instances of five different code smells, namely Divergent Change, Shotgun Surgery, Parallel Inheritance, Blob, and Feature Envy.We evaluate HIST in two empirical studies. The first, conducted on twenty open source projects, aimed at assessing the accuracy of HIST in detecting instances of the code smells mentioned above. The results indicate that the precision of HIST ranges between 72% and 86%, and its recall ranges between 58% and 100%. Also, results of the first study indicate that HIST is able to identify code smells that cannot be identified by competitive approaches solely based on code analysis of a single system’s snapshot. Then, we conducted a second study aimed at investigating to what extent the code smells detected by HIST (and by competitive code analysis techniques) reflect developers’ perception of poor design and implementation choices. We involved twelve developers of four open source projects that recognized more than 75% of the code smell instances identified by HIST as actual design/implementation problems

    Towards A Verified Complex Protocol Stack in a Production Kernel: Methodology and Demonstration

    Get PDF
    Any useful computer system performs communication and any communication must be parsed before it is computed upon. Given their importance, one might expect parsers to receive a significant share of attention from the security community. This is, however, not the case: bugs in parsers continue to account for a surprising portion of reported and exploited vulnerabilities. In this thesis, I propose a methodology for supporting the development of software that depends on parsers---such as anything connected to the Internet---to safely support any reasonably designed protocol: data structures to describe protocol messages; validation routines that check that data received from the wire conforms to the rules of the protocol; systems that allow a defender to inject arbitrary, crafted input so as to explore the effectiveness of the parser; and systems that allow for the observation of the parser code while it is being explored. Then, I describe principled method of producing parsers that automatically generates the myriad parser-related software from a description of the protocol. This has many significant benefits: it makes implementing parsers simpler, easier, and faster; it reduces the trusted computing base to the description of the protocol and the program that compiles the description to runnable code; and it allows for easier formal verification of the generated code. I demonstrate the merits of the proposed methodology by creating a description of the USB protocol using a domain-specific language (DSL) embedded in Haskell and integrating it with the FreeBSD operating system. Using the industry-standard umap test-suite, I measure the performance and efficacy of the generated parser. I show that it is stable, that it is effective at protecting a system from both accidentally and maliciously malformed input, and that it does not incur unreasonable overhead

    Bridging the Gap Between Intent and Outcome: Knowledge, Tools & Principles for Security-Minded Decision-Making

    Get PDF
    Well-intentioned decisions---even ones intended to improve aggregate security--- may inadvertently jeopardize security objectives. Adopting a stringent password composition policy ostensibly yields high-entropy passwords; however, such policies often drive users to reuse or write down passwords. Replacing URLs in emails with safe URLs that navigate through a gatekeeper service that vets them before granting user access may reduce user exposure to malware; however, it may backfire by reducing the user\u27s ability to parse the URL or by giving the user a false sense of security if user expectations misalign with the security checks delivered by the vetting process. A short timeout threshold may ensure the user is promptly logged out when the system detects they are away; however, if an infuriated user copes by inserting a USB stick in their computer to emulate mouse movements, then not only will the detection mechanism fail but the insertion of the USB stick may present a new attack surface. These examples highlight the disconnect between decision-maker intentions and decision outcomes. Our focus is on bridging this gap. This thesis explores six projects bound together by the core objective of empowering people to make decisions that achieve their security and privacy objectives. First, we use grounded theory to examine Amazon reviews of password logbooks and to obtain valuable insights into users\u27 password management beliefs, motivations, and behaviors. Second, we present a discrete-event simulation we built to assess the efficacy of password policies. Third, we explore the idea of supplementing language-theoretic security with human-computability boundaries. Fourth, we conduct an eye-tracking study to understand users\u27 visual processes while parsing and classifying URLs. Fifth, we discuss preliminary findings from a study conducted on Amazon Mechanical Turk to examine why users fall for unsafe URLs. And sixth, we develop a logic-based representation of mismorphisms, which allows us to express the root causes of security problems. Each project demonstrates a key technique that can help in bridging the gap between intent and outcome

    KnetMiner - An integrated data platform for gene mining and biological knowledge discovery

    Get PDF
    Hassani-Pak K. KnetMiner - An integrated data platform for gene mining and biological knowledge discovery. Bielefeld: UniversitĂ€t Bielefeld; 2017.Discovery of novel genes that control important phenotypes and diseases is one of the key challenges in biological sciences. Now, in the post-genomics era, scientists have access to a vast range of genomes, genotypes, phenotypes and ‘omics data which - when used systematically - can help to gain new insights and make faster discoveries. However, the volume and diversity of such un-integrated data is often seen as a burden that only those with specialist bioinformatics skills, but often only minimal specialist biological knowledge, can penetrate. Therefore, new tools are required to allow researchers to connect, explore and compare large-scale datasets to identify the genes and pathways that control important phenotypes and diseases in plants, animals and humans. KnetMiner, with a silent "K" and standing for Knowledge Network Miner, is a suite of open-source software tools for integrating and visualising large biological datasets. The software mines the myriad databases that describe an organism’s biology to present links between relevant pieces of information, such as genes, biological pathways, phenotypes and publications with the aim to provide leads for scientists who are investigating the molecular basis for a particular trait. The KnetMiner approach is based on 1) integration of heterogeneous, complex and interconnected biological information into a knowledge graph; 2) text-mining to enrich the knowledge graph with novel relations extracted from literature; 3) graph queries of varying depths to find paths between genes and evidence nodes; 4) evidence-based gene rank algorithm that combines graph and information theory; 5) fast search and interactive knowledge visualisation techniques. Overall, [KnetMiner](http://knetminer.rothamsted.ac.uk) is a publicly available resource that helps scientists trawl diverse biological databases for clues to design better crop varieties and understand diseases. The key strength of KnetMiner is to include the end user into the “interactive” knowledge discovery process with the goal of supporting human intelligence with machine intelligence

    Acta Cybernetica : Volume 23. Number 2.

    Get PDF

    Enacting the Semantic Web: Ontological Orderings, Negotiated Standards, and Human-machine Translations

    Get PDF
    Artificial intelligence (AI) that is based upon semantic search has become one of the dominant means for accessing information in recent years. This is particularly the case in mobile contexts, as search based AI are embedded in each of the major mobile operating systems. The implications are such that information is becoming less a matter of choosing between different sets of results, and more of a presentation of a single answer, limiting both the availability of, and exposure to, alternate sources of information. Thus, it is essential to understand how that information comes to be structured and how deterministic systems like search based AI come to understand the indeterminate worlds they are tasked with interrogating. The semantic web, one of the technologies underpinning these systems, creates machine-readable data from the existing web of text and formalizes those machine-readable understandings in ontologies. This study investigates the ways that those semantic assemblages structure, and thus define, the world. In accordance with assemblage theory, it is necessary to study the interactions between the components that make up such data assemblages. As yet, the social sciences have been slow to systematically investigate data assemblages, the semantic web, and the components of these important socio-technical systems. This study investigates one major ontology, Schema.org. It uses netnographic methods to study the construction and use of Schema.org to determine how ontological states are declared and how human-machine translations occur in those development and use processes. This study has two main findings that bear on the relevant literature. First, I find that development and use of the ontology is a product of negotiations with technical standards such that ontologists and users must work around, with, and through the affordances and constraints of standards. Second, these groups adopt a pragmatic and generalizable approach to data modeling and semantic markup that determines ontological context in local and global ways. This first finding is significant in that past work has largely focused on how people work around standards’ limitations, whereas this shows that practitioners also strategically engage with standards to achieve their aims. Second, the particular approach that these groups use in translating human knowledge to machines, differs from the formalized and positivistic approaches described in past work. At a larger level, this study fills a lacuna in the collective understanding of how data assemblages are constructed and operate

    Twining

    Get PDF
    Hypertext is now commonplace: links and linking structure nearly all of our experiences online. Yet the literary, as opposed to commercial, potential of hypertext has receded. One of the few tools still focused on hypertext as a means for digital storytelling is Twine, a platform for building choice-driven stories without relying heavily on code. In Twining, Anastasia Salter and Stuart Moulthrop lead readers on a journey at once technical, critical, contextual, and personal. The book’s chapters alternate careful, stepwise discussion of adaptable Twine projects, offer commentary on exemplary Twine works, and discuss Twine’s technological and cultural background. Beyond telling the story of Twine and how to make Twine stories, Twining reflects on the ongoing process of making
    corecore