117 research outputs found
On the naturalness of software
Natural languages like English are rich, complex, and powerful. The highly creative and graceful use of languages like English and Tamil, by masters like Shakespeare and Avvaiyar, can certainly delight and inspire. But in practice, given cognitive constraints and the exigencies of daily life, most human utterances are far simpler and much more repetitive and predictable. In fact, these utterances can be very usefully modeled using modern statistical methods. This fact has led to the phenomenal success of statistical approaches to speech recognition, natural language translation, question-answering, and text mining and comprehension.
We begin with the conjecture that most software is also natural, in the sense that it is created by humans at work, with all the attendant constraints and limitations---and thus, like natural language, it is also likely to be repetitive and predictable. We then proceed to ask whether (a) code can be usefully modeled by statistical language models and (b) such models can be leveraged to support software engineers. Using the widely adopted n-gram model, we provide empirical evidence supportive of a positive answer to both these questions. We show that code is also very regular, and, in fact, even more so than natural languages. As an example use of the model, we have developed a simple code completion engine for Java that, despite its simplicity, already improves Eclipse's completion capability. We conclude the paper by laying out a vision for future research in this area
A Theory of Dual Channel Constraints
The surprising predictability of source code has triggered a boom in tools using language models for code. Code is much more predictable than natural language, but the reasons are not well understood. We propose a dual channel view of code; code combines a formal channel for specifying execution and a natural language channel in the form of identifiers and comments that assists human comprehension. Computers ignore the natural language channel, but developers read both and, when writing code for longterm use and maintenance, consider each channel's audience: computer and human. As developers hold both channels in mind when coding, we posit that the two channels interact and constrain each other; we call these dual channel constraints. Their impact has been neglected. We describe how they can lead to humans writing code in a way more predictable than natural language, highlight pioneering research that has implicitly or explicitly used parts of this theory, and drive new research, such as systematically searching for cross-channel inconsistencies. Dual channel constraints provide an exciting opportunity as truly multi-disciplinary research; for computer scientists they promise improvements to program analysis via a more holistic approach to code, and to psycholinguists they promise a novel environment for studying linguistic processes
Mining Semantic Loop Idioms
To write code, developers stitch together patterns, like API protocols or data structure traversals. Discovering these patterns can identify inconsistencies in code or opportunities to replace these patterns with an API or a language construct. We present coiling, a technique for automatically mining code for semantic idioms: surprisingly probable, semantic patterns. We specialize coiling for loop idioms, semantic idioms of loops. First, we show that automatically identifiable patterns exist, in great numbers, with a largescale empirical study of loops over 25MLOC. We find that most loops in this corpus are simple and predictable: 90 percent have fewer than 15LOC and 90 percent have no nesting and very simple control. Encouraged by this result, we then mine loop idioms over a second, buildable corpus. Over this corpus, we show that only 50 loop idioms cover 50 percent of the concrete loops. Our framework opens the door to data-driven tool and language design, discovering opportunities to introduce new API calls and language constructs. Loop idioms show that LINQ would benefit from an Enumerate operator. This can be confirmed by the exitence of a StackOverflow question with 542k views that requests precisely this feature
Authenticating the Query Results of Text Search Engines
The number of successful attacks on the Internet shows that it is very difficult to guarantee the security of online search engines. A breached server that is not detected in time may return incorrect results to the users. To prevent that, we introduce a methodology for generating an integrity proof for each search result. Our solution is targeted at search engines that perform similarity-based document retrieval, and utilize an inverted list implementation (as most search engines do). We formulate the properties that define a correct result, map the task of processing a text search query to adaptations of existing threshold-based algorithms, and devise an authentication scheme for checking the validity of a result. Finally, we confirm the efficiency and practicality of our solution through an empirical evaluation with real documents and benchmark queries. 1
Query Racing: Fast Completeness Certification of Query Results
International audienceWe present a general and effective method to certify completeness of query results on relational tables stored in an untrusted DBMS. Our main contribution is the concept of "Query Race": we split up a general query into several single attribute queries, and exploit concurrency and speed to bind the complexity to the fastest of them. Our method supports selection queries with general composition of conjunctive and disjunctive order-based conditions on different attributes at the same time. To achieve our results, we require neither previous knowledge of queries nor specific support by the DBMS. We validate our approach with experimental results performed on a prototypical implementation
Authenticated LSM Trees with Minimal Trust
In the age of user-generated contents, the workloads imposed on information-security infrastructures become increasingly write-intensive. How- ever, existing security protocols, specifically authenticated data structures (ADSs), are historically designed based on update-in-place data structures and incur overhead when serving write-intensive workloads.
In this work, we present LPAD (Log-structured Persistent Authenticated Directory), a new ADS protocol designed uniquely based on the log-structured merge trees (LSM trees) which recently gained popularity in the design of modern storage systems. On the write path, LPAD supports streaming, non-interactive updates with constant proof from trusted data owners. On the read path, LPAD supports point queries over the dynamic dataset with polynomial proof. The key to enable this efficiency is a verifiable reorganization operation, called verifiable merge, in LPAD. Verifiable merge is secured by the execution in an enclave of trusted execution environments (TEE). To minimize the trusted computing base (TCB), LPAD places the code related to verifiable merge in enclave, and nothing else. Our implementation of LPAD on Google LevelDB codebase and on Intel SGX shows that the TCB is reduced by 20 times: The enclave size of LPAD is one thousand code lines out of more than twenty thousands code lines of a vanilla LevelDB. Under the YCSB workloads, LPAD improves the performance by an order of magnitude compared with that of existing update-in-place ADSs
Modeling and verifying a broad array of network properties
Motivated by widely observed examples in nature, society and software, where
groups of already related nodes arrive together and attach to an existing
network, we consider network growth via sequential attachment of linked node
groups, or graphlets. We analyze the simplest case, attachment of the three
node V-graphlet, where, with probability alpha, we attach a peripheral node of
the graphlet, and with probability (1-alpha), we attach the central node. Our
analytical results and simulations show that tuning alpha produces a wide range
in degree distribution and degree assortativity, achieving assortativity values
that capture a diverse set of many real-world systems. We introduce a
fifteen-dimensional attribute vector derived from seven well-known network
properties, which enables comprehensive comparison between any two networks.
Principal Component Analysis (PCA) of this attribute vector space shows a
significantly larger coverage potential of real-world network properties by a
simple extension of the above model when compared against a classic model of
network growth.Comment: To appear in Europhysics Letter
Indexing Information for Data Forensics
We introduce novel techniques for organizing the indexing structures of how data is stored so that alterations from an original version can be detected and the changed values specifically identified. We give forensic constructions for several fundamental data structures, including arrays, linked lists, binary search trees, skip lists, and hash tables. Some of our constructions are based on a new reduced-randomness construction for nonadaptive combinatorial group testing
- …