1,459 research outputs found
Many bioinformatics programming tasks can be automated with ChatGPT
Computer programming is a fundamental tool for life scientists, allowing them
to carry out many essential research tasks. However, despite a variety of
educational efforts, learning to write code can be a challenging endeavor for
both researchers and students in life science disciplines. Recent advances in
artificial intelligence have made it possible to translate human-language
prompts to functional code, raising questions about whether these technologies
can aid (or replace) life scientists' efforts to write code. Using 184
programming exercises from an introductory-bioinformatics course, we evaluated
the extent to which one such model -- OpenAI's ChatGPT -- can successfully
complete basic- to moderate-level programming tasks. On its first attempt,
ChatGPT solved 139 (75.5%) of the exercises. For the remaining exercises, we
provided natural-language feedback to the model, prompting it to try different
approaches. Within 7 or fewer attempts, ChatGPT solved 179 (97.3%) of the
exercises. These findings have important implications for life-sciences
research and education. For many programming tasks, researchers no longer need
to write code from scratch. Instead, machine-learning models may produce usable
solutions. Instructors may need to adapt their pedagogical approaches and
assessment techniques to account for these new capabilities that are available
to the general public.Comment: 13 pages, 4 figures, to be submitted for publicatio
Compiler and Runtime for Memory Management on Software Managed Manycore Processors
abstract: We are expecting hundreds of cores per chip in the near future. However, scaling the memory architecture in manycore architectures becomes a major challenge. Cache coherence provides a single image of memory at any time in execution to all the cores, yet coherent cache architectures are believed will not scale to hundreds and thousands of cores. In addition, caches and coherence logic already take 20-50% of the total power consumption of the processor and 30-60% of die area. Therefore, a more scalable architecture is needed for manycore architectures. Software Managed Manycore (SMM) architectures emerge as a solution. They have scalable memory design in which each core has direct access to only its local scratchpad memory, and any data transfers to/from other memories must be done explicitly in the application using Direct Memory Access (DMA) commands. Lack of automatic memory management in the hardware makes such architectures extremely power-efficient, but they also become difficult to program. If the code/data of the task mapped onto a core cannot fit in the local scratchpad memory, then DMA calls must be added to bring in the code/data before it is required, and it may need to be evicted after its use. However, doing this adds a lot of complexity to the programmer's job. Now programmers must worry about data management, on top of worrying about the functional correctness of the program - which is already quite complex. This dissertation presents a comprehensive compiler and runtime integration to automatically manage the code and data of each task in the limited local memory of the core. We firstly developed a Complete Circular Stack Management. It manages stack frames between the local memory and the main memory, and addresses the stack pointer problem as well. Though it works, we found we could further optimize the management for most cases. Thus a Smart Stack Data Management (SSDM) is provided. In this work, we formulate the stack data management problem and propose a greedy algorithm for the same. Later on, we propose a general cost estimation algorithm, based on which CMSM heuristic for code mapping problem is developed. Finally, heap data is dynamic in nature and therefore it is hard to manage it. We provide two schemes to manage unlimited amount of heap data in constant sized region in the local memory. In addition to those separate schemes for different kinds of data, we also provide a memory partition methodology.Dissertation/ThesisPh.D. Computer Science 201
Purposes, concepts, misfits, and a redesign of git
Git is a widely used version control system that is powerful but complicated. Its complexity may not be an inevitable consequence of its power but rather evidence of flaws in its design. To explore this hypothesis, we analyzed the design of Git using a theory that identifies concepts, purposes, and misfits. Some well-known difficulties with Git are described, and explained as misfits in which underlying concepts fail to meet their intended purpose. Based on this analysis, we designed a reworking of Git (called Gitless) that attempts to
remedy these flaws. To correlate misfits with issues reported by users, we
conducted a study of Stack Overflow questions. And to determine whether users experienced fewer complications using Gitless in place of Git, we conducted a small user study. Results suggest our approach can be profitable in identifying, analyzing, and fixing design problems.SUTD-MIT International Design Centre (IDC
ClickINC: In-network Computing as a Service in Heterogeneous Programmable Data-center Networks
In-Network Computing (INC) has found many applications for performance boosts
or cost reduction. However, given heterogeneous devices, diverse applications,
and multi-path network typologies, it is cumbersome and error-prone for
application developers to effectively utilize the available network resources
and gain predictable benefits without impeding normal network functions.
Previous work is oriented to network operators more than application
developers. We develop ClickINC to streamline the INC programming and
deployment using a unified and automated workflow. ClickINC provides INC
developers a modular programming abstractions, without concerning to the states
of the devices and the network topology. We describe the ClickINC framework,
model, language, workflow, and corresponding algorithms. Experiments on both an
emulator and a prototype system demonstrate its feasibility and benefits
Protecting Systems From Exploits Using Language-Theoretic Security
Any computer program processing input from the user or network must validate the input. Input-handling vulnerabilities occur in programs when the software component responsible for filtering malicious input---the parser---does not perform validation adequately. Consequently, parsers are among the most targeted components since they defend the rest of the program from malicious input. This thesis adopts the Language-Theoretic Security (LangSec) principle to understand what tools and research are needed to prevent exploits that target parsers. LangSec proposes specifying the syntactic structure of the input format as a formal grammar. We then build a recognizer for this formal grammar to validate any input before the rest of the program acts on it. To ensure that these recognizers represent the data format, programmers often rely on parser generators or parser combinators tools to build the parsers. This thesis propels several sub-fields in LangSec by proposing new techniques to find bugs in implementations, novel categorizations of vulnerabilities, and new parsing algorithms and tools to handle practical data formats. To this end, this thesis comprises five parts that tackle various tenets of LangSec. First, I categorize various input-handling vulnerabilities and exploits using two frameworks. First, I use the mismorphisms framework to reason about vulnerabilities. This framework helps us reason about the root causes leading to various vulnerabilities. Next, we built a categorization framework using various LangSec anti-patterns, such as parser differentials and insufficient input validation. Finally, we built a catalog of more than 30 popular vulnerabilities to demonstrate the categorization frameworks. Second, I built parsers for various Internet of Things and power grid network protocols and the iccMAX file format using parser combinator libraries. The parsers I built for power grid protocols were deployed and tested on power grid substation networks as an intrusion detection tool. The parser I built for the iccMAX file format led to several corrections and modifications to the iccMAX specifications and reference implementations. Third, I present SPARTA, a novel tool I built that generates Rust code that type checks Portable Data Format (PDF) files. The type checker I helped build strictly enforces the constraints in the PDF specification to find deviations. Our checker has contributed to at least four significant clarifications and corrections to the PDF 2.0 specification and various open-source PDF tools. In addition to our checker, we also built a practical tool, PDFFixer, to dynamically patch type errors in PDF files. Fourth, I present ParseSmith, a tool to build verified parsers for real-world data formats. Most parsing tools available for data formats are insufficient to handle practical formats or have not been verified for their correctness. I built a verified parsing tool in Dafny that builds on ideas from attribute grammars, data-dependent grammars, and parsing expression grammars to tackle various constructs commonly seen in network formats. I prove that our parsers run in linear time and always terminate for well-formed grammars. Finally, I provide the earliest systematic comparison of various data description languages (DDLs) and their parser generation tools. DDLs are used to describe and parse commonly used data formats, such as image formats. Next, I conducted an expert elicitation qualitative study to derive various metrics that I use to compare the DDLs. I also systematically compare these DDLs based on sample data descriptions available with the DDLs---checking for correctness and resilience
Understanding and Supporting Debugging Workflows in Multiverse Analysis
Multiverse analysis-a paradigm for statistical analysis that considers all
combinations of reasonable analysis choices in parallel-promises to improve
transparency and reproducibility. Although recent tools help analysts specify
multiverse analyses, they remain difficult to use in practice. In this work, we
conduct a formative study with four multiverse researchers, which identifies
debugging as a key barrier. We find debugging is challenging because of the
latency between running analyses and detecting bugs, and the scale of metadata
needed to be processed to diagnose a bug. To address these challenges, we
prototype a command-line interface tool, Multiverse Debugger, which helps
diagnose bugs in the multiverse and propagate fixes. In a second, focused study
(n=13), we use Multiverse Debugger as a probe to develop a model of debugging
workflows and identify challenges, including the difficulty in understanding
the composition of a multiverse. We conclude with design implications for
future multiverse analysis authoring systems
Software reverse engineering education
Software Reverse Engineering (SRE) is the practice of analyzing a software system, either in whole or in part, to extract design and implementation information. A typical SRE scenario would involve a software module that has worked for years and carries several rules of a business in its lines of code. Unfortunately the source code of the application has been lost; what remains is “native ” or “binary ” code. Reverse engineering skills are also used to detect and neutralize viruses and malware as well as to protect intellectual property. It became frighteningly apparent during the Y2K crisis that reverse engineering skills were not commonly held amongst programmers. Since that time, much research has been undertaken to formalize the types of activities that fall into the category of reverse engineering so that these skills can be taught to computer programmers and testers. To help address the lack of software reverse engineering education, several peer-reviewed articles on software reverse engineering, re-engineering, reuse, maintenance, evolution, and security were gathered with the objective of developing relevant, practical exercises for instructional purposes. The research revealed that SRE is fairly well described and most of the related activities fall into one of tw
Quantifying and Predicting the Influence of Execution Platform on Software Component Performance
The performance of software components depends on several factors, including the execution platform on which the software components run. To simplify cross-platform performance prediction in relocation and sizing scenarios, a novel approach is introduced in this thesis which separates the application performance profile from the platform performance profile. The approach is evaluated using transparent instrumentation of Java applications and with automated benchmarks for Java Virtual Machines
- …