359 research outputs found
Recommended from our members
Automated Testing and Debugging for Big Data Analytics
The prevalence of big data analytics in almost every large-scale software system has generated a substantial push to build data-intensive scalable computing (DISC) frameworks such as Google MapReduce and Apache Spark that can fully harness the power of existing data centers. However, frameworks once used by domain experts are now being leveraged by data scientists, business analysts, and researchers. This shift in user demographics calls for immediate advancements in the development, debugging, and testing practices of big data applications, which are falling behind compared to the DISC framework design and implementation. In practice, big data applications often fail as users are unable to test all behaviors emerging from interleaving dataflow operators, user-defined functions, and framework's code. "Testing based on a random sample" rarely guarantees the reliability and "trial and error" and "print" debugging methods are expensive and time-consuming. Thus, the current practice of developing a big data application must be improved and the tools built to enhance the developer's productivity must adapt to the distinct characteristics of data-intensive scalable computing. By synthesizing ideas from software engineering and database systems, our hypothesis is that we can design effective and scalable testing and debugging algorithms for big data analytics without compromising the performance and efficiency of the underlying DISC framework. To design such techniques, we investigate how we can build interactive and responsive debugging primitives that significantly reduce the debugging time, yet do not pose much performance overhead on big data applications. Furthermore, we investigate how we can leverage data provenance techniques from databases and fault-isolation algorithms from software engineering to pinpoint the minimal subset of failure-inducing inputs efficiently. To improve the reliability of big data analytics, we investigate how we can abstract the semantics of dataflow operators and use them in tandem with the semantics of user-defined functions to generate a minimum set of synthetic test inputs capable of revealing more defects than the entire input dataset.To examine the first hypothesis, we introduce interactive, real-time debugging primitives for big data analytics through innovative and scalable debugging features such as simulated breakpoint, dynamic watchpoint, and crash culprit identification. Second, we design a new automated fault localization approach that combines insights from both the software engineering and database literature to bring delta debugging closer to a reality in the big data applications by leveraging data provenance and by constructing systems optimizations for debugging provenance queries. Lastly, we devise a new symbolic-execution based white-box testing algorithm for big data applications that abstracts the implementation of dataflow operators using logical specifications instead of modeling their implementations and combines them with the semantics of any arbitrary user-defined function. We instantiate the idea of an interactive debugging algorithm as BigDebug, the idea of an automated debugging algorithm as BigSift, and the idea of symbolic execution-based testing as BigTest. Our investigation shows that the interactive debugging primitives can scale to terabytes---our record-level tracing incurs less than 25% overhead on average and provides up to 100% time saving compared to the baseline replay debugger. Second, we observe that by combining data provenance with delta debugging, we can identify the minimum faulty input in just under 30% of the original job execution time. Lastly, we verify that by abstracting dataflow operators using logical specifications, we can efficiently generate the most concise test data suitable for local testing while revealing twice as many faults as prior approaches. Our investigations collectively demonstrate that developer productivity can be significantly improved through effective and scalable testing and debugging techniques for big data analytics, without impacting the DISC framework's performance. This dissertation affirms the feasibility of automated debugging and testing techniques for big data analytics---techniques that were previously considered infeasible for large-scale data processing
RepeatFS: A File System Providing Reproducibility Through Provenance and Automation
Reproducibility is of central importance to the scientific process. The difficulty of consistently replicating and verifying experimental results is magnified in the era of big data, in which computational analysis often involves complex multi-application pipelines operating on terabytes of data. These processes result in thousands of possible permutations of data preparation steps, software versions, and command-line arguments. Existing reproducibility frameworks are cumbersome and involve redesigning computational methods. To address these issues, we developed two conceptual models and implemented them through RepeatFS, a file system that records, replicates, and verifies computational workflows with no alteration to the original methods. RepeatFS also provides provenance visualization and task automation.
We used RepeatFS to successfully visualize and replicate a variety of bioinformatics tasks consisting of over a million operations with no alteration to the original methods. RepeatFS correctly identified all software inconsistencies that resulted in replication differences
Trusted Artificial Intelligence in Manufacturing; Trusted Artificial Intelligence in Manufacturing
The successful deployment of AI solutions in manufacturing environments hinges on their security, safety and reliability which becomes more challenging in settings where multiple AI systems (e.g., industrial robots, robotic cells, Deep Neural Networks (DNNs)) interact as atomic systems and with humans. To guarantee the safe and reliable operation of AI systems in the shopfloor, there is a need to address many challenges in the scope of complex, heterogeneous, dynamic and unpredictable environments. Specifically, data reliability, human machine interaction, security, transparency and explainability challenges need to be addressed at the same time. Recent advances in AI research (e.g., in deep neural networks security and explainable AI (XAI) systems), coupled with novel research outcomes in the formal specification and verification of AI systems provide a sound basis for safe and reliable AI deployments in production lines. Moreover, the legal and regulatory dimension of safe and reliable AI solutions in production lines must be considered as well. To address some of the above listed challenges, fifteen European Organizations collaborate in the scope of the STAR project, a research initiative funded by the European Commission in the scope of its H2020 program (Grant Agreement Number: 956573). STAR researches, develops, and validates novel technologies that enable AI systems to acquire knowledge in order to take timely and safe decisions in dynamic and unpredictable environments. Moreover, the project researches and delivers approaches that enable AI systems to confront sophisticated adversaries and to remain robust against security attacks. This book is co-authored by the STAR consortium members and provides a review of technologies, techniques and systems for trusted, ethical, and secure AI in manufacturing. The different chapters of the book cover systems and technologies for industrial data reliability, responsible and transparent artificial intelligence systems, human centered manufacturing systems such as human-centred digital twins, cyber-defence in AI systems, simulated reality systems, human robot collaboration systems, as well as automated mobile robots for manufacturing environments. A variety of cutting-edge AI technologies are employed by these systems including deep neural networks, reinforcement learning systems, and explainable artificial intelligence systems. Furthermore, relevant standards and applicable regulations are discussed. Beyond reviewing state of the art standards and technologies, the book illustrates how the STAR research goes beyond the state of the art, towards enabling and showcasing human-centred technologies in production lines. Emphasis is put on dynamic human in the loop scenarios, where ethical, transparent, and trusted AI systems co-exist with human workers. The book is made available as an open access publication, which could make it broadly and freely available to the AI and smart manufacturing communities
Mark My Words: Analyzing and Evaluating Language Model Watermarks
The capabilities of large language models have grown significantly in recent
years and so too have concerns about their misuse. In this context, the ability
to distinguish machine-generated text from human-authored content becomes
important. Prior works have proposed numerous schemes to watermark text, which
would benefit from a systematic evaluation framework. This work focuses on text
watermarking techniques - as opposed to image watermarks - and proposes
MARKMYWORDS, a comprehensive benchmark for them under different tasks as well
as practical attacks. We focus on three main metrics: quality, size (e.g. the
number of tokens needed to detect a watermark), and tamper-resistance. Current
watermarking techniques are good enough to be deployed: Kirchenbauer et al. [1]
can watermark Llama2-7B-chat with no perceivable loss in quality, the watermark
can be detected with fewer than 100 tokens, and the scheme offers good
tamper-resistance to simple attacks. We argue that watermark
indistinguishability, a criteria emphasized in some prior works, is too strong
a requirement: schemes that slightly modify logit distributions outperform
their indistinguishable counterparts with no noticeable loss in generation
quality. We publicly release our benchmark
(https://github.com/wagner-group/MarkMyWords)Comment: 18 pages, 11 figure
Conceptual Framework and Methodology for Analysing Previous Molecular Docking Results
Modern drug discovery relies on in-silico computational simulations such as molecular docking. Molecular docking models biochemical interactions to predict where and how two molecules would bind. The results of large-scale molecular docking simulations can provide valuable insight into the relationship between two molecules. This is useful to a biomedical scientist before conducting in-vitro or in-vivo wet-lab experiments. Although this ˝eld has seen great advancements, feedback from biomedical scientists shows that there is a need for storage and further analysis of molecular docking results. To meet this need, biomedical scientists need to have access to computing, data, and network resources, and require speci˝c knowledge or skills they might lack.
Therefore, a conceptual framework speci˝cally tailored to enable biomedical scientists to reuse molecular docking results, and a methodology which uses regular input from scientists, has been proposed. The framework is composed of 5 types of elements and 13 interfaces. The methodology is light and relies on frequent communication between biomedical sciences and computer science experts, speci˝ed by particular roles. It shows how developers can bene˝t from using the framework which allows them to determine whether a scenario ˝ts the framework, whether an already implemented element can be reused, or whether a newly proposed tool can be used as an element.
Three scenarios that show the versatility of this new framework and the methodology based on it, have been identi˝ed and implemented. A methodical planning and design approach was used and it was shown that the implementations are at least as usable as existing solutions. To eliminate the need for access to expensive computing infrastructure, state-of-the-art cloud computing techniques are used.
The implementations enable faster identi˝cation of new molecules for use in docking, direct querying of existing databases, and simpler learning of good molecular docking practice without the need to manually run multiple tools. Thus, the framework and methodol-ogy enable more user-friendly implementations, and less error-prone use of computational methods in drug discovery. Their use could lead to more e˙ective discovery of new drugs
Understanding Gene Regulation In Development And Differentiation Using Single Cell Multi-Omics
Transcriptional regulation is a major determinant of tissue-specific gene expression during development. My thesis research leverages powerful single-cell approaches to address this fundamental question in two developmental systems, C. elegans embryogenesis and mouse embryonic hematopoiesis. I have also developed much-needed computational algorithms for single-cell data analysis and exploration. C. elegans is an animal with few cells, but a striking diversity of cell types. In this thesis, I characterize the molecular basis for their specification by analyzing the transcriptomes of 86,024 single embryonic cells. I identified 502 terminal and pre-terminal cell types, mapping most single cell transcriptomes to their exact position in C. elegans’ invariant lineage. Using these annotations, I find that: 1) the correlation between a cell’s lineage and its transcriptome increases from mid to late gastrulation, then falls dramatically as cells in the nervous system and pharynx adopt their terminal fates; 2) multilineage priming contributes to the differentiation of sister cells at dozens of lineage branches; and 3) most distinct lineages that produce the same anatomical cell type converge to a homogenous transcriptomic state. Next, I studied the development of hematopoietic stem cells (HSCs). All HSCs come from a specialized type of endothelial cells in the major arteries of the embryo called hemogenic endothelium (HE). To examine the cellular and molecular transitions underlying the formation of HSCs, we profiled nearly 40,000 rare single cells from the caudal arteries of embryonic day 9.5 (E9.5) to E11.5 mouse embryos using single-cell RNA-Seq and single-cell ATAC-Seq. I identified a continuous developmental trajectory from endothelial cells to early precursors of HSCs, and several critical transitional cell types during this process. The intermediate stage most proximal to HE, which we termed pre-HE, is characterized by increased accessibility of chromatin enriched for SOX, FOX, GATA, and SMAD binding motifs. I also identified a developmental bottleneck separates pre-HE from HE, and RUNX1 dosage regulates the efficiency of the pre-HE to HE transition. A distal enhancer of Runx1 shows high accessibility in pre-HE cells at the bottleneck, but loses accessibility thereafter. Once cells pass the bottleneck, they follow distinct developmental trajectories leading to an initial wave of lympho-myeloid-biased progenitors, followed by precursors of HSCs. During the course of both projects, I have developed novel computational methods for analyzing single-cell multi-omics data, including VERSE, PIVOT and VisCello. Together, these tools constitute a comprehensive single cell data analysis suite that facilitates the discovery of novel biological mechanisms
- …