12 research outputs found
User-guided Page Merging for Memory Deduplication in Serverless Systems
Serverless computing is an emerging cloud paradigm that offers an elastic and
scalable allocation of computing resources with pay-as-you-go billing. In the
Function-as-a-Service (FaaS) programming model, applications comprise
short-lived and stateless serverless functions executed in isolated containers
or microVMs, which can quickly scale to thousands of instances and process
terabytes of data. This flexibility comes at the cost of duplicated runtimes,
libraries, and user data spread across many function instances, and cloud
providers do not utilize this redundancy. The memory footprint of serverless
forces removing idle containers to make space for new ones, which decreases
performance through more cold starts and fewer data caching opportunities. We
address this issue by proposing deduplicating memory pages of serverless
workers with identical content, based on the content-based page-sharing concept
of Linux Kernel Same-page Merging (KSM). We replace the background memory
scanning process of KSM, as it is too slow to locate sharing candidates in
short-lived functions. Instead, we design User-Guided Page Merging (UPM), a
built-in Linux kernel module that leverages the madvise system call: we enable
users to advise the kernel of memory areas that can be shared with others. We
show that UPM reduces memory consumption by up to 55% on 16 concurrent
containers executing a typical image recognition function, more than doubling
the density for containers of the same function that can run on a system.Comment: Accepted at IEEE BigData 202
A Survey of Hashing Techniques for High Performance Computing
Hashing is a well-known and widely used technique for providing O(1) access to large files on secondary storage and tables in memory. Hashing techniques were introduced in the early 60s. The term hash function historically is used to denote a function that compresses a string of arbitrary input to a string of fixed length. Hashing finds applications in other fields such as fuzzy matching, error checking, authentication, cryptography, and networking. Hashing techniques have found application to provide faster access in routing tables, with the increase in the size of the routing tables. More recently, hashing has found applications in transactional memory in hardware. Motivated by these newly emerged applications of hashing, in this paper we present a survey of hashing techniques starting from traditional hashing methods with greater emphasis on the recent developments. We provide a brief explanation on hardware hashing and a brief introduction to transactional memory
Efficient Collection and Processing of Cyber Threat Intelligence from Partner Feeds
Sharing of threat intelligence between organizations and companies in the cyber security industry is a crucial part of proactive defense against security threats. Even though some standardization efforts exist, most publishers of cyber security feeds use their own approach and provide data in varying formats, schemata, compression algorithms, through differing APIs etc. This makes every feed unique and complicates their automated collection and processing.Furthermore, the published data may contain a lot of irrelevant records, such as duplicates or data about very exotic files or websites, which are not useful.
In this work, we present Feed Automation, a cloud-based system for fully automatic collection and processing of cyber threat intelligence from a variety of online feeds. The system provides two means for reduction of noise in the data: a smart deduplication service based on a sliding window technique, which is able to remove just the duplicates with no important changes in the metadata; and efficient rules, easily configurable by the malware analysts, to remove records, which are not useful for us. Additionally, we propose a filtering solution based on machine learning, which is able to predict how useful a record is for our backend systems based on historic data. We demonstrate how this system can help to unify the feed collection, processing, and data noise reduction into one automated system, speeding up development, simplifying maintenance, and reducing the load for the backend systems
Recommended from our members
Efficient Cloud Backup and Private Search
As organizations and companies are increasingly offloading data and computation to the cloud to reduce infrastructure administration,data volume keeps growing and new services and algorithms are needed to meet increasing demands for both storage capacity and privacy.The first part of my thesis will address cloud data backup.Organizations and companies often backup and archive high volumes of binary and text datasets for fault tolerance,internal investigation, and electronic discovery. Source-side deduplication has an advantage to avoid or minimize duplicated data transmitted over the network,however it demands more computing resource to perform extensive fingerprint comparison which would otherwise be available for primary services at the source.For data stored in the cloud, users need efficient, scalable services for searching these files.In the first part of this thesis, I will cover the key components of existing solutions for large-scale backup storage in the cloud.I will go into detail on how deduplication is important to large scale backup systems, and review some ongoing work.I will also detail my contributions in this area towards low-profile source-side deduplication.The second part of my thesis addresses an open problem for efficient private document search on data hosted on the cloud.As sensitive information is increasingly centralized into the cloud, for the protection of data privacy, such data is often encrypted,which makes effective data indexing and search a very challenging task. To overcome the challenges of querying encrypted datasets,searchable encryption schemes allow users to securely search over encrypted data through keywords.No existing solutions for efficient ranking which involves complex arithmetic computation in feature composition and scoring currently exist,and without relevant ranking of search results queries over very large datasets which may return many results can be impractical.In the second part of my thesis I will review existing work on private search and introduce our ongoing and published work for this open problem,focusing on how to make private search practical and scalable for large datasets
Recommended from our members
Making Data Storage Efficient in the Era of Cloud Computing
We enter the era of cloud computing in the last decade, as many paradigm shifts are happening on how people write and deploy applications. Despite the advancement of cloud computing, data storage abstractions have not evolved much, causing inefficiencies in performance, cost, and security.
This dissertation proposes a novel approach to make data storage efficient in the era of cloud computing by building new storage abstractions and systems that bridge the gap between cloud computing and data storage and simplify development. We build four systems to address four data inefficiencies in cloud computing.
The first system, Grandet, solves the data storage inefficiency caused by the paradigm shift from upfront provisioning to a variety of pay-as-you-go cloud services. Grandet is an extensible storage system that significantly reduces storage costs for web applications deployed in the cloud. Under the hood, it supports multiple heterogeneous stores and unifies them by placing each data object at the store deemed most economical. Our results show that Grandet reduces their costs by an average of 42.4%, and it is fast, scalable, and easy to use.
The second system, Unic, solves the data inefficiency caused by the paradigm shift from single-tenancy to multi-tenancy. Unic securely deduplicates general computations. It exports a cache service that allows cloud applications running on behalf of mutually distrusting users to memoize and reuse computation results, thereby improving performance. Unic achieves both integrity and secrecy through a novel use of code attestation, and it provides a simple yet expressive API that enables applications to deduplicate their own rich computations. Our results show that Unic is easy to use, speeds up applications by an average of 7.58x, and with little storage overhead.
The third system, Lambdata, solves the data inefficiency caused by the paradigm shift to serverless computing, where developers only write core business logic, and cloud service providers maintain all the infrastructure. Lambdata is a novel serverless computing system that enables developers to declare a cloud function's data intents, including both data read and data written. Once data intents are made explicit, Lambdata performs a variety of optimizations to improve speed, including caching data locally and scheduling functions based on code and data locality. Our results show that Lambdata achieves an average speedup of 1.51x on the turnaround time of practical workloads and reduces monetary cost by 16.5%.
The fourth system, CleanOS, solves the data inefficiency caused by the paradigm shift from desktop computers to smartphones always connected to the cloud. CleanOS is a new Android-based operating system that manages sensitive data rigorously and maintains a clean environment at all times. It identifies and tracks sensitive data, encrypts it with a key, and evicts that key to the cloud when the data is not in active use on the device. Our results show that CleanOS limits sensitive-data exposure drastically while incurring acceptable overheads on mobile networks
Secure and efficient processing of outsourced data structures using trusted execution environments
In recent years, more and more companies make use of cloud computing; in other words, they outsource data storage and data processing to a third party, the cloud provider. From cloud computing, the companies expect, for example, cost reductions, fast deployment time, and improved security. However, security also presents a significant challenge as demonstrated by many cloud computing–related data breaches. Whether it is due to failing security measures, government interventions, or internal attackers, data leakages can have severe consequences, e.g., revenue loss, damage to brand reputation, and loss of intellectual property. A valid strategy to mitigate these consequences is data encryption during storage, transport, and processing. Nevertheless, the outsourced data processing should combine the following three properties: strong security, high efficiency, and arbitrary processing capabilities.
Many approaches for outsourced data processing based purely on cryptography are available. For instance, encrypted storage of outsourced data, property-preserving encryption, fully homomorphic encryption, searchable encryption, and functional encryption. However, all of these approaches fail in at least one of the three mentioned properties.
Besides approaches purely based on cryptography, some approaches use a trusted execution environment (TEE) to process data at a cloud provider. TEEs provide an isolated processing environment for user-defined code and data, i.e., the confidentiality and integrity of code and data processed in this environment are protected against other software and physical accesses.
Additionally, TEEs promise efficient data processing.
Various research papers use TEEs to protect objects at different levels of granularity. On the one end of the range, TEEs can protect entire (legacy) applications. This approach facilitates the development effort for protected applications as it requires only minor changes. However, the downsides of this approach are that the attack surface is large, it is difficult to capture the exact leakage, and it might not even be possible as the isolated environment of commercially available TEEs is limited. On the other end of the range, TEEs can protect individual, stateless operations, which are called from otherwise unchanged applications. This approach does not suffer from the problems stated before, but it leaks the (encrypted) result of each operation and the detailed control flow through the application. It is difficult to capture the leakage of this approach, because it depends on the processed operation and the operation’s location in the code.
In this dissertation, we propose a trade-off between both approaches: the TEE-based processing of data structures. In this approach, otherwise unchanged applications call a TEE for self-contained data structure operations and receive encrypted results. We examine three data structures: TEE-protected B+-trees, TEE-protected database dictionaries, and TEE-protected file systems. Using these data structures, we design three secure and efficient systems: an outsourced system for index searches; an outsourced, dictionary-encoding–based, column-oriented, in-memory database supporting analytic queries on large datasets; and an outsourced system for group file sharing supporting large and dynamic groups.
Due to our approach, the systems have a small attack surface, a low likelihood of security-relevant bugs, and a data owner can easily perform a (formal) code verification of the sensitive code. At the same time, we prevent low-level leakage of individual operation results. For all systems, we present a thorough security evaluation showing lower bounds of security. Additionally, we use prototype implementations to present upper bounds on performance. For our implementations, we use a widely available TEE that has a limited isolated environment—Intel Software Guard Extensions. By comparing our systems to related work, we show that they provide a favorable trade-off regarding security and efficiency
Large Language Models for Software Engineering: A Systematic Literature Review
Large Language Models (LLMs) have significantly impacted numerous domains,
notably including Software Engineering (SE). Nevertheless, a well-rounded
understanding of the application, effects, and possible limitations of LLMs
within SE is still in its early stages. To bridge this gap, our systematic
literature review takes a deep dive into the intersection of LLMs and SE, with
a particular focus on understanding how LLMs can be exploited in SE to optimize
processes and outcomes. Through a comprehensive review approach, we collect and
analyze a total of 229 research papers from 2017 to 2023 to answer four key
research questions (RQs). In RQ1, we categorize and provide a comparative
analysis of different LLMs that have been employed in SE tasks, laying out
their distinctive features and uses. For RQ2, we detail the methods involved in
data collection, preprocessing, and application in this realm, shedding light
on the critical role of robust, well-curated datasets for successful LLM
implementation. RQ3 allows us to examine the specific SE tasks where LLMs have
shown remarkable success, illuminating their practical contributions to the
field. Finally, RQ4 investigates the strategies employed to optimize and
evaluate the performance of LLMs in SE, as well as the common techniques
related to prompt optimization. Armed with insights drawn from addressing the
aforementioned RQs, we sketch a picture of the current state-of-the-art,
pinpointing trends, identifying gaps in existing research, and flagging
promising areas for future study