Search CORE

200 research outputs found

CLAP: Learning Transferable Binary Code Representations with Natural Language Supervision

Author: Gao Zeyu
Qiu Han
Sha Zihan
Sun Mingyang
Sun Wenju
Wang Hao
Xiao Xi
Zhang Chao
Zhou Yuchen
Zhu Wenyu
Publication venue
Publication date: 26/02/2024
Field of study

Binary code representation learning has shown significant performance in binary analysis tasks. But existing solutions often have poor transferability, particularly in few-shot and zero-shot scenarios where few or no training samples are available for the tasks. To address this problem, we present CLAP (Contrastive Language-Assembly Pre-training), which employs natural language supervision to learn better representations of binary code (i.e., assembly code) and get better transferability. At the core, our approach boosts superior transfer learning capabilities by effectively aligning binary code with their semantics explanations (in natural language), resulting a model able to generate better embeddings for binary code. To enable this alignment training, we then propose an efficient dataset engine that could automatically generate a large and diverse dataset comprising of binary code and corresponding natural language explanations. We have generated 195 million pairs of binary code and explanations and trained a prototype of CLAP. The evaluations of CLAP across various downstream tasks in binary analysis all demonstrate exceptional performance. Notably, without any task-specific training, CLAP is often competitive with a fully supervised baseline, showing excellent transferability. We release our pre-trained model and code at https://github.com/Hustcw/CLAP

arXiv.org e-Print Archive

Convicted by memory: Automatically recovering spatial-temporal evidence from memory images

Author: Saltaformaggio Brendan D
Publication venue: 'Purdue University (bepress)'
Publication date: 01/01/2016
Field of study

Memory forensics can reveal “up to the minute” evidence of a device’s usage, often without requiring a suspect’s password to unlock the device, and it is oblivious to any persistent storage encryption schemes, e.g., whole disk encryption. Prior to my work, researchers and investigators alike considered data-structure recovery the ultimate goal of memory image forensics. This, however, was far from sufficient, as investigators were still largely unable to understand the content of the recovered evidence, and hence efficiently locating and accurately analyzing such evidence locked in memory images remained an open research challenge. In this dissertation, I propose breaking from traditional data-recovery-oriented forensics, and instead I present a memory forensics framework which leverages program analysis to automatically recover spatial-temporal evidence from memory images by understanding the programs that generated it. This framework consists of four techniques, each of which builds upon the discoveries of the previous, that represent this new paradigm of program-analysis-driven memory forensics. First, I present DSCRETE, a technique which reuses a program’s own interpretation and rendering logic to recover and present in-memory data structure contents. Following that, VCR developed vendor-generic data structure identification for the recovery of in-memory photographic evidence produced by an Android device’s cameras. GUITAR then realized an app-independent technique which automatically reassembles and redraws an app’s GUI from the multitude of GUI data elements found in a smartphone’s memory image. Finally, different from any traditional memory forensics technique, RetroScope introduced the vision of spatial-temporal memory forensics by retargeting an Android app’s execution to recover sequences of previous GUI screens, in their original temporal order, from a memory image. This framework, and the new program analysis techniques which enable it, have introduced encryption-oblivious forensics capabilities far exceeding traditional data-structure recovery

Purdue E-Pubs

Privacy preserving linkage and sharing of sensitive data

Author: Lazrig Ibrahim Meftah
Publication venue: Colorado State University. Libraries
Publication date: 01/01/2018
Field of study

2018 Summer.Includes bibliographical references.Sensitive data, such as personal and business information, is collected by many service providers nowadays. This data is considered as a rich source of information for research purposes that could benet individuals, researchers and service providers. However, because of the sensitivity of such data, privacy concerns, legislations, and con ict of interests, data holders are reluctant to share their data with others. Data holders typically lter out or obliterate privacy related sensitive information from their data before sharing it, which limits the utility of this data and aects the accuracy of research. Such practice will protect individuals' privacy; however it prevents researchers from linking records belonging to the same individual across dierent sources. This is commonly referred to as record linkage problem by the healthcare industry. In this dissertation, our main focus is on designing and implementing ecient privacy preserving methods that will encourage sensitive information sources to share their data with researchers without compromising the privacy of the clients or aecting the quality of the research data. The proposed solution should be scalable and ecient for real-world deploy- ments and provide good privacy assurance. While this problem has been investigated before, most of the proposed solutions were either considered as partial solutions, not accurate, or impractical, and therefore subject to further improvements. We have identied several issues and limitations in the state of the art solutions and provided a number of contributions that improve upon existing solutions. Our rst contribution is the design of privacy preserving record linkage protocol using semi-trusted third party. The protocol allows a set of data publishers (data holders) who compete with each other, to share sensitive information with subscribers (researchers) while preserving the privacy of their clients and without sharing encryption keys. Our second contribution is the design and implementation of a probabilistic privacy preserving record linkage protocol, that accommodates discrepancies and errors in the data such as typos. This work builds upon the previous work by linking the records that are similar, where the similarity range is formally dened. Our third contribution is a protocol that performs information integration and sharing without third party services. We use garbled circuits secure computation to design and build a system to perform the record linkages between two parties without sharing their data. Our design uses Bloom lters as inputs to the garbled circuits and performs a probabilistic record linkage using the Dice coecient similarity measure. As garbled circuits are known for their expensive computations, we propose new approaches that reduce the computation overhead needed, to achieve a given level of privacy. We built a scalable record linkage system using garbled circuits, that could be deployed in a distributed computation environment like the cloud, and evaluated its security and performance. One of the performance issues for linking large datasets is the amount of secure computation to compare every pair of records across the linked datasets to nd all possible record matches. To reduce the amount of computations a method, known as blocking, is used to lter out as much as possible of the record pairs that will not match, and limit the comparison to a subset of the record pairs (called can- didate pairs) that possibly match. Most of the current blocking methods either require the parties to share blocking keys (called blocks identiers), extracted from the domain of some record attributes (termed blocking variables), or share reference data points to group their records around these points using some similarity measures. Though these methods reduce the computation substantially, they leak too much information about the records within each block. Toward this end, we proposed a novel privacy preserving approximate blocking scheme that allows parties to generate the list of candidate pairs with high accuracy, while protecting the privacy of the records in each block. Our scheme is congurable such that the level of performance and accuracy could be achieved according to the required level of privacy. We analyzed the accuracy and privacy of our scheme, implemented a prototype of the scheme, and experimentally evaluated its accuracy and performance against dierent levels of privacy

Mountain Scholar (Digital Collections of Colorado and Wyoming)

Privacy-preserving techniques for computer and network forensics

Author: Shebaro Bilal
Publication venue: UNM Digital Repository
Publication date: 01/05/2012
Field of study

Clients, administrators, and law enforcement personnel have many privacy concerns when it comes to network forensics. Clients would like to use network services in a freedom-friendly environment that protects their privacy and personal data. Administrators would like to monitor their network, and audit its behavior and functionality for debugging and statistical purposes (which could involve invading the privacy of its network users). Finally, members of law enforcement would like to track and identify any type of digital crimes that occur on the network, and charge the suspects with the appropriate crimes. Members of law enforcement could use some security back doors made available by network administrators, or other forensic tools, that could potentially invade the privacy of network users. In my dissertation, I will be identifying and implementing techniques that each of these entities could use to achieve their goals while preserving the privacy of users on the network. I will show a privacy-preserving implementation of network flow recording that can allow administrators to monitor and audit their network behavior and functionality for debugging and statistical purposes without having this data contain any private information about its users. This implementation is based on identity-based encryption and differential privacy. I will also be showing how law enforcement could use timing channel techniques to fingerprint anonymous servers that are running websites with illegal content and services. Finally I will show the results from a thought experiment about how network administrators can identify pattern-like software that is running on clients\u27 machines remotely without any administrative privileges. The goal of my work is to understand what privileges administrators or law enforcement need to achieve their goals, and the privacy issues inherent in this, and to develop technologies that help administrators and law enforcement achieve their goals while preserving the privacy of network users

Software Obfuscation with Symmetric Cryptography

Author: Lin Alan C.
Publication venue: AFIT Scholar
Publication date: 01/03/2008
Field of study

Software protection is of great interest to commercial industry. Millions of dollars and years of research are invested in the development of proprietary algorithms used in software programs. A reverse engineer that successfully reverses another company‘s proprietary algorithms can develop a competing product to market in less time and with less money. The threat is even greater in military applications where adversarial reversers can use reverse engineering on unprotected military software to compromise capabilities on the field or develop their own capabilities with significantly less resources. Thus, it is vital to protect software, especially the software’s sensitive internal algorithms, from adversarial analysis. Software protection through obfuscation is a relatively new research initiative. The mathematical and security community have yet to agree upon a model to describe the problem let alone the metrics used to evaluate the practical solutions proposed by computer scientists. We propose evaluating solutions to obfuscation under the intent protection model, a combination of white-box and black-box protection to reflect how reverse engineers analyze programs using a combination white-box and black-box attacks. In addition, we explore use of experimental methods and metrics in analogous and more mature fields of study such as hardware circuits and cryptography. Finally, we implement a solution under the intent protection model that demonstrates application of the methods and evaluation using the metrics adapted from the aforementioned fields of study to reflect the unique challenges in a software-only software protection technique

AFTI Scholar (Air Force Institute of Technology)

Security strategies in genomic files

Author: Naro Daniel
Publication venue: Universitat Politècnica de Catalunya
Publication date: 15/05/2020
Field of study

There are new mechanisms to sequence and process the genomic code, discovering thus diagnostic tools and treatments. The file for a sequenced genome can reach hundreds of gigabytes. Thus, for further studies, we need new means to compress the information and a standardized representation to simplify the development of new tools. The ISO standardization group MPEG has used its expertise in compressing multimedia content to compress genomic information and develop its ´MPEG-G standard’. Given the sensitivity of the data, security is a major identified requirement. This thesis proposes novel technologies that assure the security of both the sequenced data and its metadata. We define a container-based file format to group data, metadata, and security information at the syntactical level. It includes new features like grouping multiple results in a same file to simplify the transport of whole studies. We use the granularity of the encoder’s output to enhance security. The information is represented in units, each dedicated to a specific region of the genome, which allows to provide encryption and signature features on a region base. We analyze the trade-off between security and an even more fine-grained approach and prove that apparently secure settings can be insecure: if the file creator may encrypt only specific elements of a unit, cross-checking unencrypted information permits to infer encrypted content. Most of the proposals for MPEG-G coming from other research groups and companies focused on data compression and representation. However, the need was recognized to find a solution for metadata encoding. Our proposal was included in the standard: an XML-based solution, separated in a core specification and extensions. It permits to adapt the metadata schema to the different genomic repositories' frameworks, without importing requirements from one framework to another. To simplify the handling of the resulting metadata, we define profiles, i.e. lists of extensions that must be present in a given framework. We use XML signature and XML encryption for metadata security. The MPEG requirements also concern access rules. Our privacy solutions limit the range of persons with access and we propose access rules represented with XACML to convey under which circumstances a user is granted access to a specific action among the ones specified in MPEG-G's API, e.g. filtering data by attributes. We also specify algorithms to combine multiple rules by defining default behaviors and exceptions. The standard’s security mechanisms protect the information only during transport and access. Once the data is obtained, the user could publish it. In order to identify leakers, we propose an algorithm that generates unique, virtually undetectable variations. Our solution is novel as the marking can be undone (and the utility of the data preserved) if the corresponding secret key is revealed. We also show how to combine multiple secret keys to avoid collusion. The API retained for MPEG-G considers search criteria not present in the indexing tables, which highlights shortcomings. Based on the proposed MPEG-G API we have developed a solution. It is based on a collaboration framework where the different users' needs and the patient's privacy settings result in a purpose-built file format that optimizes query times and provides privacy and authenticity on the patient-defined genomic regions. The encrypted output units are created and indexed to optimize query times and avoid rarely used indexing fields. Our approach resolves the shortcomings of MPEG-G's indexing strategy. We have submitted our technologies to the MPEG standardization committee. Many have been included in the final standard, via merging with other proposals (e.g. file format), discussion (e.g. security mechanisms), or direct acceptance (e.g. privacy rules).Hi han nous mètodes per la seqüenciació i el processament del codi genòmic, permetent descobrir eines de diagnòstic i tractaments en l’àmbit mèdic. El resultat de la seqüenciació d’un genoma es representa en un fitxer, que pot ocupar centenars de gigabytes. Degut a això, hi ha una necessitat d’una representació estandarditzada on la informació és comprimida. Dins de la ISO, el grup MPEG ha fet servir la seva experiència en compressió de dades multimèdia per comprimir dades genòmiques i desenvolupar l'estàndard MPEG-G, sent la seguretat un dels requeriments principals. L'objectiu de la tesi és garantir aquesta seguretat (encriptant, firmant i definint regles d¿ accés) tan per les dades seqüenciades com per les seves metadades. El primer pas és definir com transportar les dades, metadades i paràmetres de seguretat. Especifiquem un format de fitxer basat en contenidors per tal d'agrupar aquets elements a nivell sintàctic. La nostra solució proposa noves funcionalitats com agrupar múltiples resultats en un mateix fitxer. Pel que fa la seguretat de dades, la nostra proposta utilitza les propietats de la sortida del codificador. Aquesta sortida és estructurada en unitats, cadascuna dedicada a una regió concreta del genoma, permetent una encriptació i firma de dades específica a la unitat. Analitzem el compromís entre seguretat i un enfocament de gra més fi demostrant que configuracions aparentment vàlides poden no ser-ho: si es permet encriptar sols certes sub-unitats d'informació, creuant els continguts no encriptats, podem inferir el contingut encriptat. Quant a metadades, proposem una solució basada en XML separada en una especificació bàsica i en extensions. Podem adaptar l'esquema de metadades als diferents marcs de repositoris genòmics, sense imposar requeriments d’un marc a un altre. Per simplificar l'ús, plantegem la definició de perfils, és a dir, una llista de les extensions que han de ser present per un marc concret. Fem servir firmes XML i encriptació XML per implementar la seguretat de les metadades. Les nostres solucions per la privacitat limiten qui té accés a les dades, però no en limita l’ús. Proposem regles d’accés representades amb XACML per indicar en quines circumstàncies un usuari té dret d'executar una de les accions especificades a l'API de MPEG-G (per exemple, filtrar les dades per atributs). Presentem algoritmes per combinar regles, per tal de poder definir casos per defecte i excepcions. Els mecanismes de seguretat de MPEG-G protegeixen la informació durant el transport i l'accés. Una vegada l’usuari ha accedit a les dades, les podria publicar. Per tal d'identificar qui és l'origen del filtratge de dades, proposem un algoritme que genera modificacions úniques i virtualment no detectables. La nostra solució és pionera, ja que els canvis es poden desfer si el secret corresponent és publicat. Per tant, la utilitat de les dades és mantinguda. Demostrem que combinant varis secrets, podem evitar col·lusions. L'API seleccionada per MPEG-G, considera criteris de cerca que no són presents en les taules d’indexació. Basant-nos en aquesta API, hem desenvolupat una solució. És basada en un marc de col·laboració, on la combinació de les necessitats dels diferents usuaris i els requeriments de privacitat del pacient, es combinen en una representació ad-hoc que optimitza temps d’accessos tot i garantint la privacitat i autenticitat de les dades. La majoria de les nostres propostes s’han inclòs a la versió final de l'estàndard, fusionant-les amb altres proposes (com amb el format del fitxer), demostrant la seva superioritat (com amb els mecanismes de seguretat), i fins i tot sent acceptades directament (com amb les regles de privacitat)

A review and open issues of diverse text watermarking techniques in spatial domain

Author: Alwan Ali A.
Hashim Mohammed Mahdi
M. A. Shahidan
Mohd. Rahim Mohd. Shafry
Sjarif Nilam Nur Amir
Publication venue: JATIT & LLS
Publication date: 15/09/2018
Field of study

Nowadays, information hiding is becoming a helpful technique and fetches more attention due to the fast growth of using the internet; it is applied for sending secret information by using different techniques. Watermarking is one of major important technique in information hiding. Watermarking is of hiding secret data into a carrier media to provide the privacy and integrity of information so that no one can recognize and detect it's accepted the sender and receiver. In watermarking, many various carrier formats can be used such as an image, video, audio, and text. The text is most popular used as a carrier files due to its frequency on the internet. There are many techniques variables for the text watermarking; each one has its own robust and susceptible points. In this study, we conducted a review of text watermarking in the spatial domain to explore the term text watermarking by reviewing, collecting, synthesizing and analyze the challenges of different studies which related to this area published from 2013 to 2018. The aims of this paper are to provide an overview of text watermarking and comparison between approved studies as discussed according to the Arabic text characters, payload capacity, Imperceptibility, authentication, and embedding technique to open important research issues in the future work to obtain a robust method

Universiti Teknologi Malaysia Institutional Repository