74 research outputs found

    SAFE: Self-Attentive Function Embeddings for Binary Similarity

    Get PDF
    The binary similarity problem consists in determining if two functions are similar by only considering their compiled form. Advanced techniques for binary similarity recently gained momentum as they can be applied in several fields, such as copyright disputes, malware analysis, vulnerability detection, etc., and thus have an immediate practical impact. Current solutions compare functions by first transforming their binary code in multi-dimensional vector representations (embeddings), and then comparing vectors through simple and efficient geometric operations. However, embeddings are usually derived from binary code using manual feature extraction, that may fail in considering important function characteristics, or may consider features that are not important for the binary similarity problem. In this paper we propose SAFE, a novel architecture for the embedding of functions based on a self-attentive neural network. SAFE works directly on disassembled binary functions, does not require manual feature extraction, is computationally more efficient than existing solutions (i.e., it does not incur in the computational overhead of building or manipulating control flow graphs), and is more general as it works on stripped binaries and on multiple architectures. We report the results from a quantitative and qualitative analysis that show how SAFE provides a noticeable performance improvement with respect to previous solutions. Furthermore, we show how clusters of our embedding vectors are closely related to the semantic of the implemented algorithms, paving the way for further interesting applications (e.g. semantic-based binary function search).Comment: Published in International Conference on Detection of Intrusions and Malware, and Vulnerability Assessment (DIMVA) 201

    Targeted interest-driven advertising in cities using Twitter

    Get PDF
    Targeted advertising is a key characteristic of online as well as traditional-media marketing. However it is very limited in outdoor advertising, that is, performing campaigns by means of billboards in public places. The reason is the lack of information about the interests of the particular passersby, except at very imprecise and aggregate demographic or traffic estimates. In this work we propose a methodology for performing targeted outdoor advertising by leveraging the use of social media. In particular, we use the Twitter social network to gather information about users’ degree of interest in given advertising categories and about the common routes that they follow, characterizing in this way each zone in a given city. Then we use our characterization for recommending physical locations for advertising. Given an advertisement category, we estimate the most promising areas to be selected for the placement of an ad that can maximize its targeted effectiveness. We show that our approach is able to select advertising locations better with respect to a baseline reflecting a current ad-placement policy. To the best of our knowledge this is the first work on offline advertising in urban areas making use of (publicly available) data from social networks

    Everyday the Same Picture: Popularity and Content Diversity

    Get PDF
    Facebook is flooded by diverse and heterogeneous content, from kittens up to music and news, passing through satirical and funny stories. Each piece of that corpus reflects the heterogeneity of the underlying social background. In the Italian Facebook we have found an interesting case: a page having more than 40K40K followers that every day posts the same picture of a popular Italian singer. In this work, we use such a page as a control to study and model the relationship between content heterogeneity on popularity. In particular, we use that page for a comparative analysis of information consumption patterns with respect to pages posting science and conspiracy news. In total, we analyze about 2M2M likes and 190K190K comments, made by approximately 340K340K and 65K65K users, respectively. We conclude the paper by introducing a model mimicking users selection preferences accounting for the heterogeneity of contents

    Autoregressive Entity Retrieval

    Get PDF
    Entities are at the center of how we represent and aggregate knowledge. For instance, Encyclopedias such as Wikipedia are structured by entities (e.g., one per Wikipedia article). The ability to retrieve such entities given a query is fundamental for knowledge-intensive tasks such as entity linking and open-domain question answering. Current approaches can be understood as classifiers among atomic labels, one for each entity. Their weight vectors are dense entity representations produced by encoding entity meta information such as their descriptions. This approach has several shortcomings: (i) context and entity affinity is mainly captured through a vector dot product, potentially missing fine-grained interactions; (ii) a large memory footprint is needed to store dense representations when considering large entity sets; (iii) an appropriately hard set of negative data has to be subsampled at training time. In this work, we propose GENRE, the first system that retrieves entities by generating their unique names, left to right, token-by-token in an autoregressive fashion. This mitigates the aforementioned technical issues since: (i) the autoregressive formulation directly captures relations between context and entity name, effectively cross encoding both; (ii) the memory footprint is greatly reduced because the parameters of our encoder-decoder architecture scale with vocabulary size, not entity count; (iii) the softmax loss is computed without subsampling negative data. We experiment with more than 20 datasets on entity disambiguation, end-to-end entity linking and document retrieval tasks, achieving new state-of-the-art or very competitive results while using a tiny fraction of the memory footprint of competing systems. Finally, we demonstrate that new entities can be added by simply specifying their names. Code and pre-trained models at https://github.com/facebookresearch/GENRE.Comment: Accepted (spotlight) at International Conference on Learning Representations (ICLR) 2021. Code at https://github.com/facebookresearch/GENRE. 20 pages, 9 figures, 8 table

    Function Representations for Binary Similarity

    Get PDF
    The binary similarity problem consists in determining if two functions are similar considering only their compiled form. Advanced techniques for binary similarity recently gained momentum as they can be applied in several fields, such as copyright disputes, malware analysis, vulnerability detection, etc. In this paper we describe SAFE, a novel architecture for function representation based on a self-attentive neural network. SAFE works directly on disassembled binary functions, does not require manual feature extraction, is computationally more efficient than existing solutions, and is more general as it works on stripped binaries and on multiple architectures. Results from our experimental evaluation show how SAFE provides a performance improvement with respect to previoussolutions. Furthermore, we show how SAFE can be used in widely different use cases, thus providing a general solution for several application scenarios

    Cutting Down on Prompts and Parameters: Simple Few-Shot Learning with Language Models

    Get PDF
    Prompting language models (LMs) with training examples and task descriptions has been seen as critical to recent successes in few-shot learning. In this work, we show that finetuning LMs in the few-shot setting can considerably reduce the need for prompt engineering. In fact, one can use null prompts, prompts that contain neither task-specific templates nor training examples, and achieve competitive accuracy to manually-tuned prompts across a wide range of tasks. While finetuning LMs does introduce new parameters for each downstream task, we show that this memory overhead can be substantially reduced-finetuning only the bias terms can achieve comparable or better accuracy than standard finetuning while only updating 0.1% of the parameters. All in all, we recommend finetuning LMs for few-shot learning as it is more accurate, has relatively stable performance across different prompts, and can be made nearly as efficient as using frozen LMs

    Concept Matching for Low-Resource Classification

    Full text link
    We propose a model to tackle classification tasks in the presence of very little training data. To this aim, we approximate the notion of exact match with a theoretically sound mechanism that computes a probability of matching in the input space. Importantly, the model learns to focus on elements of the input that are relevant for the task at hand; by leveraging highlighted portions of the training data, an error boosting technique guides the learning process. In practice, it increases the error associated with relevant parts of the input by a given factor. Remarkable results on text classification tasks confirm the benefits of the proposed approach in both balanced and unbalanced cases, thus being of practical use when labeling new examples is expensive. In addition, by inspecting its weights, it is often possible to gather insights on what the model has learned
    • …
    corecore