678 research outputs found

    Heterogeneous Metric Learning with Content-Based Regularization for Software Artifact Retrieval

    Full text link
    The problem of software artifact retrieval has the goal to effectively locate software artifacts, such as a piece of source code, in a large code repository. This problem has been traditionally addressed through the textual query. In other words, information retrieval techniques will be exploited based on the textual similarity between queries and textual representation of software artifacts, which is generated by collecting words from comments, identifiers, and descriptions of programs. However, in addition to these semantic information, there are rich information embedded in source codes themselves. These source codes, if analyzed properly, can be a rich source for enhancing the efforts of software artifact retrieval. To this end, in this paper, we develop a feature extraction method on source codes. Specifically, this method can capture both the inherent information in the source codes and the semantic information hidden in the comments, descriptions, and identifiers of the source codes. Moreover, we design a heterogeneous metric learning approach, which allows to integrate code features and text features into the same latent semantic space. This, in turn, can help to measure the artifact similarity by exploiting the joint power of both code and text features. Finally, extensive experiments on real-world data show that the proposed method can help to improve the performances of software artifact retrieval with a significant margin

    Classifying Web Exploits with Topic Modeling

    Full text link
    This short empirical paper investigates how well topic modeling and database meta-data characteristics can classify web and other proof-of-concept (PoC) exploits for publicly disclosed software vulnerabilities. By using a dataset comprised of over 36 thousand PoC exploits, near a 0.9 accuracy rate is obtained in the empirical experiment. Text mining and topic modeling are a significant boost factor behind this classification performance. In addition to these empirical results, the paper contributes to the research tradition of enhancing software vulnerability information with text mining, providing also a few scholarly observations about the potential for semi-automatic classification of exploits in the existing tracking infrastructures.Comment: Proceedings of the 2017 28th International Workshop on Database and Expert Systems Applications (DEXA). http://ieeexplore.ieee.org/abstract/document/8049693

    Conditional Generation of Medical Images via Disentangled Adversarial Inference

    Full text link
    Synthetic medical image generation has a huge potential for improving healthcare through many applications, from data augmentation for training machine learning systems to preserving patient privacy. Conditional Adversarial Generative Networks (cGANs) use a conditioning factor to generate images and have shown great success in recent years. Intuitively, the information in an image can be divided into two parts: 1) content which is presented through the conditioning vector and 2) style which is the undiscovered information missing from the conditioning vector. Current practices in using cGANs for medical image generation, only use a single variable for image generation (i.e., content) and therefore, do not provide much flexibility nor control over the generated image. In this work we propose a methodology to learn from the image itself, disentangled representations of style and content, and use this information to impose control over the generation process. In this framework, style is learned in a fully unsupervised manner, while content is learned through both supervised learning (using the conditioning vector) and unsupervised learning (with the inference mechanism). We undergo two novel regularization steps to ensure content-style disentanglement. First, we minimize the shared information between content and style by introducing a novel application of the gradient reverse layer (GRL); second, we introduce a self-supervised regularization method to further separate information in the content and style variables. We show that in general, two latent variable models achieve better performance and give more control over the generated image. We also show that our proposed model (DRAI) achieves the best disentanglement score and has the best overall performance.Comment: Published in Medical Image Analysi

    RecRules: Recommending IF-THEN Rules for End-User Development

    Get PDF
    Nowadays, end users can personalize their smart devices and web applications by defining or reusing IF-THEN rules through dedicated End-User Development (EUD) tools. Despite apparent simplicity, such tools present their own set of issues. The emerging and increasing complexity of the Internet of Things, for example, is barely taken into account, and the number of possible combinations between triggers and actions of different smart devices and web applications is continuously growing. Such a large design space makes end-user personalization a complex task for non-programmers, and motivates the need of assisting users in easily discovering and managing rules and functionality, e.g., through recommendation techniques. In this paper, we tackle the emerging problem of recommending IF-THEN rules to end users by presenting RecRules, a hybrid and semantic recommendation system. Through a mixed content and collaborative approach, the goal of RecRules is to recommend by functionality: it suggests rules based on their final purposes, thus overcoming details like manufacturers and brands. The algorithm uses a semantic reasoning process to enrich rules with semantic information, with the aim of uncovering hidden connections between rules in terms of shared functionality. Then, it builds a collaborative semantic graph, and it exploits different types of path-based features to train a learning to rank algorithm and compute top-N recommendations. We evaluate RecRules through different experiments on real user data extracted from IFTTT, one of the most popular EUD tool. Results are promising: they show the effectiveness of our approach with respect to other state-of-the-art algorithms, and open the way for a new class of recommender systems for EUD that take into account the actual functionality needed by end users

    Transpose Attack: Stealing Datasets with Bidirectional Training

    Full text link
    Deep neural networks are normally executed in the forward direction. However, in this work, we identify a vulnerability that enables models to be trained in both directions and on different tasks. Adversaries can exploit this capability to hide rogue models within seemingly legitimate models. In addition, in this work we show that neural networks can be taught to systematically memorize and retrieve specific samples from datasets. Together, these findings expose a novel method in which adversaries can exfiltrate datasets from protected learning environments under the guise of legitimate models. We focus on the data exfiltration attack and show that modern architectures can be used to secretly exfiltrate tens of thousands of samples with high fidelity, high enough to compromise data privacy and even train new models. Moreover, to mitigate this threat we propose a novel approach for detecting infected models.Comment: NDSS24 pape

    IMC-Denoise: A content aware denoising pipeline to enhance Imaging Mass Cytometry

    Get PDF
    Imaging Mass Cytometry (IMC) is an emerging multiplexed imaging technology for analyzing complex microenvironments using more than 40 molecularly-specific channels. However, this modality has unique data processing requirements, particularly for patient tissue specimens where signal-to-noise ratios for markers can be low, despite optimization, and pixel intensity artifacts can deteriorate image quality and downstream analysis. Here we demonstrate an automated content-aware pipeline, IMC-Denoise, to restore IMC images deploying a differential intensity map-based restoration (DIMR) algorithm for removing hot pixels and a self-supervised deep learning algorithm for shot noise image filtering (DeepSNiF). IMC-Denoise outperforms existing methods for adaptive hot pixel and background noise removal, with significant image quality improvement in modeled data and datasets from multiple pathologies. This includes in technically challenging human bone marrow; we achieve noise level reduction of 87% for a 5.6-fold higher contrast-to-noise ratio, and more accurate background noise removal with approximately 2 × improved F1 score. Our approach enhances manual gating and automated phenotyping with cell-scale downstream analyses. Verified by manual annotations, spatial and density analysis for targeted cell groups reveal subtle but significant differences of cell populations in diseased bone marrow. We anticipate that IMC-Denoise will provide similar benefits across mass cytometric applications to more deeply characterize complex tissue microenvironments
    corecore