1,109 research outputs found

    Um método supervisionado para encontrar variáveis discriminantes na análise de problemas complexos : estudos de caso em segurança do Android e em atribuição de impressora fonte

    Get PDF
    Orientadores: Ricardo Dahab, Anderson de Rezende RochaDissertação (mestrado) - Universidade Estadual de Campinas, Instituto de ComputaçãoResumo: A solução de problemas onde muitos componentes atuam e interagem simultaneamente requer modelos de representação nem sempre tratáveis pelos métodos analíticos tradicionais. Embora em muitos caso se possa prever o resultado com excelente precisão através de algoritmos de aprendizagem de máquina, a interpretação do fenómeno requer o entendimento de quais são e em que proporção atuam as variáveis mais importantes do processo. Esta dissertação apresenta a aplicação de um método onde as variáveis discriminantes são identificadas através de um processo iterativo de ranqueamento ("ranking") por eliminação das que menos contribuem para o resultado, avaliando-se em cada etapa o impacto da redução de características nas métricas de acerto. O algoritmo de florestas de decisão ("Random Forest") é utilizado para a classificação e sua propriedade de importância das características ("Feature Importance") para o ranqueamento. Para a validação do método, dois trabalhos abordando sistemas complexos de natureza diferente foram realizados dando origem aos artigos aqui apresentados. O primeiro versa sobre a análise das relações entre programas maliciosos ("malware") e os recursos requisitados pelos mesmos dentro de um ecossistema de aplicações no sistema operacional Android. Para realizar esse estudo, foram capturados dados, estruturados segundo uma ontologia definida no próprio artigo (OntoPermEco), de 4.570 aplicações (2.150 malware, 2.420 benignas). O modelo complexo produziu um grafo com cerca de 55.000 nós e 120.000 arestas, o qual foi transformado usando-se a técnica de bolsa de grafos ("Bag Of Graphs") em vetores de características de cada aplicação com 8.950 elementos. Utilizando-se apenas os dados do manifesto atingiu-se com esse modelo 88% de acurácia e 91% de precisão na previsão do comportamento malicioso ou não de uma aplicação, e o método proposto foi capaz de identificar 24 características relevantes na classificação e identificação de famílias de malwares, correspondendo a 70 nós no grafo do ecosistema. O segundo artigo versa sobre a identificação de regiões em um documento impresso que contém informações relevantes na atribuição da impressora laser que o imprimiu. O método de identificação de variáveis discriminantes foi aplicado sobre vetores obtidos a partir do uso do descritor de texturas (CTGF-"Convolutional Texture Gradient Filter") sobre a imagem scaneada em 600 DPI de 1.200 documentos impressos em 10 impressoras. A acurácia e precisão médias obtidas no processo de atribuição foram de 95,6% e 93,9% respectivamente. Após a atribuição da impressora origem a cada documento, 8 das 10 impressoras permitiram a identificação de variáveis discriminantes associadas univocamente a cada uma delas, podendo-se então visualizar na imagem do documento as regiões de interesse para uma análise pericial. Os objetivos propostos foram atingidos mostrando-se a eficácia do método proposto na análise de dois problemas em áreas diferentes (segurança de aplicações e forense digital) com modelos complexos e estruturas de representação bastante diferentes, obtendo-se um modelo reduzido interpretável para ambas as situaçõesAbstract: Solving a problem where many components interact and affect results simultaneously requires models which sometimes are not treatable by traditional analytic methods. Although in many cases the result is predicted with excellent accuracy through machine learning algorithms, the interpretation of the phenomenon requires the understanding of how the most relevant variables contribute to the results. This dissertation presents an applied method where the discriminant variables are identified through an iterative ranking process. In each iteration, a classifier is trained and validated discarding variables that least contribute to the result and evaluating in each stage the impact of this reduction in the classification metrics. Classification uses the Random Forest algorithm, and the discarding decision applies using its feature importance property. The method handled two works approaching complex systems of different nature giving rise to the articles presented here. The first article deals with the analysis of the relations between \textit{malware} and the operating system resources requested by them within an ecosystem of Android applications. Data structured according to an ontology defined in the article (OntoPermEco) were captured to carry out this study from 4,570 applications (2,150 malware, 2,420 benign). The complex model produced a graph of about 55,000 nodes and 120,000 edges, which was transformed using the Bag of Graphs technique into feature vectors of each application with 8,950 elements. The work accomplished 88% of accuracy and 91% of precision in predicting malicious behavior (or not) for an application using only the data available in the application¿s manifest, and the proposed method was able to identify 24 relevant features corresponding to only 70 nodes of the entire ecosystem graph. The second article is about to identify regions in a printed document that contains information relevant to the attribution of the laser printer that printed it. The discriminant variable determination method achieved average accuracy and precision of 95.6% and 93.9% respectively in the source printer attribution using a dataset of 1,200 documents printed on ten printers. Feature vectors were obtained from the scanned image at 600 DPI applying the texture descriptor Convolutional Texture Gradient Filter (CTGF). After the assignment of the source printer to each document, eight of the ten printers allowed the identification of discriminant variables univocally associated to each one of them, and it was possible to visualize in document's image the regions of interest for expert analysis. The work in both articles accomplished the objective of reducing a complex system into an interpretable streamlined model demonstrating the effectiveness of the proposed method in the analysis of two problems in different areas (application security and digital forensics) with complex models and entirely different representation structuresMestradoCiência da ComputaçãoMestre em Ciência da Computaçã

    Attacking and Defending Printer Source Attribution Classifiers in the Physical Domain

    Get PDF
    The security of machine learning classifiers has received increasing attention in the last years. In forensic applications, guaranteeing the security of the tools investigators rely on is crucial, since the gathered evidence may be used to decide about the innocence or the guilt of a suspect. Several adversarial attacks were proposed to assess such security, with a few works focusing on transferring such attacks from the digital to the physical domain. In this work, we focus on physical domain attacks against source attribution of printed documents. We first show how a simple reprinting attack may be sufficient to fool a model trained on images that were printed and scanned only once. Then, we propose a hardened version of the classifier trained on the reprinted attacked images. Finally, we attack the hardened classifier with several attacks, including a new attack based on the Expectation Over Transformation approach, which finds the adversarial perturbations by simulating the physical transformations occurring when the image attacked in the digital domain is printed again. The results we got demonstrate a good capability of the hardened classifier to resist attacks carried out in the physical domai

    Information embedding and retrieval in 3D printed objects

    Get PDF
    Deep learning and convolutional neural networks have become the main tools of computer vision. These techniques are good at using supervised learning to learn complex representations from data. In particular, under limited settings, the image recognition model now performs better than the human baseline. However, computer vision science aims to build machines that can see. It requires the model to be able to extract more valuable information from images and videos than recognition. Generally, it is much more challenging to apply these deep learning models from recognition to other problems in computer vision. This thesis presents end-to-end deep learning architectures for a new computer vision field: watermark retrieval from 3D printed objects. As it is a new area, there is no state-of-the-art on many challenging benchmarks. Hence, we first define the problems and introduce the traditional approach, Local Binary Pattern method, to set our baseline for further study. Our neural networks seem useful but straightfor- ward, which outperform traditional approaches. What is more, these networks have good generalization. However, because our research field is new, the problems we face are not only various unpredictable parameters but also limited and low-quality training data. To address this, we make two observations: (i) we do not need to learn everything from scratch, we know a lot about the image segmentation area, and (ii) we cannot know everything from data, our models should be aware what key features they should learn. This thesis explores these ideas and even explore more. We show how to use end-to-end deep learning models to learn to retrieve watermark bumps and tackle covariates from a few training images data. Secondly, we introduce ideas from synthetic image data and domain randomization to augment training data and understand various covariates that may affect retrieve real-world 3D watermark bumps. We also show how the illumination in synthetic images data to effect and even improve retrieval accuracy for real-world recognization applications

    Print engine color management using customer image content

    Get PDF
    The production of quality color prints requires that color accuracy and reproducibility be maintained to within very tight tolerances when transferred to different media. Variations in the printing process commonly produce color shifts that result in poor color reproduction. The primary function of a color management system is maintaining color quality and consistency. Currently these systems are tuned in the factory by printing a large set of test color patches, measuring them, and making necessary adjustments. This time-consuming procedure should be repeated as needed once the printer leaves the factory. In this work, a color management system that compensates for print color shifts in real-time using feedback from an in-line full-width sensor is proposed. Instead of printing test patches, this novel attempt at color management utilizes the output pixels already rendered in production pages, for a continuous printer characterization. The printed pages are scanned in-line and the results are utilized to update the process by which colorimetric image content is translated into engine specific color separations (e.g. CIELAB-\u3eCMYK). The proposed system provides a means to perform automatic printer characterization, by simply printing a set of images that cover the gamut of the printer. Moreover, all of the color conversion features currently utilized in production systems (such as Gray Component Replacement, Gamut Mapping, and Color Smoothing) can be achieved with the proposed system

    An Extensive Review on Spectral Imaging in Biometric Systems: Challenges and Advancements

    Full text link
    Spectral imaging has recently gained traction for face recognition in biometric systems. We investigate the merits of spectral imaging for face recognition and the current challenges that hamper the widespread deployment of spectral sensors for face recognition. The reliability of conventional face recognition systems operating in the visible range is compromised by illumination changes, pose variations and spoof attacks. Recent works have reaped the benefits of spectral imaging to counter these limitations in surveillance activities (defence, airport security checks, etc.). However, the implementation of this technology for biometrics, is still in its infancy due to multiple reasons. We present an overview of the existing work in the domain of spectral imaging for face recognition, different types of modalities and their assessment, availability of public databases for sake of reproducible research as well as evaluation of algorithms, and recent advancements in the field, such as, the use of deep learning-based methods for recognizing faces from spectral images
    • …
    corecore