10 research outputs found

    Oculus-Crawl, a software tool for building datasets for computer vision tasks

    Get PDF
    [Abstract] Building datasets for computer vision tasks require a source of a large number of images, like the ones provided by the Internet search engines, joined with automated scraping tools, to construct them in a reasonable time. In this paper it is presented Oculus-Crawl, a tool designed to crawl and scrape images from the search engines Google and Yahoo Images to build datasets of pictures, that is modular, scalable and portable. It is also discussed a benchmark for this crawler and an internal feature for storing and sharing big datasets, that makes it suitable for computer vision and machine learning tasks. In our tests we were able to crawl and fetch 11.555 images in less than 14 minutes, including also their meta-data description, showing that it might be well-suited for retrieving large datasets

    Enhancing text recognition on Tor Darknet images

    Get PDF
    [Abstract] Text Spotting can be used as an approach to retrieve information found in images that cannot be obtained otherwise, by performing text detection rst and then recognizing the located text. Examples of images to apply this task on can be found in Tor network images, which contain information that may not be found in plain text. When comparing both stages, the latter performs worse due to the low resolution of the cropped areas among other problems. Focusing on the recognition part of the pipeline, we study the performance of ve recognition approaches, based on state-ofthe- art neural network models, standalone OCR, and OCR enhancements. We complement them using string-matching techniques with two lexicons and compare computational time on ve di erent datasets, including Tor network images. Our nal proposal achieved 39,70% precision of text recognition in a custom dataset of images taken from Tor domain

    Detección de información textual en imágenes de dominios de cebolla mediante la localización de texto

    No full text
    [Abstract] Due to the efforts of different authorities in the fight against illegal activities in the Tor networks, the traders have developed new ways of circumventing the monitoring tools used to obtain evidence of said activities. In particular, embedding textual content into graphical objects avoids that text analysis, using Natural Language Processing (NLP) algorithms, can be used for watching such onion web contents. In this paper, we present a Text Spotting framework dedicated to detecting and recognizing textual information within images hosted in onion domains. We found that the Connectionist Text Proposal Network and Convolutional Recurrent Neural Network achieve 0.57 F-Measure when running the combined pipeline on a subset of 100 images labeled manually obtained from TOIC dataset. We also identified the parameters that have a critical influence on the Text Spotting results. The proposed technique might support tools to help the authorities in detecting these activities.[Resumen] Debido a los esfuerzos de diferentes autoridades en la lucha contra las actividades ilegales en las redes Tor, los comerciantes han desarrollado nuevas formas de eludir las herramientas de monitoreo utilizadas para obtener evidencia de dichas actividades. En particular, la incorporación de contenido textual en objetos gráficos evita que el análisis de texto, utilizando algoritmos de Procesamiento de Lenguaje Natural (NLP), se pueda usar para ver dichos contenidos web de cebolla. En este documento, presentamos un marco de Text Spotting dedicado a detectar y reconocer información textual en imágenes alojadas en dominios de cebolla. Encontramos que la Red de propuestas de texto conexionista y la Red neuronal recurrente convolucional alcanzan 0.57 F-Measure cuando se ejecuta la tubería combinada en un subconjunto de 100 imágenes etiquetadas manualmente obtenidas del conjunto de datos TOIC. También identificamos los parámetros que tienen una influencia crítica en los resultados de Text Spotting. La técnica propuesta podría apoyar herramientas para ayudar a las autoridades a detectar estas actividades

    Short text classification approach to identify child sexual exploitation material

    No full text
    Abstract Producing or sharing Child Sexual Exploitation Material (CSEM) is a severe crime that Law Enforcement Agencies (LEAs) fight daily. When the LEA seizes a computer from a potential producer or consumer of the CSEM, it analyzes the storage devices of the suspect looking for evidence. Manual inspection of CSEM is time-consuming given the limited time available for Spanish police to use a search warrant. Our approach to speeding up the identification of CSEM-related files is to analyze only the file names and their absolute paths rather than their content. The main challenge lies in handling short and sparse texts that are deliberately distorted by file owners using obfuscated words and user-defined naming patterns. We present two approaches to CSEM identification. The first employs two independent classifiers, one for the file name and the other for the file path, and their outputs are then combined. Conversely, the second approach uses only the file name classifier to iterate over an absolute path. Both operate at the character n-gram level, whereas novel binary and orthographic features are presented to enrich the text representation. We benchmarked six classification models based on machine learning and convolutional neural networks. The proposed classifier has an F1 score of 0.988, which can be a promising tool for LEAs

    Descriptores de tamizado de color de clasificación para categorizar actividades ilegales en imágenes de dominios de cebolla

    No full text
    Comunicación presentada a las XXXIX Jornadas de Automática, celebradas en Badajoz del 5 al 7 de Septiembre de 2018 y organizada por la Universidad de Extremadura.Dark Web, i.e. the portion of the Web whose content is not indexed either accessible by standard web browsers, comprises several darknets. The Onion Router (Tor) is the most famous one, thanks to the anonymity provided to its users, and it results in the creation of domains, or hidden services, which hosts illegal activities. In this work, we explored the possibility of identifying illegal domains on Tor darknet based on its visual content. After crawling and filtering the images of 500 hidden services, we sorted them into five different illegal categories, and we trained a classifier using the Bag of Visual Words (BoVW) model. In this model, SIFT (Scale Invariant Feature Transform) or dense SIFT were used as the descriptors of the images patches to compute the visual words of the BoVW model. However, SIFT only works with gray-scale images; thus the information given by color in an image is not retrieved. To overcome this drawback, in this work we implemented and assessed the performance of three different variants of SIFT descriptors that can be used in color images, namely HSV-SIFT, RGB-SIFT and the BoVW model for image classification. The obtained results showed the usefulness of using color-SIFT descriptors instead of SIFT, whereas in our experiments the latter achieved an accuracy of 57.52%, the HSV-SIFT descriptor achieved an accuracy up to 59.44%.Dark Web, es decir, la parte de la Web cuyo contenido no está indexado, o bien es accesible a través de navegadores web estándar, comprende varias redes oscuras. El Onion Router (Tor) es el más famoso, gracias al anonimato proporcionado a sus usuarios, y resulta en la creación de dominios, o servicios ocultos, que albergan actividades ilegales. En este trabajo, exploramos la posibilidad de identificar dominios ilegales en Tor darknet según su contenido visual. Después de rastrear y filtrar las imágenes de 500 servicios ocultos, los clasificamos en cinco categorías ilegales diferentes, y capacitamos a un clasificador utilizando el modelo de Bolsa de palabras visuales (BoVW). En este modelo, se usó SIFT (Transformación de la característica invariante de escala) o SIFT denso como los descriptores de los parches de imágenes para calcular las palabras visuales del modelo BoVW. Sin embargo, SIFT solo funciona con imágenes en escala de grises; por lo tanto, la información dada por el color en una imagen no se recupera. Para superar este inconveniente, en este trabajo implementamos y evaluamos el rendimiento de tres variantes diferentes de los descriptores SIFT que se pueden usar en imágenes en color, a saber, HSV-SIFT, RGB-SIFT y el modelo BoVW para la clasificación de imágenes. Los resultados obtenidos mostraron la utilidad de usar descriptores de SIFT de color en lugar de SIFT, mientras que en nuestros experimentos este último logró una precisión de 57.52%, el descriptor de HSV-SIFT logró una precisión de hasta 59.44%.INCIBE grant \INCIBEI-2015-27359. University of León and INCIBE (Spanish National Cybersecurity Institute) Addendum 22. “Ayudas para la excelencia de los equipos deiInvestigación avanzada en ciberseguridad"peerReviewe

    Descriptores de tamizado de color de clasificación para categorizar actividades ilegales en imágenes de dominios de cebolla

    No full text
    [Abstract] Dark Web, i.e. the portion of the Web whose content is not indexed either accessible by standard web browsers, comprises several darknets. The Onion Router (Tor) is the most famous one, thanks to the anonymity provided to its users, and it results in the creation of domains, or hidden services, which hosts illegal activities. In this work, we explored the possibility of identifying illegal domains on Tor darknet based on its visual content. After crawling and filtering the images of 500 hidden services, we sorted them into five different illegal categories, and we trained a classifier using the Bag of Visual Words (BoVW) model. In this model, SIFT (Scale Invariant Feature Transform) or dense SIFT were used as the descriptors of the images patches to compute the visual words of the BoVW model. However, SIFT only works with gray-scale images; thus the information given by color in an image is not retrieved. To overcome this drawback, in this work we implemented and assessed the performance of three different variants of SIFT descriptors that can be used in color images, namely HSV-SIFT, RGB-SIFT and the BoVW model for image classification. The obtained results showed the usefulness of using color-SIFT descriptors instead of SIFT, whereas in our experiments the latter achieved an accuracy of 57.52%, the HSV-SIFT descriptor achieved an accuracy up to 59.44%.[Resumen] Dark Web, es decir, la parte de la Web cuyo contenido no está indexado, o bien es accesible a través de navegadores web estándar, comprende varias redes oscuras. El Onion Router (Tor) es el más famoso, gracias al anonimato proporcionado a sus usuarios, y resulta en la creación de dominios, o servicios ocultos, que albergan actividades ilegales. En este trabajo, exploramos la posibilidad de identificar dominios ilegales en Tor darknet según su contenido visual. Después de rastrear y filtrar las imágenes de 500 servicios ocultos, los clasificamos en cinco categorías ilegales diferentes, y capacitamos a un clasificador utilizando el modelo de Bolsa de palabras visuales (BoVW). En este modelo, se usó SIFT (Transformación de la característica invariante de escala) o SIFT denso como los descriptores de los parches de imágenes para calcular las palabras visuales del modelo BoVW. Sin embargo, SIFT solo funciona con imágenes en escala de grises; por lo tanto, la información dada por el color en una imagen no se recupera. Para superar este inconveniente, en este trabajo implementamos y evaluamos el rendimiento de tres variantes diferentes de los descriptores SIFT que se pueden usar en imágenes en color, a saber, HSV-SIFT, RGB-SIFT y el modelo BoVW para la clasificación de imágenes. Los resultados obtenidos mostraron la utilidad de usar descriptores de SIFT de color en lugar de SIFT, mientras que en nuestros experimentos este último logró una precisión de 57.52%, el descriptor de HSV-SIFT logró una precisión de hasta 59.44%.Instituto Nacional de Ciberseguridad; TRA2015-63708-
    corecore