299 research outputs found

    A Brief History of Web Crawlers

    Full text link
    Web crawlers visit internet applications, collect data, and learn about new web pages from visited pages. Web crawlers have a long and interesting history. Early web crawlers collected statistics about the web. In addition to collecting statistics about the web and indexing the applications for search engines, modern crawlers can be used to perform accessibility and vulnerability checks on the application. Quick expansion of the web, and the complexity added to web applications have made the process of crawling a very challenging one. Throughout the history of web crawling many researchers and industrial groups addressed different issues and challenges that web crawlers face. Different solutions have been proposed to reduce the time and cost of crawling. Performing an exhaustive crawl is a challenging question. Additionally capturing the model of a modern web application and extracting data from it automatically is another open question. What follows is a brief history of different technique and algorithms used from the early days of crawling up to the recent days. We introduce criteria to evaluate the relative performance of web crawlers. Based on these criteria we plot the evolution of web crawlers and compare their performanc

    State of the art 2015: a literature review of social media intelligence capabilities for counter-terrorism

    Get PDF
    Overview This paper is a review of how information and insight can be drawn from open social media sources. It focuses on the specific research techniques that have emerged, the capabilities they provide, the possible insights they offer, and the ethical and legal questions they raise. These techniques are considered relevant and valuable in so far as they can help to maintain public safety by preventing terrorism, preparing for it, protecting the public from it and pursuing its perpetrators. The report also considers how far this can be achieved against the backdrop of radically changing technology and public attitudes towards surveillance. This is an updated version of a 2013 report paper on the same subject, State of the Art. Since 2013, there have been significant changes in social media, how it is used by terrorist groups, and the methods being developed to make sense of it.  The paper is structured as follows: Part 1 is an overview of social media use, focused on how it is used by groups of interest to those involved in counter-terrorism. This includes new sections on trends of social media platforms; and a new section on Islamic State (IS). Part 2 provides an introduction to the key approaches of social media intelligence (henceforth ‘SOCMINT’) for counter-terrorism. Part 3 sets out a series of SOCMINT techniques. For each technique a series of capabilities and insights are considered, the validity and reliability of the method is considered, and how they might be applied to counter-terrorism work explored. Part 4 outlines a number of important legal, ethical and practical considerations when undertaking SOCMINT work

    ThreatCrawl: A BERT-based Focused Crawler for the Cybersecurity Domain

    Full text link
    Publicly available information contains valuable information for Cyber Threat Intelligence (CTI). This can be used to prevent attacks that have already taken place on other systems. Ideally, only the initial attack succeeds and all subsequent ones are detected and stopped. But while there are different standards to exchange this information, a lot of it is shared in articles or blog posts in non-standardized ways. Manually scanning through multiple online portals and news pages to discover new threats and extracting them is a time-consuming task. To automize parts of this scanning process, multiple papers propose extractors that use Natural Language Processing (NLP) to extract Indicators of Compromise (IOCs) from documents. However, while this already solves the problem of extracting the information out of documents, the search for these documents is rarely considered. In this paper, a new focused crawler is proposed called ThreatCrawl, which uses Bidirectional Encoder Representations from Transformers (BERT)-based models to classify documents and adapt its crawling path dynamically. While ThreatCrawl has difficulties to classify the specific type of Open Source Intelligence (OSINT) named in texts, e.g., IOC content, it can successfully find relevant documents and modify its path accordingly. It yields harvest rates of up to 52%, which are, to the best of our knowledge, better than the current state of the art.Comment: 11 pages, 9 figures, 5 table

    METRICC: Harnessing Comparable Corpora for Multilingual Lexicon Development

    Get PDF
    International audienceResearch on comparable corpora has grown in recent years bringing about the possibility of developing multilingual lexicons through the exploitation of comparable corpora to create corpus-driven multilingual dictionaries. To date, this issue has not been widely addressed. This paper focuses on the use of the mechanism of collocational networks proposed by Williams (1998) for exploiting comparable corpora. The paper first provides a description of the METRICC project, which is aimed at the automatically creation of comparable corpora and describes one of the crawlers developed for comparable corpora building, and then discusses the power of collocational networks for multilingual corpus-driven dictionary development

    Prometheus: a generic e-commerce crawler for the study of business markets and other e-commerce problems

    Get PDF
    Dissertação de mestrado em Computer ScienceThe continuous social and economic development has led over time to an increase in consumption, as well as greater demand from the consumer for better and cheaper products. Hence, the selling price of a product assumes a fundamental role in the purchase decision by the consumer. In this context, online stores must carefully analyse and define the best price for each product, based on several factors such as production/acquisition cost, positioning of the product (e.g. anchor product) and the competition companies strategy. The work done by market analysts changed drastically over the last years. As the number of Web sites increases exponentially, the number of E-commerce web sites also prosperous. Web page classification becomes more important in fields like Web mining and information retrieval. The traditional classifiers are usually hand-crafted and non-adaptive, that makes them inappropriate to use in a broader context. We introduce an ensemble of methods and the posterior study of its results to create a more generic and modular crawler and scraper for detection and information extraction on E-commerce web pages. The collected information may then be processed and used in the pricing decision. This framework goes by the name Prometheus and has the goal of extracting knowledge from E-commerce Web sites. The process requires crawling an online store and gathering product pages. This implies that given a web page the framework must be able to determine if it is a product page. In order to achieve this we classify the pages in three categories: catalogue, product and ”spam”. The page classification stage was addressed based on the html text as well as on the visual layout, featuring both traditional methods and Deep Learning approaches. Once a set of product pages has been identified we proceed to the extraction of the pricing information. This is not a trivial task due to the disparity of approaches to create a web page. Furthermore, most product pages are dynamic in the sense that they are truly a page for a family of related products. For instance, when visiting a shoe store, for a particular model there are probably a number of sizes and colours available. Such a model may be displayed in a single dynamic web page making it necessary for our framework to explore all the relevant combinations. This process is called scraping and is the last stage of the Prometheus framework.O contĂ­nuo desenvolvimento social e econĂłmico tem conduzido ao longo do tempo a um aumento do consumo, assim como a uma maior exigĂȘncia do consumidor por produtos melhores e mais baratos. Naturalmente, o preço de venda de um produto assume um papel fundamental na decisĂŁo de compra por parte de um consumidor. Nesse sentido, as lojas online precisam de analisar e definir qual o melhor preço para cada produto, tendo como base diversos fatores, tais como o custo de produção/venda, posicionamento do produto (e.g. produto Ăąncora) e as prĂłprias estratĂ©gias das empresas concorrentes. O trabalho dos analistas de mercado mudou drasticamente nos Ășltimos anos. O crescimento de sites na Web tem sido exponencial, o nĂșmero de sites E-commerce tambĂ©m tem prosperado. A classificação de pĂĄginas da Web torna-se cada vez mais importante, especialmente em campos como mineração de dados na Web e coleta/extração de informaçÔes. Os classificadores tradicionais sĂŁo geralmente feitos manualmente e nĂŁo adaptativos, o que os torna inadequados num contexto mais amplo. NĂłs introduzimos um conjunto de mĂ©todos e o estudo posterior dos seus resultados para criar um crawler e scraper mais genĂ©ricos e modulares para extração de conhecimento em pĂĄginas de Ecommerce. A informação recolhida pode entĂŁo ser processada e utilizada na tomada de decisĂŁo sobre o preço de venda. Esta Framework chama-se Prometheus e tem como intuito extrair conhecimento de Web sites de E-commerce. Este processo necessita realizar a navegação sobre lojas online e armazenar pĂĄginas de produto. Isto implica que dado uma pĂĄgina web a framework seja capaz de determinar se Ă© uma pĂĄgina de produto. Para atingir este objetivo nĂłs classificamos as pĂĄginas em trĂȘs categorias: catĂĄlogo, produto e spam. A classificação das pĂĄginas foi realizada tendo em conta o html e o aspeto visual das pĂĄginas, utilizando tanto mĂ©todos tradicionais como Deep Learning. Depois de identificar um conjunto de pĂĄginas de produto procedemos Ă  extração de informação sobre o preço. Este processo nĂŁo Ă© trivial devido Ă  quantidade de abordagens possĂ­veis para criar uma pĂĄgina web. A maioria dos produtos sĂŁo dinĂąmicos no sentido em que um produto Ă© na realidade uma famĂ­lia de produtos relacionados. Por exemplo, quando visitamos uma loja online de sapatos, para um modelo em especifico existe a provavelmente um conjunto de tamanhos e cores disponĂ­veis. Esse modelo pode ser apresentado numa Ășnica pĂĄgina dinĂąmica fazendo com que seja necessĂĄrio para a nossa Framework explorar estas combinaçÔes relevantes. Este processo Ă© chamado de scraping e Ă© o Ășltimo passo da Framework Prometheus

    Network Sampling through Crawling

    Get PDF
    In recent years, researchers have increasingly used OSN data to study human behavior. Before such a study can begin, one must first obtain appropriate data. A platform, e.g, Facebook or Twitter, may provide an API for accessing data, but such APIs are often ratelimited, restricting the amount of data that an individual collects in a given amount of time. In order for the data collector to efficiently collect data, she needs to make intelligent use of her limited budget. Therefore, when collecting data, efficiency is extremely important. We consider the problem of network sampling through crawling, in which the data collectors have no knowledge of the network of interest except the identity of a starting node. The data collector can expand the observed sample by querying an observed node. While the network science literature has proposed numerous network crawling methods, it is not always easy for the data collector to select an appropriate method: methods that are successful on one network may fail on other networks. Here, we show that the performance of network crawling methods is highly dependent on the network structural properties. We identify three important network properties: community separation, average community size, and node degree. In addition, we provide guidelines to data collectors on how to select an appropriate crawling method for a particular network. Secondly, we propose a novel crawling algorithm, called DE-Crawler, and demonstrate that it performs the best across different network domains. Lastly, we consider the scenario in which there is are errors in the data collection process. These errors then lead to errors in a subsequent analysis task. Therefore, it is important for a data analyst to know if a collected sample is trustworthy. We introduce a robustness measure called sampling robustness, which measures how robust a network is under random edge deletion with respect to sampling. We demonstrate that sampling robustness highly depends on the network properties and users can estimate sampling robustness from the obtained sample

    Reasoning about Cyber Threat Actors

    Get PDF
    abstract: Reasoning about the activities of cyber threat actors is critical to defend against cyber attacks. However, this task is difficult for a variety of reasons. In simple terms, it is difficult to determine who the attacker is, what the desired goals are of the attacker, and how they will carry out their attacks. These three questions essentially entail understanding the attacker’s use of deception, the capabilities available, and the intent of launching the attack. These three issues are highly inter-related. If an adversary can hide their intent, they can better deceive a defender. If an adversary’s capabilities are not well understood, then determining what their goals are becomes difficult as the defender is uncertain if they have the necessary tools to accomplish them. However, the understanding of these aspects are also mutually supportive. If we have a clear picture of capabilities, intent can better be deciphered. If we understand intent and capabilities, a defender may be able to see through deception schemes. In this dissertation, I present three pieces of work to tackle these questions to obtain a better understanding of cyber threats. First, we introduce a new reasoning framework to address deception. We evaluate the framework by building a dataset from DEFCON capture-the-flag exercise to identify the person or group responsible for a cyber attack. We demonstrate that the framework not only handles cases of deception but also provides transparent decision making in identifying the threat actor. The second task uses a cognitive learning model to determine the intent – goals of the threat actor on the target system. The third task looks at understanding the capabilities of threat actors to target systems by identifying at-risk systems from hacker discussions on darkweb websites. To achieve this task we gather discussions from more than 300 darkweb websites relating to malicious hacking.Dissertation/ThesisDoctoral Dissertation Computer Engineering 201

    Adaptive Big Data Pipeline

    Get PDF
    Over the past three decades, data has exponentially evolved from being a simple software by-product to one of the most important companies’ assets used to understand their customers and foresee trends. Deep learning has demonstrated that big volumes of clean data generally provide more flexibility and accuracy when modeling a phenomenon. However, handling ever-increasing data volumes entail new challenges: the lack of expertise to select the appropriate big data tools for the processing pipelines, as well as the speed at which engineers can take such pipelines into production reliably, leveraging the cloud. We introduce a system called Adaptive Big Data Pipelines: a platform to automate data pipelines creation. It provides an interface to capture the data sources, transformations, destinations and execution schedule. The system builds up the cloud infrastructure, schedules and fine-tunes the transformations, and creates the data lineage graph. This system has been tested on data sets of 50 gigabytes, processing them in just a few minutes without user intervention.ITESO, A. C
    • 

    corecore