299 research outputs found
A Brief History of Web Crawlers
Web crawlers visit internet applications, collect data, and learn about new
web pages from visited pages. Web crawlers have a long and interesting history.
Early web crawlers collected statistics about the web. In addition to
collecting statistics about the web and indexing the applications for search
engines, modern crawlers can be used to perform accessibility and vulnerability
checks on the application. Quick expansion of the web, and the complexity added
to web applications have made the process of crawling a very challenging one.
Throughout the history of web crawling many researchers and industrial groups
addressed different issues and challenges that web crawlers face. Different
solutions have been proposed to reduce the time and cost of crawling.
Performing an exhaustive crawl is a challenging question. Additionally
capturing the model of a modern web application and extracting data from it
automatically is another open question. What follows is a brief history of
different technique and algorithms used from the early days of crawling up to
the recent days. We introduce criteria to evaluate the relative performance of
web crawlers. Based on these criteria we plot the evolution of web crawlers and
compare their performanc
Recommended from our members
The Corpus Expansion Toolkit: finding what we want on the web
This thesis presents the Corpus Expansion Toolkit (CET), a generally applicable toolkit that allows researchers to build domain-specific corpora from the web. The main purpose of the work presented in this thesis and the development of the CET is to provide a solution to discovering desired content on the web from possibly unknown locations or a poorly defined domain. Using an iterative process, the CET is able to solve the problem of discovering domain-specific online content and expand a corpus using only a very small number of example documents or characteristic phrases taken from the target domain. Using a human-in-the-loop strategy and a chain of discrete software components the CET also allows the concept of a domain to be iteratively defined using the very online resources used to expand the original corpus. The CET combines feature extraction, search, web crawling and machine learning methods to collected, store, filter and perform information extraction on collected documents. Using a small number of example âseedâ documents the CET is able to expand the original corpus by finding more relevant documents from the web and provide a number of tools to support their analysis. This thesis presents a case study-based methodology that introduces the various contributions and components of the CET through the discussion of five case studies covering a wide variety of domains and requirements that the CET has been applied. These case studies hope to illustrate three main use cases, listed below, where the CET is applicable:
1. Domain known â source known
2. Domain known â source unknown
3. Domain unknown â source unknown
First, use cases where the sites for document collection are known and the topic of research is clearly defined. Second, instances where the topic of research is clearly defined but where to find relevant documents on the web is unknown. Third, the most extreme use case, where the domain is poorly defined or unknown to the researcher and the location of the information is also unknown. This thesis presents a solution that allows researchers to begin with very little information on a specific topic and iteratively build a clear conception of a domain and translate that to a computational system
State of the art 2015: a literature review of social media intelligence capabilities for counter-terrorism
Overview
This paper is a review of how information and insight can be drawn from open social media sources. It focuses on the specific research techniques that have emerged, the capabilities they provide, the possible insights they offer, and the ethical and legal questions they raise. These techniques are considered relevant and valuable in so far as they can help to maintain public safety by preventing terrorism, preparing for it, protecting the public from it and pursuing its perpetrators. The report also considers how far this can be achieved against the backdrop of radically changing technology and public attitudes towards surveillance. This is an updated version of a 2013 report paper on the same subject, State of the Art. Since 2013, there have been significant changes in social media, how it is used by terrorist groups, and the methods being developed to make sense of it.
The paper is structured as follows:
Part 1 is an overview of social media use, focused on how it is used by groups of interest to those involved in counter-terrorism. This includes new sections on trends of social media platforms; and a new section on Islamic State (IS).
Part 2 provides an introduction to the key approaches of social media intelligence (henceforth âSOCMINTâ) for counter-terrorism.
Part 3 sets out a series of SOCMINT techniques. For each technique a series of capabilities and insights are considered, the validity and reliability of the method is considered, and how they might be applied to counter-terrorism work explored.
Part 4 outlines a number of important legal, ethical and practical considerations when undertaking SOCMINT work
ThreatCrawl: A BERT-based Focused Crawler for the Cybersecurity Domain
Publicly available information contains valuable information for Cyber Threat
Intelligence (CTI). This can be used to prevent attacks that have already taken
place on other systems. Ideally, only the initial attack succeeds and all
subsequent ones are detected and stopped. But while there are different
standards to exchange this information, a lot of it is shared in articles or
blog posts in non-standardized ways. Manually scanning through multiple online
portals and news pages to discover new threats and extracting them is a
time-consuming task. To automize parts of this scanning process, multiple
papers propose extractors that use Natural Language Processing (NLP) to extract
Indicators of Compromise (IOCs) from documents. However, while this already
solves the problem of extracting the information out of documents, the search
for these documents is rarely considered. In this paper, a new focused crawler
is proposed called ThreatCrawl, which uses Bidirectional Encoder
Representations from Transformers (BERT)-based models to classify documents and
adapt its crawling path dynamically. While ThreatCrawl has difficulties to
classify the specific type of Open Source Intelligence (OSINT) named in texts,
e.g., IOC content, it can successfully find relevant documents and modify its
path accordingly. It yields harvest rates of up to 52%, which are, to the best
of our knowledge, better than the current state of the art.Comment: 11 pages, 9 figures, 5 table
METRICC: Harnessing Comparable Corpora for Multilingual Lexicon Development
International audienceResearch on comparable corpora has grown in recent years bringing about the possibility of developing multilingual lexicons through the exploitation of comparable corpora to create corpus-driven multilingual dictionaries. To date, this issue has not been widely addressed. This paper focuses on the use of the mechanism of collocational networks proposed by Williams (1998) for exploiting comparable corpora. The paper first provides a description of the METRICC project, which is aimed at the automatically creation of comparable corpora and describes one of the crawlers developed for comparable corpora building, and then discusses the power of collocational networks for multilingual corpus-driven dictionary development
Prometheus: a generic e-commerce crawler for the study of business markets and other e-commerce problems
Dissertação de mestrado em Computer ScienceThe continuous social and economic development has led over time to an increase in consumption,
as well as greater demand from the consumer for better and cheaper products.
Hence, the selling price of a product assumes a fundamental role in the purchase decision
by the consumer. In this context, online stores must carefully analyse and define the best
price for each product, based on several factors such as production/acquisition cost, positioning
of the product (e.g. anchor product) and the competition companies strategy. The
work done by market analysts changed drastically over the last years.
As the number of Web sites increases exponentially, the number of E-commerce web
sites also prosperous. Web page classification becomes more important in fields like Web
mining and information retrieval. The traditional classifiers are usually hand-crafted and
non-adaptive, that makes them inappropriate to use in a broader context. We introduce an
ensemble of methods and the posterior study of its results to create a more generic and
modular crawler and scraper for detection and information extraction on E-commerce web
pages. The collected information may then be processed and used in the pricing decision.
This framework goes by the name Prometheus and has the goal of extracting knowledge
from E-commerce Web sites.
The process requires crawling an online store and gathering product pages. This implies
that given a web page the framework must be able to determine if it is a product page.
In order to achieve this we classify the pages in three categories: catalogue, product and
âspamâ. The page classification stage was addressed based on the html text as well as on
the visual layout, featuring both traditional methods and Deep Learning approaches.
Once a set of product pages has been identified we proceed to the extraction of the pricing
information. This is not a trivial task due to the disparity of approaches to create a web
page. Furthermore, most product pages are dynamic in the sense that they are truly a page
for a family of related products. For instance, when visiting a shoe store, for a particular
model there are probably a number of sizes and colours available. Such a model may be
displayed in a single dynamic web page making it necessary for our framework to explore
all the relevant combinations. This process is called scraping and is the last stage of the
Prometheus framework.O contĂnuo desenvolvimento social e econĂłmico tem conduzido ao longo do tempo a um
aumento do consumo, assim como a uma maior exigĂȘncia do consumidor por produtos
melhores e mais baratos. Naturalmente, o preço de venda de um produto assume um papel
fundamental na decisĂŁo de compra por parte de um consumidor. Nesse sentido, as lojas
online precisam de analisar e definir qual o melhor preço para cada produto, tendo como
base diversos fatores, tais como o custo de produção/venda, posicionamento do produto
(e.g. produto ùncora) e as próprias estratégias das empresas concorrentes. O trabalho dos
analistas de mercado mudou drasticamente nos Ășltimos anos.
O crescimento de sites na Web tem sido exponencial, o nĂșmero de sites E-commerce
também tem prosperado. A classificação de påginas da Web torna-se cada vez mais importante,
especialmente em campos como mineração de dados na Web e coleta/extração
de informaçÔes. Os classificadores tradicionais são geralmente feitos manualmente e não
adaptativos, o que os torna inadequados num contexto mais amplo. NĂłs introduzimos
um conjunto de métodos e o estudo posterior dos seus resultados para criar um crawler
e scraper mais genéricos e modulares para extração de conhecimento em påginas de Ecommerce.
A informação recolhida pode então ser processada e utilizada na tomada de
decisão sobre o preço de venda. Esta Framework chama-se Prometheus e tem como intuito
extrair conhecimento de Web sites de E-commerce.
Este processo necessita realizar a navegação sobre lojas online e armazenar påginas de
produto. Isto implica que dado uma pĂĄgina web a framework seja capaz de determinar
se Ă© uma pĂĄgina de produto. Para atingir este objetivo nĂłs classificamos as pĂĄginas em
trĂȘs categorias: catĂĄlogo, produto e spam. A classificação das pĂĄginas foi realizada tendo
em conta o html e o aspeto visual das påginas, utilizando tanto métodos tradicionais como
Deep Learning.
Depois de identificar um conjunto de påginas de produto procedemos à extração de
informação sobre o preço. Este processo não é trivial devido à quantidade de abordagens
possĂveis para criar uma pĂĄgina web. A maioria dos produtos sĂŁo dinĂąmicos no sentido
em que um produto Ă© na realidade uma famĂlia de produtos relacionados. Por exemplo,
quando visitamos uma loja online de sapatos, para um modelo em especifico existe
a provavelmente um conjunto de tamanhos e cores disponĂveis. Esse modelo pode ser
apresentado numa Ășnica pĂĄgina dinĂąmica fazendo com que seja necessĂĄrio para a nossa
Framework explorar estas combinaçÔes relevantes. Este processo é chamado de scraping e
Ă© o Ășltimo passo da Framework Prometheus
Network Sampling through Crawling
In recent years, researchers have increasingly used OSN data to study human behavior. Before such a study can begin, one must first obtain appropriate data. A platform, e.g, Facebook or Twitter, may provide an API for accessing data, but such APIs are often ratelimited, restricting the amount of data that an individual collects in a given amount of time. In order for the data collector to efficiently collect data, she needs to make intelligent use of her limited budget. Therefore, when collecting data, efficiency is extremely important. We consider the problem of network sampling through crawling, in which the data collectors have no knowledge of the network of interest except the identity of a starting node. The data collector can expand the observed sample by querying an observed node. While the network science literature has proposed numerous network crawling methods, it is not always easy for the data collector to select an appropriate method: methods that are successful on one network may fail on other networks.
Here, we show that the performance of network crawling methods is highly dependent on the network structural properties. We identify three important network properties: community separation, average community size, and node degree. In addition, we provide guidelines to data collectors on how to select an appropriate crawling method for a particular network. Secondly, we propose a novel crawling algorithm, called DE-Crawler, and demonstrate that it performs the best across different network domains. Lastly, we consider the scenario in which there is are errors in the data collection process. These errors then lead to errors in a subsequent analysis task. Therefore, it is important for a data analyst to know if a collected sample is trustworthy. We introduce a robustness measure called sampling robustness, which measures how robust a network is under random edge deletion with respect to sampling. We demonstrate that sampling robustness highly depends on the network properties and users can estimate sampling robustness from the obtained sample
Reasoning about Cyber Threat Actors
abstract: Reasoning about the activities of cyber threat actors is critical to defend against cyber
attacks. However, this task is difficult for a variety of reasons. In simple terms, it is difficult
to determine who the attacker is, what the desired goals are of the attacker, and how they will
carry out their attacks. These three questions essentially entail understanding the attackerâs
use of deception, the capabilities available, and the intent of launching the attack. These
three issues are highly inter-related. If an adversary can hide their intent, they can better
deceive a defender. If an adversaryâs capabilities are not well understood, then determining
what their goals are becomes difficult as the defender is uncertain if they have the necessary
tools to accomplish them. However, the understanding of these aspects are also mutually
supportive. If we have a clear picture of capabilities, intent can better be deciphered. If we
understand intent and capabilities, a defender may be able to see through deception schemes.
In this dissertation, I present three pieces of work to tackle these questions to obtain
a better understanding of cyber threats. First, we introduce a new reasoning framework
to address deception. We evaluate the framework by building a dataset from DEFCON
capture-the-flag exercise to identify the person or group responsible for a cyber attack.
We demonstrate that the framework not only handles cases of deception but also provides
transparent decision making in identifying the threat actor. The second task uses a cognitive
learning model to determine the intent â goals of the threat actor on the target system.
The third task looks at understanding the capabilities of threat actors to target systems by
identifying at-risk systems from hacker discussions on darkweb websites. To achieve this
task we gather discussions from more than 300 darkweb websites relating to malicious
hacking.Dissertation/ThesisDoctoral Dissertation Computer Engineering 201
Adaptive Big Data Pipeline
Over the past three decades, data has exponentially evolved from being a simple software by-product to one of the most important companiesâ assets used to understand their customers and foresee trends. Deep learning has demonstrated that big volumes of clean data generally provide more flexibility and accuracy when modeling a phenomenon. However, handling ever-increasing data volumes entail new challenges: the lack of expertise to select the appropriate big data tools for the processing pipelines, as well as the speed at which engineers can take such pipelines into production reliably, leveraging the cloud. We introduce a system called Adaptive Big Data Pipelines: a platform to automate data pipelines creation. It provides an interface to capture the data sources, transformations, destinations and execution schedule. The system builds up the cloud infrastructure, schedules and fine-tunes the transformations, and creates the data lineage graph. This system has been tested on data sets of 50 gigabytes, processing them in just a few minutes without user intervention.ITESO, A. C
- âŠ