Search CORE

72 research outputs found

Scalable Graph Building from Text Data

Author: Debatty Thibault
Mees Wim
Michiardi Pietro
Thonnard Olivier
Publication venue: PMLR
Publication date: 01/01/2014
Field of study

International audienceIn this paper we propose NNCTPH, a new MapReduce algorithm that is able to build an approximate k-NN graph from large text datasets. The algorithm uses a modified version of Context Triggered Piecewise Hashing to bin the input data into buckets, and uses an exhaustive search inside the buckets to build the graph. It also uses multiple stages to join the different unconnected subgraphs. We experimentally test the algorithm on different datasets consisting of the subject of spam emails. Although the algorithm is still at an early development stage, it already proves to be four times faster than a MapReduce implementation of NN-Descent, for the same quality of produced graph

Determining the k in k-means with MapReduce

Author: Debatty Thibault
Mees Wim
Michiardi Pietro
Thonnard Olivier
Publication venue: HAL CCSD
Publication date: 24/03/2014
Field of study

International audienceIn this paper we propose a MapReduce implementation of G-means, a variant of k-means that is able to automatically determine k, the number of clusters. We show that our implementation scales to very large datasets and very large values of k, as the computation cost is proportional to nk. Other techniques that run a clustering algorithm with different values of k and choose the value of k that provides the " best " results have a computation cost that is proportional to nk 2. We run experiments that confirm that the processing time is proportional to k. These experiments also show that, because G-means adds new centers progressively, if and where they are needed, it reduces the probability to fall into a local minimum, and finally finds better centers than classical k-means processing

Scraping Airlines Bots: Insights Obtained Studying Honeypot Data

Author: Catakoglu Onur
Chiapponi Elisa
Dacier Marc
Thonnard Olivier
Todisco Massimiliano
Publication venue: 'Concept Tech Publishing'
Publication date: 12/05/2021
Field of study

Airline websites are the victims of unauthorised online travel agencies and aggregators that use armies of bots to scrape prices and flight information. These so-called Advanced Persistent Bots (APBs) are highly sophisticated. On top of the valuable information taken away, these huge quantities of requests consume a very substantial amount of resources on the airlines' websites. In this work, we propose a deceptive approach to counter scraping bots. We present a platform capable of mimicking airlines' sites changing prices at will. We provide results on the case studies we performed with it. We have lured bots for almost 2 months, fed them with indistinguishable inaccurate information. Studying the collected requests, we have found behavioural patterns that could be used as complementary bot detection. Moreover, based on the gathered empirical pieces of evidence, we propose a method to investigate the claim commonly made that proxy services used by web scraping bots have millions of residential IPs at their disposal. Our mathematical models indicate that the amount of IPs is likely 2 to 3 orders of magnitude smaller than the one claimed. This finding suggests that an IP reputation-based blocking strategy could be effective, contrary to what operators of these websites think today

Crossref

Concept Tech Publishing Int Journals

EURECOM Repository

International Journal of Cyber Forensics and Advanced Threat Investigations

Recommended from our members

Gone Rogue: An Analysis of Rogue Security Software Campaigns

Author: Cova Marco
Dacier Marc
Keromytis Angelos D.
Leita Corrado
Thonnard Olivier
Publication venue: 'Columbia University Libraries/Information Services'
Publication date: 01/01/2009
Field of study

In the past few years, Internet miscreants have developed a number of techniques to defraud and make a hefty profit out of their unsuspecting victims. A troubling, recent example of this trend is cyber-criminals distributing rogue security software, that is malicious programs that,by pretending to be legitimate security tools (e.g., anti-virus or anti-spyware), deceive users into paying a substantial amount of money in exchange for little or no protection.While the technical and economical aspects of rogue security software (e.g., its distribution and monetization mechanisms) are relatively well-understood, much less is known about the campaigns through which this type of malware is distributed, that is what are the underlying techniques and coordinated efforts employed by cyber-criminals to spread their malware.In this paper, we present the techniques we used to analyze rogue security software campaigns, with an emphasis on the infrastructure employed in the campaign and the life-cycle of the clients that they infect

Columbia University Academic Commons

Extracting inter-arrival time based behaviour from honeypot traffic using cliques

Author: Almotairi Saleh
Clark Andrew
Dacier Marc
Leita Corrado
Mohay George
Pham Van Hau
Thonnard Olivier
Zimmermann Jacob
Publication venue: Edith Cowan University, Research Online, Perth, Western Australia
Publication date: 01/01/2007
Field of study

The Leurre.com project is a worldwide network of honeypot environments that collect traces of malicious Internet traffic every day. Clustering techniques have been utilized to categorize and classify honeypot activities based on several traffic features. While such clusters of traffic provide useful information about different activities that are happening in the Internet, a new correlation approach is needed to automate the discovery of refined types of activities that share common features. This paper proposes the use of packet inter-arrival time (IAT) as a main feature in grouping clusters that exhibit commonalities in their IAT distributions. Our approach utilizes the cliquing algorithm for the automatic discovery of cliques of clusters. We demonstrate the usefulness of our methodology by providing several examples of IAT cliques and a discussion of the types of activity they represent. We also give some insight into the causes of these activities. In addition, we address the limitation of our approach, through the manual extraction of what we term supercliques, and discuss ideas for further improvement

CiteSeerX

Queensland University of Technology ePrints Archive

Research Online @ ECU

A framework for attack patterns' discovery in honeynet data

Author: Thonnard Olivier
Publication venue: 'Elsevier BV'
Publication date: 12/06/2008
Field of study

EURECOM Repository

Behavioral analysis of zombie armies

Author: Thonnard Olivier
Publication venue: 'IOS Press'
Publication date: 31/10/2009
Field of study

EURECOM Repository

Actionable knowledge discovery for threats intelligence support using a multi-dimensional data mining methodology

Author: Thonnard Olivier
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 15/12/2008
Field of study

EURECOM Repository