13 research outputs found
No NAT'd User left Behind: Fingerprinting Users behind NAT from NetFlow Records alone
It is generally recognized that the traffic generated by an individual
connected to a network acts as his biometric signature. Several tools exploit
this fact to fingerprint and monitor users. Often, though, these tools assume
to access the entire traffic, including IP addresses and payloads. This is not
feasible on the grounds that both performance and privacy would be negatively
affected. In reality, most ISPs convert user traffic into NetFlow records for a
concise representation that does not include, for instance, any payloads. More
importantly, large and distributed networks are usually NAT'd, thus a few IP
addresses may be associated to thousands of users. We devised a new
fingerprinting framework that overcomes these hurdles. Our system is able to
analyze a huge amount of network traffic represented as NetFlows, with the
intent to track people. It does so by accurately inferring when users are
connected to the network and which IP addresses they are using, even though
thousands of users are hidden behind NAT. Our prototype implementation was
deployed and tested within an existing large metropolitan WiFi network serving
about 200,000 users, with an average load of more than 1,000 users
simultaneously connected behind 2 NAT'd IP addresses only. Our solution turned
out to be very effective, with an accuracy greater than 90%. We also devised
new tools and refined existing ones that may be applied to other contexts
related to NetFlow analysis
GT: Picking up the Truth from the Ground for Internet Traffic
Much of Internet traffic modeling, firewall, and intrusion detection research requires traces where some ground truth regarding application and protocol is associated with each packet or flow. This paper presents the design, development and experimental evaluation of gt, an open source software toolset for associating ground truth information with Internet traffic traces. By probing the monitored host's kernel to obtain information on active Internet sessions, gt gathers ground truth at the application level. Preliminary exper- imental results show that gt's effectiveness comes at little cost in terms of overhead on the hosting machines. Furthermore, when coupled with other packet inspection mechanisms, gt can derive ground truth not only in terms of applications (e.g., e-mail), but also in terms of protocols (e.g., SMTP vs. POP3
Link homophily in the application layer and its usage in traffic classification
Abstract-This paper addresses the following questions. Is there link homophily in the application layer traffic? If so, can it be used to accurately classify traffic in network trace data without relying on payloads or properties at the flow level? Our research shows that the answers to both of these questions are affirmative in real network trace data. Specifically, we define link homophily to be the tendency for flows with common IP hosts to have the same application (P2P, Web, etc.) compared to randomly selected flows. The presence of link homophily in trace data provides us with statistical dependencies between flows that share common IP hosts. We utilize these dependencies to classify application layer traffic without relying on payloads or properties at the flow level. In particular, we introduce a new statistical relational learning algorithm, called Neighboring Link Classifier with Relaxation Labeling (NLC+RL). Our algorithm has no training phase and does not require features to be constructed. All that it needs to start the classification process is traffic information on a small portion of the initial flows, which we refer to as seeds. In all our traces, NLC+RL achieves above 90% accuracy with less than 5% seed size; it is robust to errors in the seeds and various seed-selection biases; and it is able to accurately classify challenging traffic such as P2P with over 90% Precision and Recall
Unsupervised host behavior classification from connection patterns
International audienceA novel host behavior classification approach is proposed as a preliminary step toward traffic classification and anomaly detection in network communication. Though many attempts described in the literature were devoted to flow or application classifications, these approaches are not always adaptable to operational constraints of traffic monitoring (expected to work even without packet payload, without bidirectionality, on highspeed networks or from flow reports only...). Instead, the classification proposed here relies on the leading idea that traffic is relevantly analyzed in terms of host typical behaviors: typical connection patterns of both legitimate applications (data sharing, downloading,...) and anomalous (eventually aggressive) behaviors are obtained by profiling traffic at the host level using unsupervised statistical classification. Classification at the host level is not reducible to flow or application classification, and neither is the contrary: they are different operations which might have complementary roles in network management. The proposed host classification is based on a nine-dimensional feature space evaluating host Internet connectivity, dispersion and exchanged traffic content. A Minimum Spanning Tree (MST) clustering technique is developed that does not require any supervised learning step to produce a set of statistically established typical host behaviors. Not relying on a priori defined classes of known behaviors enables the procedure to discover new host behaviors, that potentially were never observed before. This procedure is applied to traffic collected over the entire year 2008 on a transpacific (Japan/USA) link. A cross-validation of this unsupervised classification against a classical port-based inspection and a state-of-the-art method provides assessment of the meaningfulness and the relevance of the obtained classes for host behaviors
Detecting Networks Employing Algorithmically Generated Domain Names
Recent Botnets such as Conficker, Kraken and Torpig have used DNS based "domain fluxing" for command-and-control, where each Bot queries for existence of a
series of domain names and the owner has to register only one such domain name. In
this report, we develop a methodology to detect such "domain
fluxes" in DNS traffic
by looking for patterns inherent to domain names that are generated algorithmically,
in contrast to those generated by humans. In particular, we look at distribution
of alphanumeric characters as well as bigrams in all domains that are mapped to
the same set of IP-addresses. We present and compare the performance of several
distance metrics, including KL-distance and Edit distance. We train by using a good
data set of domains obtained via a crawl of domains mapped to all IPv4 address space
and modeling bad data sets based on behaviors seen so far and expected. We also
apply our methodology to packet traces collected at two Tier-1 ISPs and show we can
automatically detect domain
fluxing as used by Conficker botnet with minimal false
positives. We are also able to detect new botnets and other malicious networks using
our method
Caracterização multi-escalar de tráfego em redes protegidas
Mestrado em Engenharia de Computadores e TelemáticaAtualmente, a Internet pode ser vista como uma mistura de diversos serviços
e aplicações que correm sobre protocolos comuns. O aparecimento de
inúmeras aplicações Web mudou o paradigma de interação dos utilizadores,
colocando-os num papel mais ativo, permitindo aos utilizadores da Internet
partilhar fotos, vÃdeos e muito mais. A análise do perfil de cada utilizador,
tanto em redes wired como wireless, tornou-se muito interessante para
tarefas como a otimização de recursos da rede, personalização de serviços
e segurança.
Nesta dissertação pretende-se recolher um conjunto sistemático de
capturas de tráfego correspondentes à utilização de diversas aplicações
Web e efetuar a caraterização estatÃstica do tráfego correspondente a
cada aplicação em redes protegidas. O tráfego obtido (e as respetivas
estatÃsticas) será posteriormente utilizado para validar metodologias de
identificação de aplicações e caraterização do perfil de utilizadores da
Internet. O desenvolvimento de diversas metodologias estatÃsticas permite
caraterizar o tráfego associado a cada utilizador (tanto em redes wireless
como wired) com base em informação estatÃstica do tráfego por ele gerado
enquanto utiliza os diversos serviços de rede. Neste sentido, é muito
importante dispor de capturas de tráfego real que sejam representativas de
uma utilização comum das diversas aplicações Web. Serviços on-line como
notÃcias, email, redes sociais, partilha de fotografias e de vÃdeos podem ser
estudados e caraterizados através da análise estatÃstica do tráfego gerado
pela utilização de aplicações como jornais on-line, Youtube, Flickr, GMail,
Facebook, entre outras.
Ao extrair as métricas de tráfego ao nÃvel da camada 2, realizar a
decomposição baseada em Wavelets e analisar os escalogramas obtidos,
será possÃvel avaliar as diferentes componentes de tempo e de frequência
do tráfego analisado. Será então possÃvel definir um perfil de comunicação
capaz de descrever o espetro de frequência caracterÃstico de cada aplicação
web. Consequentemente, será possÃvel identificar as aplicações utilizadas
pelos diferentes clientes ligados e criar perfis de utilizadores com precisão.Nowadays, Internet can be seen as an mix of services and applications
that run over common protocols. The emergence of several web-based
applications changed the users interaction paradigm by placing them in a
more active role, allowing users to share photos, videos and much more.
The analysis of each user profile, both in wired and wireless networks, can
become very interesting for tasks such as network resources optimization,
service customization and security. This thesis aims to collect a systematic
set of traffic captures corresponding to the use of several web-based applications
in protected networks and perform a statistical traffic characterization
for each application. The captured traffic (and the corresponding statistics)
will be subsequently used to validate the methodologies developed to
identify applications and characterize the traffic associated to each user.
There are several statistical methodologies that allows the identification of
users profiles (on both wireless and wired networks) based on statistical
information collected from the traffic generated while using the different
network services. In this sense, it is very important to have real traffic
captures that are representative of a common use of several web-based
applications. On-line services, such as news, e-mail, social networking,
photo sharing and videos can be studied and characterized through the
statistical analysis of the traffic captured while using applications such as
on-line newspapers, Youtube, Flickr, GMail, Facebbok, among others. By
extracting layer 2 traffic metrics, performing a wavelet decomposition and
analyzing the obtained scalograms, it is possible to evaluate the time and
frequency components of the analyzed traffic. A communication profile
can then be defined in order to describe the frequency spectrum that is
characteristic of each web-based application. By doing that, it will be
possible to identify the different applications used by the connected clients
and build accurate users profiles
Analysis and Defense of Emerging Malware Attacks
The persistent evolution of malware intrusion brings great challenges to current anti-malware industry. First, the traditional signature-based detection and prevention schemes produce outgrown signature databases for each end-host user and user has to install the AV tool and tolerate consuming huge amount of resources for pairwise matching. At the other side of malware analysis, the emerging malware can detect its running environment and determine whether it should infect the host or not. Hence, traditional dynamic malware analysis can no longer find the desired malicious logic if the targeted environment cannot be extracted in advance. Both these two problems uncover that current malware defense schemes are too passive and reactive to fulfill the task.
The goal of this research is to develop new analysis and protection schemes for the emerging malware threats. Firstly, this dissertation performs a detailed study on recent targeted malware attacks. Based on the study, we develop a new technique to perform effectively and efficiently targeted malware analysis. Second, this dissertation studies a new trend of massive malware intrusion and proposes a new protection scheme to proactively defend malware attack. Lastly, our focus is new P2P malware. We propose a new scheme, which is named as informed active probing, for large-scale P2P malware analysis and detection. In further, our internet-wide evaluation shows
our active probing scheme can successfully detect malicious P2P malware and its corresponding malicious servers