120 research outputs found
Big Data for Traffic Monitoring and Management
The last two decades witnessed tremendous advances in the Information and
Communications Technologies. Beside improvements in computational power and
storage capacity, communication networks carry nowadays an amount of data which
was not envisaged only few years ago. Together with their pervasiveness,
network complexity increased at the same pace, leaving operators and
researchers with few instruments to understand what happens in the networks,
and, on the global scale, on the Internet. Fortunately, recent advances in data
science and machine learning come to the rescue of network analysts, and allow
analyses with a level of complexity and spatial/temporal scope not possible
only 10 years ago. In my thesis, I take the perspective of an Internet Service
Provider (ISP), and illustrate challenges and possibilities of analyzing the
traffic coming from modern operational networks. I make use of big data and
machine learning algorithms, and apply them to datasets coming from passive
measurements of ISP and University Campus networks. The marriage between data
science and network measurements is complicated by the complexity of machine
learning algorithms, and by the intrinsic multi-dimensionality and variability
of this kind of data. As such, my work proposes and evaluates novel techniques,
inspired from popular machine learning approaches, but carefully tailored to
operate with network traffic
Big Data for Traffic Monitoring and Management
The last two decades witnessed tremendous advances in the Information and Com-
munications Technologies. Beside improvements in computational power and storage
capacity, communication networks carry nowadays an amount of data which was not
envisaged only few years ago. Together with their pervasiveness, network complexity
increased at the same pace, leaving operators and researchers with few instruments to
understand what happens in the networks, and, on the global scale, on the Internet.
Fortunately, recent advances in data science and machine learning come to the res-
cue of network analysts, and allow analyses with a level of complexity and spatial/tem-
poral scope not possible only 10 years ago. In my thesis, I take the perspective of an In-
ternet Service Provider (ISP), and illustrate challenges and possibilities of analyzing the
traffic coming from modern operational networks. I make use of big data and machine
learning algorithms, and apply them to datasets coming from passive measurements of
ISP and University Campus networks. The marriage between data science and network
measurements is complicated by the complexity of machine learning algorithms, and
by the intrinsic multi-dimensionality and variability of this kind of data. As such, my
work proposes and evaluates novel techniques, inspired from popular machine learning
approaches, but carefully tailored to operate with network traffic.
In this thesis, I first provide a thorough characterization of the Internet traffic from
2013 to 2018. I show the most important trends in the composition of traffic and users’
habits across the last 5 years, and describe how the network infrastructure of Internet
big players changed in order to support faster and larger traffic. Then, I show the chal-
lenges in classifying network traffic, with particular attention to encryption and to the
convergence of Internet around few big players. To overcome the limitations of classical
approaches, I propose novel algorithms for traffic classification and management lever-
aging machine learning techniques, and, in particular, big data approaches. Exploiting
temporal correlation among network events, and benefiting from large datasets of op-
erational traffic, my algorithms learn common traffic patterns of web services, and use
them for (i) traffic classification and (ii) fine-grained traffic management. My proposals
are always validated in experimental environments, and, then, deployed in real opera-
tional networks, from which I report the most interesting findings I obtain. I also focus
on the Quality of Experience (QoE) of web users, as their satisfaction represents the
final objective of computer networks. Again, I show that using big data approaches, the
network can achieve visibility on the quality of web browsing of users. In general, the
algorithms I propose help ISPs have a detailed view of traffic that flows in their network,
allowing fine-grained traffic classification and management, and real-time monitoring
of users QoE
Robust URL Classification With Generative Adversarial Networks
Classifying URLs is essential for different applications, such as parental control, URL filtering and Ads/tracking protection. Such systems historically identify URLs by means of regular expressions, even if machine learning alternatives have been proposed to overcome the time-consuming maintenance of classification rules. Classical machine learning algorithms, however, require large samples of URLs to train the models, covering the diverse classes of URLs (i.e., a ground truth), which somehow limits the applicability of the approach. We here give a first step towards the use of Generative Adversarial Neural Networks (GANs) to classify URLs. GANs are attractive for this problem for two reasons. First, GANs can produce samples of URLs belonging to specific classes even if exposed to a limited training set, outputting both synthetic traces and a robust discriminator. Second, a GAN can be trained to discriminate a class of URLs without being exposed to all other URLs classes – i.e., GANs are robust even if not exposed to uninteresting URL classes during training. Experiments on real data show that not only the generated synthetic traces are somehow realistic, but also the URL classification is accurate with GANs. © is is held held by by author/owner(s). author/owner(s)
Impact of Access Line Capacity on Adaptive Video Streaming Quality - A Passive Perspective
Adaptive streaming over HTTP is largely used to deliver live and on-demand video. It works by adjusting video quality according to network conditions. While QoE for different streaming services has been studied, it is still unclear how access line capacity impacts QoE of broadband users in video sessions. We make a first step to answer this question by characterizing parameters influencing QoE, such as frequency of video adaptations. We take a passive point of view, and analyze a dataset summarizing video sessions of a large population for one year. We first split customers based on their estimated access line capacity. Then, we quantify how the latter affects QoE metrics by parsing HTTP requests of Microsoft Smooth Streaming (MSS) services. For selected services, we observe that at least 3~Mbps of downstream capacity is needed to let the player select the best bitrate, while at least 6~Mbps are required to minimize delays to retrieve initial fragments. Surprisingly, customers with faster access lines obtain limited benefits, hinting to restrictions on the design of services
The stock exchange of influencers: a financial approach for studying fanbase variation trends
In many online social networks (OSNs), a limited portion of profiles emerges and reaches a large base of followers, i.e., the so-called social influencers. One of their main goals is to increase their fanbase to increase their visibility, engaging users through their content. In this work, we propose a novel parallel between the ecosystem of OSNs and the stock exchange market. Followers act as private investors, and they follow influencers, i.e., buy stocks, based on their individual preferences and on the information they gather through external sources. In this preliminary study, we show how the approaches proposed in the context of the stock exchange market can be successfully applied to social networks. Our case study focuses on 60 Italian Instagram influencers and shows how their followers short-term trends obtained through Bollinger bands become close to those found in external sources, Google Trends in our case, similarly to phenomena already observed in the financial market. Besides providing a strong correlation between these different trends, our results pose the basis for studying social networks with a new lens, linking them with a different domain
Disentangling the Information Flood on OSNs: Finding Notable Posts and Topics
Online Social Networks (OSNs) are an integral part of modern life for sharing thoughts, stories, and news. An ecosystem of influencers generates a flood of content in the form of posts, some of which have an unusually high level of engagement with the influencer’s fan base. These posts relate to blossoming topics of discussion that generate particular interest among users: The COVID-19 pandemic is a prominent example. Studying these phenomena provides an understanding of the OSN landscape and requires appropriate methods. This paper presents a methodology to discover notable posts and group them according to their related topic. By combining anomaly detection, graph modelling and community detection techniques, we pinpoint salient events automatically, with the ability to tune the amount of them. We showcase our approach using a large Instagram dataset and extract some notable weekly topics that gained momentum from 1.4 million posts. We then illustrate some use cases ranging from the COVID-19 outbreak to sporting events
Measuring Web Speed From Passive Traces
Understanding the quality of Experience (QoE) of web brows- ing is key to optimize services and keep users’ loyalty. This is crucial for both Content Providers and Internet Service Providers (ISPs). Quality is subjective, and the complexity of today’s pages challenges its measurement. OnLoad time and SpeedIndex are notable attempts to quantify web performance with objective metrics. However, these metrics can only be computed by instrumenting the browser and, thus, are not available to ISPs. We designed PAIN: PAssive INdicator for ISPs. It is an automatic system to monitor the performance of web pages from passive measurements. It is open source and available for download. It leverages only flow-level and DNS measurements which are still possible in the network despite the deployment of HTTPS. With unsupervised learn- ing, PAIN automatically creates a machine learning model from the timeline of requests issued by browsers to render web pages, and uses it to measure web performance in real- time. We compared PAIN to indicators based on in-browser instrumentation and found strong correlations between the approaches. PAIN correctly highlights worsening network conditions and provides visibility into web performance. We let PAIN run on a real ISP network, and found that it is able to pinpoint performance variations across time and groups of users
Realistic testing of RTC applications under mobile networks
The increasing usage of Real-Time Communication (RTC) applications for leisure and remote working calls for realistic and reproducible techniques to test them. They are used under very different
network conditions: from high-speed broadband networks, to noisy
wireless links. As such, it is of paramount importance to assess the
impact of the network on users’ Quality of Experience (QoE), especially when it comes to the application’s mechanisms such as video
quality adjustment or transmission of redundant data. In this work,
we pose the basis for a system in which a target RTC application is
tested in an emulated mobile environment. To this end, we leverage
ERRANT, a data-driven emulator which includes 32 distinct profiles
modeling mobile network performance in different conditions. As
a use case, we opt for Cisco Webex, a popular RTC application. We
show how variable network conditions impact the packet loss, and,
in turn, trigger video quality adjustments, impairing the users’ QoE
Second Data Economy Workshop (DEC)
Welcome to the second ACM DATA ECONOMY WORKSHOP (DEC), co-located with ACM SIGCMOD 2023. Data-driven decision making through machine learning algorithms (ML) is transforming the way society and the economy work and is having a profound positive impact on our daily lives. With the exception of very large companies that have both the data and the capabilities to develop powerful ML-driven services, the vast majority of demonstrably possible ML services, from e-health to transportation to predictive maintenance, to name a few, still remain at the level of ideas or prototypes for the simple reason that data, the capabilities to manipulate it, and the business models to bring it to market rarely exist under one roof. Data must somehow meet the ML and business skills that can unleash its full power for society and the economy. This has given rise to an extremely dynamic sector around the Data Economy, involving Data Providers/Controllers, data Intermediaries, often-times in the form of Data Marketplaces or Personal Information Management Systems for end users to control and even monetize their personal data. Despite its enormous potential and observed initial growth, the Data Economy is still in its early stages and therefore faces a still uncertain future and a number of existential challenges. These challenges include a wide range of technical issues that affect multiple disciplines of computer science, including networks and distributed systems, security and privacy, machine learning, and human-computer interaction. The mission of the ACM DEC workshop will be to bring together all CS capabilities needed to support the Data Economy. We would like to thank the entire technical program committee for reviewing and selecting papers for the workshop. We hope you will find the papers interesting and stimulating
- …