3 research outputs found

    Techniques for the Analysis of Modern Web Page Traffic using Anonymized TCP/IP Headers

    Get PDF
    Analysis of traces of network traffic is a methodology that has been widely adopted for studying the Web for several decades. However, due to recent privacy legislation and increasing adoption of traffic encryption, often only anonymized TCP/IP headers are accessible in traffic traces. For traffic traces to remain useful for analysis, techniques must be developed to glean insight using this limited header information. This dissertation evaluates approaches for classifying individual web page downloads — referred to as web page classification — when only anonymized TCP/IP headers are available. The context in which web page classification is defined and evaluated in this dissertation is different from prior traffic classification methods in three ways. First, the impact of diversity in client platforms (browsers, operating systems, device type, and vantage point) on network traffic is explicitly considered. Second, the challenge of overlapping traffic from multiple web pages is explicitly considered and demultiplexing approaches are evaluated (web page segmentation). And lastly, unlike prior work on traffic classification, four orthogonal labeling schemes are considered (genre-based, device-based, navigation-based, and video streaming-based) — these are of value in several web-related applications, including privacy analysis, user behavior modeling, traffic forecasting, and potentially behavioral ad-targeting. We conduct evaluations using large collections of both synthetically generated data, as well as browsing data from real users. Our analysis shows that the client platform choice has a statistically significant impact on web traffic. It also shows that change point detection methods, a new class of segmentation approach, outperform existing idle time-based methods. Overall, this work establishes that web page classification performance can be improved by: (i) incorporating client platform differences in the feature selection and training methodology, and (ii) utilizing better performing web page segmentation approaches. This research increases the overall awareness on the challenges associated with the analysis of modern web traffic. It shows and advocates for considering real-world factors, such as client platform diversity and overlapping traffic from multiple streams, when developing and evaluating traffic analysis techniques.Doctor of Philosoph

    Robust Fingerprinting Method for Webtoon Identification in Large-Scale Databases

    No full text
    Webtoon, a portmanteau of web and cartoon, denotes a cartoon that has been published on a website. Recently, webtoons have become popular in the global Internet market. Unfortunately, the copyright infringement has emerged as a new challenge resulting in illegal profit gains. Moreover, it is difficult to apply watermarking to published webtoons, because they need to be watermarked prior to publication. In order to deal with a large number of published webtoons, it is necessary to identify each webtoon using fingerprints extracted from its webtoon image. In this paper, we propose an identification framework to detect copyright infringement due to the illegal copying and sharing of webtoons. The proposed identification framework consists of the following main stages: fingerprint generation, indexing, and fingerprint matching. In the fingerprint generation stage, the translation invariant and temporally localized fingerprints are created for distortion-robust identification. An inverted indexing of the database is implemented, using the visual word clustering method and the MapReduce framework, to store the fingerprints efficiently and to minimize the searching time. In addition, we propose a two-step matching process for faster implementation. Moreover, we measured the identification accuracy and the matching time of a large-scale database in the presence of various distortions. Through rigorous simulations, we achieved an identification accuracy of 97.5% within 10 s for each webtoon

    Machine Learning Algorithm for the Scansion of Old Saxon Poetry

    Get PDF
    Several scholars designed tools to perform the automatic scansion of poetry in many languages, but none of these tools deal with Old Saxon or Old English. This project aims to be a first attempt to create a tool for these languages. We implemented a Bidirectional Long Short-Term Memory (BiLSTM) model to perform the automatic scansion of Old Saxon and Old English poems. Since this model uses supervised learning, we manually annotated the Heliand manuscript, and we used the resulting corpus as labeled dataset to train the model. The evaluation of the performance of the algorithm reached a 97% for the accuracy and a 99% of weighted average for precision, recall and F1 Score. In addition, we tested the model with some verses from the Old Saxon Genesis and some from The Battle of Brunanburh, and we observed that the model predicted almost all Old Saxon metrical patterns correctly misclassified the majority of the Old English input verses
    corecore