156 research outputs found

    Darknet Traffic Analysis A Systematic Literature Review

    Full text link
    The primary objective of an anonymity tool is to protect the anonymity of its users through the implementation of strong encryption and obfuscation techniques. As a result, it becomes very difficult to monitor and identify users activities on these networks. Moreover, such systems have strong defensive mechanisms to protect users against potential risks, including the extraction of traffic characteristics and website fingerprinting. However, the strong anonymity feature also functions as a refuge for those involved in illicit activities who aim to avoid being traced on the network. As a result, a substantial body of research has been undertaken to examine and classify encrypted traffic using machine learning techniques. This paper presents a comprehensive examination of the existing approaches utilized for the categorization of anonymous traffic as well as encrypted network traffic inside the darknet. Also, this paper presents a comprehensive analysis of methods of darknet traffic using machine learning techniques to monitor and identify the traffic attacks inside the darknet.Comment: 35 Pages, 13 Figure

    Traffic Analysis Resistant Infrastructure

    Get PDF
    Network traffic analysis is using metadata to infer information from traffic flows. Network traffic flows are the tuple of source IP, source port, destination IP, and destination port. Additional information is derived from packet length, flow size, interpacket delay, Ja3 signature, and IP header options. Even connections using TLS leak site name and cipher suite to observers. This metadata can profile groups of users or individual behaviors. Statistical properties yield even more information. The hidden Markov model can track the state of protocols where each state transition results in an observation. Format Transforming Encryption (FTE) encodes data as the payload of another protocol. The emulated protocol is called the host protocol. Observation-based FTE is a particular case of FTE that uses real observations from the host protocol for the transformation. By communicating using a shared dictionary according to the predefined protocol, it can difficult to detect anomalous traffic. Combining observation-based FTEs with hidden Markov models (HMMs) emulates every aspect of a host protocol. Ideal host protocols would cause significant collateral damage if blocked (protected) and do not contain dynamic handshakes or states (static). We use protected static protocols with the Protocol Proxy--a proxy that defines the syntax of a protocol using an observation-based FTE and transforms data to payloads with actual field values. The Protocol Proxy massages the outgoing packet\u27s interpacket delay to match the host protocol using an HMM. The HMM ensure the outgoing traffic is statistically equivalent to the host protocol. The Protocol Proxy is a covert channel, a method of communication with a low probability of detection (LPD). These covert channels trade-off throughput for LPD. The multipath TCP (mpTCP) Linux kernel module splits a TCP streams across multiple interfaces. Two potential architectures involve splitting a covert channel across several interfaces (multipath) or splitting a single TCP stream across multiple covert channels (multisession). Splitting a covert channel across multiple interfaces leads to higher throughput but is classified as mpTCP traffic. Splitting a TCP flow across multiple covert channels is not as performant as the previous case, but it provides added obfuscation and resiliency. Each covert channel is independent of the others, and a channel failure is recoverable. The multipath and multisession frameworks provide independently address the issues associated with covert channels. Each tool addresses a challenge. The Protocol Proxy provides anonymity in a setting were detection could have critical consequences. The mpTCP kernel module offers an architecture that increases throughput despite the channel\u27s low-bandwidth restrictions. Fusing these architectures improves the goodput of the Protocol Proxy without sacrificing the low probability of detection

    Visual Causal Feature Learning

    Get PDF
    We provide a rigorous definition of the visual cause of a behavior that is broadly applicable to the visually driven behavior in humans, animals, neurons, robots and other perceiving systems. Our framework generalizes standard accounts of causal learning to settings in which the causal variables need to be constructed from micro-variables. We prove the Causal Coarsening Theorem, which allows us to gain causal knowledge from observational data with minimal experimental effort. The theorem provides a connection to standard inference techniques in machine learning that identify features of an image that correlate with, but may not cause, the target behavior. Finally, we propose an active learning scheme to learn a manipulator function that performs optimal manipulations on the image to automatically identify the visual cause of a target behavior. We illustrate our inference and learning algorithms in experiments based on both synthetic and real data.Comment: Accepted at UAI 201

    Activity understanding and unusual event detection in surveillance videos

    Get PDF
    PhDComputer scientists have made ceaseless efforts to replicate cognitive video understanding abilities of human brains onto autonomous vision systems. As video surveillance cameras become ubiquitous, there is a surge in studies on automated activity understanding and unusual event detection in surveillance videos. Nevertheless, video content analysis in public scenes remained a formidable challenge due to intrinsic difficulties such as severe inter-object occlusion in crowded scene and poor quality of recorded surveillance footage. Moreover, it is nontrivial to achieve robust detection of unusual events, which are rare, ambiguous, and easily confused with noise. This thesis proposes solutions for resolving ambiguous visual observations and overcoming unreliability of conventional activity analysis methods by exploiting multi-camera visual context and human feedback. The thesis first demonstrates the importance of learning visual context for establishing reliable reasoning on observed activity in a camera network. In the proposed approach, a new Cross Canonical Correlation Analysis (xCCA) is formulated to discover and quantify time delayed pairwise correlations of regional activities observed within and across multiple camera views. This thesis shows that learning time delayed pairwise activity correlations offers valuable contextual information for (1) spatial and temporal topology inference of a camera network, (2) robust person re-identification, and (3) accurate activity-based video temporal segmentation. Crucially, in contrast to conventional methods, the proposed approach does not rely on either intra-camera or inter-camera object tracking; it can thus be applied to low-quality surveillance videos featuring severe inter-object occlusions. Second, to detect global unusual event across multiple disjoint cameras, this thesis extends visual context learning from pairwise relationship to global time delayed dependency between regional activities. Specifically, a Time Delayed Probabilistic Graphical Model (TD-PGM) is proposed to model the multi-camera activities and their dependencies. Subtle global unusual events are detected and localised using the model as context-incoherent patterns across multiple camera views. In the model, different nodes represent activities in different decomposed re3 gions from different camera views, and the directed links between nodes encoding time delayed dependencies between activities observed within and across camera views. In order to learn optimised time delayed dependencies in a TD-PGM, a novel two-stage structure learning approach is formulated by combining both constraint-based and scored-searching based structure learning methods. Third, to cope with visual context changes over time, this two-stage structure learning approach is extended to permit tractable incremental update of both TD-PGM parameters and its structure. As opposed to most existing studies that assume static model once learned, the proposed incremental learning allows a model to adapt itself to reflect the changes in the current visual context, such as subtle behaviour drift over time or removal/addition of cameras. Importantly, the incremental structure learning is achieved without either exhaustive search in a large graph structure space or storing all past observations in memory, making the proposed solution memory and time efficient. Forth, an active learning approach is presented to incorporate human feedback for on-line unusual event detection. Contrary to most existing unsupervised methods that perform passive mining for unusual events, the proposed approach automatically requests supervision for critical points to resolve ambiguities of interest, leading to more robust detection of subtle unusual events. The active learning strategy is formulated as a stream-based solution, i.e. it makes decision on-the-fly on whether to request label for each unlabelled sample observed in sequence. It selects adaptively two active learning criteria, namely likelihood criterion and uncertainty criterion to achieve (1) discovery of unknown event classes and (2) refinement of classification boundary. The effectiveness of the proposed approaches is validated using videos captured from busy public scenes such as underground stations and traffic intersections

    Deep neural networks for automated detection of marine mammal species

    Get PDF
    Authors thank the Bureau of Ocean Energy Management for the funding of MARU deployments, Excelerate Energy Inc. for the funding of Autobuoy deployment, and Michael J. Weise of the US Office of Naval Research for support (N000141712867).Deep neural networks have advanced the field of detection and classification and allowed for effective identification of signals in challenging data sets. Numerous time-critical conservation needs may benefit from these methods. We developed and empirically studied a variety of deep neural networks to detect the vocalizations of endangered North Atlantic right whales (Eubalaena glacialis). We compared the performance of these deep architectures to that of traditional detection algorithms for the primary vocalization produced by this species, the upcall. We show that deep-learning architectures are capable of producing false-positive rates that are orders of magnitude lower than alternative algorithms while substantially increasing the ability to detect calls. We demonstrate that a deep neural network trained with recordings from a single geographic region recorded over a span of days is capable of generalizing well to data from multiple years and across the species’ range, and that the low false positives make the output of the algorithm amenable to quality control for verification. The deep neural networks we developed are relatively easy to implement with existing software, and may provide new insights applicable to the conservation of endangered species.Publisher PDFPeer reviewe

    Web usage mining for click fraud detection

    Get PDF
    Estágio realizado na AuditMark e orientado pelo Eng.º Pedro FortunaTese de mestrado integrado. Engenharia Informática e Computação. Faculdade de Engenharia. Universidade do Porto. 201

    Distributed Load Testing by Modeling and Simulating User Behavior

    Get PDF
    Modern human-machine systems such as microservices rely upon agile engineering practices which require changes to be tested and released more frequently than classically engineered systems. A critical step in the testing of such systems is the generation of realistic workloads or load testing. Generated workload emulates the expected behaviors of users and machines within a system under test in order to find potentially unknown failure states. Typical testing tools rely on static testing artifacts to generate realistic workload conditions. Such artifacts can be cumbersome and costly to maintain; however, even model-based alternatives can prevent adaptation to changes in a system or its usage. Lack of adaptation can prevent the integration of load testing into system quality assurance, leading to an incomplete evaluation of system quality. The goal of this research is to improve the state of software engineering by addressing open challenges in load testing of human-machine systems with a novel process that a) models and classifies user behavior from streaming and aggregated log data, b) adapts to changes in system and user behavior, and c) generates distributed workload by realistically simulating user behavior. This research contributes a Learning, Online, Distributed Engine for Simulation and Testing based on the Operational Norms of Entities within a system (LODESTONE): a novel process to distributed load testing by modeling and simulating user behavior. We specify LODESTONE within the context of a human-machine system to illustrate distributed adaptation and execution in load testing processes. LODESTONE uses log data to generate and update user behavior models, cluster them into similar behavior profiles, and instantiate distributed workload on software systems. We analyze user behavioral data having differing characteristics to replicate human-machine interactions in a modern microservice environment. We discuss tools, algorithms, software design, and implementation in two different computational environments: client-server and cloud-based microservices. We illustrate the advantages of LODESTONE through a qualitative comparison of key feature parameters and experimentation based on shared data and models. LODESTONE continuously adapts to changes in the system to be tested which allows for the integration of load testing into the quality assurance process for cloud-based microservices

    Classifying tor traffic using character analysis

    Get PDF
    Tor is a privacy-preserving network that enables users to browse the Internet anonymously. Although the prospect of such anonymity is welcomed in many quarters, Tor can also be used for malicious purposes, prompting the need to monitor Tor network connections. Most traffic classification methods depend on flow-based features, due to traffic encryption. However, these features can be less reliable due to issues like asymmetric routing, and processing multiple packets can be time-intensive. In light of Tor’s sophisticated multilayered payload encryption compared with nonTor encryption, our research explored patterns in the encrypted data of both networks, challenging conventional encryption theory which assumes that ciphertexts should not be distinguishable from random strings of equal length. Our novel approach leverages machine learning to differentiate Tor from nonTor traffic using only the encrypted payload. We focused on extracting statistical hex character-based features from their encrypted data. For consistent findings, we drew from two datasets: a public one, which was divided into eight application types for more granular insight and a private one. Both datasets covered Tor and nonTor traffic. We developed a custom Python script called Charcount to extract relevant data and features accurately. To verify our results’ robustness, we utilized both Weka and scikit-learn for classification. In our first line of research, we conducted hex character analysis on the encrypted payloads of both Tor and nonTor traffic using statistical testing. Our investigation revealed a significant differentiation rate between Tor and nonTor traffic of 95.42% for the public dataset and 100% for the private dataset. The second phase of our study aimed to distinguish between Tor and nonTor traffic using machine learning, focusing on encrypted payload features that are independent of length. In our evaluations, the public dataset yielded an average accuracy of 93.56% when classified with the Decision Tree (DT) algorithm in scikit-learn, and 95.65% with the j48 algorithm in Weka. For the private dataset, the accuracies were 95.23% and 97.12%, respectively. Additionally, we found that the combination of WrapperSubsetEval+BestFirst with the J48 classifier both enhanced accuracy and optimized processing efficiency. In conclusion, our study contributes to both demonstrating the distinction between Tor and nonTor traffic and achieving efficient classification of both types of traffic using features derived exclusively from a single encrypted payload packet. This work holds significant implications for cybersecurity and points towards further advancements in the field.Tor is a privacy-preserving network that enables users to browse the Internet anonymously. Although the prospect of such anonymity is welcomed in many quarters, Tor can also be used for malicious purposes, prompting the need to monitor Tor network connections. Most traffic classification methods depend on flow-based features, due to traffic encryption. However, these features can be less reliable due to issues like asymmetric routing, and processing multiple packets can be time-intensive. In light of Tor’s sophisticated multilayered payload encryption compared with nonTor encryption, our research explored patterns in the encrypted data of both networks, challenging conventional encryption theory which assumes that ciphertexts should not be distinguishable from random strings of equal length. Our novel approach leverages machine learning to differentiate Tor from nonTor traffic using only the encrypted payload. We focused on extracting statistical hex character-based features from their encrypted data. For consistent findings, we drew from two datasets: a public one, which was divided into eight application types for more granular insight and a private one. Both datasets covered Tor and nonTor traffic. We developed a custom Python script called Charcount to extract relevant data and features accurately. To verify our results’ robustness, we utilized both Weka and scikit-learn for classification. In our first line of research, we conducted hex character analysis on the encrypted payloads of both Tor and nonTor traffic using statistical testing. Our investigation revealed a significant differentiation rate between Tor and nonTor traffic of 95.42% for the public dataset and 100% for the private dataset. The second phase of our study aimed to distinguish between Tor and nonTor traffic using machine learning, focusing on encrypted payload features that are independent of length. In our evaluations, the public dataset yielded an average accuracy of 93.56% when classified with the Decision Tree (DT) algorithm in scikit-learn, and 95.65% with the j48 algorithm in Weka. For the private dataset, the accuracies were 95.23% and 97.12%, respectively. Additionally, we found that the combination of WrapperSubsetEval+BestFirst with the J48 classifier both enhanced accuracy and optimized processing efficiency. In conclusion, our study contributes to both demonstrating the distinction between Tor and nonTor traffic and achieving efficient classification of both types of traffic using features derived exclusively from a single encrypted payload packet. This work holds significant implications for cybersecurity and points towards further advancements in the field
    • …
    corecore