5 research outputs found

    Security and Privacy Attacks with and against Machine Learning

    Full text link
    Both researchers and industry have increased their employ of machine learning in new applications with the unfaltering march of the Digital Revolution. However, without complete consideration of these rapid changes, undiscovered attack surfaces may remain open that allow bad actors to breach the security of the system, or leak sensitive information. In this work we shall investigate attacks with and against Machine Learning, starting in the application space of authentication which has observed the adoption of ML, before generalizing to any ML model application. We shall explore a multitude of attacks from ML-assisted behavioral side-channel Attacks against novel authentication systems, Random Input Attacks against the ML models of biometrics, to Membership and Attribute inference attacks against ML models which find employ in Authentication among a host of other sensitive applications. With any proposed attack, there is an obligation to define mitigation strategies. This advancement of knowledge in both attacks and defenses will make the ever-evolving landscape that is our digital world more hardy to external threats. However, in the constant arms race of security and privacy threats, the problem is far from complete, with iterative improvements to be sought on both attacks and defenses. Having not yet attained the perfect defense, they are currently flawed, paired with a tangible cost in either the usability or utility of the application. The necessity of these defenses cannot be understated with a looming threat of an attack, we also need to better understand the trade-offs required, if they are to be implemented. Specifically, we shall describe our successful efforts to rapidly recover a user's secret from observation resilient authentication schemes (ORAS), through behavioral side-channels. Explore the surprising effectiveness of uniform random inputs in breaching the security of behavioral biometric models. Dive deep into membership and attribute inference attacks to highlight the infeasibility of attribute inference due to the inability to perform strong membership inference, paired with a realigned definition of approximate attribute inference to better reflect the privacy risks of an attribute inference attacker. Finally evaluating the privacy-utility tradeoffs offered by differential privacy as a means to mitigate the prior membership and attribute inference attacks

    Study of stochastic and machine learning tecniques for anomaly-based Web atack detection

    Get PDF
    Mención Internacional en el título de doctorWeb applications are exposed to different threats and it is necessary to protect them. Intrusion Detection Systems (IDSs) are a solution external to the web application that do not require the modification of the application’s code in order to protect it. These systems are located in the network, monitoring events and searching for signs of anomalies or threats that can compromise the security of the information systems. IDSs have been applied to traffic analysis of different protocols, such as TCP, FTP or HTTP. Web Application Firewalls (WAFs) are special cases of IDSs that are specialized in analyzing HTTP traffic with the aim of safeguarding web applications. The increase in the amount of data traveling through the Internet and the growing sophistication of the attacks, make necessary protection mechanisms that are both effective and efficient. This thesis proposes three anomaly-based WAFs with the characteristics of being high-speed, reaching high detection results and having a simple design. The anomaly-based approach defines the normal behavior of web application. Actions that deviate from it are considered anomalous. The proposed WAFs work at the application layer analyzing the payload of HTTP requests. These systems are designed with different detection algorithms in order to compare their results and performance. Two of the systems proposed are based on stochastic techniques: one of them is based on statistical techniques and the other one in Markov chains. The third WAF presented in this thesis is ML-based. Machine Learning (ML) deals with constructing computer programs that automatically learn with experience and can be very helpful in dealing with big amounts of data. Concretely, this third WAF is based on decision trees given their proved effectiveness in intrusion detection. In particular, four algorithms are employed: C4.5, CART, Random Tree and Random Forest. Typically, two phases are distinguished in IDSs: preprocessing and processing. In the case of stochastic systems, preprocessing includes feature extraction. The processing phase consists in training the system in order to learn the normal behavior and later testing how well it classifies the incoming requests as either normal or anomalous. The detection models of the systems are implemented either with statistical techniques or with Markov chains, depending on the system considered. For the system based on decision trees, the preprocessing phase comprises feature extraction as well as feature selection. These two phases are optimized. On the one hand, new feature extraction methods are proposed. They combine features extracted by means of expert knowledge and n-grams, and have the capacity of improving the detection results of both techniques separately. For feature selection, the Generic Feature Selection GeFS measure has been used, which has been proven to be very effective in reducing the number of redundant and irrelevant features. Additionally, for the three systems, a study for establishing the minimum number of requests required to train them in order to achieve a certain detection result has been performed. Reducing the number of training requests can greatly help in the optimization of the resource consumption of WAFs as well as on the data gathering process. Besides designing and implementing the systems, evaluating them is an essential step. For that purpose, a dataset is necessary. Unfortunately, finding labeled and adequate datasets is not an easy task. In fact, the study of the most popular datasets in the intrusion detection field reveals that most of them do not satisfy the requirements for evaluating WAFs. In order to tackle this situation, this thesis proposes the new CSIC dataset, that satisfies the necessary conditions to satisfactorily evaluate WAFs. The proposed systems have been experimentally evaluated. For that, the proposed CSIC dataset and the existing ECML/PKDD dataset have been used. The three presented systems have been compared in terms of their detection results, processing time and number of training requests used. For this comparison, the CSIC dataset has been used. In summary, this thesis proposes three WAFs based on stochastic and ML techniques. Additionally, the systems are compared, what allows to determine which system is the most appropriate for each scenario.Las aplicaciones web están expuestas a diferentes amenazas y es necesario protegerlas. Los sistemas de detección de intrusiones (IDSs del inglés Intrusion Detection Systems) son una solución externa a la aplicación web que no requiere la modificación del código de la aplicación para protegerla. Estos sistemas se sitúan en la red, monitorizando los eventos y buscando señales de anomalías o amenazas que puedan comprometer la seguridad de los sistemas de información. Los IDSs se han aplicado al análisis de tráfico de varios protocolos, tales como TCP, FTP o HTTP. Los Cortafuegos de Aplicaciones Web (WAFs del inglés Web Application Firewall) son un caso especial de los IDSs que están especializados en analizar tráfico HTTP con el objetivo de salvaguardar las aplicaciones web. El incremento en la cantidad de datos circulando por Internet y la creciente sofisticación de los ataques hace necesario contar con mecanismos de protección que sean efectivos y eficientes. Esta tesis propone tres WAFs basados en anomalías que tienen las características de ser de alta velocidad, alcanzar altos resultados de detección y contar con un diseño sencillo. El enfoque basado en anomalías define el comportamiento normal de la aplicación, de modo que las acciones que se desvían del mismo se consideran anómalas. Los WAFs diseñados trabajan en la capa de aplicación y analizan el contenido de las peticiones HTTP. Estos sistemas están diseñados con diferentes algoritmos de detección para comparar sus resultados y rendimiento. Dos de los sistemas propuestos están basados en técnicas estocásticas: una de ellas está basada en técnicas estadísticas y la otra en cadenas de Markov. El tercer WAF presentado en esta tesis está basado en aprendizaje automático. El aprendizaje automático (ML del inglés Machine Learning) se ocupa de cómo construir programas informáticos que aprenden automáticamente con la experiencia y puede ser muy útil cuando se trabaja con grandes cantidades de datos. En concreto, este tercer WAF está basado en árboles de decisión, dada su probada efectividad en la detección de intrusiones. En particular, se han empleado cuatro algoritmos: C4.5, CART, Random Tree y Random Forest. Típicamente se distinguen dos fases en los IDSs: preprocesamiento y procesamiento. En el caso de los sistemas estocásticos, en la fase de preprocesamiento se realiza la extracción de características. El procesamiento consiste en el entrenamiento del sistema para que aprenda el comportamiento normal y más tarde se comprueba cuán bien el sistema es capaz de clasificar las peticiones entrantes como normales o anómalas. Los modelos de detección de los sistemas están implementados bien con técnicas estadísticas o bien con cadenas de Markov, dependiendo del sistema considerado. Para el sistema basado en árboles de decisión la fase de preprocesamiento comprende tanto la extracción de características como la selección de características. Estas dos fases se han optimizado. Por un lado, se proponen nuevos métodos de extracción de características. Éstos combinan características extraídas por medio de conocimiento experto y n-gramas y tienen la capacidad de mejorar los resultados de detección de ambas técnicas por separado. Para la selección de características, se ha utilizado la medida GeFS (del inglés Generic Feature Selection), la cual ha probado ser muy efectiva en la reducción del número de características redundantes e irrelevantes. Además, para los tres sistemas, se ha realizado un estudio para establecer el mínimo número de peticiones necesarias para entrenarlos y obtener un cierto resultado. Reducir el número de peticiones de entrenamiento puede ayudar en gran medida a la optimización del consumo de recursos de los WAFs así como en el proceso de adquisición de datos. Además de diseñar e implementar los sistemas, la tarea de evaluarlos es esencial. Para este propósito es necesario un conjunto de datos. Desafortunadamente, encontrar conjuntos de datos etiquetados y adecuados no es una tarea fácil. De hecho, el estudio de los conjuntos de datos más utilizados en el campo de la detección de intrusiones revela que la mayoría de ellos no cumple los requisitos para evaluar WAFs. Para enfrentar esta situación, esta tesis presenta un nuevo conjunto de datos llamado CSIC, que satisface las condiciones necesarias para evaluar WAFs satisfactoriamente. Los sistemas propuestos se han evaluado experimentalmente. Para ello, se ha utilizado el conjunto de datos propuesto (CSIC) y otro existente llamado ECML/PKDD. Los tres sistemas presentados se han comparado con respecto a sus resultados de detección, tiempo de procesamiento y número de peticiones de entrenamiento utilizadas. Para esta comparación se ha utilizado el conjunto de datos CSIC. En resumen, esta tesis propone tres WAFs basados en técnicas estocásticas y de ML. Además, se han comparado estos sistemas entre sí, lo que permite determinar qué sistema es el más adecuado para cada escenario.Este trabajo ha sido realizado en el marco de las becas predoctorales de la Junta de Amplicación de Estudios (JAE) de la Agencia Estatal Consejo Superior de Investigaciones Científicas (CSIC).Programa Oficial de Doctorado en Ciencia y Tecnología InformáticaPresidente: Luis Hernández Encinas.- Secretario: Juan Manuel Estévez Tapiador.- Vocal: Georg Carl

    Click Fraud Detection in Online and In-app Advertisements: A Learning Based Approach

    Get PDF
    Click Fraud is the fraudulent act of clicking on pay-per-click advertisements to increase a site’s revenue, to drain revenue from the advertiser, or to inflate the popularity of content on social media platforms. In-app advertisements on mobile platforms are among the most common targets for click fraud, which makes companies hesitant to advertise their products. Fraudulent clicks are supposed to be caught by ad providers as part of their service to advertisers, which is commonly done using machine learning methods. However: (1) there is a lack of research in current literature addressing and evaluating the different techniques of click fraud detection and prevention, (2) threat models composed of active learning systems (smart attackers) can mislead the training process of the fraud detection model by polluting the training data, (3) current deep learning models have significant computational overhead, (4) training data is often in an imbalanced state, and balancing it still results in noisy data that can train the classifier incorrectly, and (5) datasets with high dimensionality cause increased computational overhead and decreased classifier correctness -- while existing feature selection techniques address this issue, they have their own performance limitations. By extending the state-of-the-art techniques in the field of machine learning, this dissertation provides the following solutions: (i) To address (1) and (2), we propose a hybrid deep-learning-based model which consists of an artificial neural network, auto-encoder and semi-supervised generative adversarial network. (ii) As a solution for (3), we present Cascaded Forest and Extreme Gradient Boosting with less hyperparameter tuning. (iii) To overcome (4), we propose a row-wise data reduction method, KSMOTE, which filters out noisy data samples both in the raw data and the synthetically generated samples. (iv) For (5), we propose different column-reduction methods such as multi-time-scale Time Series analysis for fraud forecasting, using binary labeled imbalanced datasets and hybrid filter-wrapper feature selection approaches

    Anomaly-based network intrusion detection enhancement by prediction threshold adaptation of binary classification models

    Get PDF
    Network traffic exhibits a high level of variability over short periods of time. This variability impacts negatively on the performance (accuracy) of anomaly-based network Intrusion Detection Systems (IDS) that are built using predictive models in a batch-learning setup. This thesis investigates how adapting the discriminating threshold of model predictions, specifically to the evaluated traffic, improves the detection rates of these Intrusion Detection models. Specifically, this thesis studied the adaptability features of three well known Machine Learning algorithms: C5.0, Random Forest, and Support Vector Machine. The ability of these algorithms to adapt their prediction thresholds was assessed and analysed under different scenarios that simulated real world settings using the prospective sampling approach. A new dataset (STA2018) was generated for this thesis and used for the analysis. This thesis has demonstrated empirically the importance of threshold adaptation in improving the accuracy of detection models when training and evaluation (test) traffic have different statistical properties. Further investigation was undertaken to analyse the effects of feature selection and data balancing processes on a model’s accuracy when evaluation traffic with different significant features were used. The effects of threshold adaptation on reducing the accuracy degradation of these models was statistically analysed. The results showed that, of the three compared algorithms, Random Forest was the most adaptable and had the highest detection rates. This thesis then extended the analysis to apply threshold adaptation on sampled traffic subsets, by using different sample sizes, sampling strategies and label error rates. This investigation showed the robustness of the Random Forest algorithm in identifying the best threshold. The Random Forest algorithm only needed a sample that was 0.05% of the original evaluation traffic to identify a discriminating threshold with an overall accuracy rate of nearly 90% of the optimal threshold."This research was supported and funded by the Government of the Sultanate of Oman represented by the Ministry of Higher Education and the Sultan Qaboos University." -- p. i
    corecore