16 research outputs found

    Detection of Software Vulnerability Communication in Expert Social Media Channels: A Data-driven Approach

    Get PDF
    Conceptually, a vulnerability is: A flaw or weakness in a system’s design, implementation,or operation and management that could be exploited to violate the system’s security policy .Some of these flaws can go undetected and exploited for long periods of time after soft-ware release. Although some software providers are making efforts to avoid this situ-ation, inevitability, users are still exposed to vulnerabilities that allow criminal hackersto take advantage. These vulnerabilities are constantly discussed in specialised forumson social media. Therefore, from a cyber security standpoint, the information found inthese places can be used for countermeasures actions against malicious exploitation ofsoftware. However, manual inspection of the vast quantity of shared content in socialmedia is impractical. For this reason, in this thesis, we analyse the real applicability ofsupervised classification models to automatically detect software vulnerability com-munication in expert social media channels. We cover the following three principal aspects: Firstly, we investigate the applicability of classification models in a range of 5 differ-ent datasets collected from 3 Internet Domains: Dark Web, Deep Web and SurfaceWeb. Since supervised models require labelled data, we have provided a systematiclabelling process using multiple annotators to guarantee accurate labels to carry outexperiments. Using these datasets, we have investigated the classification models withdifferent combinations of learning-based algorithms and traditional features represen-tation. Also, by oversampling the positive instances, we have achieved an increaseof 5% in Positive Recall (on average) in these models. On top of that, we have appiiplied Feature Reduction, Feature Extraction and Feature Selection techniques, whichprovided a reduction on the dimensionality of these models without damaging the accuracy, thus, providing computationally efficient models. Furthermore, in addition to traditional features representation, we have investigated the performance of robust language models, such as Word Embedding (WEMB) andSentence Embedding (SEMB) on the accuracy of classification models. RegardingWEMB, our experiment has shown that this model trained with a small security-vocabulary dataset provides comparable results with WEMB trained in a very large general-vocabulary dataset. Regarding SEMB model, our experiment has shown thatits use overcomes WEMB model in detecting vulnerability communication, recording 8% of Avg. Class Accuracy and 74% of Positive Recall. In addition, we investigate twoDeep Learning algorithms as classifiers, text CNN (Convolutional Neural Network)and RNN (Recurrent Neural Network)-based algorithms, which have improved ourmodel, resulting in the best overall performance for our task

    Eavesdropping Hackers: Detecting Software Vulnerability Communication on Social Media Using Text Mining

    Get PDF
    Abstract—Cyber security is striving to find new forms of protection against hacker attacks. An emerging approach nowadays is the investigation of security-related messages exchanged on Deep/Dark Web and even Surface Web channels. This approach can be supported by the use of supervised machine learning models and text mining techniques. In our work, we compare a variety of machine learning algorithms, text representations and dimension reduction approaches for the detection accuracies of software-vulnerability-related communications. Given the imbalanced nature of the three public datasets used, we investigate appropriate sampling approaches to boost detection accuracies of our models. In addition, we examine how feature reduction techniques, such as Document Frequency Reduction, Chi-square and Singular Value Decomposition (SVD) can be used to reduce the number of features of the model without impacting the detection performance. We conclude that: (1) a Support Vector Machine (SVM) algorithm used with traditional Bag of Words achieved highest accuracies (2) The increase of the minority class with Random Oversampling technique improves the detection performance of the model by 5% on average, and (3) The number of features of the model can be reduced by up to 10% without affecting the detection performance. Also, we have provided the labelled dataset used in this work for further research. These findings can be used to support Cyber Security Threat Intelligence (CTI) with respect to the use of text mining techniques for detecting security-related communicatio

    Social media mental health analysis framework through applied computational approaches

    Get PDF
    Studies have shown that mental illness burdens not only public health and productivity but also established market economies throughout the world. However, mental disorders are difficult to diagnose and monitor through traditional methods, which heavily rely on interviews, questionnaires and surveys, resulting in high under-diagnosis and under-treatment rates. The increasing use of online social media, such as Facebook and Twitter, is now a common part of people’s everyday life. The continuous and real-time user-generated content often reflects feelings, opinions, social status and behaviours of individuals, creating an unprecedented wealth of person-specific information. With advances in data science, social media has already been increasingly employed in population health monitoring and more recently mental health applications to understand mental disorders as well as to develop online screening and intervention tools. However, existing research efforts are still in their infancy, primarily aimed at highlighting the potential of employing social media in mental health research. The majority of work is developed on ad hoc datasets and lacks a systematic research pipeline. [Continues.]</div

    Politische Maschinen: Maschinelles Lernen für das Verständnis von sozialen Maschinen

    Get PDF
    This thesis investigates human-algorithm interactions in sociotechnological ecosystems. Specifically, it applies machine learning and statistical methods to uncover political dimensions of algorithmic influence in social media platforms and automated decision making systems. Based on the results, the study discusses the legal, political and ethical consequences of algorithmic implementations.Diese Arbeit untersucht Mensch-Algorithmen-Interaktionen in sozio-technologischen Ökosystemen. Sie wendet maschinelles Lernen und statistische Methoden an, um politische Dimensionen des algorithmischen Einflusses auf Socialen Medien und automatisierten Entscheidungssystemen aufzudecken. Aufgrund der Ergebnisse diskutiert die Studie die rechtlichen, politischen und ethischen Konsequenzen von algorithmischen Anwendungen

    Tracking the Temporal-Evolution of Supernova Bubbles in Numerical Simulations

    Get PDF
    The study of low-dimensional, noisy manifolds embedded in a higher dimensional space has been extremely useful in many applications, from the chemical analysis of multi-phase flows to simulations of galactic mergers. Building a probabilistic model of the manifolds has helped in describing their essential properties and how they vary in space. However, when the manifold is evolving through time, a joint spatio-temporal modelling is needed, in order to fully comprehend its nature. We propose a first-order Markovian process that propagates the spatial probabilistic model of a manifold at fixed time, to its adjacent temporal stages. The proposed methodology is demonstrated using a particle simulation of an interacting dwarf galaxy to describe the evolution of a cavity generated by a Supernov

    AVATAR - Machine Learning Pipeline Evaluation Using Surrogate Model

    Get PDF
    © 2020, The Author(s). The evaluation of machine learning (ML) pipelines is essential during automatic ML pipeline composition and optimisation. The previous methods such as Bayesian-based and genetic-based optimisation, which are implemented in Auto-Weka, Auto-sklearn and TPOT, evaluate pipelines by executing them. Therefore, the pipeline composition and optimisation of these methods requires a tremendous amount of time that prevents them from exploring complex pipelines to find better predictive models. To further explore this research challenge, we have conducted experiments showing that many of the generated pipelines are invalid, and it is unnecessary to execute them to find out whether they are good pipelines. To address this issue, we propose a novel method to evaluate the validity of ML pipelines using a surrogate model (AVATAR). The AVATAR enables to accelerate automatic ML pipeline composition and optimisation by quickly ignoring invalid pipelines. Our experiments show that the AVATAR is more efficient in evaluating complex pipelines in comparison with the traditional evaluation approaches requiring their execution

    Deep Learning In Software Engineering

    Get PDF
    Software evolves and therefore requires an evolving field of Software Engineering. The evolution of software can be seen on an individual project level through the software life cycle, as well as on a collective level, as we study the trends and uses of software in the real world. As the needs and requirements of users change, so must software evolve to reflect those changes. This cycle is never ending and has led to continuous and rapid development of software projects. More importantly, it has put a great responsibility on software engineers, causing them to adopt practices and tools that allow them to increase their efficiency. However, these tools suffer the same fate as software designed for the general population; they need to change in order to reflect the user’s needs. Fortunately, the demand for this evolving software has given software engineers a plethora of data and artifacts to analyze. The challenge arises when attempting to identify and apply patterns learned from the vast amount of data. In this dissertation, we explore and develop techniques to take advantage of the vast amount of software data and to aid developers in software development tasks. Specifically, we exploit the tool of deep learning to automatically learn patterns discovered within previous software data and automatically apply those patterns to present day software development. We first set out to investigate the current impact of deep learning in software engineering by performing a systematic literature review of top tier conferences and journals. This review provides guidelines and common pitfalls for researchers to consider when implementing DL (Deep Learning) approaches in SE (Software Engineering). In addition, the review provides a research road map for areas within SE where DL could be applicable. Our next piece of work developed an approach that simultaneously learned different representations of source code for the task of clone detection. We found that the use of multiple representations, such as Identifiers, ASTs, CFGs and bytecode, can lead to the identification of similar code fragments. Through the use of deep learning strategies, we automatically learned these different representations without the requirement of hand-crafted features. Lastly, we designed a novel approach for automating the generation of assert statements through seq2seq learning, with the goal of increasing the efficiency of software testing. Given the test method and the context of the associated focal method, we automatically generated semantically and syntactically correct assert statements for a given, unseen test method. We exemplify that the techniques presented in this dissertation provide a meaningful advancement to the field of software engineering and the automation of software development tasks. We provide analytical evaluations and empirical evidence that substantiate the impact of our findings and usefulness of our approaches toward the software engineering community

    Data Hiding and Its Applications

    Get PDF
    Data hiding techniques have been widely used to provide copyright protection, data integrity, covert communication, non-repudiation, and authentication, among other applications. In the context of the increased dissemination and distribution of multimedia content over the internet, data hiding methods, such as digital watermarking and steganography, are becoming increasingly relevant in providing multimedia security. The goal of this book is to focus on the improvement of data hiding algorithms and their different applications (both traditional and emerging), bringing together researchers and practitioners from different research fields, including data hiding, signal processing, cryptography, and information theory, among others

    AIRM: a new AI Recruiting Model for the Saudi Arabian labour market

    Get PDF
    One of the goals of Saudi Vision 2030 is to keep the unemployment rate at the lowest level to empower the economy. Prior research has shown that an increase in unemployment has a negative effect on a country’s Gross Domestic Product. This research aims to utilise cutting-edge technology such as Data Lake (DL), Machine Learning (ML) and Artificial Intelligence (AI) to assist the Saudi labour market bymatching job seekers with vacant positions. Currently, human experts carry out this process; however, this is time consuming and labour intensive. Moreover, in the Saudi labour market, this process does not use a cohesive data centre to monitor, integrate, or analyse labour market data, resulting in inefficiencies, such as bias and latency. These inefficiencies arise from a lack of technologies and, more importantly, from having an open labour market without a national labour market data centre. This research proposes a new AI Recruiting Model (AIRM) architecture that exploits DLs, ML and AI to rapidly and efficiently match job seekers to vacant positions in the Saudi labour market. A Minimum Viable Product (MVP) is employed to test the proposed AIRM architecture using a labour market dataset simulation corpus for training purposes; the architecture is further evaluated against three research-collaborative Human Resources (HR) professionals. As this research is data-driven in nature, it requires collaboration from domain experts. The first layer of the AIRM architecture uses balanced iterative reducing and clustering using hierarchies (BIRCH) as a clustering algorithm for the initial screening layer. The mapping layer uses sentence transformers with a robustly optimised BERTt pre-training approach (RoBERTa) as the base model, and ranking is carried out using the Facebook AI Similarity Search (FAISS). Finally, the preferences layer takes the user’s preferences as a list and sorts the results using the pre-trained cross-encoders model, considering the weight of the more important words. This new AIRM has yielded favourable outcomes: This research considered accepting an AIRM selection ratified by at least one HR expert to account for the subjective character of the selection process when exclusively handled by human HR experts. The research evaluated the AIRM using two metrics: accuracy and time. The AIRM had an overall matching accuracy of 84%, with at least one expert agreeing with the system’s output. Furthermore, it completed the task in 2.4 minutes, whereas human experts took more than six days on average. Overall, the AIRM outperforms humans in task execution, making it useful in pre-selecting a group of applicants and positions. The AIRM is not limited to government services. It can also help any commercial business that uses Big Data

    ICTERI 2020: ІКТ в освіті, дослідженнях та промислових застосуваннях. Інтеграція, гармонізація та передача знань 2020: Матеріали 16-ї Міжнародної конференції. Том II: Семінари. Харків, Україна, 06-10 жовтня 2020 р.

    Get PDF
    This volume represents the proceedings of the Workshops co-located with the 16th International Conference on ICT in Education, Research, and Industrial Applications, held in Kharkiv, Ukraine, in October 2020. It comprises 101 contributed papers that were carefully peer-reviewed and selected from 233 submissions for the five workshops: RMSEBT, TheRMIT, ITER, 3L-Person, CoSinE, MROL. The volume is structured in six parts, each presenting the contributions for a particular workshop. The topical scope of the volume is aligned with the thematic tracks of ICTERI 2020: (I) Advances in ICT Research; (II) Information Systems: Technology and Applications; (III) Academia/Industry ICT Cooperation; and (IV) ICT in Education.Цей збірник представляє матеріали семінарів, які були проведені в рамках 16-ї Міжнародної конференції з ІКТ в освіті, наукових дослідженнях та промислових застосуваннях, що відбулася в Харкові, Україна, у жовтні 2020 року. Він містить 101 доповідь, які були ретельно рецензовані та відібрані з 233 заявок на участь у п'яти воркшопах: RMSEBT, TheRMIT, ITER, 3L-Person, CoSinE, MROL. Збірник складається з шести частин, кожна з яких представляє матеріали для певного семінару. Тематична спрямованість збірника узгоджена з тематичними напрямками ICTERI 2020: (I) Досягнення в галузі досліджень ІКТ; (II) Інформаційні системи: Технології і застосування; (ІІІ) Співпраця в галузі ІКТ між академічними і промисловими колами; і (IV) ІКТ в освіті
    corecore