    Specification mining: Methodologies, theories and applications

    A structured approach to malware detection and analysis in digital forensics investigation

    A thesis submitted to the University of Bedfordshire in partial fulfilment of the requirement for the degree of PhDWithin the World Wide Web (WWW), malware is considered one of the most serious threats to system security with complex system issues caused by malware and spam. Networks and systems can be accessed and compromised by various types of malware, such as viruses, worms, Trojans, botnet and rootkits, which compromise systems through coordinated attacks. Malware often uses anti-forensic techniques to avoid detection and investigation. Moreover, the results of investigating such attacks are often ineffective and can create barriers for obtaining clear evidence due to the lack of sufficient tools and the immaturity of forensics methodology. This research addressed various complexities faced by investigators in the detection and analysis of malware. In this thesis, the author identified the need for a new approach towards malware detection that focuses on a robust framework, and proposed a solution based on an extensive literature review and market research analysis. The literature review focussed on the different trials and techniques in malware detection to identify the parameters for developing a solution design, while market research was carried out to understand the precise nature of the current problem. The author termed the new approaches and development of the new framework the triple-tier centralised online real-time environment (tri-CORE) malware analysis (TCMA). The tiers come from three distinctive phases of detection and analysis where the entire research pattern is divided into three different domains. The tiers are the malware acquisition function, detection and analysis, and the database operational function. This framework design will contribute to the field of computer forensics by making the investigative process more effective and efficient. By integrating a hybrid method for malware detection, associated limitations with both static and dynamic methods are eliminated. This aids forensics experts with carrying out quick, investigatory processes to detect the behaviour of the malware and its related elements. The proposed framework will help to ensure system confidentiality, integrity, availability and accountability. The current research also focussed on a prototype (artefact) that was developed in favour of a different approach in digital forensics and malware detection methods. As such, a new Toolkit was designed and implemented, which is based on a simple architectural structure and built from open source software that can help investigators develop the skills to critically respond to current cyber incidents and analyses

    Security and trust in cloud computing and IoT through applying obfuscation, diversification, and trusted computing technologies

    Cloud computing and Internet of Things (IoT) are very widely spread and commonly used technologies nowadays. The advanced services offered by cloud computing have made it a highly demanded technology. Enterprises and businesses are more and more relying on the cloud to deliver services to their customers. The prevalent use of cloud means that more data is stored outside the organization’s premises, which raises concerns about the security and privacy of the stored and processed data. This highlights the significance of effective security practices to secure the cloud infrastructure. The number of IoT devices is growing rapidly and the technology is being employed in a wide range of sectors including smart healthcare, industry automation, and smart environments. These devices collect and exchange a great deal of information, some of which may contain critical and personal data of the users of the device. Hence, it is highly significant to protect the collected and shared data over the network; notwithstanding, the studies signify that attacks on these devices are increasing, while a high percentage of IoT devices lack proper security measures to protect the devices, the data, and the privacy of the users. In this dissertation, we study the security of cloud computing and IoT and propose software-based security approaches supported by the hardware-based technologies to provide robust measures for enhancing the security of these environments. To achieve this goal, we use obfuscation and diversification as the potential software security techniques. Code obfuscation protects the software from malicious reverse engineering and diversification mitigates the risk of large-scale exploits. We study trusted computing and Trusted Execution Environments (TEE) as the hardware-based security solutions. Trusted Platform Module (TPM) provides security and trust through a hardware root of trust, and assures the integrity of a platform. We also study Intel SGX which is a TEE solution that guarantees the integrity and confidentiality of the code and data loaded onto its protected container, enclave. More precisely, through obfuscation and diversification of the operating systems and APIs of the IoT devices, we secure them at the application level, and by obfuscation and diversification of the communication protocols, we protect the communication of data between them at the network level. For securing the cloud computing, we employ obfuscation and diversification techniques for securing the cloud computing software at the client-side. For an enhanced level of security, we employ hardware-based security solutions, TPM and SGX. These solutions, in addition to security, ensure layered trust in various layers from hardware to the application. As the result of this PhD research, this dissertation addresses a number of security risks targeting IoT and cloud computing through the delivered publications and presents a brief outlook on the future research directions.Pilvilaskenta ja esineiden internet ovat nykyään hyvin tavallisia ja laajasti sovellettuja tekniikkoja. Pilvilaskennan pitkälle kehittyneet palvelut ovat tehneet siitä hyvin kysytyn teknologian. Yritykset enenevässä määrin nojaavat pilviteknologiaan toteuttaessaan palveluita asiakkailleen. Vallitsevassa pilviteknologian soveltamistilanteessa yritykset ulkoistavat tietojensa käsittelyä yrityksen ulkopuolelle, minkä voidaan nähdä nostavan esiin huolia taltioitavan ja käsiteltävän tiedon turvallisuudesta ja yksityisyydestä. Tämä korostaa tehokkaiden turvallisuusratkaisujen merkitystä osana pilvi-infrastruktuurin turvaamista. Esineiden internet -laitteiden lukumäärä on nopeasti kasvanut. Teknologiana sitä sovelletaan laajasti monilla sektoreilla, kuten älykkäässä terveydenhuollossa, teollisuusautomaatiossa ja älytiloissa. Sellaiset laitteet keräävät ja välittävät suuria määriä informaatiota, joka voi sisältää laitteiden käyttäjien kannalta kriittistä ja yksityistä tietoa. Tästä syystä johtuen on erittäin merkityksellistä suojata verkon yli kerättävää ja jaettavaa tietoa. Monet tutkimukset osoittavat esineiden internet -laitteisiin kohdistuvien tietoturvahyökkäysten määrän olevan nousussa, ja samaan aikaan suuri osuus näistä laitteista ei omaa kunnollisia teknisiä ominaisuuksia itse laitteiden tai niiden käyttäjien yksityisen tiedon suojaamiseksi. Tässä väitöskirjassa tutkitaan pilvilaskennan sekä esineiden internetin tietoturvaa ja esitetään ohjelmistopohjaisia tietoturvalähestymistapoja turvautumalla osittain laitteistopohjaisiin teknologioihin. Esitetyt lähestymistavat tarjoavat vankkoja keinoja tietoturvallisuuden kohentamiseksi näissä konteksteissa. Tämän saavuttamiseksi työssä sovelletaan obfuskaatiota ja diversifiointia potentiaalisiana ohjelmistopohjaisina tietoturvatekniikkoina. Suoritettavan koodin obfuskointi suojaa pahantahtoiselta ohjelmiston takaisinmallinnukselta ja diversifiointi torjuu tietoturva-aukkojen laaja-alaisen hyödyntämisen riskiä. Väitöskirjatyössä tutkitaan luotettua laskentaa ja luotettavan laskennan suoritusalustoja laitteistopohjaisina tietoturvaratkaisuina. TPM (Trusted Platform Module) tarjoaa turvallisuutta ja luottamuksellisuutta rakentuen laitteistopohjaiseen luottamukseen. Pyrkimyksenä on taata suoritusalustan eheys. Työssä tutkitaan myös Intel SGX:ää yhtenä luotettavan suorituksen suoritusalustana, joka takaa suoritettavan koodin ja datan eheyden sekä luottamuksellisuuden pohjautuen suojatun säiliön, saarekkeen, tekniseen toteutukseen. Tarkemmin ilmaistuna työssä turvataan käyttöjärjestelmä- ja sovellusrajapintatasojen obfuskaation ja diversifioinnin kautta esineiden internet -laitteiden ohjelmistokerrosta. Soveltamalla samoja tekniikoita protokollakerrokseen, työssä suojataan laitteiden välistä tiedonvaihtoa verkkotasolla. Pilvilaskennan turvaamiseksi työssä sovelletaan obfuskaatio ja diversifiointitekniikoita asiakaspuolen ohjelmistoratkaisuihin. Vankemman tietoturvallisuuden saavuttamiseksi työssä hyödynnetään laitteistopohjaisia TPM- ja SGX-ratkaisuja. Tietoturvallisuuden lisäksi nämä ratkaisut tarjoavat monikerroksisen luottamuksen rakentuen laitteistotasolta ohjelmistokerrokseen asti. Tämän väitöskirjatutkimustyön tuloksena, osajulkaisuiden kautta, vastataan moniin esineiden internet -laitteisiin ja pilvilaskentaan kohdistuviin tietoturvauhkiin. Työssä esitetään myös näkemyksiä jatkotutkimusaiheista

    A semantic methodology for (un)structured digital evidences analysis

    Nowadays, more than ever, digital forensics activities are involved in any criminal, civil or military investigation and represent a fundamental tool to support cyber-security. Investigators use a variety of techniques and proprietary software forensic applications to examine the copy of digital devices, searching hidden, deleted, encrypted, or damaged files or folders. Any evidence found is carefully analysed and documented in a "finding report" in preparation for legal proceedings that involve discovery, depositions, or actual litigation. The aim is to discover and analyse patterns of fraudulent activities. In this work, a new methodology is proposed to support investigators during the analysis process, correlating evidences found through different forensic tools. The methodology was implemented through a system able to add semantic assertion to data generated by forensics tools during extraction processes. These assertions enable more effective access to relevant information and enhanced retrieval and reasoning capabilities

    Browser energy efficiency in android

    Dissertação de mestrado integrado em Engenharia InformáticaNowadays, there is a massive growth in energy consumption in the IT sector, which is leaving a huge footprint in terms of energy consumption despite its benefits. With this, the topic of energy consumption and how to improve it has become one of the most talked-about topics today. Several developments have been made to find the most efficient solutions to the various problems that users and developers encounter. But this is far from being an easy task for both, as there is still very little information available, or sometimes the solutions don’t meet the needs of each one. With this in mind, this dissertation aims to verify which Browser is more efficient in the Android environment since there is not much information in this area. For this, we selected seven browsers and ran four test scenarios in order to force the browsers. To test, we recorded a script for each Browser in each scenario, trying to mimic the use of a regular user. The RERAN tool was used to record and repeat each script five times, and the Trepn tool was used to monitor it. The results obtained allowed us to conclude which Browser was more efficient among the seven selected.Atualmente, existe um grande crescimento do consumo energetico do sector de IT, que apesar dos seus benefícios, está a deixar uma enorme pegada no que diz respeito ao consumo energetico. Com isto, o tópico do consumo energético e como melhorar começou ser um dos mais falados atualmente. Diversos desenvolvimentos foram feitos neste âmbito de maneira a encontrar as soluções mais eficientes para os diversos problemas que os utilizadores e os programadores encontram. Mas isto está longe de ser uma tarefa fácil tanto para um como para o outro, sendo que ainda existe muita pouca informação disponível ou por vezes as soluções não vão de encontro às necessidades de cada um. Com isto em mente, esta dissertação tem como objetivo verificar qual o browser é mais eficiente no ambiente Android, visto que não existe muita informação nesta área. Para isto, nós selecionamos sete browsers e fizemos quatro cenários de teste, de maneira a forçar os Browsers. De modo a conseguir testar, gravamos um script para cada Browser em cada cenário, tentando imitar a utilização de um utilizador normal. Foi usada a ferramenta RERAN para gravar e repetir cinco vezes cada script e para a sua monitorização é usado a ferramenta Trepn. Os resultados obtidos permitiram concluir um ranking de qual o Browser foi mais eficiente entre os sete selecionados

    Regulatory Compliance-oriented Impediments and Associated Effort Estimation Metrics in Requirements Engineering for Contractual Systems Engineering Projects

    Large-scale contractual systems engineering projects often need to comply with a myriad of government regulations and standards as part of contractual fulfillment. A key activity in the requirements engineering (RE) process for such a project is to elicit appropriate requirements from the regulations and standards that apply to the target system. However, there are impediments in achieving compliance due to such factors as: the voluminous contract and its high-level specifications, large number of regulatory documents, and multiple domains of the system. Little empirical research has been conducted on developing a shared understanding of the compliance-oriented complexities involved in such projects, and identifying and developing RE support (such as processes, tools, metrics, and methods) to improve overall performance for compliance projects. Through three studies on an industrial RE project, we investigated a number of issues in RE concerning compliance, leading to the following novel results:(i) a meta-model that captures artefacts-types and their compliance-oriented inter-relationships that exist in RE for contractual systems engineering projects; (ii) discovery of key impediments to requirements-compliance due to: (a) contractual complexities (e.g., regulatory requirements specified non-contiguously with non-regulatory requirements in the contract at the ratio of 1:19), (b) complexities in regulatory documents (e.g., over 300 regulatory documents being relevant to the subject system), and (c) large and complex system (e.g., 40% of the contractual regulatory requirements are cross-cutting); (iii) a method for deriving base metrics for estimating the effort needed to do compliance work during RE and demonstrate how a set of derived metrics can be used to create an effort estimation model for such work; (iv) a framework for structuring diverse regulatory documents and requirements for global product developments. These results lay a foundation in RE research on compliance issues with anticipation for its impact in real-world projects and in RE research

    On the enhancement of Big Data Pipelines through Data Preparation, Data Quality, and the distribution of Optimisation Problems

    Nowadays, data are fundamental for companies, providing operational support by facilitating daily transactions. Data has also become the cornerstone of strategic decision-making processes in businesses. For this purpose, there are numerous techniques that allow to extract knowledge and value from data. For example, optimisation algorithms excel at supporting decision-making processes to improve the use of resources, time and costs in the organisation. In the current industrial context, organisations usually rely on business processes to orchestrate their daily activities while collecting large amounts of information from heterogeneous sources. Therefore, the support of Big Data technologies (which are based on distributed environments) is required given the volume, variety and speed of data. Then, in order to extract value from the data, a set of techniques or activities is applied in an orderly way and at different stages. This set of techniques or activities, which facilitate the acquisition, preparation, and analysis of data, is known in the literature as Big Data pipelines. In this thesis, the improvement of three stages of the Big Data pipelines is tackled: Data Preparation, Data Quality assessment, and Data Analysis. These improvements can be addressed from an individual perspective, by focussing on each stage, or from a more complex and global perspective, implying the coordination of these stages to create data workflows. The first stage to improve is the Data Preparation by supporting the preparation of data with complex structures (i.e., data with various levels of nested structures, such as arrays). Shortcomings have been found in the literature and current technologies for transforming complex data in a simple way. Therefore, this thesis aims to improve the Data Preparation stage through Domain-Specific Languages (DSLs). Specifically, two DSLs are proposed for different use cases. While one of them is a general-purpose Data Transformation language, the other is a DSL aimed at extracting event logs in a standard format for process mining algorithms. The second area for improvement is related to the assessment of Data Quality. Depending on the type of Data Analysis algorithm, poor-quality data can seriously skew the results. A clear example are optimisation algorithms. If the data are not sufficiently accurate and complete, the search space can be severely affected. Therefore, this thesis formulates a methodology for modelling Data Quality rules adjusted to the context of use, as well as a tool that facilitates the automation of their assessment. This allows to discard the data that do not meet the quality criteria defined by the organisation. In addition, the proposal includes a framework that helps to select actions to improve the usability of the data. The third and last proposal involves the Data Analysis stage. In this case, this thesis faces the challenge of supporting the use of optimisation problems in Big Data pipelines. There is a lack of methodological solutions that allow computing exhaustive optimisation problems in distributed environments (i.e., those optimisation problems that guarantee the finding of an optimal solution by exploring the whole search space). The resolution of this type of problem in the Big Data context is computationally complex, and can be NP-complete. This is caused by two different factors. On the one hand, the search space can increase significantly as the amount of data to be processed by the optimisation algorithms increases. This challenge is addressed through a technique to generate and group problems with distributed data. On the other hand, processing optimisation problems with complex models and large search spaces in distributed environments is not trivial. Therefore, a proposal is presented for a particular case in this type of scenario. As a result, this thesis develops methodologies that have been published in scientific journals and conferences.The methodologies have been implemented in software tools that are integrated with the Apache Spark data processing engine. The solutions have been validated through tests and use cases with real datasets

    Process Models for Learning Patterns in FLOSS Repositories

    Evidence suggests that Free/Libre Open Source Software (FLOSS) environments provide unlimited learning opportunities. Community members engage in a number of activities both during their interaction with their peers and while making use of these environments’ repositories. To date, numerous studies document the existence of learning processes in FLOSS through surveys or by means of questionnaires filled by FLOSS projects participants. At the same time, there is a surge in developing tools and techniques for extracting and analyzing data from different FLOSS data sources that has birthed a new field called Mining Software Repositories (MSR). In spite of these growing tools and techniques for mining FLOSS repositories, there is limited or no existing approaches to providing empirical evidence of learning processes directly from these repositories. Therefore, in this work we sought to trigger such an initiative by proposing an approach based on Process Mining. With this technique, we aim to trace learning behaviors from FLOSS participants’ trails of activities as recorded in FLOSS repositories. We identify the participants as Novices and Experts. A Novice is defined as any FLOSS member that benefits from a learning experience through acquiring new skills while the Expert is the provider of these skills. The significance of our work is mainly twofold. First and foremost, we extend the MSR field by showing the potential of mining FLOSS repositories by applying Process Mining techniques. Lastly, our work provides critical evidence that boosts the understanding of learning behavior in FLOSS communities by analyzing the relevant repositories. In order to accomplish this, we have proposed and implemented a methodology that follows a seven-step approach including developing an appropriate terminology or ontology for learning processes in FLOSS, contextualizing learning processes through a-priori models, generating Event Logs, generating corresponding process models, interpreting and evaluating the value of process discovery, performing conformance analysis and verifying a number of formulated hypotheses with regard to tracing learning patterns in FLOSS communities. The implementation of this approach has resulted in the development of the Ontology of Learning in FLOSS (OntoLiFLOSS) environments that defines the terms needed to describe learning processes in FLOSS as well as providing a visual representation of these processes through Petri net-like Workflow nets. Moreover, another novelty pertains to the mining of FLOSS repositories by defining and describing the preliminaries required for preprocessing FLOSS data before applying Process Mining techniques for analysis. Through a step-by-step process, we effectively detail how the Event Logs are constructed through generating key phrases and making use of Semantic Search. Taking a FLOSS environment called Openstack as our data source, we apply our proposed techniques to identify learning activities based on key phrases catalogs and classification rules expressed through pseudo code as well as the appropriate Process Mining tool. We thus produced Event Logs that are based on the semantic content of messages in Openstack’s Mailing archives, Internet Relay Chat (IRC) messages, Reviews, Bug reports and Source code to retrieve the corresponding activities. Considering these repositories in light of the three learning process phases (Initiation, Progression and maturation), we produced an Event Log for each participant (Novice or Expert) in every phase on the corresponding dataset. Hence, we produced 14 Event Logs that helped build 14 corresponding process maps which are visual representation of the flow occurrence of learning activities in FLOSS for each participant. These process maps provide critical indications that speak volumes in terms of the presence of learning processes in the analyzed repositories. The results show that learning activities do occur at a significant rate during messages exchange on both Mailing archives and IRC messages. The slight differences between the two datasets can be highlighted in two ways. First, the involvement of Experts is more on iv IRC than it is on Mailing archives with 7.22% and 0.36% of Expert involvement respectively on IRC forums and Mailing lists. This can be justified by the differences in the length of messages sent on these two datasets. The average length of sent messages is 3261 characters for an email compared to 60 characters for a chat message. The evidence produced from this mining experiment solidifies the finding in terms of the existence of learning processes in FLOSS as well as the scale at which they occur. While the Initiation phase shows the Novice as the most involved in the start of the learning process, during Progression phase the involvement of the Expert can be seen to be significantly increasing. In order to trace the advanced skills in the Maturation phase, we look at repositories that store data about developing, creating code, examining and reviewing the code, identifying and fixing possible bugs. Therefore, we consider three repositories including Source Code, Bug reports and Reviews. The results obtained in this phase largely justify the choice of these three datasets to track learning behavior at this stage. Both the Bug reports and the Source code demonstrate the commitment of the Novice to seek answers and interact as much as possible in strengthening the acquired skills. With a participation of 49.22% for the Novice against 46.72% for the Expert and 46.19 % against 42.04% respectively on Bug reports and Source code, the Novice still engages significantly in learning. On the last dataset, Reviews, we notice an increase in the Expert’s role. The Expert performs activities to the tune of 40.36 % of total number of activities against 22.17 % for the Novice. The last steps of our methodology steer the comparison of the defined a-priori models with final models that describe how learning processes occur according to the actual behavior from Event Logs. Our attempts to producing process models start with depicting process maps to track the actual behaviour as it occurs in Openstack repositories, before concluding with final Petri net models representative of learning processes in FLOSS as a result of conformance analysis. For every dataset in the corresponding learning phase, we produce 3 process maps respectively depicting the overall learning behaviour for all FLOSS community members (Novice or Expert together), then the Novice and Expert. In total, we produced 21 process maps, empirically describing process models on real data, 14 process models in the form of Petri nets for every participant on each dataset. We make use of the Artificial Immune System (AIS) algorithms to merge the 14 Event Logs that uniquely capture the behaviour of every participant on different datasets in the three phases. We then reanalyze the resulting logs in order to produce 6 global models that inclusively provide a comprehensive depiction of participants’ learning behavior in FLOSS communities. This description hints that Workflow nets introduced as our a-priori models give rather a more simplistic representation of learning processes in FLOSS. Nevertheless, our experiments with Event Logs starting from process discovery to conformance checking from Openstack repositories demonstrate that the real learning behaviors are more complete and most importantly largely submerge these simplistic a-priori models. Finally, our methodology has proved to be effective in both providing a novel alternative for mining FLOSS repositories and providing empirical evidence that describes how knowledge is exchanged in FLOSS environments. Moreover, our results enrich the MSR field by providing a reproducible step-by-step problem solving approach that can be customized to answer subsequent research questions in FLOSS repositories using Process Mining