37 research outputs found

    Intelligent Software Tooling For Improving Software Development

    Get PDF
    Software has eaten the world with many of the necessities and quality of life services people use requiring software. Therefore, tools that improve the software development experience can have a significant impact on the world such as generating code and test cases, detecting bugs, question and answering, etc. The success of Deep Learning (DL) over the past decade has shown huge advancements in automation across many domains, including Software Development processes. One of the main reasons behind this success is the availability of large datasets such as open-source code available through GitHub or image datasets of mobile Graphical User Interfaces (GUIs) with RICO and ReDRAW to be trained on. Therefore, the central research question my dissertation explores is: In what ways can the software development process be improved through leveraging DL techniques on the vast amounts of unstructured software engineering artifacts? We coin the approaches that leverage DL to automate or augment various software development task as Intelligent Software Tools. To guide our research of these intelligent software tools, we performed a systematic literature review to understand the current landscape of research on applying DL techniques to software tasks and any gaps that exist. From this literature review, we found code generation to be one of the most studied tasks with other tasks and artifacts such as impact analysis or tasks involving images and videos to be understudied. Therefore, we set out to explore the application of DL to these understudied tasks and artifacts as well as the limitations of DL models under the well studied task code completion, a subfield in code generation. Specifically, we developed a tool for automatically detecting duplicate mobile bug reports from user submitted videos. We used the popular Convolutional Neural Network (CNN) to learn important features from a large collection of mobile screenshots. Using this model, we could then compute similarity between a newly submitted bug report and existing ones to produce a ranked list of duplicate candidates that can be reviewed by a developer. Next, we explored impact analysis, a critical software maintenance task that identifies potential adverse effects of a given code change on the larger software system. To this end, we created Athena, a novel approach to impact analysis that integrates knowledge of a software system through its call-graph along with high-level representations of the code inside the system to improve impact analysis performance. Lastly, we explored the task of code completion, which has seen heavy interest from industry and academia. Specifically, we explored various methods that modify the positional encoding scheme of the Transformer architecture for allowing these models to incorporate longer sequences of tokens when predicting completions than seen during their training as this can significantly improve training times

    Three Studies on Model Transformations - Parsing, Generation and Ease of Use

    Get PDF
    ABSTRACTTransformations play an important part in both software development and the automatic processing of natural languages. We present three publications rooted in the multi-disciplinary research of Language Technology and Software Engineering and relate their contribution to the literature on syntactical transformations. Parsing Linear Context-Free Rewriting SystemsThe first publication describes four different parsing algorithms for the mildly context-sensitive grammar formalism Linear Context-Free Rewriting Systems. The algorithms automatically transform a text into a chart. As a result the parse chart contains the (possibly partial) analysis of the text according to a grammar with a lower level of abstraction than the original text. The uni-directional and endogenous transformations are described within the framework of parsing as deduction. Natural Language Generation from Class DiagramsUsing the framework of Model-Driven Architecture we generate natural language from class diagrams. The transformation is done in two steps. In the first step we transform the class diagram, defined by Executable and Translatable UML, to grammars specified by the Grammatical Framework. The grammars are then used to generate the desired text. Overall, the transformation is uni-directional, automatic and an example of a reverse engineering translation. Executable and Translatable UML - How Difficult Can it Be?Within Model-Driven Architecture there has been substantial research on the transformation from Platform-Independent Models (PIM) into Platform-Specifc Models, less so on the transformation from Computationally Independent Models (CIM) into PIMs. This publication reflects on the outcomes of letting novice software developers transform CIMs specified by UML into PIMs defined in Executable and Translatable UML.ConclusionThe three publications show how model transformations can be used within both Language Technology and Software Engineering to tackle the challenges of natural language processing and software development

    Human Factors in Agile Software Development

    Full text link
    Through our four years experiments on students' Scrum based agile software development (ASD) process, we have gained deep understanding into the human factors of agile methodology. We designed an agile project management tool - the HASE collaboration development platform to support more than 400 students self-organized into 80 teams to practice ASD. In this thesis, Based on our experiments, simulations and analysis, we contributed a series of solutions and insights in this researches, including 1) a Goal Net based method to enhance goal and requirement management for ASD process, 2) a novel Simple Multi-Agent Real-Time (SMART) approach to enhance intelligent task allocation for ASD process, 3) a Fuzzy Cognitive Maps (FCMs) based method to enhance emotion and morale management for ASD process, 4) the first large scale in-depth empirical insights on human factors in ASD process which have not yet been well studied by existing research, and 5) the first to identify ASD process as a human-computation system that exploit human efforts to perform tasks that computers are not good at solving. On the other hand, computers can assist human decision making in the ASD process.Comment: Book Draf

    Deep learning applied to the assessment of online student programming exercises

    Get PDF
    Massive online open courses (MOOCs) teaching coding are increasing in number and popularity. They commonly include homework assignments in which the students must write code that is evaluated by functional tests. Functional testing may to some extent be automated however provision of more qualitative evaluation and feedback may be prohibitively labor-intensive. Provision of qualitative evaluation at scale, automatically, is the subject of much research effort. In this thesis, deep learning is applied to the task of performing automatic assessment of source code, with a focus on provision of qualitative feedback. Four tasks: language modeling, detecting idiomatic code, semantic code search, and predicting variable names are considered in detail. First, deep learning models are applied to the task of language modeling source code. A comparison is made between the performance of different deep learning language models, and it is shown how language models can be used for source code auto-completion. It is also demonstrated how language models trained on source code can be used for transfer learning, providing improved performance on other tasks. Next, an analysis is made on how the language models from the previous task can be used to detect idiomatic code. It is shown that these language models are able to locate where a student has deviated from correct code idioms. These locations can be highlighted to the student in order to provide qualitative feedback. Then, results are shown on semantic code search, again comparing the performance across a variety of deep learning models. It is demonstrated how semantic code search can be used to reduce the time taken for qualitative evaluation, by automatically pairing a student submission with an instructor’s hand-written feedback. Finally, it is examined how deep learning can be used to predict variable names within source code. These models can be used in a qualitative evaluation setting where the deep learning models can be used to suggest more appropriate variable names. It is also shown that these models can even be used to predict the presence of functional errors. Novel experimental results show that: fine-tuning a pre-trained language model is an effective way to improve performance across a variety of tasks on source code, improving performance by 5% on average; pre-trained language models can be used as zero-shot learners across a variety of tasks, with the zero-shot performance of some architectures outperforming the fine-tuned performance of others; and that language models can be used to detect both semantic and syntactic errors. Other novel findings include: removing the non-variable tokens within source code has negligible impact on the performance of models, and that these remaining tokens can be shuffled with only a minimal decrease in performance.Engineering and Physical Sciences Research Council (EPSRC) fundin

    Continuous Rationale Management

    Get PDF
    Continuous Software Engineering (CSE) is a software life cycle model open to frequent changes in requirements or technology. During CSE, software developers continuously make decisions on the requirements and design of the software or the development process. They establish essential decision knowledge, which they need to document and share so that it supports the evolution and changes of the software. The management of decision knowledge is called rationale management. Rationale management provides an opportunity to support the change process during CSE. However, rationale management is not well integrated into CSE. The overall goal of this dissertation is to provide workflows and tool support for continuous rationale management. The dissertation contributes an interview study with practitioners from the industry, which investigates rationale management problems, current practices, and features to support continuous rationale management beneficial for practitioners. Problems of rationale management in practice are threefold: First, documenting decision knowledge is intrusive in the development process and an additional effort. Second, the high amount of distributed decision knowledge documentation is difficult to access and use. Third, the documented knowledge can be of low quality, e.g., outdated, which impedes its use. The dissertation contributes a systematic mapping study on recommendation and classification approaches to treat the rationale management problems. The major contribution of this dissertation is a validated approach for continuous rationale management consisting of the ConRat life cycle model extension and the comprehensive ConDec tool support. To reduce intrusiveness and additional effort, ConRat integrates rationale management activities into existing workflows, such as requirements elicitation, development, and meetings. ConDec integrates into standard development tools instead of providing a separate tool. ConDec enables lightweight capturing and use of decision knowledge from various artifacts and reduces the developers' effort through automatic text classification, recommendation, and nudging mechanisms for rationale management. To enable access and use of distributed decision knowledge documentation, ConRat defines a knowledge model of decision knowledge and other artifacts. ConDec instantiates the model as a knowledge graph and offers interactive knowledge views with useful tailoring, e.g., transitive linking. To operationalize high quality, ConRat introduces the rationale backlog, the definition of done for knowledge documentation, and metrics for intra-rationale completeness and decision coverage of requirements and code. ConDec implements these agile concepts for rationale management and a knowledge dashboard. ConDec also supports consistent changes through change impact analysis. The dissertation shows the feasibility, effectiveness, and user acceptance of ConRat and ConDec in six case study projects in an industrial setting. Besides, it comprehensively analyses the rationale documentation created in the projects. The validation indicates that ConRat and ConDec benefit CSE projects. Based on the dissertation, continuous rationale management should become a standard part of CSE, like automated testing or continuous integration

    Genetic Improvement of Software: From Program Landscapes to the Automatic Improvement of a Live System

    Get PDF
    In today’s technology driven society, software is becoming increasingly important in more areas of our lives. The domain of software extends beyond the obvious domain of computers, tablets, and mobile phones. Smart devices and the internet-of-things have inspired the integra- tion of digital and computational technology into objects that some of us would never have guessed could be possible or even necessary. Fridges and freezers connected to social media sites, a toaster activated with a mobile phone, physical buttons for shopping, and verbally asking smart speakers to order a meal to be delivered. This is the world we live in and it is an exciting time for software engineers and computer scientists. The sheer volume of code that is currently in use has long since outgrown beyond the point of any hope for proper manual maintenance. The rate of which mobile application stores such as Google’s and Apple’s have expanded is astounding. The research presented here aims to shed a light on an emerging field of research, called Genetic Improvement ( GI ) of software. It is a methodology to change program code to improve existing software. This thesis details a framework for GI that is then applied to explore fitness landscape of bug fixing Python software, reduce execution time in a C ++ program, and integrated into a live system. We show that software is generally not fragile and although fitness landscapes for GI are flat they are not impossible to search in. This conclusion applies equally to bug fixing in small programs as well as execution time improvements. The framework’s application is shown to be transportable between programming languages with minimal effort. Additionally, it can be easily integrated into a system that runs a live web service. The work within this thesis was funded by EPSRC grant EP/J017515/1 through the DAASE project

    Assessing the Reliability of Deep Learning Applications

    Get PDF
    Deep Learning (DL) applications are widely deployed in diverse areas, such as image classification, natural language processing, and auto-driving systems. Although these applications achieve outstanding performance in terms of accuracy, developers have raised strong concerns about their reliability since the logic of DL applications is a black box for humans. Specifically, DL applications learn the logic during stochastic training and encode the logic in high-dimensional weights of DL models. Unlike source code in conventional software, such weights are infeasible for humans to directly interpret, examine, and validate. As a result, the defects in DL applications are not easy to be detected in software development stages and may cause catastrophic accidents in safety-critical missions. Therefore, it is critical to adequately test DL applications in terms of reliability before they are deployed. This thesis aims to propose automatic approaches to testing DL applications from the perspective of reliability. It consists of the following three studies. The first study proposes object-relevancy, a property that reliable DL-based image classifiers should comply with, i.e., the classification results should be made based on the features relevant to the target object in a given image, instead of irrelevant features such as the background. This study further proposes a metamorphic testing approach and two corresponding metamorphic relations to assess if this property is violated in image classifications. The evaluation shows that the proposed approach can effectively detect the unreliable inferences violating the object-relevancy property, with the average precision 64.1% and 96.4% for the two relations, respectively. The subsequent empirical study reveals that such unreliable inferences are prevalent in the real world and the existing training strategies cannot tame this issue effectively. The second study concentrates on the reliability issues induced by model compression of DL applications. Model compression can significantly reduce the sizes of Deep Neural Network (DNN) models, and thus facilitate the dissemination of sophisticated, sizable DNN models. However, the prediction results of compressed models may deviate from those of their original models, resulting in unreliable DL applications in deployment. To help developers thoroughly understand the impact of model compression, it is essential to test these models to find those deviated behaviors before dissemination. This study proposes DFLARE, a novel, search-based, black-box testing technique. The evaluation shows that DFLARE constantly outperforms the baseline in both efficacy and efficiency. More importantly, the triggering inputs found by DFLARE can be used to repair up to 48.48% deviated behaviors. The third study focuses on the reliability of DL-based vulnerability detection (DLVD) techniques. DLVD techniques are designed to detect the vulnerability in the source code. However, these techniques may only capture the syntactic patterns of vulnerable code while ignoring the semantic information in the source code. As a result, malicious users can easily fool such techniques by manipulating the syntactic patterns of vulnerable code, e.g., variable renaming. This study proposes a new methodology to evaluate the learning ability of DLVD techniques, i.e., whether a DLVD technique can capture the semantic information from vulnerable source code and leverage such information in detection. Specifically, this approach creates a special dataset in which the vulnerable functions and non-vulnerable ones have almost identical syntactic code patterns but different semantic meanings. If a detection approach cannot capture the semantic difference between the vulnerable functions and the non-vulnerable ones, this approach will have low performance on the constructed dataset. Our preliminary results show that two common detection approaches are ineffective in capturing the semantic information from source code

    Security and trust in cloud computing and IoT through applying obfuscation, diversification, and trusted computing technologies

    Get PDF
    Cloud computing and Internet of Things (IoT) are very widely spread and commonly used technologies nowadays. The advanced services offered by cloud computing have made it a highly demanded technology. Enterprises and businesses are more and more relying on the cloud to deliver services to their customers. The prevalent use of cloud means that more data is stored outside the organization’s premises, which raises concerns about the security and privacy of the stored and processed data. This highlights the significance of effective security practices to secure the cloud infrastructure. The number of IoT devices is growing rapidly and the technology is being employed in a wide range of sectors including smart healthcare, industry automation, and smart environments. These devices collect and exchange a great deal of information, some of which may contain critical and personal data of the users of the device. Hence, it is highly significant to protect the collected and shared data over the network; notwithstanding, the studies signify that attacks on these devices are increasing, while a high percentage of IoT devices lack proper security measures to protect the devices, the data, and the privacy of the users. In this dissertation, we study the security of cloud computing and IoT and propose software-based security approaches supported by the hardware-based technologies to provide robust measures for enhancing the security of these environments. To achieve this goal, we use obfuscation and diversification as the potential software security techniques. Code obfuscation protects the software from malicious reverse engineering and diversification mitigates the risk of large-scale exploits. We study trusted computing and Trusted Execution Environments (TEE) as the hardware-based security solutions. Trusted Platform Module (TPM) provides security and trust through a hardware root of trust, and assures the integrity of a platform. We also study Intel SGX which is a TEE solution that guarantees the integrity and confidentiality of the code and data loaded onto its protected container, enclave. More precisely, through obfuscation and diversification of the operating systems and APIs of the IoT devices, we secure them at the application level, and by obfuscation and diversification of the communication protocols, we protect the communication of data between them at the network level. For securing the cloud computing, we employ obfuscation and diversification techniques for securing the cloud computing software at the client-side. For an enhanced level of security, we employ hardware-based security solutions, TPM and SGX. These solutions, in addition to security, ensure layered trust in various layers from hardware to the application. As the result of this PhD research, this dissertation addresses a number of security risks targeting IoT and cloud computing through the delivered publications and presents a brief outlook on the future research directions.Pilvilaskenta ja esineiden internet ovat nykyään hyvin tavallisia ja laajasti sovellettuja tekniikkoja. Pilvilaskennan pitkälle kehittyneet palvelut ovat tehneet siitä hyvin kysytyn teknologian. Yritykset enenevässä määrin nojaavat pilviteknologiaan toteuttaessaan palveluita asiakkailleen. Vallitsevassa pilviteknologian soveltamistilanteessa yritykset ulkoistavat tietojensa käsittelyä yrityksen ulkopuolelle, minkä voidaan nähdä nostavan esiin huolia taltioitavan ja käsiteltävän tiedon turvallisuudesta ja yksityisyydestä. Tämä korostaa tehokkaiden turvallisuusratkaisujen merkitystä osana pilvi-infrastruktuurin turvaamista. Esineiden internet -laitteiden lukumäärä on nopeasti kasvanut. Teknologiana sitä sovelletaan laajasti monilla sektoreilla, kuten älykkäässä terveydenhuollossa, teollisuusautomaatiossa ja älytiloissa. Sellaiset laitteet keräävät ja välittävät suuria määriä informaatiota, joka voi sisältää laitteiden käyttäjien kannalta kriittistä ja yksityistä tietoa. Tästä syystä johtuen on erittäin merkityksellistä suojata verkon yli kerättävää ja jaettavaa tietoa. Monet tutkimukset osoittavat esineiden internet -laitteisiin kohdistuvien tietoturvahyökkäysten määrän olevan nousussa, ja samaan aikaan suuri osuus näistä laitteista ei omaa kunnollisia teknisiä ominaisuuksia itse laitteiden tai niiden käyttäjien yksityisen tiedon suojaamiseksi. Tässä väitöskirjassa tutkitaan pilvilaskennan sekä esineiden internetin tietoturvaa ja esitetään ohjelmistopohjaisia tietoturvalähestymistapoja turvautumalla osittain laitteistopohjaisiin teknologioihin. Esitetyt lähestymistavat tarjoavat vankkoja keinoja tietoturvallisuuden kohentamiseksi näissä konteksteissa. Tämän saavuttamiseksi työssä sovelletaan obfuskaatiota ja diversifiointia potentiaalisiana ohjelmistopohjaisina tietoturvatekniikkoina. Suoritettavan koodin obfuskointi suojaa pahantahtoiselta ohjelmiston takaisinmallinnukselta ja diversifiointi torjuu tietoturva-aukkojen laaja-alaisen hyödyntämisen riskiä. Väitöskirjatyössä tutkitaan luotettua laskentaa ja luotettavan laskennan suoritusalustoja laitteistopohjaisina tietoturvaratkaisuina. TPM (Trusted Platform Module) tarjoaa turvallisuutta ja luottamuksellisuutta rakentuen laitteistopohjaiseen luottamukseen. Pyrkimyksenä on taata suoritusalustan eheys. Työssä tutkitaan myös Intel SGX:ää yhtenä luotettavan suorituksen suoritusalustana, joka takaa suoritettavan koodin ja datan eheyden sekä luottamuksellisuuden pohjautuen suojatun säiliön, saarekkeen, tekniseen toteutukseen. Tarkemmin ilmaistuna työssä turvataan käyttöjärjestelmä- ja sovellusrajapintatasojen obfuskaation ja diversifioinnin kautta esineiden internet -laitteiden ohjelmistokerrosta. Soveltamalla samoja tekniikoita protokollakerrokseen, työssä suojataan laitteiden välistä tiedonvaihtoa verkkotasolla. Pilvilaskennan turvaamiseksi työssä sovelletaan obfuskaatio ja diversifiointitekniikoita asiakaspuolen ohjelmistoratkaisuihin. Vankemman tietoturvallisuuden saavuttamiseksi työssä hyödynnetään laitteistopohjaisia TPM- ja SGX-ratkaisuja. Tietoturvallisuuden lisäksi nämä ratkaisut tarjoavat monikerroksisen luottamuksen rakentuen laitteistotasolta ohjelmistokerrokseen asti. Tämän väitöskirjatutkimustyön tuloksena, osajulkaisuiden kautta, vastataan moniin esineiden internet -laitteisiin ja pilvilaskentaan kohdistuviin tietoturvauhkiin. Työssä esitetään myös näkemyksiä jatkotutkimusaiheista

    Políticas de Copyright de Publicações Científicas em Repositórios Institucionais: O Caso do INESC TEC

    Get PDF
    A progressiva transformação das práticas científicas, impulsionada pelo desenvolvimento das novas Tecnologias de Informação e Comunicação (TIC), têm possibilitado aumentar o acesso à informação, caminhando gradualmente para uma abertura do ciclo de pesquisa. Isto permitirá resolver a longo prazo uma adversidade que se tem colocado aos investigadores, que passa pela existência de barreiras que limitam as condições de acesso, sejam estas geográficas ou financeiras. Apesar da produção científica ser dominada, maioritariamente, por grandes editoras comerciais, estando sujeita às regras por estas impostas, o Movimento do Acesso Aberto cuja primeira declaração pública, a Declaração de Budapeste (BOAI), é de 2002, vem propor alterações significativas que beneficiam os autores e os leitores. Este Movimento vem a ganhar importância em Portugal desde 2003, com a constituição do primeiro repositório institucional a nível nacional. Os repositórios institucionais surgiram como uma ferramenta de divulgação da produção científica de uma instituição, com o intuito de permitir abrir aos resultados da investigação, quer antes da publicação e do próprio processo de arbitragem (preprint), quer depois (postprint), e, consequentemente, aumentar a visibilidade do trabalho desenvolvido por um investigador e a respetiva instituição. O estudo apresentado, que passou por uma análise das políticas de copyright das publicações científicas mais relevantes do INESC TEC, permitiu não só perceber que as editoras adotam cada vez mais políticas que possibilitam o auto-arquivo das publicações em repositórios institucionais, como também que existe todo um trabalho de sensibilização a percorrer, não só para os investigadores, como para a instituição e toda a sociedade. A produção de um conjunto de recomendações, que passam pela implementação de uma política institucional que incentive o auto-arquivo das publicações desenvolvidas no âmbito institucional no repositório, serve como mote para uma maior valorização da produção científica do INESC TEC.The progressive transformation of scientific practices, driven by the development of new Information and Communication Technologies (ICT), which made it possible to increase access to information, gradually moving towards an opening of the research cycle. This opening makes it possible to resolve, in the long term, the adversity that has been placed on researchers, which involves the existence of barriers that limit access conditions, whether geographical or financial. Although large commercial publishers predominantly dominate scientific production and subject it to the rules imposed by them, the Open Access movement whose first public declaration, the Budapest Declaration (BOAI), was in 2002, proposes significant changes that benefit the authors and the readers. This Movement has gained importance in Portugal since 2003, with the constitution of the first institutional repository at the national level. Institutional repositories have emerged as a tool for disseminating the scientific production of an institution to open the results of the research, both before publication and the preprint process and postprint, increase the visibility of work done by an investigator and his or her institution. The present study, which underwent an analysis of the copyright policies of INESC TEC most relevant scientific publications, allowed not only to realize that publishers are increasingly adopting policies that make it possible to self-archive publications in institutional repositories, all the work of raising awareness, not only for researchers but also for the institution and the whole society. The production of a set of recommendations, which go through the implementation of an institutional policy that encourages the self-archiving of the publications developed in the institutional scope in the repository, serves as a motto for a greater appreciation of the scientific production of INESC TEC
    corecore