Search CORE

5 research outputs found

Leveraging Identifier Naming Structures in Source Code and Bug Reports to Localize Relevant Bugs

Author: Dugan James
Publication venue: RIT Scholar Works
Publication date: 01/05/2022
Field of study

When bugs are found in source code, bug reports are created which contain relevant information for developers to locate and fix the bug. In large source code repositories, it can be difficult and time consuming for developers to manually analyze bug reports to locate a bug. The discovery of patterns between bug reports and source files has led to the creation of automated tools using various techniques. Automated bug localization techniques can reduce the amount of manual effort required by developers by ranking the most probable location of the bug using textual information from bug reports and source code. Although these approaches offer some assistance, the lexical mismatch between the bug reports and the source code makes it difficult to accurately locate the buggy source code file(s) using Information Retrieval (IR) techniques. Our research proposes a technique that takes advantage of the lexical and structural patterns observed in source code identifier names to help offset the mismatch between bug reports and their related source code files. Our observations reveal that there are lexical and structural identifier naming trends for different identifier types in the source code. Using two open-source projects, and collecting frequencies for observed identifier patterns across the project, we applied the observed frequencies to matched word occurrences in bug reports across our evaluation data set to modify the significance of that word. Based on observations discovered in our empirical analysis of open source repositories ElasticSearch and RxJava, we developed a method to modify the significance of a word by altering the weight of the matched word represented in the Term Frequency - Inverse Document Frequency (TF-IDF) vectorization of that particular bug report. The idea behind this approach is that if we come across a word perceived to be significant based on our observed identifier pattern frequency data, we can apply a weight to that word in the bug report vectorization to increase the cosine similarity score between the bug report and source file vectors. This work expands and improves upon previous work by Gharibi et al. [1], who propose a multicomponent approach that uses token matching, stack trace, semantic similarity, and a revised vector space model (rVSM). Specifically, our approach modifies the rVSM component, and our work is evaluated on the same three open-source software projects: AspectJ, SWT, and ZXing. The results of our approach are comparable to the results of Gharibi et al., and we achieve an improvement in some cases. It was observed that our work outperforms many existing bug localization approaches. Top@N, Mean Reciprocal Rank (MRR), and Mean Average Precision (MAP) are metrics used to evaluate and rank our work against other approaches, revealing some improvement in bug localization across three open-source projects

RIT Scholar Works

Testing of Neural Networks

Author: Lad Devan
Publication venue: RIT Scholar Works
Publication date: 01/05/2022
Field of study

Research in Neural Networks is becoming more popular each year. Re- search has introduced different ways to utilize Neural Networks, but an important aspect is missing: Testing. There are only 16 papers that strictly address Testing Neural Networks with a majority of them focusing on Deep Neural Networks and a small part on Recurrent Neural Networks. Testing Re- current neural networks is just as important as testing Deep Neural Networks as they are used in products like Autonomous Vehicles. So there is a need to ensure that the recurrent neural networks are of high quality, reliable, and have the correct behavior. For the few existing research papers on the testing of recurrent neural networks, they only focused on LSTM or GRU recurrent neural network architectures, but more recurrent neural network architectures exist such as MGU, UGRNN, and Delta-RNN. This means we need to see if ex- isting test metrics works for these architectures or do we need to introduce new testing metrics. For this paper we have two objectives. First, we will do a comparative analysis of the 16 papers with research in Testing Neural Networks. We define the testing metrics and analyze the features such as code availability, programming languages, related testing software concepts, etc. We then perform a case study with the Neuron Coverage Test Metric. We will conduct an experiment using unoptimized RNN models trained by a tool within EXAMM, a RNN Framework and optimized RNN Models trained and optimized using ANTS. We compared the Neuron Coverage Outputs with the assumption that the Optimized Models will perform better

RIT Scholar Works

Why did you clone these identifiers? Using Grounded Theory to understand Identifier Clones

Author: Gutierrez Galaviz Luis Angel
Publication venue: RIT Scholar Works
Publication date: 01/05/2022
Field of study

Developers spend most of their time comprehending source code, with some studies estimating this activity takes between 58% to 70% of a developer’s time. To improve the readability of source code, and therefore the productivity of developers, it is important to understand what aspects of static code analysis and syntactic code structure hinder the understandability of code. Identifiers are a main source of code comprehension due to their large volume and their role as implicit documentation of a developer’s intent when writing code. Despite the critical role that identifiers play during program comprehension, there are no regulated naming standards for developers to follow when picking identifier names. Our research supports previous work aimed at understanding what makes a good identifier name, and practices to follow when picking names by exploring a phenomenon that occurs during identifier naming: identifier clones. Identifier clones are two or more identifiers that are declared using the same name. This is an important yet unexplored phenomenon in identifier naming where developers intentionally give the same name to two or more identifiers in separate parts of a system. We must study identifier clones to understand it’s impact on program comprehension and to better understand the nature of identifier naming. To accomplish this, we conducted an empirical study on identifier clones detected in open-source software engineered systems and propose a taxonomy of identifier clones containing categories that can explain why they are introduced into systems and whether they represent naming antipatterns

RIT Scholar Works

Modelo inteligente de especificación de la granularidad de aplicaciones basadas en microservicios.

Author: Vera Rivera Fredy Humberto
Publication venue: DOCTORADO EN INGENIERÍA-ÉNFASIS EN CIENCIAS DE LA COMPUTACIÓN
Publication date: 01/01/2021
Field of study

Los microservicios son un enfoque arquitectónico y organizativo del desarrollo de software en el que las aplicaciones están compuestas por pequeños servicios independientes que se comunican a través de una interfaz de programación de aplicaciones (API) bien definida, muchas empresas utilizan los microservicios para estructurar sus sistemas, también la arquitectura de microservicios ha sido utilizada en otras áreas como la internet de las cosas (IoT), computación en el borde (edge computing), computación en la nube, desarrollo de vehículos autónomos, telecomunicaciones, sistemas de E-Salud, E-Learning, entre otros. Un gran desafío al diseñar este tipo de aplicaciones es encontrar una partición o granularidad adecuada de los microservicios, proceso que a la fecha se realiza y diseña de forma intuitiva, según la experiencia del arquitecto o del equipo de desarrollo. La definición del tamaño o granularidad de los microservicios es un tema de investigación abierto y de interés, no se han estandarizado patrones, métodos o modelos que permitan definir qué tan pequeño debe ser un microservicio. Las estrategias más utilizadas para estimar la granularidad de los microservicios son: el aprendizaje automático, la similitud semántica, la programación genética y la ingeniería de dominio. En este trabajo de investigación doctoral se propone un modelo inteligente para especificar y evaluar la granularidad de los microservicios que hacen parte de una aplicación; teniendo en cuenta algunas características como la complejidad cognitiva, el tiempo de desarrollo, el acoplamiento, la cohesión y su comunicación. En el capitulo uno se presentan el marco teórico, se plantea el problema de investigación resuelto, junto con las preguntas de investigación que ayudan a resolverlo, también se presentan los objetivos y la metodologia de investigación, por medio de la cual se propone una nueva práctica, un modelo inteligente de especificación de la granularidad de los microservicios llamada ¿Microsevices Backlog¿, también se presentan las fases y métodos de investigación que permitieron resolver las preguntas de investigación planteadas. El captiulo dos presenta el esatado del arte y los trabajos relacionados con el presente trabajo de investigación doctoral; también se identifican las métricas que se han utilizado para definir y evaluar la granularidad de los microservicios. En el capitulo 3 se caracteriza el proceso de desarrollo de aplicaciones basadas en microservicios, explicando su uso en un caso de estudio llamado ¿Sinplafut¿. En el capitulo 4 se plantea la descripción del ¿Microservice Backlog¿, se presenta la definición de cada uno de sus componentes, entre los cuales se encuentran: el componente parametrizador, el componente agrupador (un algoritmo genético y un algoritmo de agrupamiento semántico basado en aprendizaje automático no supervisado), el componente evaluador de métricas y el componente comparador de descomposiciones y de microservicios candidatos, también se presenta la formulación matemática de la granularidad de aplicaciones basadas en microservicios. El capitulo 5 presenta la evaluación de la práctica propuesta, se realizó de forma iterativa usando cuatro casos de estudio, dos ejemplos planteados en el estado del arte (Cargo Tracking and JPet-Store) y dos proyectos reales (Foristom Conferences y Sinplafut), se utilizó el Microservices Backlog para obtener y evaluar los microservicios candidatos de las cuatro aplicaciones. Se realizó un analisis comparativo contra métodos propuestos en el estado del arte y con el diseño basado en el dominio (DDD), el cual es le método más utilizado para definir los microservicios que van a ser parte de una aplicación. El Microservices Backlog obtuvo un bajo acoplamiento, alta cohesión, baja complejidad y reduce la comunicación entre los microservicios, esto comparado con las propuestas del estado del arte y con DDD. Finalmente en el capitulo 6 se presentan las conclusiones, contribuciones, limitaciones y productos obtenidos como resultado de esta tesisDoctoradoDOCTOR(A) EN INGENIERÍ

Biblioteca Digital de la Universidad del Valle

Text Similarity Between Concepts Extracted from Source Code and Documentation

Author: Capiluppi Andrea
Pauzi Zaki
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 27/10/2020
Field of study

Context: Constant evolution in software systems often results in its documentation losing sync with the content of the source code. The traceability research field has often helped in the past with the aim to recover links between code and documentation, when the two fell out of sync. Objective: The aim of this paper is to compare the concepts contained within the source code of a system with those extracted from its documentation, in order to detect how similar these two sets are. If vastly different, the difference between the two sets might indicate a considerable ageing of the documentation, and a need to update it. Methods: In this paper we reduce the source code of 50 software systems to a set of key terms, each containing the concepts of one of the systems sampled. At the same time, we reduce the documentation of each system to another set of key terms. We then use four different approaches for set comparison to detect how the sets are similar. Results: Using the well known Jaccard index as the benchmark for the comparisons, we have discovered that the cosine distance has excellent comparative powers, and depending on the pre-training of the machine learning model. In particular, the SpaCy and the FastText embeddings offer up to 80% and 90% similarity scores. Conclusion: For most of the sampled systems, the source code and the documentation tend to contain very similar concepts. Given the accuracy for one pre-trained model (e.g., FastText), it becomes also evident that a few systems show a measurable drift between the concepts contained in the documentation and in the source code.</p

Proceedings - University of Groningen

University of Groningen

ARTS repository - University of Groningen

Dissertations of the University of Groningen