6 research outputs found

    Developing an aspect-based sentiment lexicon for software engineering

    Full text link
    Natural-language processing (NLP) is an interdisciplinary research field that has its core at understanding and analyzing written language. Under this broad umbrella fits several different sub-categories with different goals like sentiment analysis and opinion mining, which are used to discover either underlying emotions from the text or the opinions of the writer. Sentiment analysis has received in recent years a lot of attention in research field due to its applicability to several different domains and topics such as products, companies, movies etc. It can be used to discover whether writers’ attitude towards these topics. Sentiment mining can be broadly categorized in two approaches that are lexicon-based and machine-learning based approaches. Lexicon-based uses a prebuilt dictionary or lexicon to classify texts in unsupervised manner, and it is used more widely as the machine-learning approach needs to have a high-quality corpus for training the classifier. In this thesis an aspect-based lexicon for software engineering was built. As the object was a creation of an artefact, design science methodology was applied to ensure the research was done rigorously. The requirements for the lexicon were derived from a literature review that yielded both necessary and supplementary requirements. The build process was similarly derived from the literature. This process has several steps starting from the creation of the corpus, continuing with identifying the aspects, creation of the seed set and finally expanding this set into the final lexicon. The result was a lexicon that contained over 10 000 unique software engineering aspects. All of these were scored in terms of valence, arousal and dominance, which together form the VAD-score. Four different methods were used to obtain four different scores for all of them. This guarantees that the most suitable calculation method and even a combination of methods can be used in the future research. The lexicon was evaluated against an existing generic lexicon, to see how domain-specific lexicon differs from it. The differences relate to how much deviation there are within the VAD-scores and how their correlations interact with each other. The automated parts of aspect annotation and seed expansion were evaluated to see if they can be performed without supervision, as manual annotation and expansion require a lot of resources. The research showed one possibility for creating an aspect-based lexicon, and steps that can be taken for creating one regardless of the domain. It also showed avenues for future research. The broadest ones are the development of a gold standard dataset for software engineering that uses VAD-scores and defining a systematic and unified process for the creation of a lexicon for natural language processing purposes.Natural-language processing (NLP) eli luonnollisen kielen prosessointi on monitieteistä tutkimusta, jonka tarkoituksena on ymmärtää ja analysoida kirjoitettua kieltä. Tämä kattotermi sisältää useita erilaisia suuntauksia, joilla kaikilla on omat tavoitteensa. Esimerkikkeinä mainittakoon tunnetilojen tunnistamiseen ja analysointiin keskittyvä tutkimus sekä mielipiteiden havaitsemiseen keskittyvä suuntaus. Tunnetilojen tutkimus on saanut paljon huomiota tutkimuksessa, sillä sitä voidaan hyödyntää monilla eri aloilla. Sen avulla no mahdollista tutkia esimerkiksi sitä, kuinka positiivisesti kirjoittaja suhtautuu kirjoittamaansa aiheeseen, käsittelipä se sitten tuotetta, yritystä tai vaikka elokuvaa. Tunnetilojen analysoinnin tutkimuksessa on laajasti ajateltuna kaksi tyylisuunaa, joista toinen perustuu erikseen luotuihin sanakirjoihin ja toinen puolestaan tekoälyn hyödyntämiseen. Sanakirjojen käyttö on suositumpaa, sillä tekoäly vaatii oppiakseen käsintehdyn ja korkealaatuisen suuren testiaineiston. Tässä pro gradu -tutkielmassa luotiin sanakirja ohjelsmistotuotannon alalle. Koska työ keskittyi artifaktin luomiseen, sen toteuttamisessa hyödynnettiin design science research -metodologiaa. Sanakirjan vaatimukset haettiin kirjallisuuskatsauksesta, jonka myötä työlle määriteltiin pakolliset sekä valinnaiset vaatimukset. Luomisprosessi pohjautui myös kirjallisuudesta löydettyihin toimintatapoihin ja se sisälsi useita erilaisia vaiheita. Prosessi alkoi lähdemateriaalina toimivan corpuksen määrittelemisestä, erilaisten useista sanoista koostuvien aspektien tunnistamiseen. Näistä aspekteista koottiin pieni ydinjoukko, jonka laajennuksen myötä sanakirja sai lopullisen muotonsa. Lopputuloksena oli sanakirja, joka sisältää yli 10 000 ohjelmistontuotantoon liittyvää aspekteja, Näille aspekteille laskettiin erilaisia arvoja suhteessa siihen kuinka positiivinen, aktiivinen sekä dominoiva kyseinen termi oli. Jokaiselle aspektille esitettiin neljä erilaista vaihtoehtoa kullekin arvolle, jotta tulevaisuudessa on mahdollista valita tarkoitukseen sopiva vaihtoehto useista arvoista tai jopa arvojen yhdistelmä. Sanakirjaa arvioitiin suhteessa yleiseen samaa skaalaa käyttävään sanakirjaan. Erot liittyvät aspektien saamiin erilaisiin numeerisiin arvoihin sekä jakaumien poikkeavuuksiin toisistaan. Etenkin positiivisuuden sekä aktiivisuuden välinen korrelaatio erosi huomattavasti yleisen sanakirjan luvuista. Tämän lisäksi arvioitiin automaattisesti toteutettujen laskutoimitusten sekä Wordnet-tietokannan avulla suoritetun sanakirjan laajentamisen tarkkuus. Näiden vaiheiden tekeminen käsin on työlästä, joten automaattiset menetelmät säästäisivät paljon resursseja. Tutkimus osoitti yhden tavan luoda ohjelmistotuotantoon perustuvan sanakirjan, sekä yleiset askeleet sanakirjan luomiseksi mille tahansa alalle. Tutkielma viitoittaa myös tietä tulevaisuuden tutkimukselle useilla osa-alueilla. Tärkeimpinä mainittakoon käsintuotetun standardoidun testiaineiston luominen sekä yhteinäisten ohjeiden luominen sanakirjojen valmistamiseksi

    Predicting technical debt from commit contents:reproduction and extension with automated feature selection

    Full text link
    Abstract Self-admitted technical debt refers to sub-optimal development solutions that are expressed in written code comments or commits. We reproduce and improve on a prior work by Yan et al. (2018) on detecting commits that introduce self-admitted technical debt. We use multiple natural language processing methods: Bag-of-Words, topic modeling, and word embedding vectors. We study 5 open-source projects. Our NLP approach uses logistic Lasso regression from Glmnet to automatically select best predictor words. A manually labeled dataset from prior work that identified self-admitted technical debt from code level commits serves as ground truth. Our approach achieves + 0.15 better area under the ROC curve performance than a prior work, when comparing only commit message features, and + 0.03 better result overall when replacing manually selected features with automatically selected words. In both cases, the improvement was statistically significant (p < 0.0001). Our work has four main contributions, which are comparing different NLP techniques for SATD detection, improved results over previous work, showing how to generate generalizable predictor words when using multiple repositories, and producing a list of words correlating with SATD. As a concrete result, we release a list of the predictor words that correlate positively with SATD, as well as our used datasets and scripts to enable replication studies and to aid in the creation of future classifiers

    PENTACET data:23 million contextual code comments and 250,000 SATD comments

    Full text link
    Abstract Most Self-Admitted Technical Debt (SATD) research utilizes explicit SATD features such as ‘TODO’ and ‘FIXME’ for SATD detection. A closer look reveals several SATD research uses simple SATD (‘Easy to Find’) code comments without contextual data (preceding and succeeding source code context). This work addresses this gap through PENTACET (or 5C dataset) data. PENTACET is a large Curated Contextual Code Comments per Contributor and the most extensive SATD data. We mine 9,096 Open Source Software Java projects totaling over 400 million LOC. The outcome is a dataset with 23 million code comments, preceding and succeeding source code context for each comment, and more than 250,000 SATD comments, including both ‘Easy to Find’ and ‘Hard to Find’ SATD. We believe PENTACET data will further SATD research using Artificial Intelligence techniques

    Keyword-labeled self-admitted technical debt and static code analysis have significant relationship but limited overlap

    Full text link
    Abstract Technical debt presents sub-optimal choices made in development, which are beneficial in the short term but not in the long run. Consciously admitted debt, which is marked with a keyword, e.g., TODO, is called keyword-labeled self-admitted technical debt (KL-SATD). KL-SATD can lead to adverse effects in software development, e.g., to a rise in complexity within the developed software. We investigated the relationship between KL-SATD from source code comments and reports from the highly popular industrial program analysis tool SonarQube. The goal was to find which SonarQube metrics and issues are related to KL-SATD introduction and removal and how many KL-SATD in the context of an issue addresses that issue. We performed a study with 33 software repositories. We analyzed the changes in SonarQube reports (sqale index, reliability and security remediation metrics, and SonarQube issues) and the relationship to KL-SATD addition and removal with mixed model analysis. We manually annotated a sample to investigate how many KL-SATD comments are in the context of SonarQube issues and how many address them directly. KL-SATD is associated with a reduction in code maintainability measured with SonarQube’s sqale index. KL-SATD removal is associated with an increase in code maintainability (sqale index) and reliability measured with SonarQube’s reliability remediation effort. The introduction and removal of KL-SATD have a predominantly relationship with code smells, and not with vulnerabilities and bugs. Manual annotation revealed that 36% of KL-SATD comments are in the context of a SonarQube issue, but only 15% of the comment address an issue. This means that despite of statistical relationship between KL-SATD comments and SonarQube reports there is a large set of KL-SATD comments that are in areas that Sonarqube reports as clean or free of maintainability issues. KL-SATD introduction and removal are connected mainly to code smells, connecting them to maintainability rather than reliability or security. This is reinforced by the relationship with the sqale index, as well as the dominance of code smells in SonarQube issues. Many KL-SATD issues have characteristics going beyond static analysis tools and require future studies extending the capabilities of the current tools. As KL-SATD comments and SonarQube reports appear to have limited overlap, it suggests that they are complementary and both are needed for getting a comprehensive view coverage of code maintainability. The study also presents rules violations developers should be aware of regarding KL-SATD introduction and removal

    Prevalence, contents and automatic detection of KL-SATD

    Full text link
    Abstract When developers use different keywords such as TODO and FIXME in source code comments to describe self-admitted technical debt (SATD), we refer it as Keyword-Labeled SATD (KL-SATD). We study KL-SATD from 33 software repositories with 13,588 KL-SATD comments. We find that the median percentage of KL-SATD comments among all comments is only 1,52%. We find that KL-SATD comment contents include words expressing code changes and uncertainty, such as remove, fix, maybe and probably. This makes them different compared to other comments. KL-SATD comment contents are similar to manually labeled SATD comments of prior work. Our machine learning classifier using logistic Lasso regression has good performance in detecting KL-SATD comments (AUC-ROC 0.88). Finally, we demonstrate that using machine learning we can identify comments that are currently missing but which should have a SATD keyword in them. Automating SATD identification of comments that lack SATD keywords can save time and effort by replacing manual identification of comments. Using KL-SATD offers a potential to bootstrap a complete SATD detector

    Data balancing improves self-admitted technical debt detection

    Full text link
    Abstract A high imbalance exists between technical debt and non-technical debt source code comments. Such imbalance affects Self-Admitted Technical Debt (SATD) detection performance, and existing literature lacks empirical evidence on the choice of balancing technique. In this work, we evaluate the impact of multiple balancing techniques, including Data level, Classifier level, and Hybrid, for SATD detection in Within-Project and Cross-Project setup. Our results show that the Data level balancing technique SMOTE or Classifier level Ensemble approaches Random Forest or XGBoost are reasonable choices depending on whether the goal is to maximize Precision, Recall, F1, or AUC-ROC. We compared our best-performing model with the previous SATD detection benchmark (cost-sensitive Convolution Neural Network). Interestingly the top-performing XGBoost with SMOTE sampling improved the Within-project F1 score by 10% but fell short in Cross-Project set up by 9%. This supports the higher generalization capability of deep learning in Cross-Project SATD detection, yet while working within individual projects, classical machine learning algorithms can deliver better performance. We also evaluate and quantify the impact of duplicate source code comments in SATD detection performance. Finally, we employ SHAP and discuss the interpreted SATD features. We have included the replication package1 and shared a web-based SATD prediction tool2 with the balancing techniques in this study
    corecore