10 research outputs found
Predicting Good Configurations for GitHub and Stack Overflow Topic Models
Software repositories contain large amounts of textual data, ranging from
source code comments and issue descriptions to questions, answers, and comments
on Stack Overflow. To make sense of this textual data, topic modelling is
frequently used as a text-mining tool for the discovery of hidden semantic
structures in text bodies. Latent Dirichlet allocation (LDA) is a commonly used
topic model that aims to explain the structure of a corpus by grouping texts.
LDA requires multiple parameters to work well, and there are only rough and
sometimes conflicting guidelines available on how these parameters should be
set. In this paper, we contribute (i) a broad study of parameters to arrive at
good local optima for GitHub and Stack Overflow text corpora, (ii) an
a-posteriori characterisation of text corpora related to eight programming
languages, and (iii) an analysis of corpus feature importance via per-corpus
LDA configuration. We find that (1) popular rules of thumb for topic modelling
parameter configuration are not applicable to the corpora used in our
experiments, (2) corpora sampled from GitHub and Stack Overflow have different
characteristics and require different configurations to achieve good model fit,
and (3) we can predict good configurations for unseen corpora reliably. These
findings support researchers and practitioners in efficiently determining
suitable configurations for topic modelling when analysing textual data
contained in software repositories.Comment: to appear as full paper at MSR 2019, the 16th International
Conference on Mining Software Repositorie
BenchPress: Analyzing Android App Vulnerability Benchmark Suites
In recent years, various benchmark suites have been developed to evaluate the
efficacy of Android security analysis tools. The choice of such benchmark
suites used in tool evaluations is often based on the availability and
popularity of suites and not on their characteristics and relevance. One of the
reasons for such choices is the lack of information about the characteristics
and relevance of benchmarks suites.
In this context, we empirically evaluated four Android specific benchmark
suites: DroidBench, Ghera, IccBench, and UBCBench. For each benchmark suite, we
identified the APIs used by the suite that were discussed on Stack Overflow in
the context of Android app development and measured the usage of these APIs in
a sample of 227K real world apps (coverage). We also compared each pair of
benchmark suites to identify the differences between them in terms of API
usage. Finally, we identified security-related APIs used in real-world apps but
not in any of the above benchmark suites to assess the opportunities to extend
benchmark suites (gaps).
The findings in this paper can help 1) Android security analysis tool
developers choose benchmark suites that are best suited to evaluate their tools
(informed by coverage and pairwise comparison) and 2) Android app vulnerability
benchmark creators develop and extend benchmark suites (informed by gaps).Comment: Updates based on AMobile 2019 review
An?lise explorat?ria dos t?picos no Stack Overflow usando LDA (Latent Dirichlet Allocation)
Topic modeling is a machine learning problem, which aims to extract, given a collection
of documents, the main topics that represent the subjects covered by the collection.
Documents can be generated from different distributions on topics, the topics being formed by
a probabilistic distribution of words. To infer the set of topics that generated a collection of
documents, apply probabilistic techniques that make the process reverse. In this work, an exploratory
analysis is performed in the Stack Overflow database, and for this purpose, it is used
the topic modeling to extract the desired information, applying the Latent Dirichlet Allocation
(LDA) to extract the topics from the database. As a result, the topics that represent the collection
are obtained, with more recurring themes related to web programming, textit mobile,
and version control. In addition, the values of topics are compared, evaluated from metrics that
verify the coherence of their words, identifying, among the analyzed values, the number of 50
topics with the best results to represent the collectionA modelagem de t?picos ? um problema de aprendizado de m?quina, que visa extrair, dada
uma cole??o de documentos, os principais t?picos que representem os assuntos abordados pela
cole??o. Os documentos podem ser gerados a partir de diferentes distribui??es sobre t?picos,
sendo os t?picos formados por uma distribui??o probabil?stica de palavras. Para inferir o conjunto
de t?picos que geraram uma cole??o de documentos, usam-se t?cnicas probabil?sticas
que fazem o processo reverso. Nesse trabalho, realiza-se uma an?lise explorat?ria na base de
dados do Stack Overflow, e para tal, utiliza-se da modelagem de t?picos para a extra??o das
informa??es desejadas, aplicando o LDA (Latent Dirichlet Allocation) para extrair os t?picos
da base de dados. Como resultado, s?o obtidos os t?picos que representam a cole??o, sendo
mais recorrentes assuntos ligados ? programa??o web, mobile e controle de vers?o. Al?m disso,
s?o comparados os valores de t?picos, avaliados a partir de m?tricas que verificam a coer?ncia
entre suas palavras, identificando, dentre os valores analisados, o n?mero de 50 t?picos com os
melhores resultados para representar a cole??o
Challenges and Barriers of Using Low Code Software for Machine Learning
As big data grows ubiquitous across many domains, more and more stakeholders
seek to develop Machine Learning (ML) applications on their data. The success
of an ML application usually depends on the close collaboration of ML experts
and domain experts. However, the shortage of ML engineers remains a fundamental
problem. Low-code Machine learning tools/platforms (aka, AutoML) aim to
democratize ML development to domain experts by automating many repetitive
tasks in the ML pipeline. This research presents an empirical study of around
14k posts (questions + accepted answers) from Stack Overflow (SO) that
contained AutoML-related discussions. We examine how these topics are spread
across the various Machine Learning Life Cycle (MLLC) phases and their
popularity and difficulty. This study offers several interesting findings.
First, we find 13 AutoML topics that we group into four categories. The MLOps
topic category (43% questions) is the largest, followed by Model (28%
questions), Data (27% questions), Documentation (2% questions). Second, Most
questions are asked during Model training (29%) (i.e., implementation phase)
and Data preparation (25%) MLLC phase. Third, AutoML practitioners find the
MLOps topic category most challenging, especially topics related to model
deployment & monitoring and Automated ML pipeline. These findings have
implications for all three AutoML stakeholders: AutoML researchers, AutoML
service vendors, and AutoML developers. Academia and Industry collaboration can
improve different aspects of AutoML, such as better DevOps/deployment support
and tutorial-based documentation
Configuring and Assembling Information Retrieval based Solutions for Software Engineering Tasks.
Information Retrieval (IR) approaches are used to leverage textual or unstructured data generated during the software development process to support various software engineering (SE) tasks (e.g., concept location, traceability link recovery, change impact analysis, etc.). Two of the most important steps for applying IR techniques to support SE tasks are preprocessing the corpus and configuring the IR technique, and these steps can significantly influence the outcome and the amount of effort developers have to spend for these maintenance tasks. We present the use of Genetic Algorithms (GAs) to automatically configure and assemble an IR process to support SE tasks. The approach named IR-GA determines the (near) optimal solution to be used for each step of the IR process without requiring any training. We applied IR-GA on three different SE tasks and the results of the study indicate that IR-GA outperforms approaches previously used in the literature, and that it does not significantly differ from an ideal upper bound that could be achieved by a supervised approach and a combinatorial approach
Explainable, Security-Aware and Dependency-Aware Framework for Intelligent Software Refactoring
As software systems continue to grow in size and complexity, their maintenance continues to become more challenging and costly. Even for the most technologically sophisticated and competent organizations, building and maintaining high-performing software applications with high-quality-code is an extremely challenging and expensive endeavor. Software Refactoring is widely recognized as the key component for maintaining high-quality software by restructuring existing code and reducing technical debt. However, refactoring is difficult to achieve and often neglected due to several limitations in the existing refactoring techniques that reduce their effectiveness. These limitation include, but not limited to, detecting refactoring opportunities, recommending specific refactoring activities, and explaining the recommended changes. Existing techniques are mainly focused on the use of quality metrics such as coupling, cohesion, and the Quality Metrics for Object Oriented Design (QMOOD). However, there are many other factors identified in this work to assist and facilitate different maintenance activities for developers:
1. To structure the refactoring field and existing research results, this dissertation provides the most scalable and comprehensive systematic literature review analyzing the results of 3183 research papers on refactoring covering the last three decades. Based on this survey, we created a taxonomy to classify the existing research, identified research trends and highlighted gaps in the literature for further research.
2. To draw attention to what should be the current refactoring research focus from the developers’ perspective, we carried out the first large scale refactoring study on the most popular online Q&A forum for developers, Stack Overflow. We collected and analyzed posts to identify what developers ask about refactoring, the challenges that practitioners face when refactoring software systems, and what should be the current refactoring research focus from the developers’ perspective.
3. To improve the detection of refactoring opportunities in terms of quality and security in the context of mobile apps, we designed a framework that recommends the files to be refactored based on user reviews. We also considered the detection of refactoring opportunities in the context of web services. We proposed a machine learning-based approach that helps service providers and subscribers predict the quality of service with the least costs. Furthermore, to help developers make an accurate assessment of the quality of their software systems and decide if the code should be refactored, we propose a clustering-based approach to automatically identify the preferred benchmark to use for the quality assessment of a project.
4. Regarding the refactoring generation process, we proposed different techniques to enhance the change operators and seeding mechanism by using the history of applied refactorings and incorporating refactoring dependencies in order to improve the quality of the refactoring solutions. We also introduced the security aspect when generating refactoring recommendations, by investigating the possible impact of improving different quality attributes on a set of security metrics and finding the best trade-off between them. In another approach, we recommend refactorings to prioritize fixing quality issues in security-critical files, improve quality attributes and remove code smells.
All the above contributions were validated at the large scale on thousands of open source and industry projects in collaboration with industry partners and the open source community. The contributions of this dissertation are integrated in a cloud-based refactoring framework which is currently used by practitioners.Ph.D.College of Engineering & Computer ScienceUniversity of Michigan-Dearbornhttp://deepblue.lib.umich.edu/bitstream/2027.42/171082/1/Chaima Abid Final Dissertation.pdfDescription of Chaima Abid Final Dissertation.pdf : Dissertatio