11 research outputs found

    Popularity Prediction of Reddit Texts

    Get PDF
    Popularity prediction is a useful technique for marketers to anticipate the success of marketing campaigns, to build recommendation systems that suggest new products to consumers, and to develop targeted advertising. Researchers likewise use popularity prediction to measure how popularity changes within a community or within a given timespan. In this paper, I explore ways to predict popularity of posts in reddit.com, which is a blend of news aggregator and community forum. I frame popularity prediction as a text classification problem and attempt to solve it by first identifying topics in the text and then classifying whether the topics identified are more characteristic of popular or unpopular texts. This classifier is then used to label unseen texts as popular or not dependent on the topics found in these new posts. I explore the use of Latent Dirichlet Allocation and term frequency-inverse document frequency for topic identification and naĂŻve Bayes classifiers and support vector machines for classification. The relation between topics and popularity is dynamic -- topics in Reddit communities can wax and wane in popularity. Despite the inherent variability, the methods explored in the paper are effective, showing prediction accuracy between 60% and 75%. The study contributes to the field in various ways. For example, it provides novel data for research and development, not only for text classification but also for the study of relation between topics and popularity in general. The study also helps us better understand different topic identification and classification methods by illustrating their effectiveness on real-life data from a fast-changing and multi-purpose websit

    Towards more accurate content categorization of API discussions

    Get PDF
    Nowadays, software developers often discuss the usage of various APIs in online forums. Automatically assigning pre-defined se-mantic categorizes to API discussions in these forums could help manage the data in online forums, and assist developers to search for useful information. We refer to this process as content catego-rization of API discussions. To solve this problem, Hou and Mo proposed the usage of naive Bayes multinomial, which is an effec-tive classification algorithm. In this paper, we propose a Cache-bAsed compoSitE algorithm, short formed as CASE, to automatically categorize API discussion-s. Considering that the content of an API discussion contains both textual description and source code, CASE has 3 components that analyze an API discussion in 3 different ways: text, code, and o-riginal. In the text component, CASE only considers the textual de-scription; in the code component, CASE only considers the source code; in the original component, CASE considers the original con-tent of an API discussion which might include textual description and source code. Next, for each component, since different terms (i.e., words) have different affinities to different categories, CASE caches a subset of terms which have the highest affinity scores to each category, and builds a classifier based on the cached terms. Finally, CASE combines all the 3 classifiers to achieve a better ac-curacy score. We evaluate the performance of CASE on 3 datasets which contain a total of 1,035 API discussions. The experiment results show that CASE achieves accuracy scores of 0.69, 0.77, and 0.96 for the 3 datasets respectively, which outperforms the state-of-the-art method proposed by Hou and Mo by 11%, 10%, and 2%, respectively

    An Empirical Study of Bug Report Field Reassignment

    Get PDF
    NSF

    Visual representation of bug report assignment recommendations

    Get PDF
    Software development projects typically use an issue tracking system where the project members and users can either report faults or request additional features. Each of these reports needs to be triaged to determine such things as the priority of the report or which developers should be assigned to resolve the report. To assist a triager with report assigning, an assignment recommender has been suggested as a means of improving the process. However, proposed assignment recommenders typically present a list of developer names, without an explanation of the rationale. This work focuses on providing visual explanations for bug report assignment recommendations. We examine the use of a supervised and unsupervised machine learning algorithm for the assignment recommendation from which we can provide recommendation rationale. We explore the use of three types of graphs for the presentation of the rationale and validate their use-cases and usability through a small user study

    Empirical Analysis and Automated Classification of Security Bug Reports

    Get PDF
    With the ever expanding amount of sensitive data being placed into computer systems, the need for effective cybersecurity is of utmost importance. However, there is a shortage of detailed empirical studies of security vulnerabilities from which cybersecurity metrics and best practices could be determined. This thesis has two main research goals: (1) to explore the distribution and characteristics of security vulnerabilities based on the information provided in bug tracking systems and (2) to develop data analytics approaches for automatic classification of bug reports as security or non-security related. This work is based on using three NASA datasets as case studies. The empirical analysis showed that the majority of software vulnerabilities belong only to a small number of types. Addressing these types of vulnerabilities will consequently lead to cost efficient improvement of software security. Since this analysis requires labeling of each bug report in the bug tracking system, we explored using machine learning to automate the classification of each bug report as a security or non-security related (two-class classification), as well as each security related bug report as specific security type (multiclass classification). In addition to using supervised machine learning algorithms, a novel unsupervised machine learning approach is proposed. Of the machine learning algorithms tested, Naive Bayes was the most consistent, well performing classifier across all datasets. The novel unsupervised approach did not perform as well as the supervised methods, but still performed well resulting in a G-Score of 0.715 in the case of best performance whereas the supervised approach achieved a G-Score of 0.903 in the case of best performance

    Open design : práticas atuais e implicações para a arquitetura e desenho urbano

    Get PDF
    Orientador: Evandro Ziggiatti MonteiroTese (doutorado) - Universidade Estadual de Campinas, Faculdade de Engenharia Civil, Arquitetura e UrbanismoResumo: O conceito de Design Aberto (OD) tem atraído cada vez mais atenção de pesquisadores, comunidades e empresas. Os seus benefícios são frequentemente associados à democratização do design, melhoria mais rápida de projetos, customização em massa e aos processos de inovação alternativos. No campo da construção, diferentes exemplos que levam em consideração o conceito do OD, podem ser encontrados. As possibilidades vão desde o compartilhamento de componentes de rápida fabricação e de baixo custo para a construção de casas (Wikihouse), passando pela fabricação de móveis (Opendesk) até as ferramentas de jardinagem (AKER). No contexto de comunidades mais pobres, a abordagem OD desperta interesse. Esta pesquisa tem como objetivo investigar o conceito de DO como fenômeno emergente e suas implicações no campo da Arquitetura e do Design Urbano. Atualmente, existem poucas pesquisas sobre OD, principalmente se o limitarmos à prática da arquitetura. Proponho uma análise de pesquisa multi-método, utilizando estratégias qualitativas e quantitativas no estudo do mesmo fenômeno. A estrutura da pesquisa aborda quatro questões principais: (1) Como os diferentes aspectos de abertura afetam a fabricação de artefatos? (2) Como o OD se relaciona com o desenvolvimento sustentável? Quais são as limitações atuais e os caminhos possíveis para superá-las? (3) Quais são os desafios atuais para replicabilidade no OD e como superá-los? (4) Qual é a estrutura de uma comunidade colaborativa de OD? Com base nos resultados, é possível argumentar que que o OD possa alterar a maneira como os arquitetos e os urbanistas trabalham. Os obstáculos atuais, no entanto, precisam ser enfrentados antes que o conceito possa ser adotado por um público maior, especialmente nas comunidades mais pobres. Dos resultados transversais de quatro questões propostas, quatro sugestões foram feitas: (1) a adoção de uma abordagem de metadesign, (2) a adoção de projetos modulares, (3) a educação para a abertura e (4) o uso de microfábricas móveis como infraestrutura urbana. Por fim, a pesquisa contribui para as discussões sobre OD e visa construir uma estrutura conceitual para a prática profissional da arquitetura com uma abordagem voltada ao ODAbstract: The concept of Open Design (OD) has increasingly gathered attention amongst scholars, grassroots communities and companies during the last ten years. OD benefits are often associated to the design democratization, faster improvement of design artifacts, mass customization and alternative innovation processes. In the construction field, a number of examples that take knowledge and digital commons into account already exists. The possibilities go from sharing low-cost and rapid-assembly components for building houses (Wikihouse), furniture fabrication (Opendesk) and gardening tools (AKER)). In the context of a developing country, the OD approach arouses interest. This research aims to investigate the concept of OD as an emergent phenomenon and its implications to the field of Architecture and Urban Design. Despite the emergence, little research on OD currently exists, especially if we limit it to the scope of the architecture practice. I propose a multi-method research analysis, using qualitative and quantitative strategies in the study of the same phenoma. The research structure addresses four main questions: (1) How do the different aspects of openness affect artefact manufacturing? (2) How does Open Design relate to sustainable development? What are the current limitations and possible pathways to overcome such limitations? (3) What are the current challenges for replicability in OD and how to overcome them? (4) What is the structure of an OD collaborative community? How and Why users collaborate? Based on the findings, it is possible to argue for the viability of OD to change the way architects and urban designers work. Current hurdles however need to be tackled before it can be adopted by a larger audience, especially in poorer communities. From cross-cutting results of four RQs, four suggestions were made: (1) the adoption of a metadesign approach, (2) the adoption of modular designs, (3) the education for openness and (4) mobile microfactories as urban infrastructure. The research contributes to discussions on Open Design and aims to build a conceptual framework for the professional practice within the emergence of ODDoutoradoArquitetura, Tecnologia e CidadeDoutor em Arquitetura, Tecnologia e Cidade01-P-04375-2015CAPE

    Empirical Analysis and Automated Classification of Security Bug Reports

    Get PDF
    With the ever expanding amount of sensitive data being placed into computer systems, the need for effective cybersecurity is of utmost importance. However, there is a shortage of detailed empirical studies of security vulnerabilities from which cybersecurity metrics and best practices could be determined. This thesis has two main research goals: (1) to explore the distribution and characteristics of security vulnerabilities based on the information provided in bug tracking systems and (2) to develop data analytics approaches for automatic classification of bug reports as security or non-security related. This work is based on using three NASA datasets as case studies. The empirical analysis showed that the majority of software vulnerabilities belong only to a small number of types. Addressing these types of vulnerabilities will consequently lead to cost efficient improvement of software security. Since this analysis requires labeling of each bug report in the bug tracking system, we explored using machine learning to automate the classification of each bug report as a security or non-security related (two-class classification), as well as each security related bug report as specific security type (multiclass classification). In addition to using supervised machine learning algorithms, a novel unsupervised machine learning approach is proposed. An ac- curacy of 92%, recall of 96%, precision of 92%, probability of false alarm of 4%, F-Score of 81% and G-Score of 90% were the best results achieved during two-class classification. Furthermore, an accuracy of 80%, recall of 80%, precision of 94%, and F-score of 85% were the best results achieved during multiclass classification

    Automatic bug triaging techniques using machine learning and stack traces

    Get PDF
    When a software system crashes, users have the option to report the crash using automated bug tracking systems. These tools capture software crash and failure data (e.g., stack traces, memory dumps, etc.) from end-users. These data are sent in the form of bug (crash) reports to the software development teams to uncover the causes of the crash and provide adequate fixes. The reports are first assessed (usually in a semi-automatic way) by a group of software analysts, known as triagers. Triagers assign priority to the bugs and redirect them to the software development teams in order to provide fixes. The triaging process, however, is usually very challenging. The problem is that many of these reports are caused by similar faults. Studies have shown that one way to improve the bug triaging process is to detect automatically duplicate (or similar) reports. This way, triagers would not need to spend time on reports caused by faults that have already been handled. Another issue is related to the prioritization of bug reports. Triagers often rely on the information provided by the customers (the report submitters) to prioritize bug reports. However, this task can be quite tedious and requires tool support. Next, triagers route the bug report to the responsible development team based on the subsystem, which caused the crash. Since having knowledge of all the subsystems of an ever-evolving industrial system is impractical, having a tool to automatically identify defective subsystems can significantly reduce the manual bug triaging effort. The main goal of this research is to investigate techniques and tools to help triagers process bug reports. We start by studying the effect of the presence of stack traces in analyzing bug reports. Next, we present a framework to help triagers in each step of the bug triaging process. We propose a new and scalable method to automatically detect duplicate bug reports using stack traces and bug report categorical features. We then propose a novel approach for predicting bug severity using stack traces and categorical features, and finally, we discuss a new method for predicting faulty product and component fields of bug reports. We evaluate the effectiveness of our techniques using bug reports from two large open-source systems. Our results show that stack traces and machine learning methods can be used to automate the bug triaging process, and hence increase the productivity of bug triagers, while reducing costs and efforts associated with manual triaging of bug reports

    Mining and untangling change genealogies

    Get PDF
    Developers change source code to add new functionality, fix bugs, or refactor their code. Many of these changes have immediate impact on quality or stability. However, some impact of changes may become evident only in the long term. This thesis makes use of change genealogy dependency graphs modeling dependencies between code changes capturing how earlier changes enable and cause later ones. Using change genealogies, it is possible to: (a) applyformalmethodslikemodelcheckingonversionarchivestorevealtemporal process patterns. Such patterns encode key features of the software process and can be validated automatically: In an evaluation of four open source histories, our prototype would recommend pending activities with a precision of 60—72%. (b) classify the purpose of code changes. Analyzing the change dependencies on change genealogies shows that change genealogy network metrics can be used to automatically separate bug fixing from feature implementing code changes. (c) build competitive defect prediction models. Defect prediction models based on change genealogy network metrics show competitive prediction accuracy when compared to state-of-the-art defect prediction models. As many other approaches mining version archives, change genealogies and their applications rely on two basic assumptions: code changes are considered to be atomic and bug reports are considered to refer to corrective maintenance tasks. In a manual examination of more than 7,000 issue reports and code changes from bug databases and version control systems of open- source projects, we found 34% of all issue reports to be misclassified and that up to 15% of all applied issue fixes consist of multiple combined code changes serving multiple developer maintenance tasks. This introduces bias in bug prediction models confusing bugs and features. To partially solve these issues we present an approach to untangle such combined changes with a mean success rate of 58—90% after the fact.Softwareentwickler ändern Source-Code um neue Funktionalität hinzuzufügen, Bugs zu beheben oder um ihren Code zu restrukturieren. Viele dieser Änderungen haben einen direkten Einfluss auf Qualität und Stabilität des Softwareprodukts. Jedoch kommen einige dieser Einflüsse erst zu einem späteren Zeitpunkt zur Geltung. Diese Arbeit verwendet Genealogien zwischen Code-Änderungen um zu erfassen, wie frühere Änderungen spätere Änderungen erfordern oder ermöglichen. Die Verwendung von Änderungs-Genealogien ermöglicht: (a) die Anwendung formaler Methoden wie Model-Checking auf Versionsarchive um temporäre Prozessmuster zu erkennen. Solche Prozessmuster verdeutlichen Hauptmerkmale eines Softwareentwicklungsprozesses: In einer Evaluation auf vier Open-Source Projekten war unser Prototyp im Stande noch ausstehende Änderungen mit einer Präzision von 60–72% vorherzusagen. (b) die Absicht einer Code-Änderung zu bestimmen. Analysen von Änderungsabhängigkeiten zeigen, dass Netzwerkmetriken auf Änderungsgenealogien geeignet sind um fehlerbehebende Änderungen von Änderungen die eine Funktionalität hinzufügen zu trennen. (c) konkurrenzfähige Fehlervorhersagen zu erstellen. Fehlervorhersagen basierend auf Genealogie-Metriken können sich mit anerkannten Fehlervorhersagemodellen messen. Änderungs-Genealogien und deren Anwendungen basieren, wie andere Data-Mining Ansätze auch, auf zwei fundamentalen Annahmen: Code-Änderungen beabsichtigen die Lösung nur eines Problems und Bug-Reports weisen auf Fehler korrigierende Tätigkeiten hin. Eine manuelle Inspektion von mehr als 7.000 Issue-Reports und Code-Änderungen hat ergeben, dass 34% aller Issue-Reports falsch klassifiziert sind und dass bis zu 15% aller fehlerbehebender Änderungen mehr als nur einem Entwicklungs-Task dienen. Dies wirkt sich negativ auf Vorhersagemodelle aus, die nicht mehr klar zwischen Bug-Fixes und anderen Änderungen unterscheiden können. Als Lösungsansatz stellen wir einen Algorithmus vor, der solche nicht eindeutigen Änderungen mit einer Erfolgsrate von 58–90% entwirrt
    corecore