    Opinion Mining for Software Development: A Systematic Literature Review

    Opinion mining, sometimes referred to as sentiment analysis, has gained increasing attention in software engineering (SE) studies. SE researchers have applied opinion mining techniques in various contexts, such as identifying developers’ emotions expressed in code comments and extracting users’ critics toward mobile apps. Given the large amount of relevant studies available, it can take considerable time for researchers and developers to figure out which approaches they can adopt in their own studies and what perils these approaches entail. We conducted a systematic literature review involving 185 papers. More specifically, we present 1) well-defined categories of opinion mining-related software development activities, 2) available opinion mining approaches, whether they are evaluated when adopted in other studies, and how their performance is compared, 3) available datasets for performance evaluation and tool customization, and 4) concerns or limitations SE researchers might need to take into account when applying/customizing these opinion mining techniques. The results of our study serve as references to choose suitable opinion mining tools for software development activities, and provide critical insights for the further development of opinion mining techniques in the SE domain

    Codeklonerkennung mit Dominatorinformationen

    If an existing function in a software project is copied and reused (in a slightly modified version), the result is a code clone. If there was an error or vulnerability in the original function, this error or vulnerability is now contained in several places in the software project. This is one of the reasons why research is being done to develop powerful and scalable clone detection techniques. In this thesis, a new clone detection method is presented that uses paths and path sets derived from the dominator trees of the functions to detect the code clones. A dominator tree is a special form of the control flow graph, which does not contain cycles. The dominator tree based method has been implemented in the StoneDetector tool and can detect code clones in Java source code as well as in Java bytecode. It has equally good or better recall and precision results than previously published code clone detection methods. The evaluation was performed using the BigCloneBench. Scalability measurements showed that even source code with several 100 million lines of code can be searched in a reasonable time. In order to evaluate the bytecode based StoneDetector variant, the BigCloneBench files had to be compiled. For this purpose, the Stubber tool was developed, which can compile Java source code files without the required libraries. Finally, it could be shown that using the register code generated from the Java bytecode, similar recall and precision values could be achieved compared to the source code based variant. Since some machine learning studies specify that very good recall and precision values can be achieved for all clone types, a machine learning method was trained with dominator trees. It could be shown that the results published by the studies are not reproducible on unseen data.Wird eine bestehende Funktion in einem Softwareprojekt kopiert und (in leicht angepasster Form) erneut genutzt, entsteht ein Codeklon. War in der ursprünglichen Funktion jedoch ein Fehler oder eine Schwachstelle, so ist dieser Fehler beziehungsweise diese Schwachstelle jetzt an mehreren Stellen im Softwareprojekt enthalten. Dies ist einer der Gründe, weshalb an der Entwicklung von leistungsstarken und skalierbaren Klonerkennungsverfahren geforscht wird. In der hier vorliegenden Arbeit wird ein neues Klonerkennungsverfahren vorgestellt, das zum Detektieren der Codeklone Pfade und Pfadmengen nutzt, die aus den Dominatorbäumen der Funktionen abgeleitet werden. Ein Dominatorbaum wird aus dem Kontrollflussgraphen abgeleitet und enthält keine Zyklen. Das Dominatorbaum-basierte Verfahren wurde in dem Werkzeug StoneDetector umgesetzt und kann Codeklone sowohl im Java-Quelltext als auch im Java-Bytecode detektieren. Dabei hat es gleich gute oder bessere Recall- und Precision-Werte als bisher veröffentlichte Codeklonerkennungsverfahren. Die Wert-Evaluierungen wurden dabei unter Verwendung des BigClone-Benchs durchgeführt. Skalierbarkeitsmessungen zeigten, dass sogar Quellcodedateien mit mehreren 100-Millionen Codezeilen in angemessener Zeit durchsucht werden können. Damit die Bytecode-basierte StoneDetector-Variante auch evaluiert werden konnte, mussten die Dateien des BigCloneBench kompiliert werden. Dazu wurde das Stubber-Tool entwickelt, welches Java-Quelltextdateien ohne die benötigten Abhängigkeiten kompilieren kann. Schlussendlich konnte somit gezeigt werden, dass mithilfe des aus dem Java-Bytecode generierten Registercodes ähnliche Recall- und Precision-Werte im Vergleich zu der Quelltext-basierten Variante erreicht werden können. Da einige Arbeiten mit maschinellen Lernverfahren angeben, bei allen Klontypen sehr gute Recall- und Precision-Werte zu erreichen, wurde ein maschinelles Lernverfahren mit Dominatoräumen trainiert. Es konnte gezeigt werden, dass die von den Arbeiten veröffentlichten Ergebnisse nicht auf ungesehenen Daten reproduzierbar sind

    Analyzing Clone Evolution for Identifying the Important Clones for Management

    Code clones (identical or similar code fragments in a code-base) have dual but contradictory impacts (i.e., both positive and negative impacts) on the evolution and maintenance of a software system. Because of the negative impacts (such as high change-proneness, bug-proneness, and unintentional inconsistencies), software researchers consider code clones to be the number one bad-smell in a code-base. Existing studies on clone management suggest managing code clones through refactoring and tracking. However, a software system's code-base may contain a huge number of code clones, and it is impractical to consider all these clones for refactoring or tracking. In these circumstances, it is essential to identify code clones that can be considered particularly important for refactoring and tracking. However, no existing study has investigated this matter. We conduct our research emphasizing this matter, and perform five studies on identifying important clones by analyzing clone evolution history. In our first study we detect evolutionary coupling of code clones by automatically investigating clone evolution history from thousands of commits of software systems downloaded from on-line SVN repositories. By analyzing evolutionary coupling of code clones we identify a particular clone change pattern, Similarity Preserving Change Pattern (SPCP), such that code clones that evolve following this pattern should be considered important for refactoring. We call these important clones the SPCP clones. We rank SPCP clones considering their strength of evolutionary coupling. In our second study we further analyze evolutionary coupling of code clones with an aim to assist clone tracking. The purpose of clone tracking is to identify the co-change (i.e. changing together) candidates of code clones to ensure consistency of changes in the code-base. Our research in the second study identifies and ranks the important co-change candidates by analyzing their evolutionary coupling. In our third study we perform a deeper analysis on the SPCP clones and identify their cross-boundary evolutionary couplings. On the basis of such couplings we separate the SPCP clones into two disjoint subsets. While one subset contains the non-cross-boundary SPCP clones which can be considered important for refactoring, the other subset contains the cross-boundary SPCP clones which should be considered important for tracking. In our fourth study we analyze the bug-proneness of different types of SPCP clones in order to identify which type(s) of code clones have high tendencies of experiencing bug-fixes. Such clone-types can be given high priorities for management (refactoring or tracking). In our last study we analyze and compare the late propagation tendencies of different types of code clones. Late propagation is commonly regarded as a harmful clone evolution pattern. Findings from our last study can help us prioritize clone-types for management on the basis of their tendencies of experiencing late propagations. We also find that late propagation can be considerably minimized by managing the SPCP clones. On the basis of our studies we develop an automatic system called AMIC (Automatic Mining of Important Clones) that identifies the important clones for management (refactoring and tracking) and ranks these clones considering their evolutionary coupling, bug-proneness, and late propagation tendencies. We believe that our research findings have the potential to assist clone management by pin-pointing the important clones to be managed, and thus, considerably minimizing clone management effort

    Weiterentwicklung analytischer Datenbanksysteme

    This thesis contributes to the state of the art in analytical database systems. First, we identify and explore extensions to better support analytics on event streams. Second, we propose a novel polygon index to enable efficient geospatial data processing in main memory. Third, we contribute a new deep learning approach to cardinality estimation, which is the core problem in cost-based query optimization.Diese Arbeit trägt zum aktuellen Forschungsstand von analytischen Datenbanksystemen bei. Wir identifizieren und explorieren Erweiterungen um Analysen auf Eventströmen besser zu unterstützen. Wir stellen eine neue Indexstruktur für Polygone vor, die eine effiziente Verarbeitung von Geodaten im Hauptspeicher ermöglicht. Zudem präsentieren wir einen neuen Ansatz für Kardinalitätsschätzungen mittels maschinellen Lernens

    Data-Efficient Learned Database Components

    While databases are the backbone of many software systems, database components such as query optimizers often have to be redesigned to meet the increasing variety in workloads, data and hardware designs, which incurs significant engineering efforts to adapt their design. Recently, it was thus proposed to replace DBMS components such as optimizers, cardinality estimators, etc. by ML models, which not only eliminates the engineering efforts but also provides superior performance for many components. The predominant approach to derive such learned components is workload-driven learning where ten thousands of queries have to be executed first to derive the necessary training data. Unfortunately, the training data collection, which can take days even for medium-sized datasets, has to be repeated for every new database (i.e., the combination of dataset, schema and workload) a component should be deployed for. This is especially problematic for cloud databases such as Snowflake or Redshift since this effort has to be incurred for every customer. This dissertation thus proposes data-efficient learned database components, which either reduce or fully eliminate the high costs of training data collection for learned database components. In particular, three directions are proposed in this dissertation, namely (i) we first aim to reduce the number of training queries needed for workload-driven components before we (ii) propose data-driven learning, which uses the data stored in the database as training data instead of queries, and (iii) introduce zero-shot learned components, which can generalize to new databases out-of-the-box, s.t. no training data collection is required. First, we strive to reduce the number of training queries required for workload-driven components by using simulation models to convey the basic tradeoffs of the underlying problem, e.g., that in database partitioning the network costs of shuffling tuples over the network for joins is the dominating factor. This substantially reduces the number of training queries since the basic principles are already covered by the simulation model and thus only subtleties not covered in the simulation model have to be learned by observing query executions, which we will demonstrate for the problem of database partitioning. An alternative direction is to incorporate domain knowledge (e.g., in a cost model we could encode that scan costs increase linearly with the number of tuples) into components by designing them using differentiable programming. This significantly reduces the number of learnable parameters and thus also the number of required training queries. We demonstrate the feasibility of the approach for the problem of cost estimation in databases. While both approaches reduce the number of training queries, there is still a significant number of training queries required for unseen databases. This motivates our second approach of data-driven learning. In particular, we propose to train the database component by learning the data distribution present in a database instead of observing query executions. This not only completely eliminates the need to collect training data queries but can even improve the state-of-the-art in problems such as cardinality estimation or AQP. While we demonstrate the applicability to a wide range of additional database tasks such as the completion of incomplete relational datasets, data-driven learning is only useful for problems where the data distribution provides sufficient information for the underlying database task. However, for tasks where observations of query executions are indispensable such as cost estimation, data-driven learning cannot be leveraged. In a third direction, we thus propose zero-shot learned database components, which are applicable to a broader set of tasks including those that require observations of queries. In particular, motivated by recent advances in transfer learning, we propose to pretrain a model once on a variety of databases and workloads and thus allow the component to generalize to unseen databases out-of-the-box. Hence, similar to data-driven learning no training queries have to be collected. In this dissertation, we demonstrate that zero-shot learning can indeed yield learned cost models which can predict query latencies on entirely unseen databases more accurately than state-of-the-art workload-driven approaches, which require ten thousands of query executions on every unseen database. Overall, the proposed techniques yield state-of-the-art performance for many database tasks while significantly reducing or completely eliminating the expensive training data collection for unseen databases. However, while the proposed directions address the prevalent data-inefficiency of learned database components, there are still many opportunities to improve learned components in the future. First, the robustness and debuggability of learned components should be improved since as of today they do not offer the same transparency as standard code in databases, which can render the components less attractive to be deployed in production systems. Moreover, to increase the applicability of data-driven models it is desirable to increase the coverage of supported queries, e.g., queries involving wildcard predicates on string columns, which are currently not supported by data-driven learning. Finally, we envision that a broader set of tasks should be supported in the future by zero-shot models (e.g., query optimization) potentially converging towards complete zero-shot learned systems