222,833 research outputs found
Automatic detection of accommodation steps as an indicator of knowledge maturing
Jointly working on shared digital artifacts – such as wikis – is a well-tried method of developing knowledge collectively within a group or organization. Our assumption is that such knowledge maturing is an accommodation process that can be measured by taking the writing process itself into account. This paper describes the development of a tool that detects accommodation automatically with the help of machine learning algorithms. We applied a software framework for task detection to the automatic identification of accommodation processes within a wiki. To set up the learning algorithms and test its performance, we conducted an empirical study, in which participants had to contribute to a wiki and, at the same time, identify their own tasks. Two domain experts evaluated the participants’ micro-tasks with regard to accommodation. We then applied an ontology-based task detection approach that identified accommodation with a rate of 79.12%. The potential use of our tool for measuring knowledge maturing online is discussed
The Co-Evolution of Test Maintenance and Code Maintenance through the lens of Fine-Grained Semantic Changes
Automatic testing is a widely adopted technique for improving software
quality. Software developers add, remove and update test methods and test
classes as part of the software development process as well as during the
evolution phase, following the initial release. In this work we conduct a large
scale study of 61 popular open source projects and report the relationships we
have established between test maintenance, production code maintenance, and
semantic changes (e.g, statement added, method removed, etc.). performed in
developers' commits.
We build predictive models, and show that the number of tests in a software
project can be well predicted by employing code maintenance profiles (i.e., how
many commits were performed in each of the maintenance activities: corrective,
perfective, adaptive). Our findings also reveal that more often than not,
developers perform code fixes without performing complementary test maintenance
in the same commit (e.g., update an existing test or add a new one). When
developers do perform test maintenance, it is likely to be affected by the
semantic changes they perform as part of their commit.
Our work is based on studying 61 popular open source projects, comprised of
over 240,000 commits consisting of over 16,000,000 semantic change type
instances, performed by over 4,000 software engineers.Comment: postprint, ICSME 201
Evolution of statistical analysis in empirical software engineering research: Current state and steps forward
Software engineering research is evolving and papers are increasingly based
on empirical data from a multitude of sources, using statistical tests to
determine if and to what degree empirical evidence supports their hypotheses.
To investigate the practices and trends of statistical analysis in empirical
software engineering (ESE), this paper presents a review of a large pool of
papers from top-ranked software engineering journals. First, we manually
reviewed 161 papers and in the second phase of our method, we conducted a more
extensive semi-automatic classification of papers spanning the years 2001--2015
and 5,196 papers. Results from both review steps was used to: i) identify and
analyze the predominant practices in ESE (e.g., using t-test or ANOVA), as well
as relevant trends in usage of specific statistical methods (e.g.,
nonparametric tests and effect size measures) and, ii) develop a conceptual
model for a statistical analysis workflow with suggestions on how to apply
different statistical methods as well as guidelines to avoid pitfalls. Lastly,
we confirm existing claims that current ESE practices lack a standard to report
practical significance of results. We illustrate how practical significance can
be discussed in terms of both the statistical analysis and in the
practitioner's context.Comment: journal submission, 34 pages, 8 figure
Rationale in Development Chat Messages: An Exploratory Study
Chat messages of development teams play an increasingly significant role in
software development, having replaced emails in some cases. Chat messages
contain information about discussed issues, considered alternatives and
argumentation leading to the decisions made during software development. These
elements, defined as rationale, are invaluable during software evolution for
documenting and reusing development knowledge. Rationale is also essential for
coping with changes and for effective maintenance of the software system.
However, exploiting the rationale hidden in the chat messages is challenging
due to the high volume of unstructured messages covering a wide range of
topics. This work presents the results of an exploratory study examining the
frequency of rationale in chat messages, the completeness of the available
rationale and the potential of automatic techniques for rationale extraction.
For this purpose, we apply content analysis and machine learning techniques on
more than 8,700 chat messages from three software development projects. Our
results show that chat messages are a rich source of rationale and that machine
learning is a promising technique for detecting rationale and identifying
different rationale elements.Comment: 11 pages, 6 figures. The 14th International Conference on Mining
Software Repositories (MSR'17
High-dimensional approximate nearest neighbor: k-d Generalized Randomized Forests
We propose a new data-structure, the generalized randomized kd forest, or
kgeraf, for approximate nearest neighbor searching in high dimensions. In
particular, we introduce new randomization techniques to specify a set of
independently constructed trees where search is performed simultaneously, hence
increasing accuracy. We omit backtracking, and we optimize distance
computations, thus accelerating queries. We release public domain software
geraf and we compare it to existing implementations of state-of-the-art methods
including BBD-trees, Locality Sensitive Hashing, randomized kd forests, and
product quantization. Experimental results indicate that our method would be
the method of choice in dimensions around 1,000, and probably up to 10,000, and
pointsets of cardinality up to a few hundred thousands or even one million;
this range of inputs is encountered in many critical applications today. For
instance, we handle a real dataset of images represented in 960
dimensions with a query time of less than sec on average and 90\% responses
being true nearest neighbors
- …
