2,438,601 research outputs found
Finding Regressions in Projects under Version Control Systems
Version Control Systems (VCS) are frequently used to support development of
large-scale software projects. A typical VCS repository of a large project can
contain various intertwined branches consisting of a large number of commits.
If some kind of unwanted behaviour (e.g. a bug in the code) is found in the
project, it is desirable to find the commit that introduced it. Such commit is
called a regression point. There are two main issues regarding the regression
points. First, detecting whether the project after a certain commit is correct
can be very expensive as it may include large-scale testing and/or some other
forms of verification. It is thus desirable to minimise the number of such
queries. Second, there can be several regression points preceding the actual
commit; perhaps a bug was introduced in a certain commit, inadvertently fixed
several commits later, and then reintroduced in a yet later commit. In order to
fix the actual commit it is usually desirable to find the latest regression
point.
The currently used distributed VCS contain methods for regression
identification, see e.g. the git bisect tool. In this paper, we present a new
regression identification algorithm that outperforms the current tools by
decreasing the number of validity queries. At the same time, our algorithm
tends to find the latest regression points which is a feature that is missing
in the state-of-the-art algorithms. The paper provides an experimental
evaluation of the proposed algorithm and compares it to the state-of-the-art
tool git bisect on a real data set
Essential Tools: Version Control Systems
Did you ever wish you\u27d made a backup copy of a file before changing it? Or before applying a collaborator\u27s modifications? Version control systems make this easier, and do a lot more
DataHub: Collaborative Data Science & Dataset Version Management at Scale
Relational databases have limited support for data collaboration, where teams
collaboratively curate and analyze large datasets. Inspired by software version
control systems like git, we propose (a) a dataset version control system,
giving users the ability to create, branch, merge, difference and search large,
divergent collections of datasets, and (b) a platform, DataHub, that gives
users the ability to perform collaborative data analysis building on this
version control system. We outline the challenges in providing dataset version
control at scale.Comment: 7 page
Scalability of Distributed Version Control Systems
Source at https://ojs.bibsys.no/index.php/NIK/article/view/434.Distributed version control systems are popular for storing source code, but they are notoriously ill suited for storing large binary files.
We report on the results from a set of experiments designed to characterize the behavior of some widely used distributed version control systems with respect to scaling. The experiments measured commit times and repository sizes when storing single files of increasing size, and when storing increasing numbers of single-kilobyte files.
The goal is to build a distributed storage system with characteristics similar to version control but for much larger data sets. An early prototype of such a system, Distributed Media Versioning (DMV), is briefly described and compared with Git, Mercurial, and the Git-based backup tool Bup.
We find that processing large files without splitting them into smaller parts will limit maximum file size to what can fit in RAM. Storing millions of small files will result in inefficient use of disk space. And storing files with hash-based file and directory names will result in high-latency write operations, due to having to switch between directories rather than performing a sequential write.
The next-phase strategy for DMV will be to break files into chunks by content for de-duplication, then re-aggregating the chunks into append-only log files for low-latency write operations and efficient use of disk space
Version Control of Speaker Recognition Systems
This paper discusses one of the most challenging practical engineering
problems in speaker recognition systems - the version control of models and
user profiles. A typical speaker recognition system consists of two stages: the
enrollment stage, where a profile is generated from user-provided enrollment
audio; and the runtime stage, where the voice identity of the runtime audio is
compared against the stored profiles. As technology advances, the speaker
recognition system needs to be updated for better performance. However, if the
stored user profiles are not updated accordingly, version mismatch will result
in meaningless recognition results. In this paper, we describe different
version control strategies for different types of speaker recognition systems,
according to how they are deployed in the production environment
Scalability of Distributed Version Control Systems
Distributed version control systems are popular for storing source code, but they are notoriously ill suited for storing large binary files.
We report on the results from a set of experiments designed to characterize the behavior of some widely used distributed version control systems with respect to scaling. The experiments measured commit times and repository sizes when storing single files of increasing size, and when storing increasing numbers of single-kilobyte files.
The goal is to build a distributed storage system with characteristics similar to version control but for much larger data sets. An early prototype of such a system, Distributed Media Versioning (DMV), is briefly described and compared with Git, Mercurial, and the Git-based backup tool Bup.
We find that processing large files without splitting them into smaller parts will limit maximum file size to what can fit in RAM. Storing millions of small files will result in inefficient use of disk space. And storing files with hash-based file and directory names will result in high-latency write operations, due to having to switch between directories rather than performing a sequential write.
The next-phase strategy for DMV will be to break files into chunks by content for de-duplication, then re-aggregating the chunks into append-only log files for low-latency write operations and efficient use of disk space
- …