215 research outputs found
"Medium" LMs of Code in the Era of LLMs: Lessons From StackOverflow
Large pre-trained neural language models have brought immense progress to
both NLP and software engineering. Models in OpenAI's GPT series now dwarf
Google's BERT and Meta's RoBERTa, which previously set new benchmarks on a wide
range of NLP applications. These models are trained on massive corpora of
heterogeneous data from web crawls, which enables them to learn general
language patterns and semantic relationships. However, the largest models are
both expensive to train and deploy and are often closed-source, so we lack
access to their data and design decisions. We argue that this trend towards
large, general-purpose models should be complemented with single-purpose, more
modestly sized pre-trained models. In this work, we take StackOverflow (SO) as
a domain example in which large volumes of rich aligned code and text data is
available. We adopt standard practices for pre-training large language models,
including using a very large context size (2,048 tokens), batch size (0.5M
tokens) and training set (27B tokens), coupled with a powerful toolkit
(Megatron-LM), to train two models: SOBertBase, with 109M parameters, and
SOBertLarge with 762M parameters, at a budget of just and each.
We compare the performance of our models with both the previous SOTA model
trained on SO data exclusively as well general-purpose BERT models and OpenAI's
ChatGPT on four SO-specific downstream tasks - question quality prediction,
closed question prediction, named entity recognition and obsoletion prediction
(a new task we introduce). Not only do our models consistently outperform all
baselines, the smaller model is often sufficient for strong results. Both
models are released to the public. These results demonstrate that pre-training
both extensively and properly on in-domain data can yield a powerful and
affordable alternative to leveraging closed-source general-purpose models
Learning Defect Prediction from Unrealistic Data
Pretrained models of code, such as CodeBERT and CodeT5, have become popular
choices for code understanding and generation tasks. Such models tend to be
large and require commensurate volumes of training data, which are rarely
available for downstream tasks. Instead, it has become popular to train models
with far larger but less realistic datasets, such as functions with
artificially injected bugs. Models trained on such data, however, tend to only
perform well on similar data, while underperforming on real world programs. In
this paper, we conjecture that this discrepancy stems from the presence of
distracting samples that steer the model away from the real-world task
distribution. To investigate this conjecture, we propose an approach for
identifying the subsets of these large yet unrealistic datasets that are most
similar to examples in real-world datasets based on their learned
representations. Our approach extracts high-dimensional embeddings of both
real-world and artificial programs using a neural model and scores artificial
samples based on their distance to the nearest real-world sample. We show that
training on only the nearest, representationally most similar samples while
discarding samples that are not at all similar in representations yields
consistent improvements across two popular pretrained models of code on two
code understanding tasks. Our results are promising, in that they show that
training models on a representative subset of an unrealistic dataset can help
us harness the power of large-scale synthetic data generation while preserving
downstream task performance. Finally, we highlight the limitations of applying
AI models for predicting vulnerabilities and bugs in real-world application
Route choice control of automated baggage handling systems.
Abstract State-of-the-art baggage handling systems transport luggage in an automated way using destination coded vehicles (DCVs). These vehicles transport the bags at high speeds on a "mini" railway network. Currently, the networks are simple, with only a few junctions, since otherwise bottlenecks would be created at the junctions. This makes the system inefficient. In the research we conduct, more complex networks are considered. In order to optimize the performance of the system we develop and compare centralized and decentralized control methods that can be used to route the DCVs through the track network. The proposed centralized control method is model predictive control (MPC). Due to the large computation effort centralized MPC requires, decentralized MPC and a fast decentralized heuristic approach are also proposed. When implementing the decentralized approaches, each junction has its own local controller for positioning the switch going into the junction and the switch going out of the junction. In order to assess the advantages and disadvantages of centralized MPC, decentralized MPC, and the decentralized heuristic approach, we also discuss a simple benchmark case study. The considered control methods are compared for several scenarios. Results indicate that centralized MPC becomes intractable when a large stream of bags has to be handled, while decentralized MPC can still be used to suboptimally solve the problem. Moreover, the decentralized heuristic approach usually gives worse results than those obtained when using decentralized MPC, but with very low computation time. TarČu, De Schutter, Hellendoorn
Multiple-Peptidase Mutants of Lactococcus lactis Are Severely Impaired in Their Ability To Grow in Milk
To examine the contribution of peptidases to the growth of Lactococcus lactis in milk, 16 single- and multiple-deletion mutants were constructed. In successive rounds of chromosomal gene replacement mutagenesis, up to all five of the following peptidase genes were inactivated (fivefold mutant): pepX, pepO, pepT, pepC, and pepN. Multiple mutations led to slower growth rates in milk, the general trend being that growth rates decreased when more peptidases were inactivated. The fivefold mutant grew more than 10 times more slowly in milk than the wild-type strain. In one of the fourfold mutants and in the fivefold mutant, the intracellular pools of amino acids were lower than those of the wild type, whereas peptides had accumulated inside the cell. No significant differences in the activities of the cell envelope-associated proteinase and of the oligopeptide transport system were observed. Also, the expression of the peptidases still present in the various mutants was not detectably affected. Thus, the lower growth rates can directly be attributed to the inability of the mutants to degrade casein-derived peptides. These results supply the first direct evidence for the functioning of lactococcal peptidases in the degradation of milk proteins. Furthermore, the study provides critical information about the relative importance of the peptidases for growth in milk, the order of events in the proteolytic pathway, and the regulation of its individual components.
- ā¦