215 research outputs found

    "Medium" LMs of Code in the Era of LLMs: Lessons From StackOverflow

    Full text link
    Large pre-trained neural language models have brought immense progress to both NLP and software engineering. Models in OpenAI's GPT series now dwarf Google's BERT and Meta's RoBERTa, which previously set new benchmarks on a wide range of NLP applications. These models are trained on massive corpora of heterogeneous data from web crawls, which enables them to learn general language patterns and semantic relationships. However, the largest models are both expensive to train and deploy and are often closed-source, so we lack access to their data and design decisions. We argue that this trend towards large, general-purpose models should be complemented with single-purpose, more modestly sized pre-trained models. In this work, we take StackOverflow (SO) as a domain example in which large volumes of rich aligned code and text data is available. We adopt standard practices for pre-training large language models, including using a very large context size (2,048 tokens), batch size (0.5M tokens) and training set (27B tokens), coupled with a powerful toolkit (Megatron-LM), to train two models: SOBertBase, with 109M parameters, and SOBertLarge with 762M parameters, at a budget of just $187\$187 and $800\$800 each. We compare the performance of our models with both the previous SOTA model trained on SO data exclusively as well general-purpose BERT models and OpenAI's ChatGPT on four SO-specific downstream tasks - question quality prediction, closed question prediction, named entity recognition and obsoletion prediction (a new task we introduce). Not only do our models consistently outperform all baselines, the smaller model is often sufficient for strong results. Both models are released to the public. These results demonstrate that pre-training both extensively and properly on in-domain data can yield a powerful and affordable alternative to leveraging closed-source general-purpose models

    Learning Defect Prediction from Unrealistic Data

    Full text link
    Pretrained models of code, such as CodeBERT and CodeT5, have become popular choices for code understanding and generation tasks. Such models tend to be large and require commensurate volumes of training data, which are rarely available for downstream tasks. Instead, it has become popular to train models with far larger but less realistic datasets, such as functions with artificially injected bugs. Models trained on such data, however, tend to only perform well on similar data, while underperforming on real world programs. In this paper, we conjecture that this discrepancy stems from the presence of distracting samples that steer the model away from the real-world task distribution. To investigate this conjecture, we propose an approach for identifying the subsets of these large yet unrealistic datasets that are most similar to examples in real-world datasets based on their learned representations. Our approach extracts high-dimensional embeddings of both real-world and artificial programs using a neural model and scores artificial samples based on their distance to the nearest real-world sample. We show that training on only the nearest, representationally most similar samples while discarding samples that are not at all similar in representations yields consistent improvements across two popular pretrained models of code on two code understanding tasks. Our results are promising, in that they show that training models on a representative subset of an unrealistic dataset can help us harness the power of large-scale synthetic data generation while preserving downstream task performance. Finally, we highlight the limitations of applying AI models for predicting vulnerabilities and bugs in real-world application

    Route choice control of automated baggage handling systems.

    Get PDF
    Abstract State-of-the-art baggage handling systems transport luggage in an automated way using destination coded vehicles (DCVs). These vehicles transport the bags at high speeds on a "mini" railway network. Currently, the networks are simple, with only a few junctions, since otherwise bottlenecks would be created at the junctions. This makes the system inefficient. In the research we conduct, more complex networks are considered. In order to optimize the performance of the system we develop and compare centralized and decentralized control methods that can be used to route the DCVs through the track network. The proposed centralized control method is model predictive control (MPC). Due to the large computation effort centralized MPC requires, decentralized MPC and a fast decentralized heuristic approach are also proposed. When implementing the decentralized approaches, each junction has its own local controller for positioning the switch going into the junction and the switch going out of the junction. In order to assess the advantages and disadvantages of centralized MPC, decentralized MPC, and the decentralized heuristic approach, we also discuss a simple benchmark case study. The considered control methods are compared for several scenarios. Results indicate that centralized MPC becomes intractable when a large stream of bags has to be handled, while decentralized MPC can still be used to suboptimally solve the problem. Moreover, the decentralized heuristic approach usually gives worse results than those obtained when using decentralized MPC, but with very low computation time. Tarȃu, De Schutter, Hellendoorn

    Multiple-Peptidase Mutants of Lactococcus lactis Are Severely Impaired in Their Ability To Grow in Milk

    Get PDF
    To examine the contribution of peptidases to the growth of Lactococcus lactis in milk, 16 single- and multiple-deletion mutants were constructed. In successive rounds of chromosomal gene replacement mutagenesis, up to all five of the following peptidase genes were inactivated (fivefold mutant): pepX, pepO, pepT, pepC, and pepN. Multiple mutations led to slower growth rates in milk, the general trend being that growth rates decreased when more peptidases were inactivated. The fivefold mutant grew more than 10 times more slowly in milk than the wild-type strain. In one of the fourfold mutants and in the fivefold mutant, the intracellular pools of amino acids were lower than those of the wild type, whereas peptides had accumulated inside the cell. No significant differences in the activities of the cell envelope-associated proteinase and of the oligopeptide transport system were observed. Also, the expression of the peptidases still present in the various mutants was not detectably affected. Thus, the lower growth rates can directly be attributed to the inability of the mutants to degrade casein-derived peptides. These results supply the first direct evidence for the functioning of lactococcal peptidases in the degradation of milk proteins. Furthermore, the study provides critical information about the relative importance of the peptidases for growth in milk, the order of events in the proteolytic pathway, and the regulation of its individual components.
    • ā€¦
    corecore