Search CORE

4 research outputs found

DETECTION OF SOURCE CODE IN INTERNET TEXTS USING AUTOMATICALLY GENERATED MACHINE LEARNING MODELS

Author: BADUROWICZ Marcin
Publication venue: 'Politechnika Lubelska'
Publication date: 01/01/2022
Field of study

In the paper, the authors are presenting the outcome of web scraping software allowing for the automated classification of source code. The software system was prepared for a discussion forum for software developers to find fragments of source code that were published without marking them as code snippets. The analyzer software is using a Machine Learning binary classification model for differentiating between a programming language source code and highly technical text about software. The analyzer model was prepared using the AutoML subsystem without human intervention and fine-tuning and its accuracy in a described problem exceeds 95%. The analyzer based on the automatically generated model has been deployed and after the first year of continuous operation, its False Positive Rate is less than 3%. The similar process may be introduced in document management in software development process, where automatic tagging and search for code or pseudo-code may be useful for archiving purposes

Biblioteka Nauki - repozytorium artykuÅÃ³w

Lublin University of Technology Journals

Content-Based Textual File Type Detection at Scale

Author: Del Bonifro Francesca
Gabbrielli Maurizio
Zacchiroli Stefano
Publication venue
Publication date: 01/01/2021
Field of study

Programming language detection is a common need in the analysis of large source code bases. It is supported by a number of existing tools that rely on several features, and most notably file extensions, to determine file types. We consider the problem of accurately detecting the type of files commonly found in software code bases, based solely on textual file content. Doing so is helpful to classify source code that lack file extensions (e.g., code snippets posted on the Web or executable scripts), to avoid misclassifying source code that has been recorded with wrong or uncommon file extensions, and also shed some light on the intrinsic recognizability of source code files. We propose a simple model that (a) use a language-agnostic word tokenizer for textual files, (b) group tokens in 1-/2-grams, (c) build feature vectors based on N-gram frequencies, and (d) use a simple fully connected neural network as classifier. As training set we use textual files extracted from GitHub repositories with at least 1000 stars, using existing file extensions as ground truth. Despite its simplicity the proposed model reaches 85% in our experiments for a relatively high number of recognized classes (more than 130 file types)

arXiv.org e-Print Archive

Archivio istituzionale della ricerca - Alma Mater Studiorum Università di Bologna