Search CORE

67,301 research outputs found

Learning natural coding conventions

Author: Allamanis Miltiadis
Publication venue: The University of Edinburgh
Publication date: 07/07/2017
Field of study

Coding conventions are ubiquitous in software engineering practice. Maintaining a uniform coding style allows software development teams to communicate through code by making the code clear and, thus, readable and maintainable—two important properties of good code since developers spend the majority of their time maintaining software systems. This dissertation introduces a set of probabilistic machine learning models of source code that learn coding conventions directly from source code written in a mostly conventional style. This alleviates the coding convention enforcement problem, where conventions need to first be formulated clearly into unambiguous rules and then be coded in order to be enforced; a tedious and costly process. First, we introduce the problem of inferring a variable’s name given its usage context and address this problem by creating Naturalize — a machine learning framework that learns to suggest conventional variable names. Two machine learning models, a simple n-gram language model and a specialized neural log-bilinear context model are trained to understand the role and function of each variable and suggest new stylistically consistent variable names. The neural log-bilinear model can even suggest previously unseen names by composing them from subtokens (i.e. sub-components of code identifiers). The suggestions of the models achieve 90% accuracy when suggesting variable names at the top 20% most confident locations, rendering the suggestion system usable in practice. We then turn our attention to the significantly harder method naming problem. Learning to name methods, by looking only at the code tokens within their body, requires a good understating of the semantics of the code contained in a single method. To achieve this, we introduce a novel neural convolutional attention network that learns to generate the name of a method by sequentially predicting its subtokens. This is achieved by focusing on different parts of the code and potentially directly using body (sub)tokens even when they have never been seen before. This model achieves an F1 score of 51% on the top five suggestions when naming methods of real-world open-source projects. Learning about naming code conventions uses the syntactic structure of the code to infer names that implicitly relate to code semantics. However, syntactic similarities and differences obscure code semantics. Therefore, to capture features of semantic operations with machine learning, we need methods that learn semantic continuous logical representations. To achieve this ambitious goal, we focus our investigation on logic and algebraic symbolic expressions and design a neural equivalence network architecture that learns semantic vector representations of expressions in a syntax-driven way, while solely retaining semantics. We show that equivalence networks learn significantly better semantic vector representations compared to other, existing, neural network architectures. Finally, we present an unsupervised machine learning model for mining syntactic and semantic code idioms. Code idioms are conventional “mental chunks” of code that serve a single semantic purpose and are commonly used by practitioners. To achieve this, we employ Bayesian nonparametric inference on tree substitution grammars. We present a wide range of evidence that the resulting syntactic idioms are meaningful, demonstrating that they do indeed recur across software projects and that they occur more frequently in illustrative code examples collected from a Q&A site. These syntactic idioms can be used as a form of automatic documentation of coding practices of a programming language or an API. We also mine semantic loop idioms, i.e. highly abstracted but semantic-preserving idioms of loop operations. We show that semantic idioms provide data-driven guidance during the creation of software engineering tools by mining common semantic patterns, such as candidate refactoring locations. This gives data-based evidence to tool, API and language designers about general, domain and project-specific coding patterns, who instead of relying solely on their intuition, can use semantic idioms to achieve greater coverage of their tool or new API or language feature. We demonstrate this by creating a tool that suggests loop refactorings into functional constructs in LINQ. Semantic loop idioms also provide data-driven evidence for introducing new APIs or programming language features

Sparse Codes for Speech Predict Spectrotemporal Receptive Fields in the Inferior Colliculus

Author: Carlson Nicole L.
DeWeese Michael R.
Ming Vivienne L.
Publication venue: 'Public Library of Science (PLoS)'
Publication date: 12/07/2012
Field of study

We have developed a sparse mathematical representation of speech that minimizes the number of active model neurons needed to represent typical speech sounds. The model learns several well-known acoustic features of speech such as harmonic stacks, formants, onsets and terminations, but we also find more exotic structures in the spectrogram representation of sound such as localized checkerboard patterns and frequency-modulated excitatory subregions flanked by suppressive sidebands. Moreover, several of these novel features resemble neuronal receptive fields reported in the Inferior Colliculus (IC), as well as auditory thalamus and cortex, and our model neurons exhibit the same tradeoff in spectrotemporal resolution as has been observed in IC. To our knowledge, this is the first demonstration that receptive fields of neurons in the ascending mammalian auditory pathway beyond the auditory nerve can be predicted based on coding principles and the statistical properties of recorded sounds.Comment: For Supporting Information, see PLoS website: http://www.ploscompbiol.org/article/info%3Adoi%2F10.1371%2Fjournal.pcbi.100259

arXiv.org e-Print Archive

Directory of Open Access Journals

FigShare

The Self-Organization of Speech Sounds

Author: Oudeyer Pierre-Yves
Publication venue: Elsevier
Publication date: 01/01/2005
Field of study

The speech code is a vehicle of language: it defines a set of forms used by a community to carry information. Such a code is necessary to support the linguistic interactions that allow humans to communicate. How then may a speech code be formed prior to the existence of linguistic interactions? Moreover, the human speech code is discrete and compositional, shared by all the individuals of a community but different across communities, and phoneme inventories are characterized by statistical regularities. How can a speech code with these properties form? We try to approach these questions in the paper, using the ``methodology of the artificial''. We build a society of artificial agents, and detail a mechanism that shows the formation of a discrete speech code without pre-supposing the existence of linguistic capacities or of coordinated interactions. The mechanism is based on a low-level model of sensory-motor interactions. We show that the integration of certain very simple and non language-specific neural devices leads to the formation of a speech code that has properties similar to the human speech code. This result relies on the self-organizing properties of a generic coupling between perception and production within agents, and on the interactions between agents. The artificial system helps us to develop better intuitions on how speech might have appeared, by showing how self-organization might have helped natural selection to find speech

arXiv.org e-Print Archive

CiteSeerX

CogPrints Cognitive Sciences Eprint Archive

Structure emerges faster during cultural transmission in children than in adults

Author: Forsyth Douglas
Gauvrit Nicolas
Kempe Vera
Publication venue
Publication date: 21/11/2014
Field of study

How does children’s limited processing capacity affect cultural transmission of complex information? We show that over the course of iterated reproduction of two-dimensional random dot patterns transmission accuracy increased to a similar extent in 5- to 8-year-old children and adults whereas algorithmic complexity decreased faster in children. Thus, children require more structure to render complex inputs learnable. In line with the Less-Is-More hypothesis, we interpret this as evidence that children’s processing limitations affecting working memory capacity and executive control constrain the ability to represent and generate complexity, which, in turn, facilitates emergence of structure. This underscores the importance of investigating the role of children in the transmission of complex cultural traits