Search CORE

381 research outputs found

Improving the tokenisation of identifier names

Author: A. Kuhn
A. Marcus
A. Vermeulen
B. Caprile
D. Lawrie
D. Raţiu
E. Enslen
E.W. Høst
E.W. Høst
G. Antoniol
G. Antoniol
J. Singer
N. Madani
S. Abebe
S. Butler
V.I. Levenshtein
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2011
Field of study

Identifier names are the main vehicle for semantic information during program comprehension. For tool-supported program comprehension tasks, including concept location and requirements traceability, identifier names need to be tokenised into their semantic constituents. In this paper we present an approach to the automated tokenisation of identifier names that improves on existing techniques in two ways. First, it improves the tokenisation accuracy for single-case identifier names and for identifier names containing digits, which existing techniques largely ignore. Second, performance gains over existing techniques are achieved using smaller oracles, making the approach easier to deploy. Accuracy was evaluated by comparing our algorithm to manual tokenizations of 28,000 identifier names drawn from 60 well-known open source Java projects totalling 16.5 MSLOC. Moreover, the projects were used to perform a study of identifier tokenisation features (single case, camel case, use of digits, etc.) per object-oriented construct (class names, method names, local variable names, etc.), thus providing an insight into naming conventions in industrial-scale object-oriented code. Our tokenisation tool and datasets are publicly available

Crossref

Open Research Online (The Open University)

Recommended from our members

Analysing Java Identifier Names

Author: Butler Simon Jonathan
Publication venue
Publication date: 13/06/2016
Field of study

Identifier names are the principal means of recording and communicating ideas in source code and are a significant source of information for software developers and maintainers, and the tools that support their work. This research aims to increase understanding of identifier name content types - words, abbreviations, etc. - and phrasal structures - noun phrases, verb phrases, etc. - by improving techniques for the analysis of identifier names. The techniques and knowledge acquired can be applied to improve program comprehension tools that support internal code quality, concept location, traceability and model extraction. Previous detailed investigations of identifier names have focused on method names, and the content and structure of Java class and reference (field, parameter, and variable) names are less well understood. I developed improved algorithms to tokenise names, and trained part-of-speech tagger models on identifier names to support the analysis of class and reference names in a corpus of 60 open source Java projects. I confirm that developers structure the majority of names according to identifier naming conventions, and use phrasal structures reported in the literature. I also show that developers use a wider variety of content types and phrasal structures than previously understood. Unusually structured class names are largely project-specific naming conventions, but could indicate design issues. Analysis of phrasal reference names showed that developers most often use the phrasal structures described in the literature and used to support the extraction of information from names, but also choose unexpected phrasal structures, and complex, multi-phrasal, names. Using Nominal - software I created to evaluate adherence to naming conventions - I found developers tend to follow naming conventions, but that adherence to published conventions varies between projects because developers also establish new conventions for the use of typography, content types and phrasal structure to support their work: particularly to distinguish the roles of Java field names

Open Research Online (The Open University)

Relating Developers’ Concepts and Artefact Vocabulary in a Financial Software Module

Author: Dilshener Tezcan
Wermelinger Michel
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/09/2011
Field of study

Developers working on unfamiliar systems are challenged to accurately identify where and how high-level concepts are implemented in the source code. Without additional help, concept location can become a tedious, time-consuming and error-prone task. In this paper we study an industrial financial application for which we had access to the user guide, the source code, and some change requests. We compared the relative importance of the domain concepts, as understood by developers, in the user manual and in the source code. We also searched the code for the concepts occurring in change requests, to see if they could point developers to code to be modified. We varied the searches (using exact and stem matching, discarding stop-words, etc.) and present the precision and recall. We discuss the implication of our results for maintenance

Crossref

Open Research Online (The Open University)

A cascaded approach to normalising gene mentions in biomedical literature

Author: Keane John A.
Nenadic Goran
Yang Hui
Publication venue: 'Biomedical Informatics'
Publication date: 01/01/2007
Field of study

Linking gene and protein names mentioned in the literature to unique identifiers in referent genomic databases is an essential step in accessing and integrating knowledge in the biomedical domain. However, it remains a challenging task due to lexical and terminological variation, and ambiguity of gene name mentions in documents. We present a generic and effective rule-based approach to link gene mentions in the literature to referent genomic databases, where pre-processing of both gene synonyms in the databases and gene mentions in text are first applied. The mapping method employs a cascaded approach, which combines exact, exact-like and token-based approximate matching by using flexible representations of a gene synonym dictionary and gene mentions generated during the pre-processing phase. We also consider multi-gene name mentions and permutation of components in gene names. A systematic evaluation of the suggested methods has identified steps that are beneficial for improving either precision or recall in gene name identification. The results of the experiments on the BioCreAtIvE2 data sets (identification of human gene names) demonstrated that our methods achieved highly encouraging results with F-measure of up to 81.20%

University of Birmingham Research Portal

Open Research Online (The Open University)

PubMed Central

The University of Manchester - Institutional Repository

INVocD: Identifier name vocabulary dataset

Author
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date
Field of study

Crossref

Recommended from our members

Extraction of chemical structures and reactions from the literature

Author: Lowe Daniel Mark
Publication venue: University of Cambridge
Publication date: 09/10/2012
Field of study

The ever increasing quantity of chemical literature necessitates the creation of automated techniques for extracting relevant information. This work focuses on two aspects: the conversion of chemical names to computer readable structure representations and the extraction of chemical reactions from text. Chemical names are a common way of communicating chemical structure information. OPSIN (Open Parser for Systematic IUPAC Nomenclature), an open source, freely available algorithm for converting chemical names to structures was developed. OPSIN employs a regular grammar to direct tokenisation and parsing leading to the generation of an XML parse tree. Nomenclature operations are applied successively to the tree with many requiring the manipulation of an in-memory connection table representation of the structure under construction. Areas of nomenclature supported are described with attention being drawn to difficulties that may be encountered in name to structure conversion. Results on sets of generated names and names extracted from patents are presented. On generated names, recall of between 96.2% and 99.0% was achieved with a lower bound of 97.9% on precision with all results either being comparable or superior to the tested commercial solutions. On the patent names OPSIN s recall was 2-10% higher than the tested solutions when the patent names were processed as found in the patents. The uses of OPSIN as a web service and as a tool for identifying chemical names in text are shown to demonstrate the direct utility of this algorithm. A software system for extracting chemical reactions from the text of chemical patents was developed. The system relies on the output of ChemicalTagger, a tool for tagging words and identifying phrases of importance in experimental chemistry text. Improvements to this tool required to facilitate this task are documented. The structure of chemical entities are where possible determined using OPSIN in conjunction with a dictionary of name to structure relationships. Extracted reactions are atom mapped to confirm that they are chemically consistent. 424,621 atom mapped reactions were extracted from 65,034 organic chemistry USPTO patents. On a sample of 100 of these extracted reactions chemical entities were identified with 96.4% recall and 88.9% precision. Quantities could be associated with reagents in 98.8% of cases and 64.9% of cases for products whilst the correct role was assigned to chemical entities in 91.8% of cases. Qualitatively the system captured the essence of the reaction in 95% of cases. This system is expected to be useful in the creation of searchable databases of reactions from chemical patents and in facilitating analysis of the properties of large populations of reactions

Apollo (Cambridge)

Locating bugs without looking back

Author: CD Manning
D Poshyvanyk
EM Voorhees
G Antoniol
G Salton
J Sillito
M Petrenko
MF Porter
Michel Wermelinger
N Wilde
T Zimmermann
Tezcan Dilshener
Yijun Yu
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 10/10/2017
Field of study

Bug localisation is a core program comprehension task in software maintenance: given the observation of a bug, e.g. via a bug report, where is it located in the source code? Information retrieval (IR) approaches see the bug report as the query, and the source code files as the documents to be retrieved, ranked by relevance. Such approaches have the advantage of not requiring expensive static or dynamic analysis of the code. However, current state-of-the-art IR approaches rely on project history, in particular previously fixed bugs or previous versions of the source code. We present a novel approach that directly scores each current file against the given report, thus not requiring past code and reports. The scoring method is based on heuristics identified through manual inspection of a small sample of bug reports. We compare our approach to eight others, using their own five metrics on their own six open source projects. Out of 30 performance indicators, we improve 27 and equal 2. Over the projects analysed, on average we find one or more affected files in the top 10 ranked files for 76% of the bug reports. These results show the applicability of our approach to software projects without history

Crossref

Open Research Online (The Open University)

MATHEMATICAL LANGUAGE PROCESSING: DEEP LEARNING REPRESENTATIONS AND INFERENCE OVER MATHEMATICAL TEXT

Author: Mendes Ferreira Deborah
Publication venue
Publication date: 01/08/2022
Field of study

The University of Manchester - Institutional Repository

Extending functional databases for use in text-intensive applications

Author: Simon N. Sheldrake (7169804)
Publication venue
Publication date: 01/01/2002
Field of study

This thesis continues research exploring the benefits of using functional databases based around the functional data model for advanced database applications-particularly those supporting investigative systems. This is a growing generic application domain covering areas such as criminal and military intelligence, which are characterised by significant data complexity, large data sets and the need for high performance, interactive use. An experimental functional database language was developed to provide the requisite semantic richness. However, heavy use in a practical context has shown that language extensions and implementation improvements are required-especially in the crucial areas of string matching and graph traversal. In addition, an implementation on multiprocessor, parallel architectures is essential to meet the performance needs arising from existing and projected database sizes in the chosen application area. [Continues.

Loughborough University Institutional Repository