126 research outputs found
Tokenisation of class files for an embedded java processor
Los Alamitos, US
Recommended from our members
Analysing Java Identifier Names
Identifier names are the principal means of recording and communicating ideas in source code and are a significant source of information for software developers and maintainers, and the tools that support their work. This research aims to increase understanding of identifier name content types - words, abbreviations, etc. - and phrasal structures - noun phrases, verb phrases, etc. - by improving techniques for the analysis of identifier names. The techniques and knowledge acquired can be applied to improve program comprehension tools that support internal code quality, concept location, traceability and model extraction. Previous detailed investigations of identifier names have focused on method names, and the content and structure of Java class and reference (field, parameter, and variable) names are less well understood.
I developed improved algorithms to tokenise names, and trained part-of-speech tagger models on identifier names to support the analysis of class and reference names in a corpus of 60 open source Java projects. I confirm that developers structure the majority of names according to identifier naming conventions, and use phrasal structures reported in the literature. I also show that developers use a wider variety of content types and phrasal structures than previously understood. Unusually structured class names are largely project-specific naming conventions, but could indicate design issues. Analysis of phrasal reference names showed that developers most often use the phrasal structures described in the literature and used to support the extraction of information from names, but also choose unexpected phrasal structures, and complex, multi-phrasal, names.
Using Nominal - software I created to evaluate adherence to naming conventions - I found developers tend to follow naming conventions, but that adherence to published conventions varies between projects because developers also establish new conventions for the use of typography, content types and phrasal structure to support their work: particularly to distinguish the roles of Java field names
Requirement Mining for Model-Based Product Design
PLM software applications should enable engineers to develop and manage requirements throughout the product’s lifecycle. However, PLM activities of the beginning-of-life and end-of-life of a product mainly deal with a fastidious document-based approach. Indeed, requirements are scattered in many different prescriptive documents (reports, specifications, standards, regulations, etc.) that make the feeding of a requirements management tool laborious. Our contribution is two-fold. First, we propose a natural language processing (NLP) pipeline to extract requirements from prescriptive documents. Second, we show how machine learning techniques can be used to develop a text classifier that will automatically classify requirements into disciplines. Both contributions support companies willing to feed a requirements management tool from prescriptive documents. The NLP experiment shows an average precision of 0.86 and an average recall of 0.95, whereas the SVM requirements classifier outperforms that of naive Bayes with a 76% accuracy rate
Requirement mining for model-based product design
PLM software applications should enable engineers to develop and manage requirements throughout the product’s lifecycle. However, PLM activities of the beginning-of-life and end-of-life of a product mainly deal with a fastidious document-based approach. Indeed, requirements are scattered in many different prescriptive documents (reports, specifications, standards, regulations, etc.) that make the feeding of a requirements management tool laborious. Our contribution is two-fold. First, we propose a natural language processing (NLP) pipeline to extract requirements from prescriptive documents. Second, we show how machine learning techniques can be used to develop a text classifier that will automatically classify requirements into disciplines. Both contributions support companies willing to feed a requirements management tool from prescriptive documents. The NLP experiment shows an average precision of 0.86 and an average recall of 0.95, whereas the SVM requirements classifier outperforms that of naive Bayes with a 76% accuracy rate
Text Augmentation: Inserting markup into natural language text with PPM Models
This thesis describes a new optimisation and new heuristics for automatically marking up XML documents. These are implemented in CEM, using PPMmodels. CEM is significantly more general than previous systems, marking up large numbers of hierarchical tags, using n-gram models for large n and a variety of escape methods.
Four corpora are discussed, including the bibliography corpus of 14682 bibliographies laid out in seven standard styles using the BIBTEX system and markedup in XML with every field from the original BIBTEX. Other corpora include the ROCLING Chinese text segmentation corpus, the Computists’ Communique corpus and the Reuters’ corpus. A detailed examination is presented of the methods of evaluating mark up algorithms, including computation complexity measures and correctness measures from the fields of information retrieval, string processing, machine learning and information theory.
A new taxonomy of markup complexities is established and the properties of each taxon are examined in relation to the complexity of marked-up documents. The performance of the new heuristics and optimisation is examined using the four corpora
Recommended from our members
Extraction of chemical structures and reactions from the literature
The ever increasing quantity of chemical literature necessitates
the creation of automated techniques for extracting relevant information.
This work focuses on two aspects: the conversion of chemical names to
computer readable structure representations and the extraction of chemical
reactions from text.
Chemical names are a common way of communicating chemical structure
information. OPSIN (Open Parser for Systematic IUPAC Nomenclature), an
open source, freely available algorithm for converting chemical names to
structures was developed. OPSIN employs a regular grammar to direct
tokenisation and parsing leading to the generation of an XML parse tree.
Nomenclature operations are applied successively to the tree with many
requiring the manipulation of an in-memory connection table representation
of the structure under construction. Areas of nomenclature supported are
described with attention being drawn to difficulties that may be
encountered in name to structure conversion. Results on sets of generated
names and names extracted from patents are presented. On generated names,
recall of between 96.2% and 99.0% was achieved with a lower bound of 97.9%
on precision with all results either being comparable or superior to the
tested commercial solutions. On the patent names OPSIN s recall was 2-10%
higher than the tested solutions when the patent names were processed as
found in the patents. The uses of OPSIN as a web service and as a tool for
identifying chemical names in text are shown to demonstrate the direct
utility of this algorithm.
A software system for extracting chemical reactions from the text of
chemical patents was developed. The system relies on the output of
ChemicalTagger, a tool for tagging words and identifying phrases of
importance in experimental chemistry text. Improvements to this tool
required to facilitate this task are documented. The structure of chemical
entities are where possible determined using OPSIN in conjunction with a
dictionary of name to structure relationships. Extracted reactions are
atom mapped to confirm that they are chemically consistent. 424,621 atom
mapped reactions were extracted from 65,034 organic chemistry USPTO
patents. On a sample of 100 of these extracted reactions chemical entities
were identified with 96.4% recall and 88.9% precision. Quantities could be
associated with reagents in 98.8% of cases and 64.9% of cases for products
whilst the correct role was assigned to chemical entities in 91.8% of
cases. Qualitatively the system captured the essence of the reaction in
95% of cases. This system is expected to be useful in the creation of
searchable databases of reactions from chemical patents and in
facilitating analysis of the properties of large populations of reactions
Requirement mining for model-based product design
PLM software applications should enable engineers to develop and manage requirements throughout the product’s lifecycle. However, PLM activities of the beginning-of-life and end-of-life of a product mainly deal with a fastidious document-based approach. Indeed, requirements are scattered in many different prescriptive documents (reports, specifications, standards, regulations, etc.) that make the feeding of a requirements management tool laborious. Our contribution is two-fold. First, we propose a natural language processing (NLP) pipeline to extract requirements from prescriptive documents. Second, we show how machine learning techniques can be used to develop a text classifier that will automatically classify requirements into disciplines. Both contributions support companies willing to feed a requirements management tool from prescriptive documents. The NLP experiment shows an average precision of 0.86 and an average recall of 0.95, whereas the SVM requirements classifier outperforms that of naive Bayes with a 76% accuracy rate
Requirement Mining for Model-Based Product Design
PLM software applications should enable engineers to develop and manage requirements throughout the product’s lifecycle. However, PLM activities of the beginning-of-life and end-of-life of a product mainly deal with a fastidious document-based approach. Indeed, requirements are scattered in many different prescriptive documents (reports, specifications, standards, regulations, etc.) that make the feeding of a requirements management tool laborious. Our contribution is two-fold. First, we propose a natural language processing (NLP) pipeline to extract requirements from prescriptive documents. Second, we show how machine learning techniques can be used to develop a text classifier that will automatically classify requirements into disciplines. Both contributions support companies willing to feed a requirements management tool from prescriptive documents. The NLP experiment shows an average precision of 0.86 and an average recall of 0.95, whereas the SVM requirements classifier outperforms that of naive Bayes with a 76% accuracy rate
- …