2,885 research outputs found

    Document frequency and term specificity

    Get PDF
    Document frequency is used in various applications in Information Retrieval and other related fields. An assumption frequently made is that the document frequency represents a level of the term’s specificity. However, empirical results to support this assumption are limited. Therefore, a large-scale experiment was carried out, using multiple corpora, to gain further insight into the relationship between the document frequency and terms specificity. The results show that the assumption holds only at the very specific levels that cover the majority of vocabulary. The results also show that a larger corpus is more accurate at estimating the specificity. However, the co-occurrence information is shown to be effective for improving the accuracy when only a small corpus is available

    Evaluation of taxonomic and neural embedding methods for calculating semantic similarity

    Full text link
    Modelling semantic similarity plays a fundamental role in lexical semantic applications. A natural way of calculating semantic similarity is to access handcrafted semantic networks, but similarity prediction can also be anticipated in a distributional vector space. Similarity calculation continues to be a challenging task, even with the latest breakthroughs in deep neural language models. We first examined popular methodologies in measuring taxonomic similarity, including edge-counting that solely employs semantic relations in a taxonomy, as well as the complex methods that estimate concept specificity. We further extrapolated three weighting factors in modelling taxonomic similarity. To study the distinct mechanisms between taxonomic and distributional similarity measures, we ran head-to-head comparisons of each measure with human similarity judgements from the perspectives of word frequency, polysemy degree and similarity intensity. Our findings suggest that without fine-tuning the uniform distance, taxonomic similarity measures can depend on the shortest path length as a prime factor to predict semantic similarity; in contrast to distributional semantics, edge-counting is free from sense distribution bias in use and can measure word similarity both literally and metaphorically; the synergy of retrofitting neural embeddings with concept relations in similarity prediction may indicate a new trend to leverage knowledge bases on transfer learning. It appears that a large gap still exists on computing semantic similarity among different ranges of word frequency, polysemous degree and similarity intensity

    Closing the gap between guidance and practice, an investigation of the relevance of design guidance to practitioners using object-oriented technologies

    Get PDF
    This thesis investigates if object oriented guidance is relevant in practice, and how this affects software that is produced. This is achieved by surveying practitioners and studying how constructs such as interfaces and inheritance are used in open-source systems. Surveyed practitioners framed 'good design' in terms of impact on development and maintenance. Recognition of quality requires practitioner judgement (individually and as a group), and principles are valued over rules. Time constraints heighten sensitivity to the rework cost of poor design decisions. Examination of open source systems highlights the use of interface and inheritance. There is some evidence of 'textbook' use of these structures, and much use is simple. Outliers are widespread indicating a pragmatic approach. Design is found to reflect the pressures of practice - high-level decisions justify 'designed' structures and architecture, while uncertainty leads to deferred design decisions - simpler structures, repetition, and unconsolidated design. Sub-populations of structures can be identified which may represent common trade-offs. Useful insights are gained into practitioner attitude to design guidance. Patterns of use and structure are identified which may aid in assessment and comprehension of object oriented systems.This thesis investigates if object oriented guidance is relevant in practice, and how this affects software that is produced. This is achieved by surveying practitioners and studying how constructs such as interfaces and inheritance are used in open-source systems. Surveyed practitioners framed 'good design' in terms of impact on development and maintenance. Recognition of quality requires practitioner judgement (individually and as a group), and principles are valued over rules. Time constraints heighten sensitivity to the rework cost of poor design decisions. Examination of open source systems highlights the use of interface and inheritance. There is some evidence of 'textbook' use of these structures, and much use is simple. Outliers are widespread indicating a pragmatic approach. Design is found to reflect the pressures of practice - high-level decisions justify 'designed' structures and architecture, while uncertainty leads to deferred design decisions - simpler structures, repetition, and unconsolidated design. Sub-populations of structures can be identified which may represent common trade-offs. Useful insights are gained into practitioner attitude to design guidance. Patterns of use and structure are identified which may aid in assessment and comprehension of object oriented systems

    Software quality attribute measurement and analysis based on class diagram metrics

    Get PDF
    Software quality measurement lies at the heart of the quality engineering process. Quality measurement for object-oriented artifacts has become the key for ensuring high quality software. Both researchers and practitioners are interested in measuring software product quality for improvement. It has recently become more important to consider the quality of products at the early phases, especially at the design level to ensure that the coding and testing would be conducted more quickly and accurately. The research work on measuring quality at the design level progressed in a number of steps. The first step was to discover the correct set of metrics to measure design elements at the design level. Chidamber and Kemerer (C&K) formulated the first suite of OO metrics. Other researchers extended on this suite and provided additional metrics. The next step was to collect these metrics by using software tools. A number of tools were developed to measure the different suites of metrics; some represent their measurements in the form of ordinary numbers, others represent them in 3D visual form. In recent years, researchers developed software quality models which went a bit further by computing quality attributes from collected design metrics. In this research we extended on the software quality modelers’ work by adding a quality attribute prioritization scheme and a design metric analysis layer. Our work is all focused on the class diagram, the most fundamental constituent in any object oriented design. Using earlier researchers’ work, we extract a class diagram’s metrics and compute its quality attributes. We then analyze the results and inform the user. We present our figures and observations in the form of an analysis report. Our target user could be a project manager or a software quality engineer or a developer who needs to improve the class diagram’s quality. We closely examine the design metrics that affect quality attributes. We pinpoint the weaknesses in the class diagram, based on these metrics, inform the user about the problems that emerged from these classes, and advice him/her as to how he/she can go about improving the overall design quality. We consider the six basic quality attributes: “Reusability”, “Functionality”, “Understandability”, “Flexibility”, “Extendibility”, and “Effectiveness” of the whole class diagram. We allow the user to set priorities on these quality attributes in a sequential manner based on his/her requirements. Using a geometric series, we calculate a weighted average value for the arranged list of quality attributes. This weighted average value indicates the overall quality of the product, the class diagram. Our experimental work gave us much insight into the meanings and dependencies between design metrics and quality attributes. This helped us refine our analysis technique and give more concrete observations to the user

    Restructuring Object -Oriented Designs Using a Metric-Driven Approach.

    Get PDF
    The benefits of object-oriented software are now widely recognized. However, methodologies that are used to develop object-oriented software are still in their infancy. There is a lack of methods to assess the quality of the various components that are derived during the development process. The design of a system is a crucial component derived during the system development process. Little attention has been given to assessing object-oriented designs to determine the goodness of the designs. There are metrics that can provide guidance for assessing the quality of the design. The objective of this research is to develop a system to evaluate object-oriented designs and to provide guidance for the restructuring of the design based on the results of the evaluation process. We identify a basic set of metrics that reflects the benefits of the object-oriented paradigm such as inheritance, encapsulation, and method interactions. Specifically, we include metrics that measure depth of inheritance, methods usage, cardinality of subclasses, coupling, class responses, and cohesion. We define techniques to evaluate the metric values on existing object-oriented designs. We then define techniques to utilize the metric values to help restructure designs so that they conform to predetermined design criteria. These methods and techniques are implemented as a part of a Design Evaluation Assistant that automates much of the evaluation and restructuring process

    AUTOMATED SOFTWARE METRICS, REPOSITORY EVALUATION AND SOFTWARE ASSET MANAGEMENT: NEW TOOLS AND PERSPECTIVES FOR MANAGING INTEGRATED COMPUTER AIDED SOFTWARE ENGINEERING (I-CASE)

    Get PDF
    Automated collection of software metrics in computer aided software engineering (CASE) environments opens up new avenues for improving the management of software development operations, as well as shifting the focus of management's control efforts from "software projectâ to "software assets" stored in a centralized repository. Repository evaluation, a new direction for software metrics research in the 1990s, promises a fresh view of software development performance for a range of responsibility levels. We discuss the automation of function point and code reuse analysis in the context of an integrated CASE (I-CASE) environment deployed at a large investment bank in New York City. The development of an automated code reuse analysis tool prompted us to re-think how to measure and interpret code reuse in the I-CASE environment. The metrics we propose describe three dimensions of code reuse -- leverage, value and classification -- and we examine the value of applying them on a project and a repository-wide basis.Information Systems Working Papers Serie

    Benchmarking Ontologies: Bigger or Better?

    Get PDF
    A scientific ontology is a formal representation of knowledge within a domain, typically including central concepts, their properties, and relations. With the rise of computers and high-throughput data collection, ontologies have become essential to data mining and sharing across communities in the biomedical sciences. Powerful approaches exist for testing the internal consistency of an ontology, but not for assessing the fidelity of its domain representation. We introduce a family of metrics that describe the breadth and depth with which an ontology represents its knowledge domain. We then test these metrics using (1) four of the most common medical ontologies with respect to a corpus of medical documents and (2) seven of the most popular English thesauri with respect to three corpora that sample language from medicine, news, and novels. Here we show that our approach captures the quality of ontological representation and guides efforts to narrow the breach between ontology and collective discourse within a domain. Our results also demonstrate key features of medical ontologies, English thesauri, and discourse from different domains. Medical ontologies have a small intersection, as do English thesauri. Moreover, dialects characteristic of distinct domains vary strikingly as many of the same words are used quite differently in medicine, news, and novels. As ontologies are intended to mirror the state of knowledge, our methods to tighten the fit between ontology and domain will increase their relevance for new areas of biomedical science and improve the accuracy and power of inferences computed across them

    Indirect Relatedness, Evaluation, and Visualization for Literature Based Discovery

    Get PDF
    The exponential growth of scientific literature is creating an increased need for systems to process and assimilate knowledge contained within text. Literature Based Discovery (LBD) is a well established field that seeks to synthesize new knowledge from existing literature, but it has remained primarily in the theoretical realm rather than in real-world application. This lack of real-world adoption is due in part to the difficulty of LBD, but also due to several solvable problems present in LBD today. Of these problems, the ones in most critical need of improvement are: (1) the over-generation of knowledge by LBD systems, (2) a lack of meaningful evaluation standards, and (3) the difficulty interpreting LBD output. We address each of these problems by: (1) developing indirect relatedness measures for ranking and filtering LBD hypotheses; (2) developing a representative evaluation dataset and applying meaningful evaluation methods to individual components of LBD; (3) developing an interactive visualization system that allows a user to explore LBD output in its entirety. In addressing these problems, we make several contributions, most importantly: (1) state of the art results for estimating direct semantic relatedness, (2) development of set association measures, (3) development of indirect association measures, (4) development of a standard LBD evaluation dataset, (5) division of LBD into discrete components with well defined evaluation methods, (6) development of automatic functional group discovery, and (7) integration of indirect relatedness measures and automatic functional group discovery into a comprehensive LBD visualization system. Our results inform future development of LBD systems, and contribute to creating more effective LBD systems
    • …
    corecore