630 research outputs found

    On the Effect of Semantically Enriched Context Models on Software Modularization

    Full text link
    Many of the existing approaches for program comprehension rely on the linguistic information found in source code, such as identifier names and comments. Semantic clustering is one such technique for modularization of the system that relies on the informal semantics of the program, encoded in the vocabulary used in the source code. Treating the source code as a collection of tokens loses the semantic information embedded within the identifiers. We try to overcome this problem by introducing context models for source code identifiers to obtain a semantic kernel, which can be used for both deriving the topics that run through the system as well as their clustering. In the first model, we abstract an identifier to its type representation and build on this notion of context to construct contextual vector representation of the source code. The second notion of context is defined based on the flow of data between identifiers to represent a module as a dependency graph where the nodes correspond to identifiers and the edges represent the data dependencies between pairs of identifiers. We have applied our approach to 10 medium-sized open source Java projects, and show that by introducing contexts for identifiers, the quality of the modularization of the software systems is improved. Both of the context models give results that are superior to the plain vector representation of documents. In some cases, the authoritativeness of decompositions is improved by 67%. Furthermore, a more detailed evaluation of our approach on JEdit, an open source editor, demonstrates that inferred topics through performing topic analysis on the contextual representations are more meaningful compared to the plain representation of the documents. The proposed approach in introducing a context model for source code identifiers paves the way for building tools that support developers in program comprehension tasks such as application and domain concept location, software modularization and topic analysis

    Advanced Knowledge Technologies at the Midterm: Tools and Methods for the Semantic Web

    Get PDF
    The University of Edinburgh and research sponsors are authorised to reproduce and distribute reprints and on-line copies for their purposes notwithstanding any copyright annotation hereon. The views and conclusions contained herein are the author’s and shouldn’t be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of other parties.In a celebrated essay on the new electronic media, Marshall McLuhan wrote in 1962:Our private senses are not closed systems but are endlessly translated into each other in that experience which we call consciousness. Our extended senses, tools, technologies, through the ages, have been closed systems incapable of interplay or collective awareness. Now, in the electric age, the very instantaneous nature of co-existence among our technological instruments has created a crisis quite new in human history. Our extended faculties and senses now constitute a single field of experience which demands that they become collectively conscious. Our technologies, like our private senses, now demand an interplay and ratio that makes rational co-existence possible. As long as our technologies were as slow as the wheel or the alphabet or money, the fact that they were separate, closed systems was socially and psychically supportable. This is not true now when sight and sound and movement are simultaneous and global in extent. (McLuhan 1962, p.5, emphasis in original)Over forty years later, the seamless interplay that McLuhan demanded between our technologies is still barely visible. McLuhan’s predictions of the spread, and increased importance, of electronic media have of course been borne out, and the worlds of business, science and knowledge storage and transfer have been revolutionised. Yet the integration of electronic systems as open systems remains in its infancy.Advanced Knowledge Technologies (AKT) aims to address this problem, to create a view of knowledge and its management across its lifecycle, to research and create the services and technologies that such unification will require. Half way through its sixyear span, the results are beginning to come through, and this paper will explore some of the services, technologies and methodologies that have been developed. We hope to give a sense in this paper of the potential for the next three years, to discuss the insights and lessons learnt in the first phase of the project, to articulate the challenges and issues that remain.The WWW provided the original context that made the AKT approach to knowledge management (KM) possible. AKT was initially proposed in 1999, it brought together an interdisciplinary consortium with the technological breadth and complementarity to create the conditions for a unified approach to knowledge across its lifecycle. The combination of this expertise, and the time and space afforded the consortium by the IRC structure, suggested the opportunity for a concerted effort to develop an approach to advanced knowledge technologies, based on the WWW as a basic infrastructure.The technological context of AKT altered for the better in the short period between the development of the proposal and the beginning of the project itself with the development of the semantic web (SW), which foresaw much more intelligent manipulation and querying of knowledge. The opportunities that the SW provided for e.g., more intelligent retrieval, put AKT in the centre of information technology innovation and knowledge management services; the AKT skill set would clearly be central for the exploitation of those opportunities.The SW, as an extension of the WWW, provides an interesting set of constraints to the knowledge management services AKT tries to provide. As a medium for the semantically-informed coordination of information, it has suggested a number of ways in which the objectives of AKT can be achieved, most obviously through the provision of knowledge management services delivered over the web as opposed to the creation and provision of technologies to manage knowledge.AKT is working on the assumption that many web services will be developed and provided for users. The KM problem in the near future will be one of deciding which services are needed and of coordinating them. Many of these services will be largely or entirely legacies of the WWW, and so the capabilities of the services will vary. As well as providing useful KM services in their own right, AKT will be aiming to exploit this opportunity, by reasoning over services, brokering between them, and providing essential meta-services for SW knowledge service management.Ontologies will be a crucial tool for the SW. The AKT consortium brings a lot of expertise on ontologies together, and ontologies were always going to be a key part of the strategy. All kinds of knowledge sharing and transfer activities will be mediated by ontologies, and ontology management will be an important enabling task. Different applications will need to cope with inconsistent ontologies, or with the problems that will follow the automatic creation of ontologies (e.g. merging of pre-existing ontologies to create a third). Ontology mapping, and the elimination of conflicts of reference, will be important tasks. All of these issues are discussed along with our proposed technologies.Similarly, specifications of tasks will be used for the deployment of knowledge services over the SW, but in general it cannot be expected that in the medium term there will be standards for task (or service) specifications. The brokering metaservices that are envisaged will have to deal with this heterogeneity.The emerging picture of the SW is one of great opportunity but it will not be a wellordered, certain or consistent environment. It will comprise many repositories of legacy data, outdated and inconsistent stores, and requirements for common understandings across divergent formalisms. There is clearly a role for standards to play to bring much of this context together; AKT is playing a significant role in these efforts. But standards take time to emerge, they take political power to enforce, and they have been known to stifle innovation (in the short term). AKT is keen to understand the balance between principled inference and statistical processing of web content. Logical inference on the Web is tough. Complex queries using traditional AI inference methods bring most distributed computer systems to their knees. Do we set up semantically well-behaved areas of the Web? Is any part of the Web in which semantic hygiene prevails interesting enough to reason in? These and many other questions need to be addressed if we are to provide effective knowledge technologies for our content on the web

    Multi modal multi-semantic image retrieval

    Get PDF
    PhDThe rapid growth in the volume of visual information, e.g. image, and video can overwhelm users’ ability to find and access the specific visual information of interest to them. In recent years, ontology knowledge-based (KB) image information retrieval techniques have been adopted into in order to attempt to extract knowledge from these images, enhancing the retrieval performance. A KB framework is presented to promote semi-automatic annotation and semantic image retrieval using multimodal cues (visual features and text captions). In addition, a hierarchical structure for the KB allows metadata to be shared that supports multi-semantics (polysemy) for concepts. The framework builds up an effective knowledge base pertaining to a domain specific image collection, e.g. sports, and is able to disambiguate and assign high level semantics to ‘unannotated’ images. Local feature analysis of visual content, namely using Scale Invariant Feature Transform (SIFT) descriptors, have been deployed in the ‘Bag of Visual Words’ model (BVW) as an effective method to represent visual content information and to enhance its classification and retrieval. Local features are more useful than global features, e.g. colour, shape or texture, as they are invariant to image scale, orientation and camera angle. An innovative approach is proposed for the representation, annotation and retrieval of visual content using a hybrid technique based upon the use of an unstructured visual word and upon a (structured) hierarchical ontology KB model. The structural model facilitates the disambiguation of unstructured visual words and a more effective classification of visual content, compared to a vector space model, through exploiting local conceptual structures and their relationships. The key contributions of this framework in using local features for image representation include: first, a method to generate visual words using the semantic local adaptive clustering (SLAC) algorithm which takes term weight and spatial locations of keypoints into account. Consequently, the semantic information is preserved. Second a technique is used to detect the domain specific ‘non-informative visual words’ which are ineffective at representing the content of visual data and degrade its categorisation ability. Third, a method to combine an ontology model with xi a visual word model to resolve synonym (visual heterogeneity) and polysemy problems, is proposed. The experimental results show that this approach can discover semantically meaningful visual content descriptions and recognise specific events, e.g., sports events, depicted in images efficiently. Since discovering the semantics of an image is an extremely challenging problem, one promising approach to enhance visual content interpretation is to use any associated textual information that accompanies an image, as a cue to predict the meaning of an image, by transforming this textual information into a structured annotation for an image e.g. using XML, RDF, OWL or MPEG-7. Although, text and image are distinct types of information representation and modality, there are some strong, invariant, implicit, connections between images and any accompanying text information. Semantic analysis of image captions can be used by image retrieval systems to retrieve selected images more precisely. To do this, a Natural Language Processing (NLP) is exploited firstly in order to extract concepts from image captions. Next, an ontology-based knowledge model is deployed in order to resolve natural language ambiguities. To deal with the accompanying text information, two methods to extract knowledge from textual information have been proposed. First, metadata can be extracted automatically from text captions and restructured with respect to a semantic model. Second, the use of LSI in relation to a domain-specific ontology-based knowledge model enables the combined framework to tolerate ambiguities and variations (incompleteness) of metadata. The use of the ontology-based knowledge model allows the system to find indirectly relevant concepts in image captions and thus leverage these to represent the semantics of images at a higher level. Experimental results show that the proposed framework significantly enhances image retrieval and leads to narrowing of the semantic gap between lower level machinederived and higher level human-understandable conceptualisation

    Knowledge-based methods for automatic extraction of domain-specific ontologies

    Get PDF
    Semantic web technology aims at developing methodologies for representing large amount of knowledge in web accessible form. The semantics of knowledge should be easy to interpret and understand by computer programs, so that sharing and utilizing knowledge across the Web would be possible. Domain specific ontologies form the basis for knowledge representation in the semantic web. Research on automated development of ontologies from texts has become increasingly important because manual construction of ontologies is labor intensive and costly, and, at the same time, large amount of texts for individual domains is already available in electronic form. However, automatic extraction of domain specific ontologies is challenging due to the unstructured nature of texts and inherent semantic ambiguities in natural language. Moreover, the large size of texts to be processed renders full-fledged natural language processing methods infeasible. In this dissertation, we develop a set of knowledge-based techniques for automatic extraction of ontological components (concepts, taxonomic and non-taxonomic relations) from domain texts. The proposed methods combine information retrieval metrics, lexical knowledge-base(like WordNet), machine learning techniques, heuristics, and statistical approaches to meet the challenge of the task. These methods are domain-independent and automatic approaches. For extraction of concepts, the proposed WNSCA+{PE, POP} method utilizes the lexical knowledge base WordNet to improve precision and recall over the traditional information retrieval metrics. A WordNet-based approach, the compound term heuristic, and a supervised learning approach are developed for taxonomy extraction. We also developed a weighted word-sense disambiguation method for use with the WordNet-based approach. An unsupervised approach using log-likelihood ratios is proposed for extracting non-taxonomic relations. Further more, a supervised approach is investigated to learn the semantic constraints for identifying relations from prepositional phrases. The proposed methods are validated by experiments with the Electronic Voting and the Tender Offers, Mergers, and Acquisitions domain corpus. Experimental results and comparisons with some existing approaches clearly indicate the superiority of our methods. In summary, a good combination of information retrieval, lexical knowledge base, statistics and machine learning methods in this study has led to the techniques efficient and effective for extracting ontological components automatically

    Ontology Localization

    Get PDF
    Nuestra meta principal en esta tesis es proponer una solución para construir una ontología multilingüe, a través de la localización automática de una ontología. La noción de localización viene del área de Desarrollo de Software que hace referencia a la adaptación de un producto de software a un ambiente no nativo. En la Ingeniería Ontológica, la localización de ontologías podría ser considerada como un subtipo de la localización de software en el cual el producto es un modelo compartido de un dominio particular, por ejemplo, una ontología, a ser usada por una cierta aplicación. En concreto, nuestro trabajo introduce una nueva propuesta para el problema de multilingüismo, describiendo los métodos, técnicas y herramientas para la localización de recursos ontológicos y cómo el multilingüismo puede ser representado en las ontologías. No es la meta de este trabajo apoyar una única propuesta para la localización de ontologías, sino más bien mostrar la variedad de métodos y técnicas que pueden ser readaptadas de otras áreas de conocimiento para reducir el costo y esfuerzo que significa enriquecer una ontología con información multilingüe. Estamos convencidos de que no hay un único método para la localización de ontologías. Sin embargo, nos concentramos en soluciones automáticas para la localización de estos recursos. La propuesta presentada en esta tesis provee una cobertura global de la actividad de localización para los profesionales ontológicos. En particular, este trabajo ofrece una explicación formal de nuestro proceso general de localización, definiendo las entradas, salidas, y los principales pasos identificados. Además, en la propuesta consideramos algunas dimensiones para localizar una ontología. Estas dimensiones nos permiten establecer una clasificación de técnicas de traducción basadas en métodos tomados de la disciplina de traducción por máquina. Para facilitar el análisis de estas técnicas de traducción, introducimos una estructura de evaluación que cubre sus aspectos principales. Finalmente, ofrecemos una vista intuitiva de todo el ciclo de vida de la localización de ontologías y esbozamos nuestro acercamiento para la definición de una arquitectura de sistema que soporte esta actividad. El modelo propuesto comprende los componentes del sistema, las propiedades visibles de esos componentes, las relaciones entre ellos, y provee además, una base desde la cual sistemas de localización de ontologías pueden ser desarrollados. Las principales contribuciones de este trabajo se resumen como sigue: - Una caracterización y definición de los problemas de localización de ontologías, basado en problemas encontrados en áreas relacionadas. La caracterización propuesta tiene en cuenta tres problemas diferentes de la localización: traducción, gestión de la información, y representación de la información multilingüe. - Una metodología prescriptiva para soportar la actividad de localización de ontologías, basada en las metodologías de localización usadas en Ingeniería del Software e Ingeniería del Conocimiento, tan general como es posible, tal que ésta pueda cubrir un amplio rango de escenarios. - Una clasificación de las técnicas de localización de ontologías, que puede servir para comparar (analíticamente) diferentes sistemas de localización de ontologías, así como también para diseñar nuevos sistemas, tomando ventaja de las soluciones del estado del arte. - Un método integrado para construir sistemas de localización de ontologías en un entorno distribuido y colaborativo, que tenga en cuenta los métodos y técnicas más apropiadas, dependiendo de: i) el dominio de la ontología a ser localizada, y ii) la cantidad de información lingüística requerida para la ontología final. - Un componente modular para soportar el almacenamiento de la información multilingüe asociada a cada término de la ontología. Nuestra propuesta sigue la tendencia actual en la integración de la información multilingüe en las ontologías que sugiere que el conocimiento de la ontología y la información lingüística (multilingüe) estén separados y sean independientes. - Un modelo basado en flujos de trabajo colaborativos para la representación del proceso normalmente seguido en diferentes organizaciones, para coordinar la actividad de localización en diferentes lenguajes naturales. - Una infraestructura integrada implementada dentro del NeOn Toolkit por medio de un conjunto de plug-ins y extensiones que soporten el proceso colaborativo de localización de ontologías

    Ontology Enrichment from Free-text Clinical Documents: A Comparison of Alternative Approaches

    Get PDF
    While the biomedical informatics community widely acknowledges the utility of domain ontologies, there remain many barriers to their effective use. One important requirement of domain ontologies is that they achieve a high degree of coverage of the domain concepts and concept relationships. However, the development of these ontologies is typically a manual, time-consuming, and often error-prone process. Limited resources result in missing concepts and relationships, as well as difficulty in updating the ontology as domain knowledge changes. Methodologies developed in the fields of Natural Language Processing (NLP), Information Extraction (IE), Information Retrieval (IR), and Machine Learning (ML) provide techniques for automating the enrichment of ontology from free-text documents. In this dissertation, I extended these methodologies into biomedical ontology development. First, I reviewed existing methodologies and systems developed in the fields of NLP, IR, and IE, and discussed how existing methods can benefit the development of biomedical ontologies. This previously unconducted review was published in the Journal of Biomedical Informatics. Second, I compared the effectiveness of three methods from two different approaches, the symbolic (the Hearst method) and the statistical (the Church and Lin methods), using clinical free-text documents. Third, I developed a methodological framework for Ontology Learning (OL) evaluation and comparison. This framework permits evaluation of the two types of OL approaches that include three OL methods. The significance of this work is as follows: 1) The results from the comparative study showed the potential of these methods for biomedical ontology enrichment. For the two targeted domains (NCIT and RadLex), the Hearst method revealed an average of 21% and 11% new concept acceptance rates, respectively. The Lin method produced a 74% acceptance rate for NCIT; the Church method, 53%. As a result of this study (published in the Journal of Methods of Information in Medicine), many suggested candidates have been incorporated into the NCIT; 2) The evaluation framework is flexible and general enough that it can analyze the performance of ontology enrichment methods for many domains, thus expediting the process of automation and minimizing the likelihood that key concepts and relationships would be missed as domain knowledge evolves

    Conceptual Representations for Computational Concept Creation

    Get PDF
    Computational creativity seeks to understand computational mechanisms that can be characterized as creative. The creation of new concepts is a central challenge for any creative system. In this article, we outline different approaches to computational concept creation and then review conceptual representations relevant to concept creation, and therefore to computational creativity. The conceptual representations are organized in accordance with two important perspectives on the distinctions between them. One distinction is between symbolic, spatial and connectionist representations. The other is between descriptive and procedural representations. Additionally, conceptual representations used in particular creative domains, such as language, music, image and emotion, are reviewed separately. For every representation reviewed, we cover the inference it affords, the computational means of building it, and its application in concept creation.Peer reviewe

    Ontology-driven web-based semantic similarity

    Get PDF
    The version of record is available online at: http://dx.doi.org/10.1007/s10844-009-0103-xEstimation of the degree of semantic similarity/distance between concepts is a very common problem in research areas such as natural language processing, knowledge acquisition, information retrieval or data mining. In the past, many similarity measures have been proposed, exploiting explicit knowledge—such as the structure of a taxonomy—or implicit knowledge—such as information distribution. In the former case, taxonomies and/or ontologies are used to introduce additional semantics; in the latter case, frequencies of term appearances in a corpus are considered. Classical measures based on those premises suffer from some problems: in the first case, their excessive dependency of the taxonomical/ontological structure; in the second case, the lack of semantics of a pure statistical analysis of occurrences and/or the ambiguity of estimating concept statistical distribution from term appearances. Measures based on Information Content (IC) of taxonomical concepts combine both approaches. However, they heavily depend on a properly pre-tagged and disambiguated corpus according to the ontological entities in order to compute accurate concept appearance probabilities. This limits the applicability of those measures to other ontologies –like specific domain ontologies- and massive corpus –like the Web-. In this paper, several of the presented issues are analyzed. Modifications of classical similarity measures are also proposed. They are based on a contextualized and scalable version of IC computation in the Web by exploiting taxonomical knowledge. The goal is to avoid the measures’ dependency on the corpus pre-processing to achieve reliable results and minimize language ambiguity. Our proposals are able to outperform classical approaches when using the Web for estimating concept probabilities.Peer ReviewedPostprint (author's final draft

    Using Ontology-Based Approaches to Representing Speech Transcripts for Automated Speech Scoring

    Get PDF
    Text representation is a process of transforming text into some formats that computer systems can use for subsequent information-related tasks such as text classification. Representing text faces two main challenges: meaningfulness of representation and unknown terms. Research has shown evidence that these challenges can be resolved by using the rich semantics in ontologies. This study aims to address these challenges by using ontology-based representation and unknown term reasoning approaches in the context of content scoring of speech, which is a less explored area compared to some common ones such as categorizing text corpus (e.g. 20 newsgroups and Reuters). From the perspective of language assessment, the increasing amount of language learners taking second language tests makes automatic scoring an attractive alternative to human scoring for delivering rapid and objective scores of written and spoken test responses. This study focuses on the speaking section of second language tests and investigates ontology-based approaches to speech scoring. Most previous automated speech scoring systems for spontaneous responses of test takers assess speech by primarily using acoustic features such as fluency and pronunciation, while text features are less involved and exploited. As content is an integral part of speech, the study is motivated by the lack of rich text features in speech scoring and is designed to examine the effects of different text features on scoring performance. A central question to the study is how speech transcript content can be represented in an appropriate means for speech scoring. Previously used approaches from essay and speech scoring systems include bag-of-words and latent semantic analysis representations, which are adopted as baselines in this study; the experimental approaches are ontology-based, which can help improving meaningfulness of representation units and estimating importance of unknown terms. Two general domain ontologies, WordNet and Wikipedia, are used respectively for ontology-based representations. In addition to comparison between representation approaches, the author analyzes which parameter option leads to the best performance within a particular representation. The experimental results show that on average, ontology-based representations slightly enhances speech scoring performance on all measurements when combined with the bag-of-words representation; reasoning of unknown terms can increase performance on one measurement (cos.w4) but decrease others. Due to the small data size, the significance test (t-test) shows that the enhancement of ontology-based representations is inconclusive. The contributions of the study include: 1) it examines the effects of different representation approaches on speech scoring tasks; 2) it enhances the understanding of the mechanisms of representation approaches and their parameter options via in-depth analysis; 3) the representation methodology and framework can be applied to other tasks such as automatic essay scoring

    Realising context-oriented information filtering.

    Get PDF
    The notion of information overload is an increasing factor in modern information service environments where information is ‘pushed’ to the user. As increasing volumes of information are presented to computing users in the form of email, web sites, instant messaging and news feeds, there is a growing need to filter and prioritise the importance of this information. ‘Information management’ needs to be undertaken in a manner that not only prioritises what information we do need, but to also dispose of information that is sent, which is of no (or little) use to us.The development of a model to aid information filtering in a context-aware way is developed as an objective for this thesis. A key concern in the conceptualisation of a single concept is understanding the context under which that concept exists (or can exist). An example of a concept is a concrete object, for instance a book. This contextual understanding should provide us with clear conceptual identification of a concept including implicit situational information and detail of surrounding concepts.Existing solutions to filtering information suffer from their own unique flaws: textbased filtering suffers from problems of inaccuracy; ontology-based solutions suffer from scalability challenges; taxonomies suffer from problems with collaboration. A major objective of this thesis is to explore the use of an evolving community maintained knowledge-base (that of Wikipedia) in order to populate the context model from prioritise concepts that are semantically relevant to the user’s interest space. Wikipedia can be classified as a weak knowledge-base due to its simple TBox schema and implicit predicates, therefore, part of this objective is to validate the claim that a weak knowledge-base is fit for this purpose. The proposed and developed solution, therefore, provides the benefits of high recall filtering with low fallout and a dependancy on a scalable and collaborative knowledge-base.A simple web feed aggregator has been built using the Java programming language that we call DAVe’s Rss Organisation System (DAVROS-2) as a testbed environment to demonstrate specific tests used within this investigation. The motivation behind the experiments is to demonstrate that the combination of the concept framework instantiated through Wikipedia can provide a framework to aid in concept comparison, and therefore be used in news filtering scenario as an example of information overload. In order to evaluate the effectiveness of the method well understood measures of information retrieval are used. This thesis demonstrates that the utilisation of the developed contextual concept expansion framework (instantiated using Wikipedia) improved the quality of concept filtering over a baseline based on string matching. This has been demonstrated through the analysis of recall and fallout measures
    corecore