588 research outputs found

    Automatic thesaurus construction

    Get PDF
    Sydney, NS

    Ontology Population via NLP Techniques in Risk Management

    Get PDF
    In this paper we propose an NLP-based method for Ontology Population from texts and apply it to semi automatic instantiate a Generic Knowledge Base (Generic Domain Ontology) in the risk management domain. The approach is semi-automatic and uses a domain expert intervention for validation. The proposed approach relies on a set of Instances Recognition Rules based on syntactic structures, and on the predicative power of verbs in the instantiation process. It is not domain dependent since it heavily relies on linguistic knowledge. A description of an experiment performed on a part of the ontology of the PRIMA project (supported by the European community) is given. A first validation of the method is done by populating this ontology with Chemical Fact Sheets from Environmental Protection Agency . The results of this experiment complete the paper and support the hypothesis that relying on the predicative power of verbs in the instantiation process improves the performance.Information Extraction, Instance Recognition Rules, Ontology Population, Risk Management, Semantic Analysis

    Synonym Detection Using Syntactic Dependency And Neural Embeddings

    Full text link
    Recent advances on the Vector Space Model have significantly improved some NLP applications such as neural machine translation and natural language generation. Although word co-occurrences in context have been widely used in counting-/predicting-based distributional models, the role of syntactic dependencies in deriving distributional semantics has not yet been thoroughly investigated. By comparing various Vector Space Models in detecting synonyms in TOEFL, we systematically study the salience of syntactic dependencies in accounting for distributional similarity. We separate syntactic dependencies into different groups according to their various grammatical roles and then use context-counting to construct their corresponding raw and SVD-compressed matrices. Moreover, using the same training hyperparameters and corpora, we study typical neural embeddings in the evaluation. We further study the effectiveness of injecting human-compiled semantic knowledge into neural embeddings on computing distributional similarity. Our results show that the syntactically conditioned contexts can interpret lexical semantics better than the unconditioned ones, whereas retrofitting neural embeddings with semantic knowledge can significantly improve synonym detection

    A review of the state of the art in Machine Learning on the Semantic Web: Technical Report CSTR-05-003

    Get PDF

    Rječnik suvremenoga slovenskog jezika: od slovenske leksičke baze do digitalne rječničke baze

    Get PDF
    The ability to process language data has become fundamental to the development of technologies in various areas of human life in the digital world. The development of digitally readable linguistic resources, methods, and tools is, therefore, also a key challenge for the contemporary Slovene language. This challenge has been recognized in the Slovene language community both at the professional and state level and has been the subject of many activities over the past ten years, which will be presented in this paper. The idea of a comprehensive dictionary database covering all levels of linguistic description in modern Slovene, from the morphological and lexical levels to the syntactic level, has already formulated within the framework of the European Social Fundā€™s Communication in Slovene (2008-2013) project; the Slovene Lexical Database was also created within the framework of this project. Two goals were pursued in designing the Slovene Lexical Database (SLD): creating linguistic descriptions of Slovene intended for human users that would also be useful for the machine processing of Slovene. Ever since the construction of the first Slovene corpus, it has become evident that there is a need for a description of modern Slovene based on real language data, and that it is necessary to understand the needs of language users to create useful language reference works. It also became apparent that only the digital medium enables the comprehensiveness of language description and that the design of the database must be adapted to it from the start. Also, the description must follow best practices as closely as possible in terms of formats and international standards, as this enables the inclusion of Slovene into a wider network of resources, such as Open Linked Data, babelNet and ELExIS. Due to time pressures and trends in lexicography, procedures to automate the extraction of linguistic data from corpora and the inclusion of crowdsourcing into the lexicographic process were taken into consideration. Following the essential idea of creating an all-inclusive digital dictionary database for Slovene, a few independent databases have been created over the past two years: the Collocations Dictionary of Modern Slovene, and the automatically generated Thesaurus of Modern Slovene, both of which also exist as independent online dictionary portals. One of the novelties that we put forward together with both dictionaries is the ā€˜responsive dictionaryā€™ concept, which includes crowdsourcing methods. Ultimately, the Digital Dictionary Database provides all (other) levels of linguistic description: the morphological level with the Sloleks database upgrade, the phraseological level with the construction of a multi-word expressions lexicon, and the syntactic level with the formalization of Slovene verb valency patterns. Each of these databases contains its specific language data that will ultimately be included in the comprehensive Slovene Digital Dictionary Database, which will represent basic linguistic descriptions of Slovene both for the human and machine user.Ideja sveobuhvatne rječničke baze koja uključuje sve razine jezičnoga opisa suvremenoga slovenskog jezika od morfoloÅ”ke i leksičke do sintaktičke prvotno je formulirana u okviru projekta Sporazumijevanje na slovenskomu jeziku (2008. ā€“ 2013.). U cilju ostvarenja ideje o stvaranju sveobuhvatne digitalne rječničke baze stvorene su dvije neovisne baze podataka: Kolokacijski rječnik suvremenoga slovenskoga jezika i automatski generiran Tezaurus modernoga slovenskoga jezika. Jedna od novina u obama rječnicima koncept je responzivnoga rječnika, koji uključuje masovnu podrÅ”ku. Digitalna rječnička baza sadržava sve razine jezičnoga opisa: morfoloÅ”ku nadograđenu Sloleksom, izraznu s opisom konstrukcija viÅ”erječnih jedinica te sintaktičku s formalizacijom modela glagolskih valencija. Svaka od postojećih baza podataka sadržava specifične jezične podatke koji će biti uključeni u sveobuhvatnu Slovensku digitalnu rječničku bazu podataka, koja će sadržavati temeljni jezikoslovni opis slovenskoga jezika čiji korisnici mogu biti ljudi i strojevi

    The Role of Indexing in Subject Retrieval

    Get PDF
    On first reading the list of speakers proposed for this institute, I became aware of being rather the "odd man out" for two reasons. Firstly, I was asked to present a paper on PRECIS which is very much a verbal indexing system-at a conference dominated by contributions on classification schemes with a natural bias, as the centenary year approaches, toward the Dewey Decimal Classification (DDC). Secondly, I feared (quite wrongly, as it happens) that I might be at variance with one or two of my fellow speakers, who would possibly like to assure us, in an age when we can no longer ignore the computer, that traditional library schemes such as DDC and Library of Congress Classification (LCC) are capable of maintaining their original function of organizing collections of documents, and at the same time are also well suited to the retrieval of relevant citations from machine-held files. In this context, I am reminded of a review of a general collection of essays on classification schemes which appeared in the Journal of Documentation in 1972. Norman Roberts, reviewing the papers which dealt specifically with the well established schemes, deduced that "all the writers project their particular schemes into the future with an optimism that springs, perhaps, as much from a sense of emotional involvement as from concrete evidence." Since I do not believe that these general schemes can play any significant part in the retrieval of items from mechanized files, it appeared that I had been cast in the role of devil's advocate.published or submitted for publicatio

    An approach to automated thesaurus construction using clusterization-based dictionary analysis

    Get PDF
    In the paper an automated approach for construction of the terminological thesaurus for a specific domain is proposed. It uses an explanatory dictionary as the initial text corpus and a controlled vocabulary related to the target lexicon to initiate extraction of the terms for the thesaurus. Subdivision of the terms into semantic clusters is based on the CLOPE clustering algorithm. The approach diminishes the cost of the thesaurus creation by involving the expert only once during the whole construction process, and only for analysis of a small subset of the initial dictionary. To validate the performance of the proposed approach the authors successfully constructed a thesaurus in the cardiology domain

    Computer-assisted text analysis methodology in the social sciences

    Full text link
    "This report presents an account of methods of research in computer-assisted text analysis in the social sciences. Rather than to provide a comprehensive enumeration of all computer-assisted text analysis investigations either directly or indirectly related to the social sciences using a quantitative and computer-assisted methodology as their text analytical tool, the aim of this report is to describe the current methodological standpoint of computer-assisted text analysis in the social sciences. This report provides, thus, a description and a discussion of the operations carried out in computer-assisted text analysis investigations. The report examines both past and well-established as well as some of the current approaches in the field and describes the techniques and the procedures involved. By this means, a first attempt is made toward cataloguing the kinds of supplementary information as well as computational support which are further required to expand the suitability and applicability of the method for the variety of text analysis goals." (author's abstract
    • ā€¦
    corecore