588 research outputs found
Ontology Population via NLP Techniques in Risk Management
In this paper we propose an NLP-based method for Ontology Population from texts and apply it to semi automatic instantiate a Generic Knowledge Base (Generic Domain Ontology) in the risk management domain. The approach is semi-automatic and uses a domain expert intervention for validation. The proposed approach relies on a set of Instances Recognition Rules based on syntactic structures, and on the predicative power of verbs in the instantiation process. It is not domain dependent since it heavily relies on linguistic knowledge. A description of an experiment performed on a part of the ontology of the PRIMA project (supported by the European community) is given. A first validation of the method is done by populating this ontology with Chemical Fact Sheets from Environmental Protection Agency . The results of this experiment complete the paper and support the hypothesis that relying on the predicative power of verbs in the instantiation process improves the performance.Information Extraction, Instance Recognition Rules, Ontology Population, Risk Management, Semantic Analysis
Synonym Detection Using Syntactic Dependency And Neural Embeddings
Recent advances on the Vector Space Model have significantly improved some
NLP applications such as neural machine translation and natural language
generation. Although word co-occurrences in context have been widely used in
counting-/predicting-based distributional models, the role of syntactic
dependencies in deriving distributional semantics has not yet been thoroughly
investigated. By comparing various Vector Space Models in detecting synonyms in
TOEFL, we systematically study the salience of syntactic dependencies in
accounting for distributional similarity. We separate syntactic dependencies
into different groups according to their various grammatical roles and then use
context-counting to construct their corresponding raw and SVD-compressed
matrices. Moreover, using the same training hyperparameters and corpora, we
study typical neural embeddings in the evaluation. We further study the
effectiveness of injecting human-compiled semantic knowledge into neural
embeddings on computing distributional similarity. Our results show that the
syntactically conditioned contexts can interpret lexical semantics better than
the unconditioned ones, whereas retrofitting neural embeddings with semantic
knowledge can significantly improve synonym detection
RjeÄnik suvremenoga slovenskog jezika: od slovenske leksiÄke baze do digitalne rjeÄniÄke baze
The ability to process language data has become fundamental to the development of technologies in various areas of human life in the digital world. The development of digitally readable linguistic resources, methods, and tools is, therefore, also a key challenge for the contemporary Slovene language. This challenge has been recognized in the Slovene language community both at the professional and state level and has been the subject of many activities over the past ten years, which will be presented in this paper.
The idea of a comprehensive dictionary database covering all levels of linguistic description in modern Slovene, from the morphological and lexical levels to the syntactic level, has already formulated within the framework of the European Social Fundās Communication in Slovene (2008-2013) project; the Slovene Lexical Database was also created within the framework of this project. Two goals were pursued in designing the Slovene Lexical Database (SLD): creating linguistic descriptions of Slovene intended for human users that would also be useful for the machine processing of Slovene. Ever since the construction of the first Slovene corpus, it has become evident that there is a need for a description of modern Slovene based on real language data, and that it is necessary to understand the needs of language users to create useful language reference works. It also became apparent that only the digital medium enables the comprehensiveness of language description and that the design of the database must be adapted to it from the start. Also, the description must follow best practices as closely as possible in terms of formats and international standards, as this enables the inclusion of Slovene into a wider network of resources, such as Open Linked Data, babelNet and ELExIS. Due to time pressures and trends in lexicography, procedures to automate the extraction of linguistic data from corpora and the inclusion of crowdsourcing into the lexicographic process were taken into consideration.
Following the essential idea of creating an all-inclusive digital dictionary database for Slovene, a few independent databases have been created over the past two years: the Collocations Dictionary of Modern Slovene, and the automatically generated Thesaurus of Modern Slovene, both of which also exist as independent online dictionary portals. One of the novelties that we put forward together with both dictionaries is the āresponsive dictionaryā concept, which includes crowdsourcing methods. Ultimately, the Digital Dictionary Database provides all (other) levels of linguistic description: the morphological level with the Sloleks database upgrade, the phraseological level with the construction of a multi-word expressions lexicon, and the syntactic level with the formalization of Slovene verb valency patterns. Each of these databases contains its specific language data that will ultimately be included in the comprehensive Slovene Digital Dictionary Database, which will represent basic linguistic descriptions of Slovene both for the human and machine user.Ideja sveobuhvatne rjeÄniÄke baze koja ukljuÄuje sve razine jeziÄnoga opisa suvremenoga slovenskog jezika od morfoloÅ”ke i leksiÄke do sintaktiÄke prvotno je formulirana u okviru projekta Sporazumijevanje na slovenskomu jeziku (2008. ā 2013.). U cilju ostvarenja ideje o stvaranju sveobuhvatne digitalne rjeÄniÄke baze stvorene su dvije neovisne baze podataka: Kolokacijski rjeÄnik suvremenoga slovenskoga jezika i automatski generiran Tezaurus modernoga slovenskoga jezika. Jedna od novina u obama rjeÄnicima koncept je responzivnoga rjeÄnika, koji ukljuÄuje masovnu podrÅ”ku. Digitalna rjeÄniÄka baza sadržava sve razine jeziÄnoga opisa: morfoloÅ”ku nadograÄenu Sloleksom, izraznu s opisom konstrukcija viÅ”erjeÄnih jedinica te sintaktiÄku s formalizacijom modela glagolskih valencija. Svaka od postojeÄih baza podataka sadržava specifiÄne jeziÄne podatke koji Äe biti ukljuÄeni u sveobuhvatnu Slovensku digitalnu rjeÄniÄku bazu podataka, koja Äe sadržavati temeljni jezikoslovni opis slovenskoga jezika Äiji korisnici mogu biti ljudi i strojevi
The Role of Indexing in Subject Retrieval
On first reading the list of speakers proposed for this institute, I
became aware of being rather the "odd man out" for two reasons. Firstly, I
was asked to present a paper on PRECIS which is very much a verbal
indexing system-at a conference dominated by contributions on classification
schemes with a natural bias, as the centenary year approaches, toward the
Dewey Decimal Classification (DDC). Secondly, I feared (quite wrongly, as it
happens) that I might be at variance with one or two of my fellow speakers,
who would possibly like to assure us, in an age when we can no longer ignore
the computer, that traditional library schemes such as DDC and Library of
Congress Classification (LCC) are capable of maintaining their original
function of organizing collections of documents, and at the same time are also
well suited to the retrieval of relevant citations from machine-held files. In
this context, I am reminded of a review of a general collection of essays on
classification schemes which appeared in the Journal of Documentation in
1972. Norman Roberts, reviewing the papers which dealt specifically with the
well established schemes, deduced that "all the writers project their particular
schemes into the future with an optimism that springs, perhaps, as much from
a sense of emotional involvement as from concrete evidence." Since I do not
believe that these general schemes can play any significant part in the retrieval
of items from mechanized files, it appeared that I had been cast in the role of
devil's advocate.published or submitted for publicatio
An approach to automated thesaurus construction using clusterization-based dictionary analysis
In the paper an automated approach for construction of the terminological thesaurus for a specific domain is proposed. It uses an explanatory dictionary as the initial text corpus and a controlled vocabulary related to the target lexicon to initiate extraction of the terms for the thesaurus. Subdivision of the terms into semantic clusters is based on the CLOPE clustering algorithm. The approach diminishes the cost of the thesaurus creation by involving the expert only once during the whole construction process, and only for analysis of a small subset of the initial dictionary. To validate the performance of the proposed approach the authors successfully constructed a thesaurus in the cardiology domain
Computer-assisted text analysis methodology in the social sciences
"This report presents an account of methods of research in computer-assisted text analysis in
the social sciences. Rather than to provide a comprehensive enumeration of all computer-assisted
text analysis investigations either directly or indirectly related to the social sciences using a
quantitative and computer-assisted methodology as their text analytical tool, the aim of this report is to describe the current methodological standpoint of computer-assisted text analysis in the social sciences. This report provides, thus, a description and a discussion of the operations carried out in computer-assisted text analysis investigations. The report examines both past and well-established as well as some of the current approaches in the field and describes the techniques and the procedures involved. By this means, a first attempt is made toward cataloguing the kinds of supplementary information as well as computational support which are further required to expand the suitability and applicability of the method for the variety of text analysis goals." (author's abstract
- ā¦