Search CORE

670 research outputs found

MONDILEX – towards the research infrastructure for digital resources in Slavic lexicography

Author
Publication venue: 'Institute of Slavic Studies Polish Academy of Sciences'
Publication date
Field of study

Lexicographic Tools and Techniques.

Author: Dimitrova Ludmila
Pavlov Radoslav
Publication venue: Institute for Information Transmission Problems - Russian Academy of Sciences
Publication date: 01/01/2008
Field of study

We describe in brief what grid technologies are and how they could contribute to the language technologies, in particular lexicographic activities. Based on our participation in the EC international project MULTEXT-East, we present some aspects of language resource compatibility: unification and standardisation. We underline the importance of the developed harmonised lexical (morphosyntactic) specifications and descriptions of language data in machine-readable form in a common standard encoding format – Corpus Encoding Standard format – for six Central and East European (CEE) languages, as well as the language-independence of the tools employed.The study and preparation of these results have received funding from the EC's Seventh Framework Programme [FP7/2007-2013] under grant agreement 211938 MONDILEX

Bulgarian OpenAIRE Repository

CLaRK System - an XML-based system for Corpora Development

Author
Publication venue: Linguistic Modeling Department, IPP, Bulgarian Academy of Sciences
Publication date: 30/07/2014
Field of study

The CLaRK System incorporates several technologies: - XML technology - Unicode - Cascaded Regular Grammars; - Constraints over XML Documents On the basis of these technologies the following tools are implemented: XML Editor, Unicode Tokeniser, Sorting tool, Removing and Extracting tool, Concordancer, XSLT tool, Cascaded Regular Grammar tool, etc. 1 Unicode tokenization In order to provide possibility for imposing constraints over the textual node and to segment them in meaningful way, the CLaRK System supports a user-defined hierarchy of tokenisers. At the very basic level the user can define a tokeniser in terms of a set of token types. In this basic tokeniser each token type is defined by a set of UNICODE symbols. Above this basic level tokenisers, the user can define other tokenisers, for which the token types are defined as regular expressions over the tokens of some other tokeniser, the so called parent tokeniser. 2 Regular Grammars The regular grammars are the basic mechanism for linguistic processing of the content of an XML document within the system. The regular grammar processor applies a set of rules over the content of some elements in the document and incorporates the categories of the rules back in the document as XML mark-up. The content is processed before the application of the grammar rules in the following way: textual nodes are tokenized with respect to some appropriate tokeniser, the element nodes are textualized on the basis of XPath expressions that determine the important information about the element. The recognized word is substituted by a new XML mark-up, which can or can not contain the word. 3 Constraints The constraints that we implemented in the CLaRK System are generally based on the XPath language. We use XPath expressions to determine some data within one or several XML documents and thus we evaluate some predicates over the data. There are two modes of using a constraint. In the first mode the constraint is used for validity check, similar to the validity check, which is based on DTD or XML schema. In the second mode, the constraint is used to support the change of the document in order it to satisfy the constraint. There are three types of constraints, implemented in the system: regular expression constraints, number restriction constraints, value restriction constraints. 4 Macro Language In the CLaRK System the tools support a mechanism for describing their settings. On the basis of these descriptions (called queries) a tool can be applied only by pointing to a certain description record. Each query contains the states of all settings and options which the corresponding tool has. Once having this kind of queries there is a special tool for combining and applying them in groups (macros). During application the queries are executed successively and the result from an application is an input for the next one. For a better control on the process of applying several queries in one we introduce several conditional operators. These operators can determine the next query for application depending on certain conditions. When a condition for such an operator is satisfied, the execution continues from a location defined in the operator. The mechanism for addressing queries is based on user defined labels. When a condition is not satisfied the operator is ignored and the process continues from the position following the operator. In this way constructions like IF-THEN-ELSE and WHILE-DO easily can be expressed. The system supports five types of control operators: IF (XPath): the condition is an XPath expression which is evaluated on the current working document. If the result is a non-empty node-set, non-empty string, positive number or true boolean value the condition is satisfied; IF NOT (XPath): the same kind of condition as the previous one but the approving result is negated; IF CHANGED: the condition is satisfied if the preceding operation has changed the current working document or has produced a non-empty result document (depending on the operation); IF NOT CHANGED: the condition is satisfied if either the previous operation did not change the working document or did not produce a non-empty result. GOTO: unconditional changing the execution position. Each macro defined in the system can have its own query and can be incorporated in another macro. In this way some limited form of subroutine can be implemented. The new version of CLaRK will support server applications, calls to/from external programs

Dependency parsing of Turkish

Author: Eryigit Gulsen
Eryiğit Gülşen
Nivre Joakim
Oflazer Kemal
Publication venue: 'MIT Press - Journals'
Publication date: 01/09/2006
Field of study

The suitability of different parsing methods for different languages is an important topic in syntactic parsing. Especially lesser-studied languages, typologically different from the languages for which methods have originally been developed, poses interesting challenges in this respect. This article presents an investigation of data-driven dependency parsing of Turkish, an agglutinative free constituent order language that can be seen as the representative of a wider class of languages of similar type. Our investigations show that morphological structure plays an essential role in finding syntactic relations in such a language. In particular, we show that employing sublexical representations called inflectional groups, rather than word forms, as the basic parsing units improves parsing accuracy. We compare two different parsing methods, one based on a probabilistic model with beam search, the other based on discriminative classifiers and a deterministic parsing strategy, and show that the usefulness of sublexical units holds regardless of parsing method.We examine the impact of morphological and lexical information in detail and show that, properly used, this kind of information can improve parsing accuracy substantially. Applying the techniques presented in this article, we achieve the highest reported accuracy for parsing the Turkish Treebank

CiteSeerX

Crossref

Sabanci University Research Database

Annotated Bibliography

Author: Dilevko Juris
Foley John Miles
Tyler Lee Edgar
Publication venue
Publication date: 01/10/1986
Field of study

Processing note: review needed; add So

University of Missouri: MOspace

DWS 2006: Proceedings of the fourth international workshop on dictionary writing systems, Tuesday 5th September 2006, Turin, Italy (Pre-EURALEX 2006)

Author: de Schryver Gilles-Maurice
Publication venue: (SF)2 Press
Publication date: 01/01/2006
Field of study

Ghent University Academic Bibliography

Universal Dictionary of Concepts

Author: Boguslavsky Igor M.
Dikonov Vyacheslav
Publication venue: Facultad de Informática (UPM)
Publication date: 01/10/2008
Field of study

A universal dictionary of concepts, developed as a part of the ongoing effort to create a semantic intermediary language for global information exchange, is presented. The article describes basic principles and contents of the dictionary and outlines the current state of the project. The dictionary can evolve into an open and freely available language-independent resource with many potential applications. For example, the extensible dictionary of concepts can serve as a pivot to uniformly record and link meanings of words of different languages and facilitate creation of bi- and multilingual dictionaries. Another possible use is word sense markup of corpora. It could bring rich extra benefits due to the fact that the same set of concepts is going to be linked with major world languages including Russian, English, Spanish etc. and supported by multiple text analysis tools. There is a possibility of cooperation and exchange between this dictionary project and other projects, which could enhance the output and eventually spare a lot of parallel effort

Archivo Digital UPM

Building to learn / Learning to build. BACUS-UAB: Terminology training for translator trainees

Author: Aguilar-Amat Anna
Mesa-Lao Bartolomé
Pahisa Marta
Publication venue: Red Iberoamericana de Terminología - RITERM
Publication date: 31/12/2015
Field of study

This article describes a terminology database developed with three goals in mind: i) training translator trainees in terminology, ii) information retrieval for specialized translation, and iii) resource building for laypeople and language experts and mediators (translators and terminologists). The work that has taken place during more than a decade enables us to draw conclusions regarding the human factor, knowledge engineering, the scope of terminography and the knowledge power of specialist language

Archives of the Faculty of Veterinary Medicine UFRGS