17 research outputs found
Modernizing science&engineering software systems
As the demands for modernized legacy systems rise, so does the need for
frameworks for information integration and tool interoperability. The Object Management
Group (OMG) has adopted the Model Driven Architecture (MDA), which is an evolving
conceptual architecture that aligns with this demand. MDA could help solve coupling
problems of multidisciplinary character in science and engineering that consist of one or more
applications, supported by one or more platforms. The objective of this paper is to describe
rigorous techniques to control the evolution from science & engineering software legacy
systems to MDA technologies. We propose a rigorous framework to reverse engineering code
in the context of MDA. Considering that validation, verification and consistency are crucial
activities in the modernization of systems that are critical to safety, security and economic
profits, our approach emphasizes the integration of MDA with formal methods
Automatic Creation of Corpora
Obsahem práce je představení způsobu formátování a značkování textových dat korpusu. Nad vhodně reprezentovanými dokumenty vytváří vrstvu pro jejich vzájemné porovnání s cílem určení míry podobnosti mezi nimi. Nástroje, které výpočty podobnosti zajišťují, jsou základem automatizovaného systému pro vytváření a doplňování existujícího korpusu dat. Mezi dvěma základními přístupy je možno volit podle požadavku výpovědní hodnoty výsledku. Prostředkem pro získávání dat nových je nástroj stahování obsahu webu.This work is a presentation of tagging and formatting of text-data corpus. It creates a layer above suitable represented documents for their mutual comparison in order to determine the similarity among them. Tools that provide near-duplicate calculations are the basis for an automated system for creation and expansion of the existing text-data corpus. There is an option to choose between two basic approaches according to the significance of the outcome. Means of new text-data acquiring is the tool for web crawling.
A Semi-automatic and Low Cost Approach to Build Scalable Lemma-based Lexical Resources for Arabic Verbs
International audienceThis work presents a method that enables Arabic NLP community to build scalable lexical resources. The proposed method is low cost and efficient in time in addition to its scalability and extendibility. The latter is reflected in the ability for the method to be incremental in both aspects, processing resources and generating lexicons. Using a corpus; firstly, tokens are drawn from the corpus and lemmatized. Secondly, finite state transducers (FSTs) are generated semi-automatically. Finally, FSTsare used to produce all possible inflected verb forms with their full morphological features. Among the algorithm’s strength is its ability to generate transducers having 184 transitions, which is very cumbersome, if manually designed. The second strength is a new inflection scheme of Arabic verbs; this increases the efficiency of FST generation algorithm. The experimentation uses a representative corpus of Modern Standard Arabic. The number of semi-automatically generated transducers is 171. The resulting open lexical resources coverage is high. Our resources cover more than 70% Arabic verbs. The built resources contain 16,855 verb lemmas and 11,080,355 fully, partially and not vocalized verbal inflected forms. All these resources are being made public and currently used as an open package in the Unitex framework available under the LGPL license
DFKI finite-state machine toolkit
Finite-state devices such as finite-state automata and finite-state transducers have been known since the emergence of computer science and are recently extensively used in many areas of language technology. The use of finite-state devices is mainly motivated by their time and space efficiency. In this paper we present the Finite-State Machine Toolkit for building, combining and optimizing the finite-state machines, developed at the Language Technology Lab of the German Research Center for Artificial Intelligence
CORLEONE - Core Linguistic Entity Online Extraction
This report presents CORLEONE (Core Linguistic Entity Online Extraction) - a
pool of loosely coupled general-purpose basic lightweight linguistic processing resources, which can be independently used
to identify core linguistic entities and their features in free texts. Currently, CORLEONE consists of five processing resources:
(a) a basic tokenizer, (b) a tokenizer which performs fine-grained token classification, (c) a component for performing morphological analysis, and (d) a memory-efficient database-like dictionary look-up component, and (e) sentence splitter. Linguistic
resources for several languages are provided. Additionally, CORLEONE includes a comprehensive library of string distance metrics relevant for the task of name variant matching. CORLEONE has been developed in the Java programming language and heavily deploys
state-of-the-art finite-state techniques.
Noteworthy, CORLEONE components are used as basic linguistic processing resources in ExPRESS, a pattern matching engine based on regular expressions over feature structures and in the real-time news event extraction system, which were
developed by the Web Mining and Intelligence Group of the Support to External Security Unit of IPSC.
This report constitutes an end-user guide for COLREONE and provides scientifically interesting details of how it
was implemented.JRC.G.2-Support to external securit
Morphological Analyser Implemented as FSAs
Tato práce se věnuje analýze českého jazyka a pokouší se rozšířit zatím omezenou derivativní nadstavbu, kterou disponuje morfologický analyzátor MA. Autor popisuje dosavadní stav tohoto programu a vytváří postupy pro nalezení slovotvorných vazeb, které slouží k vytváření derivačních pravidel, díky kterým je možné automatické rozšiřování znalostí české slovotvorby. Poté ilustruje, jak se data seskupují dle podobnosti, aby vytvořila derivační vzory, které usnadňují budoucí zpracování nových slov. Závěrem jsou výstupy práce zhodnoceny a jsou naznačeny směry možného rozvoje.This thesis deals with analysis of czech language and tries to enlarge limited derivative extension of morphologic analysator MA. Author describes actual state of this program and defines ways to find word formation connections, which serves to create derivation rules, which helps to automatically enrich knowledge of czech word formation. Illustrates how are similiar data grouped to create derivation patterns, which will make future work with new words easier. In the end, outcomes are sumarized and direction of possible future evolution is described.