17 research outputs found

    Modernizing science&engineering software systems

    Get PDF
    As the demands for modernized legacy systems rise, so does the need for frameworks for information integration and tool interoperability. The Object Management Group (OMG) has adopted the Model Driven Architecture (MDA), which is an evolving conceptual architecture that aligns with this demand. MDA could help solve coupling problems of multidisciplinary character in science and engineering that consist of one or more applications, supported by one or more platforms. The objective of this paper is to describe rigorous techniques to control the evolution from science & engineering software legacy systems to MDA technologies. We propose a rigorous framework to reverse engineering code in the context of MDA. Considering that validation, verification and consistency are crucial activities in the modernization of systems that are critical to safety, security and economic profits, our approach emphasizes the integration of MDA with formal methods

    Automatic Creation of Corpora

    Get PDF
    Obsahem práce je představení způsobu formátování a značkování textových dat korpusu. Nad vhodně reprezentovanými dokumenty vytváří vrstvu pro jejich vzájemné porovnání s cílem určení míry podobnosti mezi nimi. Nástroje, které výpočty podobnosti zajišťují, jsou základem automatizovaného systému pro vytváření a doplňování existujícího korpusu dat. Mezi dvěma základními přístupy je možno volit podle požadavku výpovědní hodnoty výsledku. Prostředkem pro získávání dat nových je nástroj stahování obsahu webu.This work is a presentation of tagging and formatting of text-data corpus. It creates a layer above suitable represented documents for their mutual comparison in order to determine the similarity among them. Tools that provide near-duplicate calculations are the basis for an automated system for creation and expansion of the existing text-data corpus. There is an option to choose between two basic approaches according to the significance of the outcome. Means of new text-data acquiring is the tool for web crawling.

    A Semi-automatic and Low Cost Approach to Build Scalable Lemma-based Lexical Resources for Arabic Verbs

    Get PDF
    International audienceThis work presents a method that enables Arabic NLP community to build scalable lexical resources. The proposed method is low cost and efficient in time in addition to its scalability and extendibility. The latter is reflected in the ability for the method to be incremental in both aspects, processing resources and generating lexicons. Using a corpus; firstly, tokens are drawn from the corpus and lemmatized. Secondly, finite state transducers (FSTs) are generated semi-automatically. Finally, FSTsare used to produce all possible inflected verb forms with their full morphological features. Among the algorithm’s strength is its ability to generate transducers having 184 transitions, which is very cumbersome, if manually designed. The second strength is a new inflection scheme of Arabic verbs; this increases the efficiency of FST generation algorithm. The experimentation uses a representative corpus of Modern Standard Arabic. The number of semi-automatically generated transducers is 171. The resulting open lexical resources coverage is high. Our resources cover more than 70% Arabic verbs. The built resources contain 16,855 verb lemmas and 11,080,355 fully, partially and not vocalized verbal inflected forms. All these resources are being made public and currently used as an open package in the Unitex framework available under the LGPL license

    DFKI finite-state machine toolkit

    Get PDF
    Finite-state devices such as finite-state automata and finite-state transducers have been known since the emergence of computer science and are recently extensively used in many areas of language technology. The use of finite-state devices is mainly motivated by their time and space efficiency. In this paper we present the Finite-State Machine Toolkit for building, combining and optimizing the finite-state machines, developed at the Language Technology Lab of the German Research Center for Artificial Intelligence

    CORLEONE - Core Linguistic Entity Online Extraction

    Get PDF
    This report presents CORLEONE (Core Linguistic Entity Online Extraction) - a pool of loosely coupled general-purpose basic lightweight linguistic processing resources, which can be independently used to identify core linguistic entities and their features in free texts. Currently, CORLEONE consists of five processing resources: (a) a basic tokenizer, (b) a tokenizer which performs fine-grained token classification, (c) a component for performing morphological analysis, and (d) a memory-efficient database-like dictionary look-up component, and (e) sentence splitter. Linguistic resources for several languages are provided. Additionally, CORLEONE includes a comprehensive library of string distance metrics relevant for the task of name variant matching. CORLEONE has been developed in the Java programming language and heavily deploys state-of-the-art finite-state techniques. Noteworthy, CORLEONE components are used as basic linguistic processing resources in ExPRESS, a pattern matching engine based on regular expressions over feature structures and in the real-time news event extraction system, which were developed by the Web Mining and Intelligence Group of the Support to External Security Unit of IPSC. This report constitutes an end-user guide for COLREONE and provides scientifically interesting details of how it was implemented.JRC.G.2-Support to external securit

    Morphological Analyser Implemented as FSAs

    Get PDF
    Tato práce se věnuje analýze českého jazyka a pokouší se rozšířit zatím omezenou derivativní nadstavbu, kterou disponuje morfologický analyzátor MA. Autor popisuje dosavadní stav tohoto programu a vytváří postupy pro nalezení slovotvorných vazeb, které slouží k vytváření derivačních pravidel, díky kterým je možné automatické rozšiřování znalostí české slovotvorby. Poté ilustruje, jak se data seskupují dle podobnosti, aby vytvořila derivační vzory, které usnadňují budoucí zpracování nových slov. Závěrem jsou výstupy práce zhodnoceny a jsou naznačeny směry možného rozvoje.This thesis deals with analysis of czech language and tries to enlarge limited derivative extension of morphologic analysator MA. Author describes actual state of this program and defines ways to find word formation connections, which serves to create derivation rules, which helps to automatically enrich knowledge of czech word formation. Illustrates how are similiar data grouped to create derivation patterns, which will make future work with new words easier. In the end, outcomes are sumarized and direction of possible future evolution is described.
    corecore