27 research outputs found

    COMBINA-PT: a Large Corpus-extracted and Hand-checked Lexical Database of Portuguese Multiword Expressions

    Get PDF
    This paper presents the COMBINA-PT project, a study of corpus-extracted Portuguese Multiword (MW) expressions. The objective of this on-going project is to compile a large lexical database of multiword (MW) units of the Portuguese language, automatically extracted from a balanced 50 million word corpus, interpreted with lexical association measures and manually validated. MW expressions considered in the database include named entities and lexical associations with different degrees of cohesion, ranging from frozen groups, which undergo little or no variation, to lexical collocations composed of words that tend to occur together and that constitute syntactic dependencies, although with a low degree of fixedness. This new resource has a two-fold objective: (i) to be an important research tool which supports the development of MW expressions typologies and their lexicographic treatment; (ii) to be of major help in developing and evaluating language processing tools able of dealing with MW expressionsinfo:eu-repo/semantics/publishedVersio

    Corpus-based extraction and identification of Portuguese Multiword Expressions

    Get PDF
    This presentation reports the methodology followed and the results attained on an on-going project aiming at building a large lexical database of corpus-extracted multiword (MW) expressions for the Portuguese language. MW expressions were automatically extracted from a balanced 50 million word corpus compiled for this project, furthermore statistically interpreted using lexical association measures and are undergoing a manual validation process. The lexical database covers different types of MW expressions, from named entities to lexical associations with different degrees of cohesion, ranging from totally frozen idioms to favoured co-occurring forms, like collocations. We aim to achieve two main objectives with this resource: to build on the large set of data of different types of MW expressions to revise existing typologies of collocations and to integrate them in a larger theory of MW units; to use the extensive hand-checked data as training data to evaluate existing statistical lexical association measures.Cet article présente la méthodologie suivie et les résultats obtenus dans le cadre d’un projet qui a pour objectif la construction d’une large base de données d’expressions multi-mots de la langue portugaise. Ces expressions multi-mots ont été automatiquement extraites d’un corpus équilibré de 50 millions de mots, interprétées statistiquement à l’aide de mesures d’association lexicales et ont été ensuite manuellement vérifiées. La base de données lexicales recouvre différent types d’expressions multi-mots avec différents degrés de cohésion, qui vont de la quasi totale fixité jusqu’aux groupes de mots qui se réalisent préférentiellement ensemble, comme les collocations. Le large ensemble de données de cette ressource permettra une révision des typologies d’unités multi-mots en portugais et l’évaluation de différentes mesures d’associations lexicales.info:eu-repo/semantics/publishedVersio

    A Lexical Database of Portuguese Multiword Expressions

    Get PDF
    This presentation focuses on an ongoing project which aims at the creation of a large lexical database of Portuguese multiword (MW) units, automatically extracted through the analysis of a balanced 50 million word corpus, statistically interpreted with lexical association measures and validated by hand. This database covers different types of MW units, like named entities, and lexical associations ranging from sets of favoured co-occurring forms with high corpus frequency and low cohesion to strongly lexicalized expressions with no, or minimum, variation. This new resource has a two-fold objective: to be an important research tool which supports the development of collocation typologies and their integration in a larger theory of MW units; to be of major help in developing and evaluating language processing tools able of dealing with MW expressions.info:eu-repo/semantics/publishedVersio

    A Lexical Database of Portuguese Multiword Expressions

    Get PDF
    This presentation focuses on an ongoing project which aims at the creation of a large lexical database of Portuguese multiword (MW) units, automatically extracted through the analysis of a balanced 50 million word corpus, statistically interpreted with lexical association measures and validated by hand. This database covers different types of MW units, like named entities, and lexical associations ranging from sets of favoured co-occurring forms with high corpus frequency and low cohesion to strongly lexicalized expressions with no, or minimum, variation. This new resource has a two-fold objective: to be an important research tool which supports the development of collocation typologies and their integration in a larger theory of MW units; to be of major help in developing and evaluating language processing tools able of dealing with MW expressions.info:eu-repo/semantics/publishedVersio

    Cornetto: A Combinatorial Lexical Semantic Database for Dutch

    Get PDF
    One of the goals of the STEVIN programme is the realisation of a digital infrastructure that will enforce the position of the Dutch language in the modern information and communication technology.A semantic database makes it possible to go from words to concepts and consequently, to develop technologies that access and use knowledge rather than textual representations

    The relationship between the typical errors in the translation of business idioms and their lexicographical treatment

    Get PDF
    This essay is based on the complexity involved in the translation of English business idioms into Spanish, due to the fact that these linguistic constructions are created with metaphors and based on associations of meaning that have not yet been studied sufficiently. By performing a translation experiment with my students, some conclusions are drawn regarding the difficulties inexperienced translators face and how dictionaries should cope with them. It is suggested that, most general and specialized dictionaries do not offer exact translation equivalents for idioms, but present versions that either belong to a different language level, show a lost of semantic content, have a different frequency, are archaic or erroneous. To solve these limitations, the lexicographical resources should not only include idioms as lemmas, but also offer more syntactic-semantic information with them and to structure it more systematically

    Consolidation of Heterogeneous Terminology Resources

    Get PDF
    Elektroniskā versija nesatur pielikumusAndrejs Vasiļjevs Promocijas darbā piedāvātas jaunas pieejas datu sistēmu veidošanā un datorizētā informācijas apstrādē, lai risinātu aktuālas problēmas terminoloģijas jomā. Promocijas darbs teorētiski un praktiski parāda, ka terminoloģijas resursu fragmentācijas un heterogenitātes radītās pieejamības un lietojamības problēmas var efektīvi atrisināt ar federatīvu daudzvalodu sistēmu, kas nodrošina konsolidētu datu reprezentāciju un integrāciju lietojumprogrammatūrā. Darbā piedāvāta vienota metodoloģija konsolidētu sistēmu izveidē, kas ietver svarīgākos soļus no scenārijos balstītas prasību analīzes līdz autonomu datubāzu federācijai. Piedāvāts jauns princips - terminu ierakstu sastatīšana, lai identificētu un apvienotu semantiski atbilstošus daudzvalodu terminus. Semantiskajai salīdzināšanai piedāvāta automatizēta korpusbāzēta kontekstanalīzes metode. Atslēgvārdi: datorsistēmas, terminoloģija, federatīvās datubāzes, terminu sastatīšana, terminoloģijas konsolidācijaAndrejs Vasiļjevs This thesis researches data management and accessibility problems in the terminology domain. The thesis theoretically and practically demonstrates that the usability problems posed by the fragmentation and heterogeneity of terminology resources can be effectively solved with a federated multilingual database system that consolidates data representation and integration in authoring software. A unified consolidation methodology is proposed covering all major aspects from scenario based requirements analysis, data modeling, storage and representation to the federation of autonomous databases. The thesis introduces a new concept of terminology entry compounding for identification of semantically matching multilingual terms from different resources. An automated corpus based on context analysis is proposed for term sense disambiguation in entry compounding. Keywords: computing, terminology, federated databases, term compounding, terminology consolidatio

    A Computational Lexicon and Representational Model for Arabic Multiword Expressions

    Get PDF
    The phenomenon of multiword expressions (MWEs) is increasingly recognised as a serious and challenging issue that has attracted the attention of researchers in various language-related disciplines. Research in these many areas has emphasised the primary role of MWEs in the process of analysing and understanding language, particularly in the computational treatment of natural languages. Ignoring MWE knowledge in any NLP system reduces the possibility of achieving high precision outputs. However, despite the enormous wealth of MWE research and language resources available for English and some other languages, research on Arabic MWEs (AMWEs) still faces multiple challenges, particularly in key computational tasks such as extraction, identification, evaluation, language resource building, and lexical representations. This research aims to remedy this deficiency by extending knowledge of AMWEs and making noteworthy contributions to the existing literature in three related research areas on the way towards building a computational lexicon of AMWEs. First, this study develops a general understanding of AMWEs by establishing a detailed conceptual framework that includes a description of an adopted AMWE concept and its distinctive properties at multiple linguistic levels. Second, in the use of AMWE extraction and discovery tasks, the study employs a hybrid approach that combines knowledge-based and data-driven computational methods for discovering multiple types of AMWEs. Third, this thesis presents a representative system for AMWEs which consists of multilayer encoding of extensive linguistic descriptions. This project also paves the way for further in-depth AMWE-aware studies in NLP and linguistics to gain new insights into this complicated phenomenon in standard Arabic. The implications of this research are related to the vital role of the AMWE lexicon, as a new lexical resource, in the improvement of various ANLP tasks and the potential opportunities this lexicon provides for linguists to analyse and explore AMWE phenomena
    corecore