71 research outputs found
Common Scientific Lexicon for Automatic Discourse Analysis of Scientific and Technical Texts
The paper reports on preliminary results of an ongoing research aiming at development of an
automatic procedure for recognition of discourse-compositional structure of scientific and technical texts, which is
required in many NLP applications. The procedure exploits as discourse markers various domain-independent
words and expressions that are specific for scientific and technical texts and organize scientific discourse. The
paper discusses features of scientific discourse and common scientific lexicon comprising such words and
expressions. Methodological issues of development of a computer dictionary for common scientific lexicon are
concerned; basic principles of its organization are described as well. Main steps of the discourse-analyzing
procedure based on the dictionary and surface syntactical analysis are pointed out
Semantic frames and semantic networks in the Health Science Corpus
The aim of this paper is to apply frame semantics principles to the analysis of a specialized corpus, the Health Science Corpus, implemented in the lexical data b ase SciE-Lex. Taking FrameNet as the basis for this research, I will assign frame semantic features to Scie-Lex data in order to highlight the shared semantic and syntactic background of the related words in the biomedical register, give motivation to their patterns of collocates and establish frame-based semantic networks of related lexical units.El objetivo de este artĂculo es aplicar los principios de la semĂĄntica de marcos al anĂĄlisis de un corpus especializado, el Health Science Corpus, implementado en la base de datos lĂ©xica SciE-Lex. Tomando FrameNet como base para esta investigaciĂłn, se aplica la semĂĄntica de marcos a los datos de Scie-Lex para destacar los aspectos sintĂĄcticos y semĂĄnticos communes de los tĂ©rminos del registro biomĂ©dico, motivar sus patrones combinatorios y establecer redes semĂĄnticas basadas en marcos
PLPrepare: A Grammar Checker for Challenging Cases
This study investigates one of the Polish languageâs most arbitrary cases: the genitive masculine inanimate singular. It collects and ranks several guidelines to help language learners discern its proper usage and also introduces a framework to provide detailed feedback regarding arbitrary cases. The study tests this framework by implementing and evaluating a hybrid grammar checker called PLPrepare. PLPrepare performs similarly to other grammar checkers and is able to detect genitive case usages and provide feedback based on a number of error classifications
Semantic frames and semantic networks in the Health Science Corpus
[eng] The aim of this paper is to apply frame semantics principles to the analysis of a specialized corpus, the Health Science Corpus, implemented in the lexical database SciE-Lex. Taking FrameNet as the basis for this research, I will assign frame semantic features to Scie-Lex data in order to highlight the shared semantic and syntactic background of the related words in the biomedical register, give motivation to their patterns of collocates and establish frame-based semantic networks of related lexical units.[spa] El objetivo de este artĂculo es aplicar los principios de la semĂĄntica de marcos al anĂĄlisis de un corpus especializado, el Health Science Corpus, implementado en la base de datos lĂ©xica SciE-Lex. Tomando FrameNet como base para esta investigaciĂłn, se aplica la semĂĄntica de marcos a los datos de Scie-Lex para destacar los aspectos sintĂĄcticos y semĂĄnticos communes de los tĂ©rminos del registro biomĂ©dico, motivar sus patrones combinatorios y establecer redes semĂĄnticas basadas en marcos
Multiword expressions
Multiword expressions (MWEs) are a challenge for both the natural language applications and the linguistic theory because they often defy the application of the machinery developed for free combinations where the default is that the meaning of an utterance can be predicted from its structure. There is a rich body of primarily descriptive work on MWEs for many European languages but comparative work is little. The volume brings together MWE experts to explore the benefits of a multilingual perspective on MWEs. The ten contributions in this volume look at MWEs in Bulgarian, English, French, German, Maori, Modern Greek, Romanian, Serbian, and Spanish. They discuss prominent issues in MWE research such as classification of MWEs, their formal grammatical modeling, and the description of individual MWE types from the point of view of different theoretical frameworks, such as Dependency Grammar, Generative Grammar, Head-driven Phrase Structure Grammar, Lexical Functional Grammar, Lexicon Grammar
Vieraan kielen sanat ja idiomiperiaate
This work sets out to examine how second language (L2) users of English acquire, use and process lexical items. For this purpose three types of data were collected from five non-native students of the University of Helsinki. First, each student s drafts of Master s thesis chapters written over a period of time were compiled into a language usage corpus. Second, academic publications a student referred to in her thesis were compiled into a corpus representing her language exposure. Third, several hundreds of words a student used in her thesis were presented to her as stimuli in word association tasks to obtain psycholinguistic data on the representation of the patterns in the mind. Lexical usage patterns, conceived of in accordance with John Sinclair s conceptualisation of lexis and meaning, were then compared to (1) language exposure and (2) word association responses.
The results of this triangulation show that, contrary to mainstream thinking in SLA, language production on the idiom principle, i.e. by retrieving holistic patterns glued by syntagmatic association rather than constructing them word by word, is available to L2 users to a much larger degree than is often claimed. More than half of significant multi-word units used by the students also occur in the language they were exposed to. The idiosyncratic multi-word units are often a result of approximation or fixing. Approximation is a process through which a more or less fixed pattern loosens and becomes variable on the semantic or grammatical axis due to frequency effects and the properties of human memory. Fixing, on the other hand, is a reverse process making the wording of the pattern become overly fixed through repeated usage. Neither of the processes damage the meaning communicated in any way. Word association responses also support the main conclusion of the availability of the idiom principle showing that multi-word units used are also represented holistically in the mind and so confirming the continuity between exposure, usage and psycholinguistic representation. Furthermore, they suggest that the model of a unit of meaning developed by Sinclair has psycholinguistic reality as representations of lexical items in the mind seem to mirror the components of a unit of meaning: collocation, colligation and semantic preference.
This work offers an in-depth discussion of Sinclair s conceptualisation of meaning and a novel methodology for studying units of meaning in L2 use both quantitatively and qualitatively by triangulating usage, exposure and word association data. It is hoped that the dissertation will be of interest to scholars specialising in second language acquisition and use, English as a lingua franca, phraseological view of language and corpus linguistic methodology.Miksi joskus tuntuu siltÀ, ettei koskaan pysty puhumaan toista kieltÀ virheettömÀsti? TÀmÀ tutkimus osoittaa, ettÀ puhujan sanaston rakenteet ja sanojen kÀyttöprosessit ovat vieraalla kielellÀ hyvin samankaltaisia kuin ÀidinkielessÀ ja kielen muutoksessa. Tarkastelun kohteena on Helsingin yliopiston eri kielitaustaisten opiskelijoiden kÀyttÀmÀ englannin kielen sanasto heidÀn omissa teksteissÀÀn ja sana-assosiaatiotesteissÀ. Tutkimus soveltaa kielen analyysiin sellaista monisanaisen merkitysyksikön mallia, joka mahdollistaa yksikön sisÀisen vaihtelun ja muutoksen havainnoinnin. Tutkimuksessa kehitetyn mallin avulla voi havainnoida sitÀ, miten merkityksen siirtymÀ tapahtuu vapaassa sanayhdistelmÀssÀ niin, ettÀ se kiteytyy uudeksi monisanaiseksi merkitysyksiköksi ja sitÀ, miten tÀmÀ yksikkö jatkaa edelleen vakiintumista ja muuttumista merkitysjatkumoa pitkin jopa idiomiin asti. Merkityksen yksikkö voi myös muuttua taaksepÀin ja vakiintumisen sijaan löystyÀ ilman, ettÀ se kuitenkaan tÀysin hajoaa. TÀtÀ vaihtelua voidaan kognitiivisesti selittÀÀ frekvenssivaikutuksella: mitÀ yleisempi yksikkö on, sitÀ paremmin meillÀ on sen tarkka kÀyttö hallussamme ja kÀÀntÀen: mitÀ harvinaisempi se on, sitÀ todennÀköisempÀÀ on, ettÀ emme tuota sitÀ sanatarkasti. Harvinaisemmat yksiköt tuotetaan todennÀköisemmin likiarvona eli korvaamalla muutama niiden komponentti abstraktimmalla komponentilla. Ilmauksen vakiintumisilmiö on tuttu kaikille, joilla on kokemusta saman tekstin, esimerkiksi saman luennon tai puheen, esittÀmistÀ useampaan kertaan: samat ilmaukset pÀÀdytÀÀn toistamaan melkein samoin sanoin. Ilmauksen likiarvo on taas kysymyksessÀ silloin, kun vaikkapa etsitÀÀn kirjastosta kirjaa, jonka nimestÀ on mielessÀ hieman epÀtarkka muistikuva: oliko se Looking at the Sun vai Gazing at the Sun , kun itse asiassa se on Staring at the Sun . On perusteltua olettaa, ettÀ sama prosessi toimii kun toisen kielen kÀyttÀjÀ lausuu so to say eikÀ so to speak , the hen or the egg eikÀ the chicken or the egg tai to my head eikÀ to my mind , koska muistamme merkityksen paremmin kuin kielellisen ilmiasun. Siksi toisen kielen kÀyttö ei enimmÀkseen ole virheellistÀ vaan ainoastaan hieman epÀmÀÀrÀisempÀÀ, kielen muotojen likiarvoista kÀyttöÀ
Promoting multiword expressions in A* TAG parsing
International audienceMultiword expressions (MWEs) are pervasive in natural languages and often have both idiomatic and compositional readings, which leads to high syntactic ambiguity. We show that for some MWE types idiomatic readings are usually the correct ones. We propose a heuristic for an A* parser for Tree Adjoining Grammars which benefits from this knowledge by promoting MWE-oriented analyses. This strategy leads to a substantial reduction in the parsing search space in case of true positive MWE occurrences, while avoiding parsing failures in case of false positives
Criteria for the validation of specialized verb equivalents : application in bilingual terminography
Multilingual terminological resources do not always include valid equivalents of legal terms for two main reasons. Firstly, legal systems can differ from one language community to another and even from one country to another because each has its own history and traditions. As a result, the non-isomorphism between legal and linguistic systems may render the identification of equivalents a particularly challenging task. Secondly, by focusing primarily on the definition of equivalence, a notion widely discussed in translation but not in terminology, the literature does not offer solid and systematic methodologies for assigning terminological equivalents. As a result, there is a lack of criteria to guide both terminologists and translators in the search and validation of equivalent terms.
This problem is even more evident in the case of predicative units, such as verbs. Although some terminologists (LâHomme 1998; Lerat 2002; Lorente 2007) have worked on specialized verbs, terminological equivalence between units that belong to this part of speech would benefit from a thorough study. By proposing a novel methodology to assign the equivalents of specialized verbs, this research aims at defining validation criteria for this kind of predicative units, so as to contribute to a better understanding of the phenomenon of terminological equivalence as well as to the development of multilingual terminography in general, and to the development of legal terminography, in particular.
The study uses a Portuguese-English comparable corpus that consists of a single genre of texts, i.e. Supreme Court judgments, from which 100 Portuguese and 100 English specialized verbs were selected. The description of the verbs is based on the theory of Frame Semantics (Fillmore 1976, 1977, 1982, 1985; Fillmore and Atkins 1992), on the FrameNet methodology (Ruppenhofer et al. 2010), as well as on the methodology for compiling specialized lexical resources, such as DiCoInfo (LâHomme 2008), developed in the Observatoire de linguistique Sens-Texte at the UniversitĂ© de MontrĂ©al. The research reviews contributions that have adopted the same theoretical and methodological framework to the compilation of lexical resources and proposes adaptations to the specific objectives of the project.
In contrast to the top-down approach adopted by FrameNet lexicographers, the approach described here is bottom-up, i.e. verbs are first analyzed and then grouped into frames for each language separately. Specialized verbs are said to evoke a semantic frame, a sort of conceptual scenario in which a number of mandatory elements (core Frame Elements) play specific roles (e.g. ARGUER, JUDGE, LAW), but specialized verbs are often accompanied by other optional information (non-core Frame Elements), such as the criteria and reasons used by the judge to reach a decision (statutes, codes, previous decisions). The information concerning the semantic frame that each verb evokes was encoded in an xml editor and about twenty contexts illustrating the specific way each specialized verb evokes a given frame were semantically and syntactically annotated. The labels attributed to each semantic frame (e.g. [Compliance], [Verdict]) were used to group together certain synonyms, antonyms as well as equivalent terms.
The research identified 165 pairs of candidate equivalents among the 200 Portuguese and English terms that were grouped together into 76 frames. 71% of the pairs of equivalents were considered full equivalents because not only do the verbs evoke the same conceptual scenario but their actantial structures, the linguistic realizations of the actants and their syntactic patterns were similar. 29% of the pairs of equivalents did not entirely meet these criteria and were considered partial equivalents. Reasons for partial equivalence are provided along with illustrative examples. Finally, the study describes the semasiological and onomasiological entry points that JuriDiCo, the bilingual lexical resource compiled during the project, offers to future users.Les ressources multilingues portant sur le domaine juridique nâincluent pas toujours dâĂ©quivalents valides pour deux raisons. Dâabord, les systĂšmes juridiques peuvent diffĂ©rer dâune communautĂ© linguistique Ă lâautre et mĂȘme dâun pays Ă lâautre, car chacun a son histoire et ses traditions. Par consĂ©quent, le phĂ©nomĂšne de la non-isomorphie entre les systĂšmes juridiques et linguistiques rend difficile la tĂąche dâidentification des Ă©quivalents. En deuxiĂšme lieu, en se concentrant surtout sur la dĂ©finition de la notion dâĂ©quivalence, notion largement dĂ©battue en traductologie, mais non suffisamment en terminologie, la littĂ©rature ne propose pas de mĂ©thodologies solides et systĂ©matiques pour identifier les Ă©quivalents. On assiste donc Ă une absence de critĂšres pouvant guider tant les terminologues que les traducteurs dans la recherche et la validation des Ă©quivalents des termes. Ce problĂšme est encore plus Ă©vident dans le cas dâunitĂ©s prĂ©dicatives comme les verbes. Bien que certains terminologues (L'Homme, 1998; Lorente et Bevilacqua 2000; Costa et Silva 2004) aient dĂ©jĂ travaillĂ© sur les verbes spĂ©cialisĂ©s, lâĂ©quivalence terminologique, en ce qui concerne ce type dâunitĂ©s, bĂ©nĂ©ficierait dâune Ă©tude approfondie. En proposant une mĂ©thodologie originale pour identifier les Ă©quivalents des verbes spĂ©cialisĂ©s, cette recherche consiste donc Ă dĂ©finir des critĂšres de validation de ce type dâunitĂ©s prĂ©dicatives afin de mieux comprendre le phĂ©nomĂšne de lâĂ©quivalence et aussi amĂ©liorer les ressources terminologiques multilingues, en gĂ©nĂ©ral, et les ressources terminologiques multilingues couvrant le domaine juridique, en particulier.
Cette Ă©tude utilise un corpus comparable portugais-anglais contenant un seul genre de textes, Ă savoir les dĂ©cisions des cours suprĂȘmes, Ă partir duquel 100 verbes spĂ©cialisĂ©s ont Ă©tĂ© sĂ©lectionnĂ©s pour chaque langue. La description des verbes se base sur la thĂ©orie de la sĂ©mantique des cadres (Fillmore 1976, 1977, 1982, 1985; Fillmore and Atkins 1992), sur la mĂ©thodologie de FrameNet (Ruppenhofer et al. 2010), ainsi que sur la mĂ©thodologie dĂ©veloppĂ©e Ă lâObservatoire de linguistique Sens-Texte pour compiler des ressources lexicales spĂ©cialisĂ©es, telles que le DiCoInfo (LâHomme 2008). La recherche examine dâautres contributions ayant dĂ©jĂ utilisĂ© ce cadre thĂ©orique et mĂ©thodologique et propose des adaptations objectives du projet. Au lieu de suivre une dĂ©marche descendante comme le font les lexicographes de FrameNet, la dĂ©marche que nous dĂ©crivons est ascendante, câest-Ă -dire, pour chaque langue sĂ©parĂ©ment, les verbes sont dâabord analysĂ©s puis regroupĂ©s par cadres sĂ©mantiques. Dans cette recherche, chacun des verbes « Ă©voque » un cadre ou frame, une sorte de scĂ©nario conceptuel, dans lequel un certain nombre dâacteurs obligatoires (core Frame Elements) jouent des rĂŽles spĂ©cifiques (le rĂŽle de juge, le rĂŽle dâappelant, le rĂŽle de la loi). Mis en discours, les termes sont souvent accompagnĂ©s dâautres renseignements optionnels (non-core Frame Elements) comme ceux des critĂšres utilisĂ©s par le juge pour rendre une dĂ©cision (des lois, des codes, dâautres dĂ©cisions antĂ©rieures). Tous les renseignements concernant les cadres sĂ©mantiques que chacun des verbes Ă©voque ont Ă©tĂ© encodĂ©s dans un Ă©diteur xml et une vingtaine de contextes illustrant la façon spĂ©cifique dont chacun des verbes Ă©voque un cadre donnĂ© ont Ă©tĂ© annotĂ©s. Les Ă©tiquettes attribuĂ©es Ă chaque cadre sĂ©mantique (ex. [Compliance], [Verdict]) ont servi Ă relier certains termes synonymes, certains termes antonymes ainsi que des candidats Ă©quivalents.
Parmi les 200 termes portugais et anglais regroupĂ©s en 76 cadres, 165 paires de candidats Ă©quivalents ont Ă©tĂ© identifiĂ©s. 71% des paires dâĂ©quivalents sont des Ă©quivalents parfaits parce que les verbes Ă©voquent le mĂȘme scĂ©nario conceptuel, leurs structures actancielles sont identiques, les rĂ©alisations linguistiques de chacun des actants sont Ă©quivalentes, et les patrons syntaxiques des verbes sont similaires. 29% des paires dâĂ©quivalents correspondent Ă des Ă©quivalents partiels parce quâils ne remplissent pas tous ces critĂšres. Au moyen dâexemples, lâĂ©tude illustre tous les cas de figure observĂ©s et termine en prĂ©sentant les diffĂ©rentes façons dont les futurs utilisateurs peuvent consulter le JuriDiCo, la ressource lexicale qui a Ă©tĂ© compilĂ©e pendant ce projet
- âŠ