19 research outputs found

    Ut med adamsslekt og inn med arveprinsesse? Leksikografiske metodar i revisjonen av BokmÄlsordboka og Nynorskordboka

    Get PDF
    Ut med adamsslekt og inn med arveprinsesse? Leksikografiske metoder i Revisjonsprosjektet for BokmÄlsordboka og Nynorskordboka I dette innlegget presenteres kildematerialet, verktÞy og metoder som brukes og utvikles i det norske Revisjonsprosjektet for BokmÄlsordboka og Nynorskordboka. I dette prosjektet skal leksikografene revidere to eksisterende ordbÞker parallelt, én pÄ bokmÄl og én pÄ nynorsk. Hvordan kan leksikografene jobbe effektivt, klokt og etterrettelig for Ä peke ut et modernisert og relevant ordtilfang i begge mÄlformene? PÄ den ene siden bruker vi de vitenskapelige SprÄksamlingene ved Universitetet i Bergen. PÄ den andre siden bruker vi sÞkeverktÞy for Ä analysere tekst pÄ ord- og setningsnivÄ. Vi vil vise hvordan tekstkorpus er en god empirisk kunnskapskilde nÄr det gjelder lemmatilfang, fraser og flerordsuttrykk, ordbetydninger, den syntaktiske oppfÞrselen til ord og brukseksempler. Vi vil samtidig vise eksempler hvor korpus potensielt ogsÄ er misvisende. Leksikografene bruker verktÞy for sprÄkanalyse som er tilgjengelig gjennom sprÄkinfrastrukturen CLARINO (https://clarin.w.uib.no/). Corpuscle er et verktÞy for Ä sÞke pÄ ord og fraser. Gjennom revisjonsprosjektet er Corpuscle utvidet til Ä sÞke i flere korpus samtidig (til sammen ca. 2,4 milliarder ord). For eksempel kan vi gjÞre komplekse sÞk i Nynorskkorpuset, Leksikografisk BokmÄlskorpus og Nasjonalbibliotekets frie tekster samtidig. SÞketreffene kan sorteres med frekvens over tid og distribusjon i ulike korpus, hvilket forenkler leksikografens empiriske arbeid betydelig. Infrastrukturen INESS lar oss sÞke etter syntaktiske konstruksjoner som ikke er sÄ lett sÞkbare i et tradisjonelt tekstkorpus. NorGramBank inneholder syntaktisk analysert materiale for norsk (til sammen ca. 80 mill. ord). Vi vil vise hvordan vi kan hente ut «wordsketch»-lignende informasjon, som de vanligste verbrammene til et verb (gifte seg, gifte bort, gifte seg til) eller typiske eksempler pÄ hvordan at-setninger styres av preposisjoner (etter at, uten at). Relevante lenker: Revisjonsprosjektet for BokmÄlsordboka og Nynorskordboka: https://ordbok.uib.no Corpuscle (sÞk pÄ ordnivÄ): https://clarino.uib.no/corpuscle INESS (sÞk pÄ setningsnivÄ): https://clarino.uib.no/inesspublishedVersio

    Translation-based Word Sense Disambiguation

    Get PDF
    This thesis investigates the use of the translation-based Mirrors method (Dyvik, 2005, inter alia) for Word Sense Disambiguation (WSD) for Norwegian. Word Sense Disambiguation is the process of determining the relevant sense of an ambiguous word in context automatically. Automated WSD is relevant for Natural Language Processing systems such as machine translation (MT), information retrieval, information extraction and content analysis. The most successful WSD approaches to date are so-called supervised machine learning (ML) techniques, in which the system ‘learns’ the contextual characteristics of each sense from a training corpus that contains concrete examples of contexts in which a word sense typically occurs. This approach suffers from a knowledge acquisition problem since word senses are not overtly available in corpus text. First, we therefore need a sense inventory which is computationally tractable. Subjectively defined sense distinctions have been the norm in WSD research (especially the Princeton WordNet, Fellbaum, 1998). But WSD studies increasingly show that the WordNet senses are too fine-grained for efficient WSD, which has made WordNet less attractive for machine-learned WSD. Ide and Wilks (2006) recommend instead to approximate word senses by way of cross-lingual sense definitions. Second, we need a method for sense-tagging context examples with the relevant sense given the context. Preparing such sense-tagged training corpora manually is costly and time-consuming, in particular because statistical methods require large amounts of training examples, and automated methods are therefore desirable. This thesis introduces an experimental lexical knowledge source which derives word senses and relations between word senses on the basis of translational correspondences in a parallel corpus, resulting in a structured semantic network (Dyvik, 2009). The Mirrors method is applicable for any language pair for which a parallel corpus and word alignment is available. The appeal of the Mirrors method and its translational basis for lexical semantics is that it offers an objective and consistent—and hence, testable—criterion, as opposed to the traditional subjective judgements in lexicon classification (cf. the Princeton WordNet). But due to the lack of intersubjective “gold standards” for lexical semantics, it is not an easy task to evaluate the Mirrors method. The main research question of this thesis may thus be formulated as follows: are the translation-based senses and semantic relations in the Mirrors method linguistically motivated from a monolingual point of view? To this end, this thesis proposes to use monolingual task of WSD as a practical framework to evaluate the usefulness of the Mirrors method as a lexical knowledge source. This is motivated by the idea that a well-defined end-user application may provide a stable framework within which the benefits and drawbacks of a resource or a system can be demonstrated (e.g. Ng & Lee, 1996; Stevenson & Wilks, 2001; Yarowsky & Florian, 2002; Specia et al., 2009). The innovative aspect of applying the Mirrors method for WSD is two-fold: first, the Mirrors method is used to obtain sense-tagged data automatically (using cross-lingual data), providing a SemCor-like corpus which allows us to exploit semantically analysed context features in a subsequent WSD classifier. Second, we will test whether training on semantically analysed context features, based on information from the Mirrors method, means that the system resolves other instances than a ‘traditional’ classifier trained on words. In the absence of existing data sets for WSD for Norwegian, an automatically sense-tagged parallel corpus and a manually verified lexical sample of fifteen target words was developed for Norwegian as part of this thesis. The proposed automatic sense-tagging method is based on the Mirrors sense inventory and on the translational correspondents of each word occurrence. The sense-tagger provides a partially semantically analysed context—partially, because the translation-based sense-tagger can only sense-tag tokens that were successfully word-aligned. The sense-tagged English-Norwegian Parallel Corpus (the ENPC) is comparable in size to the existing SemCor. The sense-tagged material formed the basis for a series of controlled experiments, in which the knowledge source is varied but where we maintain the same experimental framework in terms of the classification algorithm, data sets, lexical sample and sense inventory. First, a WSD classifier is trained on the actually co-occurring context WORDS. This knowledge source functions as a point of reference to indicate how well a traditional word-based classifier could be expected to perform, given our specific data sample and using the Mirrors sense inventory. Second, two Mirrors-derived knowledge sources were tentatively implemented, both of which attempt to generalise from the actually occurring context words as a means of alleviating the sparse data problem in WSD. For instance, if the noun phone was found to co-occur with the ambiguous noun billN in the ‘invoice’ sense, and if the classifier can generalise from this to include words that are semantically close to phone, such as telephone, this means that the presence of only one of them during learning could make both of them ‘known’ to the classifier at classification time. In other words, it might be desirable to study not only word co-occurrences, as unanalysed and isolated units, but also how words enter into relations with other words (classes of words) in the structured network that constitutes the vocabulary of a language. In ML terms, it might be interesting to build a WSD model which learns, not how a word sense correlates with isolated words, but rather how a word sense correlates with certain classes of semantically related words. Such a tool for generalisation is clearly desirable in the face of sparse data and in view of the fact that most content words have a relatively low frequency even in larger text corpora. The first of the two Mirrors-based knowledge source rests on so-called SEMANTIC-FEATURES that are shared between word senses in the Mirrors network. Since SEMANTIC-FEATURES may include a very high number of related words, a second knowledge source was also developed—RELATED-WORDS—which attempts to selects a stricter class of near-related word senses in the wordnet-like Mirrors network. The results indicated that the gain in abstracting from context words to classes of semantically related word senses was only marginal in that the two Mirrorsbased knowledge sources only knew marginally more of the context words at classification time compared to a traditional word-based classifier. Regarding classification accuracy, the Mirrors-based SEMANTIC-FEATURES seemed to suffer from including too broad semantic information and performed significantly worse than the other two knowledge sources. The Mirrors-based RELATED-WORDS, on the other hand, was as good as, and sometimes better, than the traditional word model, but the differences were not found to be statistically significant. Although unfortunate for the purpose of enriching a traditional WSD model with Mirrorsderived information, the lack of a difference between the traditional word model and RELATED-WORDS nevertheless provides promising indications with regard to the plausibility of the Mirrors method

    Creation of Shared Language Resource Repository in the Nordic and Baltic Countries

    Get PDF
    Proceeding volume: 8The META-NORD project has contributed to an open infrastructure for language resources (data and tools) under the META-NET umbrella. This paper presents the key objectives of META-NORD and reports on the results achieved in the first year of the project. META-NORD has mapped and described the national language technology landscape in the Nordic and Baltic countries in terms of language use, language technology and resources, main actors in the academy, industry, government and society; identified and collected the first batch of language resources in the Nordic and Baltic countries; documented, processed, linked, and upgraded the identified language resources to agreed standards and guidelines. The three horizontal multilingual actions in META-NORD are overviewed in this paper: linking and validating Nordic and Baltic wordnets, the harmonisation of multilingual Nordic and Baltic treebanks, and consolidating multilingual terminology resources across European countries. This paper also touches upon intellectual property rights for the sharing of language resources.Peer reviewe

    Translation-based Word Sense Disambiguation: Appendices

    No full text
    Appendix 1: Norwegian ambiguous lemmas in the ENPC where at least two senses have a frequency greater than or equal to 10. Appendix 2: English ambiguous lemmas in the ENPC where at least two senses have a frequency greater than or equal to 10. Appendix 3: Results from model selection (cross-validation) in Chapter 9 with knowledge source=WORDS. The target words are ordered alphabetically. Evaluated with 5-fold cross validation and Overall Accuracy (measured as total recall). The best accuracy in each group is marked in bold-face (in case of ties, the model with the smallest context window is selected). Appendix 4: Results from model selection (cross-validation) in Chapter 9 with knowledge source=SEMANTIC-FEATURES. The target words are ordered alphabetically. Evaluated with 5-fold cross validation and Overall Accuracy (measured as total recall). The best accuracy in each group is marked in bold-face (in case of ties, the model with the smallest context window is selected). Appendix 5: Results from model selection (cross-validation) in Chapter 9 with knowledge source=RELATED-WORDS. The target words are ordered alphabetically. Evaluated with 5-fold cross validation and Overall Accuracy (measured as total recall). The best accuracy in each group is marked in bold-face (in case of ties, the model with the smallest context window is selected). Appendix 6: Results from model selection (cross-validation) in Chapter 10 with knowledge source=WORDS. The target words are ordered alphabetically. Evaluated with 5-fold cross validation and Overall Accuracy (measured as total recall). The best accuracy in each group is marked in bold-face (in case of ties, the model with the smallest context window is selected). Appendix 7: Results from model selection (cross-validation) in Chapter 10 with knowledge source=SEMANTIC-FEATURES. The target words are ordered alphabetically. Evaluated with 5-fold cross validation and Overall Accuracy (measured as total recall). The best accuracy in each group is marked in bold-face (in case of ties, the model with the smallest context window is selected). Appendix 8: Results from model selection (cross-validation) in Chapter 10 with knowledge source=RELATED-WORDS. The target words are ordered alphabetically. Evaluated with 5-fold cross validation and Overall Accuracy (measured as total recall). The best accuracy in each group is marked in bold-face (in case of ties, the model with the smallest context window is selected)

    Fra speilmetoden til automatisk ekstrahering av et betydningstagget korpus for WSD-formÄl

    Get PDF
    This thesis adresses the lack of sense-annotated corpora as a background resource for Word Sense Disambiguation (WSD). The most promising approach to WSD is generally considered to be corpus-based, supervised machine learning methods. In this approach, a sense-tagged training corpora provides example instances which illustrate the relation between a given word sense and its typical context. However, supervised learning has proven to be limited as a larger-scale alternative, because sense-tagged corpora need to be manually tagged, which is costly and time-consuming. Consequently, it is desirable to investigate methods to overcome this knowledge acquisition bottleneck. This thesis suggests a method which automatically extracts a finite, sense-tagged corpus.Although the method is only tested on one ambigous lemma within this thesis, the method is in principle expected to be applicable for extracting sense-tagged corpora for all ambigous words within the vocabulary of a given language. The presented method is based on translational correspondences in a parallel corpus, sorted by meaning by a "semantic mirroring" method (Dyvik, 1998/2002). The chief goal of the thesis is to explore the presented method's potential as an alternative to a manual sense-tagging of corpora. The results are first evaluated manually. Then follows a practical evaluation, by applying the automatically sense- tagged corpus as training material for a supervised learning algorithm. The results reveal that the presented approach methodically seems promising, indicating a good potential for further exploration

    Ut med adamsslekt og inn med arveprinsesse? Leksikografiske metodar i revisjonen av BokmÄlsordboka og Nynorskordboka

    Get PDF
    Ut med adamsslekt og inn med arveprinsesse? Leksikografiske metoder i Revisjonsprosjektet for BokmÄlsordboka og Nynorskordboka I dette innlegget presenteres kildematerialet, verktÞy og metoder som brukes og utvikles i det norske Revisjonsprosjektet for BokmÄlsordboka og Nynorskordboka. I dette prosjektet skal leksikografene revidere to eksisterende ordbÞker parallelt, én pÄ bokmÄl og én pÄ nynorsk. Hvordan kan leksikografene jobbe effektivt, klokt og etterrettelig for Ä peke ut et modernisert og relevant ordtilfang i begge mÄlformene? PÄ den ene siden bruker vi de vitenskapelige SprÄksamlingene ved Universitetet i Bergen. PÄ den andre siden bruker vi sÞkeverktÞy for Ä analysere tekst pÄ ord- og setningsnivÄ. Vi vil vise hvordan tekstkorpus er en god empirisk kunnskapskilde nÄr det gjelder lemmatilfang, fraser og flerordsuttrykk, ordbetydninger, den syntaktiske oppfÞrselen til ord og brukseksempler. Vi vil samtidig vise eksempler hvor korpus potensielt ogsÄ er misvisende. Leksikografene bruker verktÞy for sprÄkanalyse som er tilgjengelig gjennom sprÄkinfrastrukturen CLARINO (https://clarin.w.uib.no/). Corpuscle er et verktÞy for Ä sÞke pÄ ord og fraser. Gjennom revisjonsprosjektet er Corpuscle utvidet til Ä sÞke i flere korpus samtidig (til sammen ca. 2,4 milliarder ord). For eksempel kan vi gjÞre komplekse sÞk i Nynorskkorpuset, Leksikografisk BokmÄlskorpus og Nasjonalbibliotekets frie tekster samtidig. SÞketreffene kan sorteres med frekvens over tid og distribusjon i ulike korpus, hvilket forenkler leksikografens empiriske arbeid betydelig. Infrastrukturen INESS lar oss sÞke etter syntaktiske konstruksjoner som ikke er sÄ lett sÞkbare i et tradisjonelt tekstkorpus. NorGramBank inneholder syntaktisk analysert materiale for norsk (til sammen ca. 80 mill. ord). Vi vil vise hvordan vi kan hente ut «wordsketch»-lignende informasjon, som de vanligste verbrammene til et verb (gifte seg, gifte bort, gifte seg til) eller typiske eksempler pÄ hvordan at-setninger styres av preposisjoner (etter at, uten at). Relevante lenker: Revisjonsprosjektet for BokmÄlsordboka og Nynorskordboka: https://ordbok.uib.no Corpuscle (sÞk pÄ ordnivÄ): https://clarino.uib.no/corpuscle INESS (sÞk pÄ setningsnivÄ): https://clarino.uib.no/ines
    corecore