1,813 research outputs found

    Multilingual Lexicon Extraction under Resource-Poor Language Pairs

    Get PDF
    In general, bilingual and multilingual lexicons are important resources in many natural language processing fields such as information retrieval and machine translation. Such lexicons are usually extracted from bilingual (e.g., parallel or comparable) corpora with external seed dictionaries. However, few such corpora and bilingual seed dictionaries are publicly available for many language pairs such as Koreanโ€“French. It is important that such resources for these language pairs be publicly available or easily accessible when a monolingual resource is considered. This thesis presents efficient approaches for extracting bilingual single-/multi-word lexicons for resource-poor language pairs such as Koreanโ€“French and Koreanโ€“Spanish. The goal of this thesis is to present several efficient methods of extracting translated single-/multi-words from bilingual corpora based on a statistical method. Three approaches for single words and one approach for multi-words are proposed. The first approach is the pivot context-based approach (PCA). The PCA uses a pivot language to connect source and target languages. It builds context vectors from two parallel corpora sharing one pivot language and calculates their similarity scores to choose the best translation equivalents. The approach can reduce the effort required when using a seed dictionary for translation by using parallel corpora rather than comparable corpora. The second approach is the extended pivot context-based approach (EPCA). This approach gathers similar context vectors for each source word to augment its context. The approach assumes that similar vectors can enrich contexts. For example, young and youth can augment the context of baby. In the investigation described here, such similar vectors were collected by similarity measures such as cosine similarity. The third approach for single words uses a competitive neural network algorithm (i.e., self-organizing mapsSOM). The SOM-based approach (SA) uses synonym vectors rather than context vectors to train two different SOMs (i.e., source and target SOMs) in different ways. A source SOM is trained in an unsupervised way, while a target SOM is trained in a supervised way. The fourth approach is the constituent-based approach (CTA), which deals with multi-word expressions (MWEs). This approach reinforces the PCA for multi-words (PCAM). It extracts bilingual MWEs taking all constituents of the source MWEs into consideration. The PCAM 2 identifies MWE candidates by pointwise mutual information first and then adds them to input data as single units in order to use the PCA directly. The experimental results show that the proposed approaches generally perform well for resource-poor language pairs, particularly Korean and Frenchโ€“Spanish. The PCA and SA have demonstrated good performance for such language pairs. The EPCA would not have shown a stronger performance than expected. The CTA performs well even when word contexts are insufficient. Overall, the experimental results show that the CTA significantly outperforms the PCAM. In the future, homonyms (i.e., homographs such as lead or tear) should be considered. In particular, the domains of bilingual corpora should be identified. In addition, more parts of speech such as verbs, adjectives, or adverbs could be tested. In this thesis, only nouns are discussed for simplicity. Finally, thorough error analysis should also be conducted.Abstract List of Abbreviations List of Tables List of Figures Acknowledgement Chapter 1 Introduction 1.1 Multilingual Lexicon Extraction 1.2 Motivations and Goals 1.3 Organization Chapter 2 Background and Literature Review 2.1 Extraction of Bilingual Translations of Single-words 2.1.1 Context-based approach 2.1.2 Extended approach 2.1.3 Pivot-based approach 2.2 Extractiong of Bilingual Translations of Multi-Word Expressions 2.2.1 MWE identification 2.2.2 MWE alignment 2.3 Self-Organizing Maps 2.4 Evaluation Measures Chapter 3 Pivot Context-Based Approach 3.1 Concept of Pivot-Based Approach 3.2 Experiments 3.2.1 Resources 3.2.2 Results 3.3 Summary Chapter 4 Extended Pivot Context-Based Approach 4.1 Concept of Extended Pivot Context-Based Approach 4.2 Experiments 4.2.1 Resources 4.2.2 Results 4.3 Summary Chapter 5 SOM-Based Approach 5.1 Concept of SOM-Based Approach 5.2 Experiments 5.2.1 Resources 5.2.2 Results 5.3 Summary Chapter 6 Constituent-Based Approach 6.1 Concept of Constituent-Based Approach 6.2 Experiments 6.2.1 Resources 6.2.2 Results 6.3 Summary Chapter 7 Conclusions and Future Work 7.1 Conclusions 7.2 Future Work Reference

    Natural language processing

    Get PDF
    Beginning with the basic issues of NLP, this chapter aims to chart the major research activities in this area since the last ARIST Chapter in 1996 (Haas, 1996), including: (i) natural language text processing systems - text summarization, information extraction, information retrieval, etc., including domain-specific applications; (ii) natural language interfaces; (iii) NLP in the context of www and digital libraries ; and (iv) evaluation of NLP systems

    Syntactic Verifier as a Filter to Compound Unit Recognizer

    Get PDF

    Distributional effects and individual differences in L2 morphology learning

    Get PDF
    Second language (L2) learning outcomes may depend on the structure of the input and learnersโ€™ cognitive abilities. This study tested whether less predictable input might facilitate learning and generalization of L2 morphology while evaluating contributions of statistical learning ability, nonverbal intelligence, phonological short-term memory, and verbal working memory. Over three sessions, 54 adults were exposed to a Russian case-marking paradigm with a balanced or skewed item distribution in the input. Whereas statistical learning ability and nonverbal intelligence predicted learning of trained items, only nonverbal intelligence also predicted generalization of case-marking inflections to new vocabulary. Neither measure of temporary storage capacity predicted learning. Balanced, less predictable input was associated with higher accuracy in generalization but only in the initial test session. These results suggest that individual differences in pattern extraction play a more sustained role in L2 acquisition than instructional manipulations that vary the predictability of lexical items in the input

    Creating a Korean Engineering Academic Vocabulary List (KEAVL): Computational Approach

    Get PDF
    With a growing number of international students in South Korea, the need for developing materials to study Korean for academic purposes is becoming increasingly pressing. According to statistics, engineering colleges in Korea attract the largest number of international students (Korean National Institute for International Education, 2018). However, despite the availability of technical vocabulary lists for some engineering sub-fields, a list of vocabulary common for the majority of the engineering sub-fields has not yet been built. Therefore, this study was aimed at creating a list of Korean academic vocabulary of engineering for non-native Korean speakers that may help future or first-year engineering students and engineers working in Korea. In order to compile this list, a corpus of Korean textbooks and research articles of 12 major engineering sub-fields, named as the Corpus of Korean Engineering Academic Texts (CKEAT), was compiled. Then, in order to analyze the corpus and compile the preliminary list, I designed a Python-based tool called KWordList. The KWordList lemmatizes all words in the corpus while excluding general Korean vocabulary included in the Korean Learnerโ€™s List (Jo, 2003). Then, for the remaining words, KWordList calculates the range, frequency, and dispersion (in this study deviation of proportions or DP (Gries, 2008)) and excludes words that do not pass the studyโ€™s criteria (range โ‰ฅ 6, frequency โ‰ฅ 100, DP โ‰ค 0.5). The final version of the list, called Korean Engineering Academic Vocabulary List or KEAVL, includes 830 lemmas (318 of intermediate level and 512 of advanced level). For each word, the collocations that occur more than 30 times in the corpus are provided. The comparison of the coverage of the Korean Academic Vocabulary List (Shin, 2004) and KEAVL based on the Corpus of Korean Engineering Academic Texts showed that KEAVL covers more lemmas in the corpus. Moreover, only 313 lemmas from the Korean Academic Vocabulary List (Shin, 2004) passed the criteria of the study. Therefore, KEAVL may be more efficient for engineering studentsโ€™ vocabulary training than the Korean Academic Vocabulary List and may be used for the engineering Korean teaching materials and curriculum development. Moreover, the KWordList program written for the study can be used by other researchers, teachers, and even students and is open access (https://github.com/HelgaKr/KWordList)

    Methodologies for the Automatic Location of Academic and Educational Texts on the Internet

    Get PDF
    Traditionally online databases of web resources have been compiled by a human editor, or though the submissions of authors or interested parties. Considerable resources are needed to maintain a constant level of input and relevance in the face of increasing material quantity and quality, and much of what is in databases is of an ephemeral nature. These pressures dictate that many databases stagnate after an initial period of enthusiastic data entry. The solution to this problem would seem to be the automatic harvesting of resources, however, this process necessitates the automatic classification of resources as โ€˜appropriateโ€™ to a given database, a problem only solved by complex text content analysis. This paper outlines the component methodologies necessary to construct such an automated harvesting system, including a number of novel approaches. In particular this paper looks at the specific problems of automatically identifying academic research work and Higher Education pedagogic materials. Where appropriate, experimental data is presented from searches in the field of Geography as well as the Earth and Environmental Sciences. In addition, appropriate software is reviewed where it exists, and future directions are outlined

    Methodologies for the Automatic Location of Academic and Educational Texts on the Internet

    Get PDF
    Traditionally online databases of web resources have been compiled by a human editor, or though the submissions of authors or interested parties. Considerable resources are needed to maintain a constant level of input and relevance in the face of increasing material quantity and quality, and much of what is in databases is of an ephemeral nature. These pressures dictate that many databases stagnate after an initial period of enthusiastic data entry. The solution to this problem would seem to be the automatic harvesting of resources, however, this process necessitates the automatic classification of resources as โ€˜appropriateโ€™ to a given database, a problem only solved by complex text content analysis. This paper outlines the component methodologies necessary to construct such an automated harvesting system, including a number of novel approaches. In particular this paper looks at the specific problems of automatically identifying academic research work and Higher Education pedagogic materials. Where appropriate, experimental data is presented from searches in the field of Geography as well as the Earth and Environmental Sciences. In addition, appropriate software is reviewed where it exists, and future directions are outlined

    ์•ฝ๋ฌผ ๊ฐ์‹œ๋ฅผ ์œ„ํ•œ ๋น„์ •ํ˜• ํ…์ŠคํŠธ ๋‚ด ์ž„์ƒ ์ •๋ณด ์ถ”์ถœ ์—ฐ๊ตฌ

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ(๋ฐ•์‚ฌ) -- ์„œ์šธ๋Œ€ํ•™๊ต๋Œ€ํ•™์› : ์œตํ•ฉ๊ณผํ•™๊ธฐ์ˆ ๋Œ€ํ•™์› ์‘์šฉ๋ฐ”์ด์˜ค๊ณตํ•™๊ณผ, 2023. 2. ์ดํ˜•๊ธฐ.Pharmacovigilance is a scientific activity to detect, evaluate and understand the occurrence of adverse drug events or other problems related to drug safety. However, concerns have been raised over the quality of drug safety information for pharmacovigilance, and there is also a need to secure a new data source to acquire drug safety information. On the other hand, the rise of pre-trained language models based on a transformer architecture has accelerated the application of natural language processing (NLP) techniques in diverse domains. In this context, I tried to define two problems in pharmacovigilance as an NLP task and provide baseline models for the defined tasks: 1) extracting comprehensive drug safety information from adverse drug events narratives reported through a spontaneous reporting system (SRS) and 2) extracting drug-food interaction information from abstracts of biomedical articles. I developed annotation guidelines and performed manual annotation, demonstrating that strong NLP models can be trained to extracted clinical information from unstructrued free-texts by fine-tuning transformer-based language models on a high-quality annotated corpus. Finally, I discuss issues to consider when when developing annotation guidelines for extracting clinical information related to pharmacovigilance. The annotated corpora and the NLP models in this dissertation can streamline pharmacovigilance activities by enhancing the data quality of reported drug safety information and expanding the data sources.์•ฝ๋ฌผ ๊ฐ์‹œ๋Š” ์•ฝ๋ฌผ ๋ถ€์ž‘์šฉ ๋˜๋Š” ์•ฝ๋ฌผ ์•ˆ์ „์„ฑ๊ณผ ๊ด€๋ จ๋œ ๋ฌธ์ œ์˜ ๋ฐœ์ƒ์„ ๊ฐ์ง€, ํ‰๊ฐ€ ๋ฐ ์ดํ•ดํ•˜๊ธฐ ์œ„ํ•œ ๊ณผํ•™์  ํ™œ๋™์ด๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ์•ฝ๋ฌผ ๊ฐ์‹œ์— ์‚ฌ์šฉ๋˜๋Š” ์˜์•ฝํ’ˆ ์•ˆ์ „์„ฑ ์ •๋ณด์˜ ๋ณด๊ณ  ํ’ˆ์งˆ์— ๋Œ€ํ•œ ์šฐ๋ ค๊ฐ€ ๊พธ์ค€ํžˆ ์ œ๊ธฐ๋˜์—ˆ์œผ๋ฉฐ, ํ•ด๋‹น ๋ณด๊ณ  ํ’ˆ์งˆ์„ ๋†’์ด๊ธฐ ์œ„ํ•ด์„œ๋Š” ์•ˆ์ „์„ฑ ์ •๋ณด๋ฅผ ํ™•๋ณดํ•  ์ƒˆ๋กœ์šด ์ž๋ฃŒ์›์ด ํ•„์š”ํ•˜๋‹ค. ํ•œํŽธ ํŠธ๋žœ์Šคํฌ๋จธ ์•„ํ‚คํ…์ฒ˜๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ์‚ฌ์ „ํ›ˆ๋ จ ์–ธ์–ด๋ชจ๋ธ์ด ๋“ฑ์žฅํ•˜๋ฉด์„œ ๋‹ค์–‘ํ•œ ๋„๋ฉ”์ธ์—์„œ ์ž์—ฐ์–ด์ฒ˜๋ฆฌ ๊ธฐ์ˆ  ์ ์šฉ์ด ๊ฐ€์†ํ™”๋˜์—ˆ๋‹ค. ์ด๋Ÿฌํ•œ ๋งฅ๋ฝ์—์„œ ๋ณธ ํ•™์œ„ ๋…ผ๋ฌธ์—์„œ๋Š” ์•ฝ๋ฌผ ๊ฐ์‹œ๋ฅผ ์œ„ํ•œ ๋‹ค์Œ 2๊ฐ€์ง€ ์ •๋ณด ์ถ”์ถœ ๋ฌธ์ œ๋ฅผ ์ž์—ฐ์–ด์ฒ˜๋ฆฌ ๋ฌธ์ œ ํ˜•ํƒœ๋กœ ์ •์˜ํ•˜๊ณ  ๊ด€๋ จ ๊ธฐ์ค€ ๋ชจ๋ธ์„ ๊ฐœ๋ฐœํ•˜์˜€๋‹ค: 1) ์ˆ˜๋™์  ์•ฝ๋ฌผ ๊ฐ์‹œ ์ฒด๊ณ„์— ๋ณด๊ณ ๋œ ์ด์ƒ์‚ฌ๋ก€ ์„œ์ˆ ์ž๋ฃŒ์—์„œ ํฌ๊ด„์ ์ธ ์•ฝ๋ฌผ ์•ˆ์ „์„ฑ ์ •๋ณด๋ฅผ ์ถ”์ถœํ•œ๋‹ค. 2) ์˜๋ฌธ ์˜์•ฝํ•™ ๋…ผ๋ฌธ ์ดˆ๋ก์—์„œ ์•ฝ๋ฌผ-์‹ํ’ˆ ์ƒํ˜ธ์ž‘์šฉ ์ •๋ณด๋ฅผ ์ถ”์ถœํ•œ๋‹ค. ์ด๋ฅผ ์œ„ํ•ด ์•ˆ์ „์„ฑ ์ •๋ณด ์ถ”์ถœ์„ ์œ„ํ•œ ์–ด๋…ธํ…Œ์ด์…˜ ๊ฐ€์ด๋“œ๋ผ์ธ์„ ๊ฐœ๋ฐœํ•˜๊ณ  ์ˆ˜์ž‘์—…์œผ๋กœ ์–ด๋…ธํ…Œ์ด์…˜์„ ์ˆ˜ํ–‰ํ•˜์˜€๋‹ค. ๊ฒฐ๊ณผ์ ์œผ๋กœ ๊ณ ํ’ˆ์งˆ์˜ ์ž์—ฐ์–ด ํ•™์Šต๋ฐ์ดํ„ฐ๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ์‚ฌ์ „ํ•™์Šต ์–ธ์–ด๋ชจ๋ธ์„ ๋ฏธ์„ธ ์กฐ์ •ํ•จ์œผ๋กœ์จ ๋น„์ •ํ˜• ํ…์ŠคํŠธ์—์„œ ์ž„์ƒ ์ •๋ณด๋ฅผ ์ถ”์ถœํ•˜๋Š” ๊ฐ•๋ ฅํ•œ ์ž์—ฐ์–ด์ฒ˜๋ฆฌ ๋ชจ๋ธ ๊ฐœ๋ฐœ์ด ๊ฐ€๋Šฅํ•จ์„ ํ™•์ธํ•˜์˜€๋‹ค. ๋งˆ์ง€๋ง‰์œผ๋กœ ๋ณธ ํ•™์œ„ ๋…ผ๋ฌธ์—์„œ๋Š” ์•ฝ๋ฌผ๊ฐ์‹œ์™€ ๊ด€๋ จ๋œ์ž„์ƒ ์ •๋ณด ์ถ”์ถœ์„ ์œ„ํ•œ ์–ด๋…ธํ…Œ์ด์…˜ ๊ฐ€์ด๋“œ๋ผ์ธ์„ ๊ฐœ๋ฐœํ•  ๋•Œ ๊ณ ๋ คํ•ด์•ผ ํ•  ์ฃผ์˜ ์‚ฌํ•ญ์— ๋Œ€ํ•ด ๋…ผ์˜ํ•˜์˜€๋‹ค. ๋ณธ ํ•™์œ„ ๋…ผ๋ฌธ์—์„œ ์†Œ๊ฐœํ•œ ์ž์—ฐ์–ด ํ•™์Šต๋ฐ์ดํ„ฐ์™€ ์ž์—ฐ์–ด์ฒ˜๋ฆฌ ๋ชจ๋ธ์€ ์•ฝ๋ฌผ ์•ˆ์ „์„ฑ ์ •๋ณด์˜ ๋ณด๊ณ  ํ’ˆ์งˆ์„ ํ–ฅ์ƒ์‹œํ‚ค๊ณ  ์ž๋ฃŒ์›์„ ํ™•์žฅํ•˜์—ฌ ์•ฝ๋ฌผ ๊ฐ์‹œ ํ™œ๋™์„ ๋ณด์กฐํ•  ๊ฒƒ์œผ๋กœ ๊ธฐ๋Œ€๋œ๋‹ค.Chapter 1 1 1.1 Contributions of this dissertation 2 1.2 Overview of this dissertation 2 1.3 Other works 3 Chapter 2 4 2.1 Pharmacovigilance 4 2.2 Biomedical NLP for pharmacovigilance 6 2.2.1 Pre-trained language models 6 2.2.2 Corpora to extract clinical information for pharmacovigilance 9 Chapter 3 11 3.1 Motivation 12 3.2 Proposed Methods 14 3.2.1 Data source and text corpus 15 3.2.2 Annotation of ADE narratives 16 3.2.3 Quality control of annotation 17 3.2.4 Pretraining KAERS-BERT 18 3.2.6 Named entity recognition 20 3.2.7 Entity label classification and sentence extraction 21 3.2.8 Relation extraction 21 3.2.9 Model evaluation 22 3.2.10 Ablation experiment 23 3.3 Results 24 3.3.1 Annotated ICSRs 24 3.3.2 Corpus statistics 26 3.3.3 Performance of NLP models to extract drug safety information 28 3.3.4 Ablation experiment 31 3.4 Discussion 33 3.5 Conclusion 38 Chapter 4 39 4.1 Motivation 39 4.2 Proposed Methods 43 4.2.1 Data source 44 4.2.2 Annotation 45 4.2.3 Quality control of annotation 49 4.2.4 Baseline model development 49 4.3 Results 50 4.3.1 Corpus statistics 50 4.3.2 Annotation Quality 54 4.3.3 Performance of baseline models 55 4.3.4 Qualitative error analysis 56 4.4 Discussion 59 4.5 Conclusion 63 Chapter 5 64 5.1 Issues around defining a word entity 64 5.2 Issues around defining a relation between word entities 66 5.3 Issues around defining entity labels 68 5.4 Issues around selecting and preprocessing annotated documents 68 Chapter 6 71 6.1 Dissertation summary 71 6.2 Limitation and future works 72 6.2.1 Development of end-to-end information extraction models from free-texts to database based on existing structured information 72 6.2.2 Application of in-context learning framework in clinical information extraction 74 Chapter 7 76 7.1 Annotation Guideline for "Extraction of Comprehensive Drug Safety Information from Adverse Event Narratives Reported through Spontaneous Reporting System" 76 7.2 Annotation Guideline for "Extraction of Drug-Food Interactions from the Abtracts of Biomedical Articles" 100๋ฐ•
    • โ€ฆ
    corecore