86 research outputs found

    ํ˜์˜ค ๋ฐœ์–ธ ๋ถ„๋ฅ˜ ๋ชจ๋ธ์˜ ๊ฑฐ์ง“ ์–‘์„ฑ ํŽธํ–ฅ ์ง„๋‹จ ๋ฐ ๊ฐœ์„  ์—ฐ๊ตฌ

    Get PDF
    ํ•™์œ„๋…ผ๋ฌธ(์„์‚ฌ) -- ์„œ์šธ๋Œ€ํ•™๊ต๋Œ€ํ•™์› : ์ธ๋ฌธ๋Œ€ํ•™ ์–ธ์–ดํ•™๊ณผ, 2022.2. ์‹ ํšจํ•„.As the damage caused by hate speech in anonymous online spaces has been growing significantly, research on the detection of hate speech is being actively conducted. Recently, deep learning-based hate speech classifiers have shown great performance, but they tend to fail to generalize on out-of-domain data. I focus on the problem of False Positive detection and build adversarial tests sets of three different domains to diagnose this issue. I illustrate that a BERT-based classification model trained with existing Korean hate speech corpus exhibits False Positives due to over-sensitivity to specific words that have high correlations with hate speech in training datasets. Next, I present two different approaches to address the problem: a data-centric approach that adds data to correct the imbalance of training datasets and a model-centric approach that regularizes the model using post-hoc explanations. Both methods show improvement in reducing False Positives without compromising overall model quality. In addition, I show that strategically adding negative samples from a domain similar to a test set can be a cost-efficient way of greatly reducing false positives. Using Sampling and Occlusion (Jin et al., 2020) explanation, I qualitatively demonstrate that both approaches help model better utilize contextual information.์˜จ๋ผ์ธ ๋“ฑ ์ต๋ช… ๊ณต๊ฐ„์—์„œ์˜ ํ˜์˜ค ๋ฐœ์–ธ(Hate speech)์œผ๋กœ ์ธํ•œ ํ”ผํ•ด๊ฐ€ ์ปค์ ธ๊ฐ์— ๋”ฐ๋ผ, ํ˜์˜ค ๋ฐœ์–ธ ๋ถ„๋ฅ˜ ๋ฐ ๊ฒ€์ถœ์— ๊ด€ํ•œ ์—ฐ๊ตฌ๊ฐ€ ํ™œ๋ฐœํžˆ ์ง„ํ–‰๋˜๊ณ  ์žˆ๋‹ค. ์ตœ๊ทผ ๋”ฅ๋Ÿฌ๋‹ ๊ธฐ๋ฐ˜์˜ ํ˜์˜ค ๋ฐœ์–ธ ๋ถ„๋ฅ˜๊ธฐ๊ฐ€ ์ข‹์€ ์„ฑ๋Šฅ์„ ๋ณด์ด๊ณ  ์žˆ์ง€๋งŒ, ํ•™์Šต ๋„๋ฉ”์ธ ๋ฐ–(out-of-domain) ๋ฐ์ดํ„ฐ๋กœ ์ผ๋ฐ˜ํ™”ํ•จ์— ์žˆ์–ด์„œ๋Š” ์–ด๋ ค์›€์„ ๊ฒช๊ณ  ์žˆ๋‹ค. ๋ณธ ์—ฐ๊ตฌ๋Š” ๋ชจ๋ธ์ด ๊ฑฐ์ง“ ์–‘์„ฑ(False Positive)์„ ๊ฒ€์ถœํ•ด๋‚ด๋Š” ๋ฌธ์ œ์— ์ดˆ์ ์„ ๋‘๊ณ , ํ•ด๋‹น ๋ฌธ์ œ๋ฅผ ์ง„๋‹จํ•˜๊ธฐ ์œ„ํ•ด ์„ธ ๊ฐ€์ง€ ์„œ๋กœ ๋‹ค๋ฅธ ๋„๋ฉ”์ธ์˜(domain)์˜ ๋Œ€๋ฆฝ์ (adversarial) ๋ฐ์ดํ„ฐ๋ฅผ ํ™œ์šฉํ•˜์—ฌ ํ…Œ์ŠคํŠธ์…‹์„ ๋งŒ๋“ ๋‹ค. ์ด๋ฅผ ํ†ตํ•ด ๊ธฐ์กด์˜ ํ•œ๊ตญ์–ด ํ˜์˜ค ํ‘œํ˜„ ๋ฐ์ดํ„ฐ์…‹์„ ํ•™์Šตํ•œ BERT ๊ธฐ๋ฐ˜์˜ ๋ถ„๋ฅ˜ ๋ชจ๋ธ์ด ํ•™์Šต ๋ฐ์ดํ„ฐ ์ƒ์—์„œ ํ˜์˜ค ํ‘œํ˜„๊ณผ ๋†’์€ ์ƒ๊ด€๊ด€๊ณ„๋ฅผ ๊ฐ€์ง€๋Š” ํŠน์ • ๋‹จ์–ด๋“ค์— ๋ฏผ๊ฐํ•˜๊ฒŒ ๋ฐ˜์‘ํ•˜์—ฌ ๊ฑฐ์ง“ ์–‘์„ฑ(False Positive) ๊ฒฐ๊ณผ๋ฅผ ์˜ˆ์ธกํ•˜๋Š” ํ˜„์ƒ์„ ๋ณด์ธ๋‹ค. ๋‹ค์Œ์œผ๋กœ, ์ด๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•œ ๋‘ ๊ฐ€์ง€ ๋ฐฉ๋ฒ•์„ ์ œ์‹œํ•œ๋‹ค. ํ•™์Šต ๋ฐ์ดํ„ฐ์…‹์˜ ๋ถˆ๊ท ํ˜•์„ ์ˆ˜์ •ํ•˜๊ธฐ ์œ„ํ•œ ๋ฐ์ดํ„ฐ๋ฅผ ์ถ”๊ฐ€ํ•˜๋Š” ๋ฐ์ดํ„ฐ ์ค‘์ (data-centric) ๋ฐฉ๋ฒ•๊ณผ ํŠน์ • ๋‹จ์–ด๋“ค์— ๋Œ€ํ•œ ๋ชจ๋ธ์˜ ์‚ฌํ›„ ์„ค๋ช…(post-hoc explanation)์„ ํ™œ์šฉํ•˜์—ฌ ๋ชจ๋ธ์„ ์ •๊ทœํ™”(regularize) ํ•˜๋Š” ๋ชจ๋ธ ์ค‘์ (model-centric) ๋ฐฉ๋ฒ•์„ ์ ์šฉํ•˜๊ณ , ๋‘ ์ ‘๊ทผ ๋ฐฉ๋ฒ• ๋ชจ๋‘ ์ „๋ฐ˜์ ์ธ ๋ชจ๋ธ ์„ฑ๋Šฅ์„ ํ•ด์น˜์ง€ ์•Š์œผ๋ฉฐ ๊ฑฐ์ง“ ์–‘์„ฑ์˜ ๋น„์œจ์„ ์ค„์ผ ์ˆ˜ ์žˆ์Œ์„ ๋ณด์ธ๋‹ค. ๋˜ํ•œ, ํ…Œ์ŠคํŠธ ๋„๋ฉ”์ธ์˜ ํŠน์„ฑ์„ ์•Œ๊ณ  ์žˆ์„ ๊ฒฝ์šฐ, ์œ ์‚ฌํ•œ ๋„๋ฉ”์ธ์—์„œ ํ•™์Šต ๋ฐ์ดํ„ฐ์˜ ๋ถˆ๊ท ํ˜• ์ˆ˜์ •์„ ์œ„ํ•œ ์ƒ˜ํ”Œ ์ถ”๊ฐ€๋ฅผ ํ†ตํ•ด ์ ์€ ๋น„์šฉ์œผ๋กœ ๋ชจ๋ธ์˜ ๊ฑฐ์ง“์–‘์„ฑ์„ ํฐ ํญ์œผ๋กœ ์ค„์ผ ์ˆ˜ ์žˆ์Œ์„ ๋ณด์ธ๋‹ค. ๋˜ํ•œ, Samping and Occlusion (Jin et al., 2020) ์„ค๋ช…์„ ํ†ตํ•ด ๋‘ ์ ‘๊ทผ ๋ฐฉ์‹ ๋ชจ๋‘์—์„œ ๋ฌธ๋งฅ ์ •๋ณด๋ฅผ ๋” ์ž˜ ํ™œ์šฉํ•˜๊ฒŒ ๋จ์„ ์ •์„ฑ์ ์œผ๋กœ ํ™•์ธํ•œ๋‹ค.ABSTRACT I TABLE OF CONTENTS III LIST OF FIGURES IV LIST OF TABLES V CHAPTER 1. INTRODUCTION 1 1.1. HATE SPEECH DETECTION 1 1.2. FALSE POSITIVES IN HATE SPEECH DETECTION 4 1.3. PURPOSE OF RESEARCH 6 CHAPTER 2. BACKGROUND 9 2.1. DOMAIN ADAPTATION 9 2.2. MEASURING AND MITIGATING FALSE POSITIVE BIAS OF HATE SPEECH CLASSIFIER 10 2.2.1 Measuring Model bias on social identifiers 11 2.2.2 Mitigating Model bias on social identifiers 13 CHAPTER 3. DATASET 17 CHAPTER 4. QUANTIFYING BIAS 20 4.1 BASELINE MODEL 20 4.2 SELECTING NEUTRAL KEYWORDS 21 4.3 TEST DATASETS 26 4.4 QUANTIFYING BIAS OF THE BASELINE MODEL 31 CHAPTER 5. EXPERIMENTS 33 5.1 BIAS MITIGATION 33 5.1.1. Bias mitigation through train data augmentation 33 5.1.2. Model Regularization using SOC explanation 35 5.2 RESULT 36 5.2.1. Evaluation Metric 36 5.2.2. Experimental Results 36 5.2.3. Visualizing Effects of Mitigation 39 CHAPTER 6. CONCLUSION 46 REFERENCES 48 ๊ตญ๋ฌธ์ดˆ๋ก 52์„

    A DEEP LEARNING APPROACH FOR SENTIMENT ANALYSIS

    Get PDF
    La Sentiment Analysis si riferisce alla analisi qualitativa volta ad identificare e classificare opinioni contenute in frasi e testi, allo scopo di stabilire lo \u201cstato d\u2019animo\u201d dell\u2019autore rispetto ad un particolare argomento o prodotto, e di determinare se tale stato \ue8 di fatto positivo, negativo oppure neutrale. Le opinioni espresse in un testo, come ad esempio giudizi, sentimenti ed emozioni, sono di recente diventate oggetto di studio e di ricerca sia in ambito accademico che industriale. Sfortunatamente la comprensione del linguaggio, applicata a commenti di utenti, \ue8 un attivit\ue0 estremamente complessa per una macchina, specialmente se ci si riferisce ai contesti dei moderni social network. Le modalit\ue0 in cui le persone si esprimono in linguaggio naturale, sono molteplici, e l\u2019utilizzo \u201cinformale\u201d della lingua adottato tipicamente nei social netowrks, genera frasi spesso dense di errori, modi di dire (slang), costrutti sintattici \u201dpersonalizzati\u201d, o anche frasi arricchite da caratteri speciali (come l\u2019hashtag in Twitter), il che complica notevolmente l\u2019analisi. Recentemente, le tecniche di Deep Learning, stanno emergendo nel panorama del machine learning, come un modello computazionale che pu\uf2 essere adoperato con efficacia per scoprire relazioni semantiche complesse, all\u2019interno di un testo, anche senza la necessit\ue0 di dover individuare a priori caratteristiche (features) di tali relazioni. Questi approcci hanno migliorato l\u2019attuale stato dell\u2019arte in diversi settori della Sentiment Analysis, come ad esempio la classificazione di frasi o di documenti, l\u2019apprendimento basato su lexicon, fino ad arrivare alla analisi di fenomeni complessi come il cyber bullismo. I contributi di questa tesi sono di due tipi. Il primo contributo fornito, relativo ad aspetti generali di Sentiment Analysis, riguarda la proposta di un modello di rete neurale semi supervisionata, basato sulle reti di tipo Deep Belief, in grado di affrontare l\u2019incertezza dei dati insita nelle frasi testuali, con particolare riferimento alla lingua italiana. Il modello proposto \ue8 stato testato rispetto a diversi datasets presi dalla letteratura di riferimento, composti da testi relativi a critiche cinematografiche, adottando una rappresentazione dell\u2019informazione basata su vettori (Word2Vec) ed introducendo anche metodi derivati dal campo del Natural Language Processing (NLP). Il secondo contributo fornito in questa tesi, partendo dall\u2019assunto che il cyber bullismo pu\uf2 essere considerato come un caso particolare di Sentiment Analysis, propone un approccio non supervisionato alla rilevazione automatica di tracce di cyber bullismo all\u2019interno di social networks, basato sia su di una rete neurale di tipo GHSOM (Growing Hierarchical Self Organizing Map), sia su di un modello di caratteristiche (features) predefinito. Il modello non supervisionato proposto dimostra di raggiungere comunque risultati interessanti rispetto ai tipici modelli supervisionati, applicati solitamente in questo ambito.Sentiment Analysis refers to the process of computationally identifying and categorizing opinions expressed in a piece of text, in order to determine whether the writer\u2019s attitude towards a particular topic or product is positive, negative, or even neutral. The views expressed and its related concepts, such as feelings, judgments, and emotions have become recently a subject of study and research in both academic and industrial areas. Unfortunately language comprehension of user comments, especially in social networks, is inherently complex to computers. The ways in which humans express themselves with natural language are nearly unlimited and informal texts is riddled with typos, misspellings, badly set up syntactic constructions and also specific symbols (e.g. hashtags in Twitter) which exponentially complicate this task. Recently, deep learning approaches are emerging as powerful computational models that discover intricate semantic representations of texts automatically from data without hand-made feature engineering. These approaches have improved the state-of-the-art in many Sentiment Analysis tasks including sentiment classification of sentences or documents, sentiment lexicon learning and also in more complex problems as cyber bullying detection. The contributions of this work are twofold. First, related to the general Sentiment Analysis problem, we propose a semi-supervised neural network model, based on Deep Belief Networks, able to deal with data uncertainty for text sentences in Italian language. We test this model against some datasets from literature related to movie reviews, adopting a vectorized representation of text (Word2Vec) and exploiting methods from Natural Language Processing (NLP) pre-processing. Second, assuming that the cyber bullying phenomenon can be treated as a particular Sentiment Analysis problem, we propose an unsupervised approach to automatic cyber bullying detection in social networks, based both on Growing Hierarchical Self Organizing Map (GHSOM) and on a new specific features model, showing that our solution can achieve interesting results, respect to classical supervised approaches

    Adaptive Analysis and Processing of Structured Multilingual Documents

    Get PDF
    Digital document processing is becoming popular for application to office and library automation, bank and postal services, publishing houses and communication management. In recent years, the demand for tools capable of searching written and spoken sources of multilingual information has increased tremendously, where the bilingual dictionary is one of the important resource to provide the required information. Processing and analysis of bilingual dictionaries brought up the challenges of dealing with many different scripts, some of which are unknown to the designer. A framework is presented to adaptively analyze and process structured multilingual documents, where adaptability is applied to every step. The proposed framework involves: (1) General word-level script identification using Gabor filter. (2) Font classification using the grating cell operator. (3) General word-level style identification using Gaussian mixture model. (4) An adaptable Hindi OCR based on generalized Hausdorff image comparison. (5) Retargetable OCR with automatic training sample creation and its applications to different scripts. (6) Bootstrapping entry segmentation, which segments each page into functional entries for parsing. Experimental results working on different scripts, such as Chinese, Korean, Arabic, Devanagari, and Khmer, demonstrate that the proposed framework can save human efforts significantly by making each phase adaptive

    A Sound Approach to Language Matters: In Honor of Ocke-Schwen Bohn

    Get PDF
    The contributions in this Festschrift were written by Ockeโ€™s current and former PhD-students, colleagues and research collaborators. The Festschrift is divided into six sections, moving from the smallest building blocks of language, through gradually expanding objects of linguistic inquiry to the highest levels of description - all of which have formed a part of Ockeโ€™s career, in connection with his teaching and/or his academic productions: โ€œSegmentsโ€, โ€œPerception of Accentโ€, โ€œBetween Sounds and Graphemesโ€, โ€œProsodyโ€, โ€œMorphology and Syntaxโ€ and โ€œSecond Language Acquisitionโ€.ย Each one of these illustrates a sound approach to language matters

    A framework for ancient and machine-printed manuscripts categorization

    Get PDF
    Document image understanding (DIU) has attracted a lot of attention and became an of active fields of research. Although, the ultimate goal of DIU is extracting textual information of a document image, many steps are involved in a such a process such as categorization, segmentation and layout analysis. All of these steps are needed in order to obtain an accurate result from character recognition or word recognition of a document image. One of the important steps in DIU is document image categorization (DIC) that is needed in many situations such as document image written or printed in more than one script, font or language. This step provides useful information for recognition system and helps in reducing its error by allowing to incorporate a category-specific Optical Character Recognition (OCR) system or word recognition (WR) system. This research focuses on the problem of DIC in different categories of scripts, styles and languages and establishes a framework for flexible representation and feature extraction that can be adapted to many DIC problem. The current methods for DIC have many limitations and drawbacks that restrict the practical usage of these methods. We proposed an efficient framework for categorization of document image based on patch representation and Non-negative Matrix Factorization (NMF). This framework is flexible and can be adapted to different categorization problem. Many methods exist for script identification of document image but few of them addressed the problem in handwritten manuscripts and they have many limitations and drawbacks. Therefore, our first goal is to introduce a novel method for script identification of ancient manuscripts. The proposed method is based on patch representation in which the patches are extracted using skeleton map of a document images. This representation overcomes the limitation of the current methods about the fixed level of layout. The proposed feature extraction scheme based on Projective Non-negative Matrix Factorization (PNMF) is robust against noise and handwriting variation and can be used for different scripts. The proposed method has higher performance compared to state of the art methods and can be applied to different levels of layout. The current methods for font (style) identification are mostly proposed to be applied on machine-printed document image and many of them can only be used for a specific level of layout. Therefore, we proposed new method for font and style identification of printed and handwritten manuscripts based on patch representation and Non-negative Matrix Tri-Factorization (NMTF). The images are represented by overlapping patches obtained from the foreground pixels. The position of these patches are set based on skeleton map to reduce the number of patches. Non-Negative Matrix Tri-Factorization is used to learn bases from each fonts (style) and then these bases are used to classify a new image based on minimum representation error. The proposed method can easily be extended to new fonts as the bases for each font are learned separately from the other fonts. This method is tested on two datasets of machine-printed and ancient manuscript and the results confirmed its performance compared to the state of the art methods. Finally, we proposed a novel method for language identification of printed and handwritten manuscripts based on patch representation and Non-negative Matrix Tri-Factorization (NMTF). The current methods for language identification are based on textual data obtained by OCR engine or images data through coding and comparing with textual data. The OCR based method needs lots of processing and the current image based method are not applicable to cursive scripts such as Arabic. In this work we introduced a new method for language identification of machine-printed and handwritten manuscripts based on patch representation and NMTF. The patch representation provides the component of the Arabic script (letters) that can not be extracted simply by segmentation methods. Then NMTF is used for dictionary learning and generating codebooks that will be used to represent document image with a histogram. The proposed method is tested on two datasets of machine-printed and handwritten manuscripts and compared to n-gram features (text-based), texture features and codebook features (imagebased) to validate the performance. The above proposed methods are robust against variation in handwritings, changes in the font (handwriting style) and presence of degradation and are flexible that can be used to various levels of layout (from a textline to paragraph). The methods in this research have been tested on datasets of handwritten and machine-printed manuscripts and compared to state-of-the-art methods. All of the evaluations show the efficiency, robustness and flexibility of the proposed methods for categorization of document image. As mentioned before the proposed strategies provide a framework for efficient and flexible representation and feature extraction for document image categorization. This frame work can be applied to different levels of layout, the information from different levels of layout can be merged and mixed and this framework can be extended to more complex situations and different tasks

    Sensing of complex buildings and reconstruction into photo-realistic 3D models

    Get PDF
    The 3D reconstruction of indoor and outdoor environments has received an interest only recently, as companies began to recognize that using reconstructed models is a way to generate revenue through location-based services and advertisements. A great amount of research has been done in the field of 3D reconstruction, and one of the latest and most promising applications is Kinect Fusion, which was developed by Microsoft Research. Its strong points are the real-time intuitive 3D reconstruction, interactive frame rate, the level of detail in the models, and the availability of the hardware and software for researchers and enthusiasts. A representative effort towards 3D reconstruction is the Point Cloud Library (PCL). PCL is a large scale, open project for 2D/3D image and point cloud processing. On December 2011, PCL made available an implementation of Kinect Fusion, namely KinFu. KinFu emulates the functionality provided in Kinect Fusion. However, both implementations have two major limitations: 1. The real-time reconstruction takes place only within a cube with a size of 3 meters per axis. The cube's position is fixed at the start of execution, and any object outside of this cube is not integrated into the reconstructed model. Therefore the volume that can be scanned is always limited by the size of the cube. It is possible to manually align many small-size cubes into a single large model, however this is a time-consuming and difficult task, especially when the meshes have complex topologies and high polygon count, as is the case with the meshes obtained from KinFu. 2. The output mesh does not have any color textures. There are some at-tempts to add color in the output point cloud; however, the resulting effect is not photo-realistic. Applying photo-realistic textures to a model can enhance the user experience, even when the model has a simple topology. The main goal of this project is to design and implement a system that captures large indoor environments and generates 3D photo-realistic large indoor models in real time. This report describes an extended version of the KinFu system. The extensions overcome the scalability and texture reconstruction limitations using commodity hardware and open-source software. The complete hardware setup used in this project is worth โ‚ฌ2,000, which is comparable to the cost of a single professional laser scanner. The software is released under BSD license, which makes it completely free to use and commercialize. The system has been integrated into the open-source PCL project. The immediate benefits are three-fold: the system becomes a potential industry standard, it is maintained and extended by many developers around the world with no addition-al cost to the VCA group, and it can reduce the application development time by reusing numerous state-of-the-art algorithms

    Developing a unified feature-based model for L2 lexical and syntactic processing

    Get PDF
    Research on lexical processing shows that lexical representations of L2 speakers are less developed, so frequency and vocabulary size affect the way they use lexical information. Specifically, reduced access to lexical features hinders the processing system of L2 speakers from working efficiently, having an impact on their ability to build syntactic structures in a native-like manner. The present research project aims to construct and test a unified model that explains how lexical and sentence processing interact. First, it develops and validates a productive vocabulary task for L2 Italian to measure vocabulary size. The task, called I-Lex, is based on the existing LEX30 for English, and uses frequency to determine lexical knowledge. Then, adopting the formalism of Head-Driven Phrase Structure Grammar, a framework that associates all the information relevant to the grammar with the lexicon, the research project develops a model that explains the effects of lexical access on syntactic processing. The model is tested in two empirical studies on L2 speakers of Italian. The first study, using an Oral Elicited Imitation task, and the I-Lex productive vocabulary task investigates the effects of frequency and vocabulary size on cleft sentences. The second study, using the same productive vocabulary task and a Self-paced Reading task, investigates frequency and vocabulary effects on relative clauses. The results reveal that frequency and vocabulary size interact with the ability of L2 speakers to process both cleft and relative clauses, providing evidence that accessing lexical features is a crucial stage for processing syntactic structures. Based on the results, a feature-based lexical network model is constructed. The model describes how lexical access and the activation of structural links between words can be described using the same set of lexical features. In the last chapter, the model is applied to the results of the two studies

    The Effect Of Video Modeling And Social Skill Instructionon On The Social Skills Of Adolescents With High Functioning Autism And

    Get PDF
    Research conducted on video modeling has shown that these strategies are most effective when they include specific strategies to address conversation skills. Social skills research has also shown that teaching social skills to adolescents in group settings may be more effective than presenting them on an individual basis. Adolescents with Aspergers Syndrome (AS) and High functioning Autism (HFA) participated in a12-week Social Skills Training (SST) program. In addition to pre-and post-study measures, conversation skills data were collected before and after the application of the independent variable (video modeling). Follow-up interviews were also conducted with participants, secondary participants, and parents of the primary participants. After a two-week baseline phase, participants attended weekly social skills training and received the treatment of video modeling with videos found on YouTube. This established pre-existing social and conversation skills and enabled the measurement of changes over the course of the 12 week program. After post intervention data were collected, additional data were collected with participants and secondary participants, neuro-typical peers, as a measure of treatment generalization. This study proposed that presenting social skills videos found on YouTube, would be effective in increasing levels of initiation, responses and conversation skills, thereby increasing communication effectiveness and reducing social rejection by peers. Although some gains in conversational skill levels were observed by most participants in the study significant increases in conversation skill levels were not observed in both ASD only group settings or of the ASD neuro-typical mixed group setting

    Advances in automatic terminology processing: methodology and applications in focus

    Get PDF
    A thesis submitted in partial fulfilment of the requirements of the University of Wolverhampton for the degree of Doctor of Philosophy.The information and knowledge era, in which we are living, creates challenges in many fields, and terminology is not an exception. The challenges include an exponential growth in the number of specialised documents that are available, in which terms are presented, and the number of newly introduced concepts and terms, which are already beyond our (manual) capacity. A promising solution to this โ€˜information overloadโ€™ would be to employ automatic or semi-automatic procedures to enable individuals and/or small groups to efficiently build high quality terminologies from their own resources which closely reflect their individual objectives and viewpoints. Automatic terminology processing (ATP) techniques have already proved to be quite reliable, and can save human time in terminology processing. However, they are not without weaknesses, one of which is that these techniques often consider terms to be independent lexical units satisfying some criteria, when terms are, in fact, integral parts of a coherent system (a terminology). This observation is supported by the discussion of the notion of terms and terminology and the review of existing approaches in ATP presented in this thesis. In order to overcome the aforementioned weakness, we propose a novel methodology in ATP which is able to extract a terminology as a whole. The proposed methodology is based on knowledge patterns automatically extracted from glossaries, which we considered to be valuable, but overlooked resources. These automatically identified knowledge patterns are used to extract terms, their relations and descriptions from corpora. The extracted information can facilitate the construction of a terminology as a coherent system. The study also aims to discuss applications of ATP, and describes an experiment in which ATP is integrated into a new NLP application: multiplechoice test item generation. The successful integration of the system shows that ATP is a viable technology, and should be exploited more by other NLP applications
    • โ€ฆ
    corecore