213 research outputs found
Compiling and annotating a learner corpus for a morphologically rich language: CzeSL, a corpus of non-native Czech
Learner corpora, linguistic collections documenting a language as used by learners, provide an important empirical foundation for language acquisition research and teaching practice. This book presents CzeSL, a corpus of non-native Czech, against the background of theoretical and practical issues in the current learner corpus research. Languages with rich morphology and relatively free word order, including Czech, are particularly challenging for the analysis of learner language. The authors address both the complexity of learner error annotation, describing three complementary annotation schemes, and the complexity of description of non-native Czech in terms of standard linguistic categories. The book discusses in detail practical aspects of the corpus creation: the process of collection and annotation itself, the supporting tools, the resulting data, their formats and search platforms. The chapter on use cases exemplifies the usefulness of learner corpora for teaching, language acquisition research, and computational linguistics. Any researcher developing learner corpora will surely appreciate the concluding chapter listing lessons learned and pitfalls to avoid
Fine-grained Subjectivity and Sentiment Analysis: Recognizing the intensity, polarity, and attitudes of private states
Private states (mental and emotional states) are part of the information that is conveyed in many forms of discourse. News articles often report emotional responses to news stories; editorials, reviews, and weblogs convey opinions and beliefs. This dissertation investigates the manual and automatic identification of linguistic expressions of private states in a corpus of news documents from the world press. A term for the linguistic expression of private states is subjectivity.The conceptual representation of private states used in this dissertation is that of Wiebe et al. (2005). As part of this research, annotators are trained to identify expressions of private states and their properties, such as the source and the intensity of the private state. This dissertation then extends the conceptual representation of private states to better model the attitudes and targets of private states. The inter-annotator agreement studies conducted for this dissertation show that the various concepts in the original and extended representation of private states can be reliably annotated.Exploring the automatic recognition of various types of private states is also a large part of this dissertation. Experiments are conducted that focus on three types of fine-grained subjectivity analysis: recognizing the intensity of clauses and sentences, recognizing the contextual polarity of words and phrases, and recognizing the attribution levels where sentiment and arguing attitudes are expressed. Various supervised machine learning algorithms are used to train automatic systems to perform each of these tasks. These experiments result in automatic systems for performing fine-grained subjectivity analysis that significantly outperform baseline systems
An Investigation of Digital Reference Interviews: A Dialogue Act Approach
The rapid increase of computer-mediated communications (CMCs) in various forms such as micro-blogging (e.g. Twitter), online chatting (e.g. digital reference) and community- based question-answering services (e.g. Yahoo! Answers) characterizes a recent trend in web technologies, often referred to as the social web. This trend highlights the importance of supporting linguistic interactions in people\u27s online information-seeking activities in daily life - something that the web search engines still lack because of the complexity of this hu- man behavior. The presented research consists of an investigation of the information-seeking behavior of digital reference services through analysis of discourse semantics, called dialogue acts, and experimentation of automatic identification of dialogue acts using machine-learning techniques. The data was an online chat reference transaction archive, provided by the Online Computing Library Center (OCLC). Findings of the discourse analysis include supporting evidence of some of the existing theories of the information-seeking behavior. They also suggest a new way of analyzing the progress of information-seeking interactions using dia- logue act analysis. The machine learning experimentation produced promising results and demonstrated the possibility of practical applications of the DA analysis for further research across disciplines
Dataset for Automated Fact Checking in Czech Language
Naše práce prozkoumává existujĂcĂ datovĂ© sady pro Ăşlohu automatickĂ©ho faktickĂ©ho ověřovánĂ textovĂ©ho tvrzenĂ a navrhuje dvÄ› metody jejich zĂskávánĂ v ÄŚeskĂ©m jazyce. Nejprve pĹ™edkládá rozsáhlĂ˝ dataset FEVER CS se 127K anotovanĂ˝ch tvrzenĂ pomocĂ strojovĂ©ho pĹ™ekladu datovĂ© sady v angliÄŤtinÄ›. PotĂ© navrhuje sadu anotaÄŤnĂch experimentĹŻ pro sbÄ›r nativnĂho ÄŤeskĂ©ho datasetu nad znalostnĂ bázĂ archivu ÄŚTK a provádĂ ji se skupinou 163 studentĹŻ FSV UK, se ziskem 3,295 kĹ™ĂĹľovÄ› anotovanĂ˝ch tvrzenĂ s ÄŤtyĹ™cestnou Fleissovou Kappa-shodou 0.63. Dále demonstruje vhodnost datovĂ© sady pro trĂ©novánĂ modelĹŻ pro klasifikaci inference v pĹ™irozenĂ©m jazyce natrĂ©novánĂm modelu XLM-RoBERTa dosahujĂcĂho 85.5% mikro-F1 pĹ™esnosti v Ăşloze klasifikace pravdivosti tvrzenĂ z textovĂ©ho kontextu.Our work examines the existing datasets for the task of automated fact-verification of textual claims and proposes two methods of their acquisition in the low-resource Czech language. It first delivers a large-scale FEVER CS dataset of 127K annotated claims by applying the Machine Translation methods to a dataset available in English. It then designs a set of human-annotation experiments for collecting a novel dataset in Czech, using the ÄŚTK Archive corpus for a knowledge base, and conducts them with a group of 163 students of FSS CUNI, yielding a dataset of 3,295 cross-annotated claims with a 4-way Fleiss' Kappa-agreement of 0.63. It then proceeds to show the eligibility of the dataset for training the Czech Natural Language Inference models, training an XLM-RoBERTa model scoring 85.5% micro-F1 in the task of classifying the claim veracity given textual evidence
Classifying Attitude by Topic Aspect for English and Chinese Document Collections
The goal of this dissertation is to explore the design of tools to help users make sense of subjective information in English and Chinese by comparing attitudes on aspects of a topic in English and Chinese document collections. This involves two coupled challenges: topic aspect focus and attitude characterization. The topic aspect focus is specified by using information retrieval techniques to obtain documents on a topic that are of interest to a user and then
allowing the user to designate a few segments of those documents to serve as examples for aspects that she wishes to see characterized. A novel feature of this work is that the examples can be drawn from documents in two languages (English and Chinese). A bilingual aspect classifier which applies monolingual and cross-language classification techniques is used to assemble automatically a large set of document segments on those same aspects. A test collection was designed for aspect classification by annotating consecutive sentences in documents from the Topic Detection and Tracking collections as aspect instances. Experiments show that classification effectiveness can often be
increased by using training examples from both languages.
Attitude characterization is achieved by classifiers which determine the subjectivity and polarity of document segments. Sentence attitude classification is the focus of the experiments in
the dissertation because the best presently available test collection for Chinese attitude classification (the NTCIR-6 Chinese Opinion Analysis Pilot Task) is focused on sentence-level
classification. A large Chinese sentiment lexicon was constructed by leveraging existing Chinese and English lexical resources, and an
existing character-based approach for estimating the semantic orientation of other Chinese words was extended. A shallow linguistic analysis approach was adopted to classify the subjectivity and polarity of a sentence. Using the large sentiment lexicon with appropriate handling of negation, and leveraging sentence subjectivity density, sentence positivity and negativity, the resulting sentence attitude classifier was more effective than the best previously reported systems
- …