Search CORE

137 research outputs found

A Statistical Approach to Grammatical Error Correction

Author: DANIEL HERMANN RICHARD DAHLMEIER
Publication venue
Publication date: 25/01/2013
Field of study

Ph.DDOCTOR OF PHILOSOPH

ScholarBank@NUS

Beyond topic-based representations for text mining

Author: Massung Sean Alexander
Publication venue
Publication date: 01/05/2017
Field of study

A massive amount of online information is natural language text: newspapers, blog articles, forum posts and comments, tweets, scientific literature, government documents, and more. While in general, all kinds of online information is useful, textual information is especially important—it is the most natural, most common, and most expressive form of information. Text representation plays a critical role in application tasks like classification or information retrieval since the quality of the underlying feature space directly impacts each task's performance. Because of this importance, many different approaches have been developed for generating text representations. By far, the most common way to generate features is to segment text into words and record their n-grams. While simple term features perform relatively well in topic-based tasks, not all downstream applications are of a topical nature and can be captured by words alone. For example, determining the native language of an English essay writer will depend on more than just word choice. Competing methods to topic-based representations (such as neural networks) are often not interpretable or rely on massive amounts of training data. This thesis proposes three novel contributions to generate and analyze a large space of non-topical features. First, structural parse tree features are solely based on structural properties of a parse tree by ignoring all of the syntactic categories in the tree. An important advantage of these "skeletons" over regular syntactic features is that they can capture global tree structures without causing problems of data sparseness or overfitting. Second, SyntacticDiff explicitly captures differences in a text document with respect to a reference corpus, creating features that are easily explained as weighted word edit differences. These edit features are especially useful since they are derived from information not present in the current document, capturing a type of comparative feature. Third, Cross-Context Lexical Analysis is a general framework for analyzing similarities and differences in both term meaning and representation with respect to different, potentially overlapping partitions of a text collection. The representations analyzed by CCLA are not limited to topic-based features

Illinois Digital Environment for Access to Learning and Scholarship Repository

Computational Models of Problems with Writing of English as a Second Language Learners

Author: Xue Huichao
Publication venue
Publication date: 01/01/2015
Field of study

Learning a new language is a challenging endeavor. As a student attempts to master the grammar usage and mechanics of the new language, they make many mistakes. Detailed feedback and corrections from language tutors are invaluable to student learning, but it is time consuming to provide such feedback. In this thesis, I investigate the feasibility of building computer programs to help to reduce the efforts of English as a Second Language (ESL) tutors. Specifically, I consider three problems: (1) whether a program can identify areas that may need the tutor’s attention, such as places where the learners have used redundant words; (2) whether a program can auto-complete a tutor’s corrections by inferring the location and reason for the correction; (3) for detecting misusages of prepositions, a common ESL error type, whether a program can automatically construct a set of potential corrections by finding words that are more likely to be confused with each other (known as a confusion set). The viability of these programs depends on whether aspects of the English language and common ESL mistakes can be described by computational models. For each task, building computational models faces unique challenges: (1) In highlighting redundant areas, it is difficult to precisely define “redundancy” in a computer’s language. (2) In auto-completing tutors’ annotations, it is difficult for computers to correctly interpret how many writing problems were addressed during revision. (3) In confusion set construction, it is difficult to infer which words are more likely confused with the given word. To address these challenges, this thesis presents different model alternatives for each task. Empirical experiments demonstrate the degrees of success to which computational models can help with detecting and correcting ESL writing problem

D-Scholarship@Pitt

ProQuest OAI Repository

A pilot study in an application of text mining to learning system evaluation

Author: Katerattanakul Nitsawan
Publication venue: Scholars\u27 Mine
Publication date: 01/01/2010
Field of study

Text mining concerns discovering and extracting knowledge from unstructured data. It transforms textual data into a usable, intelligible format that facilitates classifying documents, finding explicit relationships or associations between documents, and clustering documents into categories. Given a collection of survey comments evaluating the civil engineering learning system, text mining technique is applied to discover and extract knowledge from the comments. This research focuses on the study of a systematic way to apply a software tool, SAS Enterprise Miner, to the survey data. The purpose is to categorize the comments into different groups in an attempt to identify major concerns from the users or students. Each group will be associated with a set of key terms. This is able to assist the evaluators of the learning system to obtain the ideas from those summarized terms without the need of going through a potentially huge amount of data --Abstract, page iii

Missouri University of Science and Technology (Missouri S&T): Scholars' Mine

Proceedings of the Seventh International Conference Formal Approaches to South Slavic and Balkan languages

Author
Publication venue: Croatian Language Technologies Society, Faculty of Humanities and Social Science
Publication date: 01/01/2010
Field of study

Proceedings of the Seventh International Conference Formal Approaches to South Slavic and Balkan Languages publishes 17 papers that were presented at the conference organised in Dubrovnik, Croatia, 4-6 Octobre 2010

Repozitorij Filozofskog fakulteta u Zagrebu' at University of Zagreb

Artificial Intelligence for Multimedia Signal Processing

Author
Publication venue: 'MDPI AG'
Publication date: 16/09/2022
Field of study

Artificial intelligence technologies are also actively applied to broadcasting and multimedia processing technologies. A lot of research has been conducted in a wide variety of fields, such as content creation, transmission, and security, and these attempts have been made in the past two to three years to improve image, video, speech, and other data compression efficiency in areas related to MPEG media processing technology. Additionally, technologies such as media creation, processing, editing, and creating scenarios are very important areas of research in multimedia processing and engineering. This book contains a collection of some topics broadly across advanced computational intelligence algorithms and technologies for emerging multimedia signal processing as: Computer vision field, speech/sound/text processing, and content analysis/information mining

Directory of Open Access Books (DOAB)

Cross-language Ontology Learning: Incorporating and Exploiting Cross-language Data in the Ontology Learning Process

Author: Hjelm Hans
Publication venue
Publication date: 01/01/2009
Field of study

Hans Hjelm. Cross-language Ontology Learning: Incorporating and Exploiting Cross-language Data in the Ontology Learning Process. NEALT Monograph Series, Vol. 1 (2009), 159 pages. © 2009 Hans Hjelm. Published by Northern European Association for Language Technology (NEALT) http://omilia.uio.no/nealt . Electronically published at Tartu University Library (Estonia) http://hdl.handle.net/10062/10126

Publikationer från Stockholms universitet

Digitala Vetenskapliga Arkivet - Academic Archive On-line

DSpace at Tartu University Library

Multi modal multi-semantic image retrieval

Author: Kesorn Kraisak
Publication venue
Publication date: 01/01/2010
Field of study

PhDThe rapid growth in the volume of visual information, e.g. image, and video can overwhelm users’ ability to find and access the specific visual information of interest to them. In recent years, ontology knowledge-based (KB) image information retrieval techniques have been adopted into in order to attempt to extract knowledge from these images, enhancing the retrieval performance. A KB framework is presented to promote semi-automatic annotation and semantic image retrieval using multimodal cues (visual features and text captions). In addition, a hierarchical structure for the KB allows metadata to be shared that supports multi-semantics (polysemy) for concepts. The framework builds up an effective knowledge base pertaining to a domain specific image collection, e.g. sports, and is able to disambiguate and assign high level semantics to ‘unannotated’ images. Local feature analysis of visual content, namely using Scale Invariant Feature Transform (SIFT) descriptors, have been deployed in the ‘Bag of Visual Words’ model (BVW) as an effective method to represent visual content information and to enhance its classification and retrieval. Local features are more useful than global features, e.g. colour, shape or texture, as they are invariant to image scale, orientation and camera angle. An innovative approach is proposed for the representation, annotation and retrieval of visual content using a hybrid technique based upon the use of an unstructured visual word and upon a (structured) hierarchical ontology KB model. The structural model facilitates the disambiguation of unstructured visual words and a more effective classification of visual content, compared to a vector space model, through exploiting local conceptual structures and their relationships. The key contributions of this framework in using local features for image representation include: first, a method to generate visual words using the semantic local adaptive clustering (SLAC) algorithm which takes term weight and spatial locations of keypoints into account. Consequently, the semantic information is preserved. Second a technique is used to detect the domain specific ‘non-informative visual words’ which are ineffective at representing the content of visual data and degrade its categorisation ability. Third, a method to combine an ontology model with xi a visual word model to resolve synonym (visual heterogeneity) and polysemy problems, is proposed. The experimental results show that this approach can discover semantically meaningful visual content descriptions and recognise specific events, e.g., sports events, depicted in images efficiently. Since discovering the semantics of an image is an extremely challenging problem, one promising approach to enhance visual content interpretation is to use any associated textual information that accompanies an image, as a cue to predict the meaning of an image, by transforming this textual information into a structured annotation for an image e.g. using XML, RDF, OWL or MPEG-7. Although, text and image are distinct types of information representation and modality, there are some strong, invariant, implicit, connections between images and any accompanying text information. Semantic analysis of image captions can be used by image retrieval systems to retrieve selected images more precisely. To do this, a Natural Language Processing (NLP) is exploited firstly in order to extract concepts from image captions. Next, an ontology-based knowledge model is deployed in order to resolve natural language ambiguities. To deal with the accompanying text information, two methods to extract knowledge from textual information have been proposed. First, metadata can be extracted automatically from text captions and restructured with respect to a semantic model. Second, the use of LSI in relation to a domain-specific ontology-based knowledge model enables the combined framework to tolerate ambiguities and variations (incompleteness) of metadata. The use of the ontology-based knowledge model allows the system to find indirectly relevant concepts in image captions and thus leverage these to represent the semantics of images at a higher level. Experimental results show that the proposed framework significantly enhances image retrieval and leads to narrowing of the semantic gap between lower level machinederived and higher level human-understandable conceptualisation

Queen Mary Research Online

Iterated learning framework for unsupervised part-of-speech induction

Author: Christodoulopoulos Christos
Publication venue: The University of Edinburgh
Publication date: 28/11/2013
Field of study

Computational approaches to linguistic analysis have been used for more than half a century. The main tools come from the field of Natural Language Processing (NLP) and are based on rule-based or corpora-based (supervised) methods. Despite the undeniable success of supervised learning methods in NLP, they have two main drawbacks: on the practical side, it is expensive to produce the manual annotation (or the rules) required and it is not easy to find annotators for less common languages. A theoretical disadvantage is that the computational analysis produced is tied to a specific theory or annotation scheme. Unsupervised methods offer the possibility to expand our analyses into more resourcepoor languages, and to move beyond the conventional linguistic theories. They are a way of observing patterns and regularities emerging directly from the data and can provide new linguistic insights. In this thesis I explore unsupervised methods for inducing parts of speech across languages. I discuss the challenges in evaluation of unsupervised learning and at the same time, by looking at the historical evolution of part-of-speech systems, I make the case that the compartmentalised, traditional pipeline approach of NLP is not ideal for the task. I present a generative Bayesian system that makes it easy to incorporate multiple diverse features, spanning different levels of linguistic structure, like morphology, lexical distribution, syntactic dependencies and word alignment information that allow for the examination of cross-linguistic patterns. I test the system using features provided by unsupervised systems in a pipeline mode (where the output of one system is the input to another) and show that the performance of the baseline (distributional) model increases significantly, reaching and in some cases surpassing the performance of state-of-the-art part-of-speech induction systems. I then turn to the unsupervised systems that provided these sources of information (morphology, dependencies, word alignment) and examine the way that part-of-speech information influences their inference. Having established a bi-directional relationship between each system and my part-of-speech inducer, I describe an iterated learning method, where each component system is trained using the output of the other system in each iteration. The iterated learning method improves the performance of both component systems in each task. Finally, using this iterated learning framework, and by using parts of speech as the central component, I produce chains of linguistic structure induction that combine all the component systems to offer a more holistic view of NLP. To show the potential of this multi-level system, I demonstrate its use ‘in the wild’. I describe the creation of a vastly multilingual parallel corpus based on 100 translations of the Bible in a diverse set of languages. Using the multi-level induction system, I induce cross-lingual clusters, and provide some qualitative results of my approach. I show that it is possible to discover similarities between languages that correspond to ‘hidden’ morphological, syntactic or semantic elements

Edinburgh Research Archive

Examination and utilization of rare features in text classification of injury narratives

Author: Huang Hsin-Ying
Publication venue: 'Purdue University (bepress)'
Publication date: 01/01/2016
Field of study

Thanks to the advances in computing and information technology, analyzing injury surveillance data with statistical machine learning methods has grown in popularity, complexity, and quality over recent years. During that same time, researchers have recognized the limitations of statistical text analysis with limited training data. In response to the two primary challenges for statistical text analysis, dimensionality reduction and sparse data, many studies have focused on improving machine learning algorithms. Less research has been done, though, to examine and improve statistical machine learning methods in text classification from a linguistic perspective. This study addresses this research gap by examining the importance of extreme-frequency words in classifying injury narratives. The results indicate that adhering to the common practice of removing frequently-occurring prepositions from the text significantly decreased the classification performance for certain categories. Removing low-frequency words significantly improved the classification performance for Multinomial Naive Bayes (MNB), helped alleviate the problem of overfitting small categories for Logistical Regression (LR), but did not have any significant effect for Support Vector Machine (SVM). As a way to utilize low-frequency words, classic word normalization or grouping methods such as stemming and lemmatization are often used in the text preprocessing stage. Despite their popularity, these classic grouping methods are not without limitations. The proposed Type M+S Word Grouping Method groups rare and unseen words morphologically and semantically automatically using unlabeled data. Several experiments were conducted for evaluating the grouping effect for three classifiers (MNB, SVM, LR) in three train-test scenarios (1:9, 1:1, 9:1) on injury surveillance data with a half-million narratives classified into 30 external cause categories. The experimental results show that the proposed method optionally paired with three add-on methods (two-word sequence tagging, reviewed tagging, Naive Bayes-weighted classifier) resulted in better classification performance as compared to stemming and lemmatization. The overall classification performance for small categories with limited training data was improved for MNB (5.5%), SVM (4%), and LR (11.2%) to an extent comparable to increasing the size of the labeled training set by a factor of 3.6 for MNB, 2.3 for SVM, and 5.2 for LR. Some improvement was also observed for medium-sized categories (1.7%) while performance on large categories remained nearly unchanged (0.1%). The overall results advance the conclusion that the proposed method of decision support is a promising approach for incorporating expert knowledge that improves machine learning for classifying injury narratives with reduced manual effort. The results also suggest that simply increasing the size of a training dataset would not result in the level of performance that the proposed method can achieve because of the inherent limitations of linear classifiers to acquire fundamental concepts and classification rules from the narrative that human experts know by definitions of injuries

Purdue E-Pubs