Search CORE

69,317 research outputs found

Morphological annotation of Korean with Directly Maintainable Resources

Author: Berlocher Ivan
Huh Hyun-Gue
Laporte Eric
Nam Jee-Sun
Publication venue
Publication date: 01/01/2006
Field of study

This article describes an exclusively resource-based method of morphological annotation of written Korean text. Korean is an agglutinative language. Our annotator is designed to process text before the operation of a syntactic parser. In its present state, it annotates one-stem words only. The output is a graph of morphemes annotated with accurate linguistic information. The granularity of the tagset is 3 to 5 times higher than usual tagsets. A comparison with a reference annotated corpus showed that it achieves 89% recall without any corpus training. The language resources used by the system are lexicons of stems, transducers of suffixes and transducers of generation of allomorphs. All can be easily updated, which allows users to control the evolution of the performances of the system. It has been claimed that morphological annotation of Korean text could only be performed by a morphological analysis module accessing a lexicon of morphemes. We show that it can also be performed directly with a lexicon of words and without applying morphological rules at annotation time, which speeds up annotation to 1,210 word/s. The lexicon of words is obtained from the maintainable language resources through a fully automated compilation process

arXiv.org e-Print Archive

CiteSeerX

HAL-Ecole des Ponts ParisTech

HAL - UPEC / UPEM

A Lexicon of Connected Components for Arabic Optical Text Recognition

Author: Elarian Yousef
Idris Fayez
Publication venue
Publication date: 12/01/2011
Field of study

Arabic is a cursive script that lacks the ease of character segmentation. Hence, we suggest a unit that is discrete in nature, viz. the connected component, for Arabic text recognition. A lexicon listing valid Arabic connected components is necessary to any system that is to use such unit. Here, we produce and analyze a comprehensive lexicon of connected components. A lexicon can be extracted from corpora or synthesized from morphemes. We follow both approaches and merge their results. Besides, generation of a lexicon of connected components encompasses extra tokenization and point-normalization steps to make the size of the lexicon tractable. We produce a lexicon of surface-words, reduce it into a lexicon of connected components, and finally into a lexicon of point normalized connected components. The lexicon of point normalized connected components contains 684,743 entries, showing a percent decrease of 97.17% from the word-lexicon

Eldorado - Ressourcen aus und für Lehre, Studium und Forschung

Generating a Malay sentiment lexicon based on wordnet

Author: Nazlia Omar
Nur Sharmini Alexander
Publication venue: 'Penerbit Universiti Kebangsaan Malaysia (UKM Press)'
Publication date: 01/06/2017
Field of study

Sentiment lexicon is a list of vocabularies that consists of positive and negative words. In opinion mining, sentiment lexicon is one of the important source in text polarity classification task in sentiment analysis model. Studies in Malay sentiment analysis is increasing since the volume of sentiment data is growing on social media. Therefore, requirement in Malay sentiment lexicon is high. However, Malay sentiment lexicon development is a difficult task due to the scarcity of Malay language resource. Thus, various approaches and techniques are used to generate sentiment lexicon. The objective of this paper is to develop Malay sentiment lexicon generation algorithm based on WordNet. In this study, the method is to map the WordNet Bahasa with English WordNet to get the offset value of a seed set of sentiment words. The seed set is used to generate the synonym and antonym semantic relation in English WordNet. The highest result achives 86.58% agreement with human annotators and 91.31% F1-measure in word polarity classification. The result shows the effectiveness of the proposed algorithm to generate Malay sentiment lexicon based on WordNet

Directory of Open Access Journals

UKM Journal Article Repository

Sentiment Classification for Film Reviews in Gujarati Text Using Machine Learning and Sentiment Lexicons

Author: Patel Maitri
Shah Parita
Swaminarayan Priya
Publication venue: LPPM ITBis Lembah Dempo
Publication date: 11/04/2023
Field of study

In this paper, two techniques for sentiment classification are proposed: Gujarati Lexicon Sentiment Analysis (GLSA) and Gujarati Machine Learning Sentiment Analysis (GMLSA) for sentiment classification of Gujarati text film reviews. Five different datasets were produced to validate the machine learning-based and lexicon-based methods’ accuracy. The lexicon-based approach employs a sentiment lexicon known as GujSentiWordNet, which identifies sentiments with a sentiment score for feature generation, while in the machine learning-based approach, five classifiers are used: logistic regression (LR), random forest (RF), k-nearest neighbors (KNN), support vector machine (SVM), naive Bayes (NB) with TF-IDF, and count vectorizer for feature selection. Experiments were carried out and the results obtained were compared using accuracy, precision, recall, and F-score as performance evaluation criteria. According to the test results, the machine learning-based technique improved accuracy by 3 to 10% on average when compared to the lexicon-based approach

ITB Journal

Medical Text Simplification: Optimizing for Readability with Unlikelihood Training and Reranked Beam Search Decoding

Author: Chheang Sophie
Cohan Arman
Flores Lorenzo Jaime Yu
Huang Heyuan
Shi Kejian
Publication venue
Publication date: 25/10/2023
Field of study

Text simplification has emerged as an increasingly useful application of AI for bridging the communication gap in specialized fields such as medicine, where the lexicon is often dominated by technical jargon and complex constructs. Despite notable progress, methods in medical simplification sometimes result in the generated text having lower quality and diversity. In this work, we explore ways to further improve the readability of text simplification in the medical domain. We propose (1) a new unlikelihood loss that encourages generation of simpler terms and (2) a reranked beam search decoding method that optimizes for simplicity, which achieve better performance on readability metrics on three datasets. This study's findings offer promising avenues for improving text simplification in the medical field.Comment: EMNLP 2023 Finding

arXiv.org e-Print Archive

Signaling coherence relations in text generation: A case study of German temporal discourse markers

Author: Grote Brigitte
Publication venue
Publication date: 01/01/2004
Field of study

This thesis addresses the question of discourse marker choice in automatic (multilingual) text generation (MLG), in particular the issue of signaling temporal coherence relations on the linguistic surface by means of discourse markers such as nachdem, als, bevor . Current text generation systems do not pay attention to the fine-grained differences in meaning (semantic and pragmatic) between similar discourse markers. Yet, choosing the appropriate marker in a given context requires detailed knowledge of the function and form of a wide range of discourse markers, and a generation architecture that integrates discourse marker choice into the overall generation process. This thesis makes contributions to these two distinct areas of research. (1) Linguistic description and representation: The thesis provides a comprehensive analysis of the semantic, pragmatic and syntactic properties of German temporal discourse markers. The results are merged into a functional classification of German temporal conjunctive relations (following the Systemic functional linguistics (SFL) approach to language). This classification is compared to existing accounts for English and Dutch. Further, the thesis addresses the question of the nature of coherence relations and proposes a paradigmatic description of coherence relations along three dimensions (ideation, interpersonal, textual), yielding composite coherence relations. (2) Discourse marker choice in text generation: The thesis proposes a discourse marker lexicon as a generic resource for storing discourse marker meaning and usage, and defines the shape of individual lexicon entries and the global organisation of the lexicon. Sample entries for German and English temporal discourse markers are given. Finally, a computational model for automatic discourse marker choice that exploits the discourse marker lexicon is presente

E-LIB Dokumentserver - Staats und Universitätsbibliothek Bremen

Conceptually related lexicon clustering based on word context association mining.

Author: Azmi Murad Masrah Azrifah
Martin Trevor
Mohd Sharef Nurfadhlina
Publication venue: 'AICIT'
Publication date: 01/01/2013
Field of study

Automatic lexicon generation is a useful task in learning text fragment patterns. In our previous work we have focused on text fragment pattern learning through the fuzzy grammar method which inputs include a predefined lexicon and text fragments that represents the expression of the grammar class to be learned. However, the bottleneck of the success of the fuzzy grammar creation and in common with other text learner often lies in the knowledge acquisition phase; due to the labour intensive text annotation which also demands skills and background knowledge of the text. For this reason, a semi-automated technique called automatic Terminal Grammar Recommender (TGR) is devised to identify conceptually related lexicons in the texts and their related to create terminal grammars by mining associations of words contexts. The approach recognizes that there is a degree of local structure within such text and the technique exploits the local structure without the large computational overhead of deeper analysis. Result from the comparison of the associative words detected by TGR with the definition of a content category tool called General Inquirer on the data from European Central Bank data is reported. Our findings show that our proposed method has managed to reduce the manual effort of identifying conceptually similar lexicons to form terminal grammars. The average of matched generated terminal grammar clusters compared to General Inquirer is 54.85% which indicates that at least half the expensive effort to construct conceptually related lexicon is saved. This hint the potential of word context association mining in automated conceptual lexicon generation

Crossref

Universiti Putra Malaysia Institutional Repository

Synthetic Data and Artificial Neural Networks for Natural Scene Text Recognition

Author: Jaderberg Max
Simonyan Karen
Vedaldi Andrea
Zisserman Andrew
Publication venue
Publication date: 01/01/2014
Field of study

In this work we present a framework for the recognition of natural scene text. Our framework does not require any human-labelled data, and performs word recognition on the whole image holistically, departing from the character based recognition systems of the past. The deep neural network models at the centre of this framework are trained solely on data produced by a synthetic text generation engine -- synthetic data that is highly realistic and sufficient to replace real data, giving us infinite amounts of training data. This excess of data exposes new possibilities for word recognition models, and here we consider three models, each one "reading" words in a different way: via 90k-way dictionary encoding, character sequence encoding, and bag-of-N-grams encoding. In the scenarios of language based and completely unconstrained text recognition we greatly improve upon state-of-the-art performance on standard datasets, using our fast, simple machinery and requiring zero data-acquisition costs

arXiv.org e-Print Archive

Oxford University Research Archive