69,317 research outputs found
Morphological annotation of Korean with Directly Maintainable Resources
This article describes an exclusively resource-based method of morphological
annotation of written Korean text. Korean is an agglutinative language. Our
annotator is designed to process text before the operation of a syntactic
parser. In its present state, it annotates one-stem words only. The output is a
graph of morphemes annotated with accurate linguistic information. The
granularity of the tagset is 3 to 5 times higher than usual tagsets. A
comparison with a reference annotated corpus showed that it achieves 89% recall
without any corpus training. The language resources used by the system are
lexicons of stems, transducers of suffixes and transducers of generation of
allomorphs. All can be easily updated, which allows users to control the
evolution of the performances of the system. It has been claimed that
morphological annotation of Korean text could only be performed by a
morphological analysis module accessing a lexicon of morphemes. We show that it
can also be performed directly with a lexicon of words and without applying
morphological rules at annotation time, which speeds up annotation to 1,210
word/s. The lexicon of words is obtained from the maintainable language
resources through a fully automated compilation process
A Lexicon of Connected Components for Arabic Optical Text Recognition
Arabic is a cursive script that lacks the ease of character segmentation. Hence, we suggest a unit that is discrete in nature, viz. the connected component, for Arabic text recognition. A lexicon listing valid Arabic connected components is necessary to any system that is to use such unit. Here, we produce and analyze a comprehensive lexicon of connected components.
A lexicon can be extracted from corpora or synthesized from morphemes. We follow both approaches and merge their results. Besides, generation of a lexicon of connected components encompasses extra tokenization and point-normalization steps to make the size of the lexicon tractable. We produce a lexicon of surface-words, reduce it into a lexicon of connected components, and finally into a lexicon of point normalized connected components. The lexicon of point normalized connected components contains 684,743 entries, showing a percent decrease of 97.17% from the word-lexicon
Generating a Malay sentiment lexicon based on wordnet
Sentiment lexicon is a list of vocabularies that consists of positive and negative words. In opinion mining, sentiment lexicon is one of the important source in text polarity classification task in sentiment analysis model. Studies in Malay sentiment analysis is increasing since the volume of sentiment data is growing on social media. Therefore, requirement in Malay sentiment lexicon is high. However, Malay sentiment lexicon development is a difficult task due to the scarcity of Malay language resource. Thus, various approaches and techniques are used to generate sentiment lexicon. The objective of this paper is to develop Malay sentiment lexicon generation algorithm based on WordNet. In this study, the method is to map the WordNet Bahasa with English WordNet to get the offset value of a seed set of sentiment words. The seed set is used to generate the synonym and antonym semantic relation in English WordNet. The highest result achives 86.58% agreement with human annotators and 91.31% F1-measure in word polarity classification. The result shows the effectiveness of the proposed algorithm to generate Malay sentiment lexicon based on WordNet
Sentiment Classification for Film Reviews in Gujarati Text Using Machine Learning and Sentiment Lexicons
In this paper, two techniques for sentiment classification are proposed: Gujarati Lexicon Sentiment Analysis (GLSA) and Gujarati Machine Learning Sentiment Analysis (GMLSA) for sentiment classification of Gujarati text film reviews. Five different datasets were produced to validate the machine learning-based and lexicon-based methods’ accuracy. The lexicon-based approach employs a sentiment lexicon known as GujSentiWordNet, which identifies sentiments with a sentiment score for feature generation, while in the machine learning-based approach, five classifiers are used: logistic regression (LR), random forest (RF), k-nearest neighbors (KNN), support vector machine (SVM), naive Bayes (NB) with TF-IDF, and count vectorizer for feature selection. Experiments were carried out and the results obtained were compared using accuracy, precision, recall, and F-score as performance evaluation criteria. According to the test results, the machine learning-based technique improved accuracy by 3 to 10% on average when compared to the lexicon-based approach
Medical Text Simplification: Optimizing for Readability with Unlikelihood Training and Reranked Beam Search Decoding
Text simplification has emerged as an increasingly useful application of AI
for bridging the communication gap in specialized fields such as medicine,
where the lexicon is often dominated by technical jargon and complex
constructs. Despite notable progress, methods in medical simplification
sometimes result in the generated text having lower quality and diversity. In
this work, we explore ways to further improve the readability of text
simplification in the medical domain. We propose (1) a new unlikelihood loss
that encourages generation of simpler terms and (2) a reranked beam search
decoding method that optimizes for simplicity, which achieve better performance
on readability metrics on three datasets. This study's findings offer promising
avenues for improving text simplification in the medical field.Comment: EMNLP 2023 Finding
Signaling coherence relations in text generation: A case study of German temporal discourse markers
This thesis addresses the question of discourse marker choice in automatic (multilingual) text generation (MLG), in particular the issue of signaling temporal coherence relations on the linguistic surface by means of discourse markers such as nachdem, als, bevor . Current text generation systems do not pay attention to the fine-grained differences in meaning (semantic and pragmatic) between similar discourse markers. Yet, choosing the appropriate marker in a given context requires detailed knowledge of the function and form of a wide range of discourse markers, and a generation architecture that integrates discourse marker choice into the overall generation process. This thesis makes contributions to these two distinct areas of research. (1) Linguistic description and representation: The thesis provides a comprehensive analysis of the semantic, pragmatic and syntactic properties of German temporal discourse markers. The results are merged into a functional classification of German temporal conjunctive relations (following the Systemic functional linguistics (SFL) approach to language). This classification is compared to existing accounts for English and Dutch. Further, the thesis addresses the question of the nature of coherence relations and proposes a paradigmatic description of coherence relations along three dimensions (ideation, interpersonal, textual), yielding composite coherence relations. (2) Discourse marker choice in text generation: The thesis proposes a discourse marker lexicon as a generic resource for storing discourse marker meaning and usage, and defines the shape of individual lexicon entries and the global organisation of the lexicon. Sample entries for German and English temporal discourse markers are given. Finally, a computational model for automatic discourse marker choice that exploits the discourse marker lexicon is presente
Conceptually related lexicon clustering based on word context association mining.
Automatic lexicon generation is a useful task in learning text fragment patterns. In our previous work we have focused on text fragment pattern learning through the fuzzy grammar method which inputs include a predefined lexicon and text fragments that represents the expression of the grammar class to be learned. However, the bottleneck of the success of the fuzzy grammar creation and in common with other text learner often lies in the knowledge acquisition phase; due to the labour intensive text annotation which also demands skills and background knowledge of the text. For this reason, a semi-automated technique called automatic Terminal Grammar Recommender (TGR) is devised to identify conceptually related lexicons in the texts and their related to create terminal grammars by mining associations of words contexts. The approach recognizes that there is a degree of local structure within such text and the technique exploits the local structure without the large computational overhead of deeper analysis. Result from the comparison of the associative words detected by TGR with the definition of a content category tool called General Inquirer on the data from European Central Bank data is reported. Our findings show that our proposed method has managed to reduce the manual effort of identifying conceptually similar lexicons to form terminal grammars. The average of matched generated terminal grammar clusters compared to General Inquirer is 54.85% which indicates that at least half the expensive effort to construct conceptually related lexicon is saved. This hint the potential of word context association mining in automated conceptual lexicon generation
Synthetic Data and Artificial Neural Networks for Natural Scene Text Recognition
In this work we present a framework for the recognition of natural scene
text. Our framework does not require any human-labelled data, and performs word
recognition on the whole image holistically, departing from the character based
recognition systems of the past. The deep neural network models at the centre
of this framework are trained solely on data produced by a synthetic text
generation engine -- synthetic data that is highly realistic and sufficient to
replace real data, giving us infinite amounts of training data. This excess of
data exposes new possibilities for word recognition models, and here we
consider three models, each one "reading" words in a different way: via 90k-way
dictionary encoding, character sequence encoding, and bag-of-N-grams encoding.
In the scenarios of language based and completely unconstrained text
recognition we greatly improve upon state-of-the-art performance on standard
datasets, using our fast, simple machinery and requiring zero data-acquisition
costs
- …