1,191 research outputs found
Automatic rule learning exploiting morphological features for named entity recognition in Turkish
Named entity recognition (NER) is one of the basic tasks in automatic extraction of information from natural language texts. In this paper, we describe an automatic rule learning method that exploits different features of the input text to identify the named entities located in the natural language texts. Moreover, we explore the use of morphological features for extracting named entities from Turkish texts. We believe that the developed system can also be used for other agglutinative languages. The paper also provides a comprehensive overview of the field by reviewing the NER research literature. We conducted our experiments on the TurkIE dataset, a corpus of articles collected from different Turkish newspapers. Our method achieved an average F-score of 91.08% on the dataset. The results of the comparative experiments demonstrate that the developed technique is successfully applicable to the task of automatic NER and exploiting morphological features can significantly improve the NER from Turkish, an agglutinative language. © The Author(s) 2011
Named Entity Recognition in Turkish with Bayesian Learning and Hybrid Approaches
Named entity recognition is one of the significant textual information extraction tasks. In this paper, we present two approaches for named entity recognition on Turkish texts. The first is a Bayesian learning approach which is trained on a considerably limited training set. The second approach comprises two hybrid systems based on joint utilization of this Bayesian learning approach and a previously proposed rule-based named entity recognizer. All of the proposed three approaches achieve promising performance rates. This paper is significant as it reports the first use of the Bayesian approach for the task of named entity recognition on Turkish texts for which especially practical approaches are still insufficient
Named Entity Recognition on Turkish Tweets
Various recent studies show that the performance of named entity recognition (NER) systems developed for well-formed text types drops significantly when applied to tweets. The only existing study for the highly inflected agglutinative language Turkish reports a drop in F-Measure
from 91% to 19% when ported from news articles to tweets. In this study, we present a new named entity-annotated tweet corpus and a detailed analysis of the various tweet-specific linguistic phenomena. We perform comparative NER experiments with a rule-based multilingual NER system adapted to Turkish on three corpora: a news corpus, our new tweet corpus, and another tweet corpus. Based on the analysis and the experimentation results, we suggest system features required to improve NER results for social media like Twitter.JRC.G.2-Global security and crisis managemen
Experiments to Improve Named Entity Recognition on Turkish Tweets
Social media texts are significant information sources for several
application areas including trend analysis, event monitoring, and opinion
mining. Unfortunately, existing solutions for tasks such as named entity
recognition that perform well on formal texts usually perform poorly when
applied to social media texts. In this paper, we report on experiments that
have the purpose of improving named entity recognition on Turkish tweets, using
two different annotated data sets. In these experiments, starting with a
baseline named entity recognition system, we adapt its recognition rules and
resources to better fit Twitter language by relaxing its capitalization
constraint and by diacritics-based expansion of its lexical resources, and we
employ a simplistic normalization scheme on tweets to observe the effects of
these on the overall named entity recognition performance on Turkish tweets.
The evaluation results of the system with these different settings are provided
with discussions of these results.Comment: appears in Proceedings of the EACL Workshop on Language Analysis for
Social Media, 201
Automating information extraction task for Turkish texts
Ankara : The Department of Computer Engineering and the Institute of Engineering and Science of Bilkent University, 2011.Thesis (Ph. D.) -- Bilkent University, 2011.Includes bibliographical references leaves 85-97.Throughout history, mankind has often suffered from a lack of necessary resources.
In today’s information world, the challenge can sometimes be a wealth
of resources. That is to say, an excessive amount of information implies the need
to find and extract necessary information. Information extraction can be defined
as the identification of selected types of entities, relations, facts or events in a set
of unstructured text documents in a natural language.
The goal of our research is to build a system that automatically locates and
extracts information from Turkish unstructured texts. Our study focuses on
two basic Information Extraction (IE) tasks: Named Entity Recognition and
Entity Relation Detection. Named Entity Recognition, finding named entities
(persons, locations, organizations, etc.) located in unstructured texts, is one of
the most fundamental IE tasks. Entity Relation Detection task tries to identify
relationships between entities mentioned in text documents.
Using supervised learning strategy, the developed systems start with a set
of examples collected from a training dataset and generate the extraction rules
from the given examples by using a carefully designed coverage algorithm. Moreover,
several rule filtering and rule refinement techniques are utilized to maximize
generalization and accuracy at the same time. In order to obtain accurate generalization,
we use several syntactic and semantic features of the text, including:
orthographical, contextual, lexical and morphological features. In particular,
morphological features of the text are effectively used in this study to increase
the extraction performance for Turkish, an agglutinative language. Since the system
does not rely on handcrafted rules/patterns, it does not heavily suffer from
domain adaptability problem.
The results of the conducted experiments show that (1) the developed systems
are successfully applicable to the Named Entity Recognition and Entity Relation
Detection tasks, and (2) exploiting morphological features can significantly improve
the performance of information extraction from Turkish, an agglutinative
language.Tatar, SerhanPh.D
Impact of Tokenization on Language Models: An Analysis for Turkish
Tokenization is an important text preprocessing step to prepare input tokens
for deep language models. WordPiece and BPE are de facto methods employed by
important models, such as BERT and GPT. However, the impact of tokenization can
be different for morphologically rich languages, such as Turkic languages,
where many words can be generated by adding prefixes and suffixes. We compare
five tokenizers at different granularity levels, i.e. their outputs vary from
smallest pieces of characters to the surface form of words, including a
Morphological-level tokenizer. We train these tokenizers and pretrain
medium-sized language models using RoBERTa pretraining procedure on the Turkish
split of the OSCAR corpus. We then fine-tune our models on six downstream
tasks. Our experiments, supported by statistical tests, reveal that
Morphological-level tokenizer has challenging performance with de facto
tokenizers. Furthermore, we find that increasing the vocabulary size improves
the performance of Morphological and Word-level tokenizers more than that of de
facto tokenizers. The ratio of the number of vocabulary parameters to the total
number of model parameters can be empirically chosen as 20% for de facto
tokenizers and 40% for other tokenizers to obtain a reasonable trade-off
between model size and performance.Comment: submitted to ACM TALLI
- …