9 research outputs found

    MEASURING WORD FREQUENCY IN LANGUAGE TEACHING TEXTBOOKS USING LEXITÜRK

    Get PDF
    Vocabulary is a fundamental component of language usage, and study into its interactions with other aspects of language competence is an essential topic of language teaching. There are strong relationships between vocabulary and language competency measurements. Learners with larger vocabularies are better at a variety of language abilities than those with lower vocabularies. As a result, it may be stated that vocabulary knowledge is inextricably tied to total language proficiency. This suggests that the quantity of words we know has an impact on how much text we can comprehend. The more often a word is used, the more polysemy and irregular morphology it is likely to have. One of a word's quantifiable qualities is how extensively it is used. Based on this measurable attribute, the word's prevalence and frequency can be considered a guiding reference. Analyzing the frequency of recurrence of a specific word or phrase is the most basic sort of corpus analysis. Words frequently occur together, forming collocations, colligations, and other word combinations. Exploring such trends is another sort of corpus analysis that one may perform. This is also known as chunks, n-grams, or lexical bundles. In particular, when selecting which word should be prioritized for language learners. If this model is adopted, foreign language students will be taught essentially the most often used words. The development and application of measurement like an analysis tool can assist developers or researchers in the dilemma of preferences arising from the inevitable use of a word corpus. For teaching after the stage of creating a corpus and shaping the language textbook's content with the marked and dense frequency words from a corpus are discussed in this study, and a tool which is created by the author is presented to and for the scientific community.  Kelime bilgisi, dil kullanımının temel bir bileşenidir ve dil yeterliliğinin diğer yönleriyle etkileşimlerinin incelenmesi, dil öğretiminin temel bir konusudur. Kelime bilgisi ve dil yeterliliği ölçümleri arasında güçlü ilişkiler vardır. Daha geniş kelime dağarcığına sahip öğrenciler, çeşitli dil becerilerinde daha düşük kelime dağarcığına sahip olanlardan daha iyidir. Sonuç olarak, kelime bilgisinin ayrılmaz bir şekilde toplam dil yeterliliğine bağlı olduğu söylenebilir. Bu, bildiğimiz kelimelerin miktarının, ne kadar metni anlayabileceğimiz üzerinde bir etkisi olduğunu göstermektedir. Bir kelime ne kadar sık ​​kullanılırsa, o kadar çok anlamlılık ve düzensiz morfolojiye sahip olması muhtemeldir. Bir kelimenin ölçülebilir niteliklerinden biri, ne kadar kapsamlı kullanıldığıdır. Bu ölçülebilir özelliğe dayanarak, kelimenin yaygınlığı ve sıklığı yol gösterici bir referans olarak kabul edilebilir. Belirli bir kelimenin veya ifadenin tekrarlanma sıklığını analiz etmek, en temel türde korpus analizidir. Kelimeler sıklıkla birlikte bulunur, eşdizimler, birliktelikler ve diğer kelime kombinasyonları oluşturur. Bu tür eğilimleri keşfetmek, kişinin yapabileceği başka bir tür korpus analizidir. Bu aynı zamanda parçalar, n-gramlar veya sözcük demetleri olarak da bilinir. Özellikle, dil öğrenenler için hangi kelimeye öncelik verilmesi gerektiğini seçerken. Bu model benimsenirse, yabancı dil öğrencilerine esasen en sık kullanılan kelimeler öğretilecektir. Ölçümün bir analiz aracı gibi geliştirilmesi ve uygulanması, bir kelime bütünlüğünün kaçınılmaz kullanımından kaynaklanan tercihler ikileminde geliştiricilere veya araştırmacılara yardımcı olabilir. Derlem oluşturma ve dil ders kitabının içeriğini bir bütünceden belirgin ve yoğun sıklıktaki kelimelerle şekillendirme aşamasından sonra öğretim için bu çalışmada ele alınmış ve yazarın oluşturduğu bir araç bilim camiasına ve bilim camiasına sunulmuştur

    Issue Report Validation in an Industrial Context

    Full text link
    Effective issue triaging is crucial for software development teams to improve software quality, and thus customer satisfaction. Validating issue reports manually can be time-consuming, hindering the overall efficiency of the triaging process. This paper presents an approach on automating the validation of issue reports to accelerate the issue triaging process in an industrial set-up. We work on 1,200 randomly selected issue reports in banking domain, written in Turkish, an agglutinative language, meaning that new words can be formed with linear concatenation of suffixes to express entire sentences. We manually label these reports for validity, and extract the relevant patterns indicating that they are invalid. Since the issue reports we work on are written in an agglutinative language, we use morphological analysis to extract the features. Using the proposed feature extractors, we utilize a machine learning based approach to predict the issue reports' validity, performing a 0.77 F1-score.Comment: Accepted for publication in Proceedings of the 31st ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE'23

    A tree-based approach for English-to-Turkish translation

    Get PDF
    In this paper, we present our English-to-Turkish translation methodology, which adopts a tree-based approach. Our approach relies on tree analysis and the application of structural modification rules to get the target side (Turkish) trees from source side (English) ones. We also use morphological analysis to get candidate root words and apply tree-based rules to obtain the agglutinated target words. Compared to earlier work on English-to-Turkish translation using phrase-based models, we have been able to obtain higher BLEU scores in our current study. Our syntactic subtree permutation strategy, combined with a word replacement algorithm, provides a 67% relative improvement from a baseline 12.8 to 21.4 BLEU, all averaged over 10-fold cross-validation. As future work, improvements in choosing the correct senses and structural rules are needed.This work was supported by TUBITAK project 116E104Publisher's Versio

    MS-TR: A Morphologically Enriched Sentiment Treebank and Recursive Deep Models for Compositional Semantics in Turkish

    Get PDF
    Recursive Deep Models have been used as powerful models to learn compositional representations of text for many natural language processing tasks. However, they require structured input (i.e. sentiment treebank) to encode sentences based on their tree-based structure to enable them to learn latent semantics of words using recursive composition functions. In this paper, we present our contributions and efforts for the Turkish Sentiment Treebank construction. We introduce MS-TR, a Morphologically Enriched Sentiment Treebank, which was implemented for training Recursive Deep Models to address compositional sentiment analysis for Turkish, which is one of the well-known Morphologically Rich Language (MRL). We propose a semi-supervised automatic annotation, as a distantsupervision approach, using morphological features of words to infer the polarity of the inner nodes of MS-TR as positive and negative. The proposed annotation model has four different annotation levels: morph-level, stem-level, token-level, and review-level. Each annotation level’s contribution was tested using three different domain datasets, including product reviews, movie reviews, and the Turkish Natural Corpus essays. Comparative results were obtained with the Recursive Neural Tensor Networks (RNTN) model which is operated over MS-TR, and conventional machine learning methods. Experiments proved that RNTN outperformed the baseline methods and achieved much better accuracy results compared to the baseline methods, which cannot accurately capture the aggregated sentiment information

    Natural Language Parsing : Progress and Challenges

    Get PDF
    [Abstract] Natural language parsing is the task of automatically obtaining the syntactic structure of sentences written in a human language. Parsing is a crucial step for language processing systems that need to extract meaning from text or speech, and thus a key technology of artificial intelligence. This article presents an outline of the current state of the art in this field, as well as reflections on the main challenges that, in the author's opinion, it is currently facing: limitations in accuracy on especially difficult languages and domains, psycholinguistic adequacy, and speed.Xunta de Galicia; ED431B 2017/01Ministerio de Economía y Competitividad; FFI2014-51978-C2-2-RMinisterio de Economía y Competitividad; TIN2017-85160-C2-1-

    Formation Control of Multiple Autonomous Mobile Robots Using Turkish Natural Language Processing

    Get PDF
    People use natural language to express their thoughts and wishes. As robots reside in various human environments, such as homes, offices, and hospitals, the need for human–robot communication is increasing. One of the best ways to achieve this communication is the use of natural languages. Natural language processing (NLP) is the most important approach enabling robots to understand natural languages and improve human–robot interaction. Also, due to this need, the amount of research on NLP has increased considerably in recent years. In this study, commands were given to a multiple-mobile-robot system using the Turkish natural language, and the robots were required to fulfill these orders. Turkish is classified as an agglutinative language. In agglutinative languages, words combine different morphemes, each carrying a specific meaning, to create complex words. Turkish exhibits this characteristic by adding various suffixes to a root or base form to convey grammatical relationships, tense, aspect, mood, and other semantic nuances. Since the Turkish language has an agglutinative structure, it is very difficult to decode its sentence structure in a way that robots can understand. Parsing of a given command, path planning, path tracking, and formation control were carried out. In the path-planning phase, the A* algorithm was used to find the optimal path, and a PID controller was used to follow the generated path with minimum error. A leader–follower approach was used to control multiple robots. A platoon formation was chosen as the multi-robot formation. The proposed method was validated on a known map containing obstacles, demonstrating the system’s ability to navigate the robots to the desired locations while maintaining the specified formation. This study used Turtlebot3 robots within the Gazebo simulation environment, providing a controlled and replicable setting for comprehensive experimentation. The results affirm the feasibility and effectiveness of employing NLP techniques for the formation control of multiple mobile robots, offering a robust and effective method for further research and development on human–robot interaction

    A Matrix-Based Heuristic Algorithm for Extracting Multiword Expressions from a Corpus

    Get PDF
    This paper describes an algorithm for automatically extracting multiword expressions (MWEs) from a corpus. The algorithm is node-based, ie extracts MWEs that contain the item specified by the user, using a fixed window-size around the node. The main idea is to detect the frequency anomalies that occur at the starting and ending points of an ngram that constitutes a MWE. This is achieved by locally comparing matrices of observed frequencies to matrices of expected frequencies, and determining, for each individual input, one or more sub-sequences that have the highest probability of being a MWE. Top-performing sub-sequences are then combined in a score-aggregation and ranking stage, thus producing a single list of score-ranked MWE candidates, without having to indiscriminately generate all possible sub-sequences of the input strings. The knowledge-poor and computationally efficient algorithm attempts to solve certain recurring problems in MWE extraction, such as the inability to deal with MWEs of arbitrary length, the repetitive counting of nested ngrams, and excessive sensitivity to frequency. Evaluation results show that the best-performing version generates top-50 precision values between 0.71 and 0.88 on Turkish and English data, and performs better than the baseline method even at n= 1000

    A large vocabulary online handwriting recognition system for Turkish

    Get PDF
    Handwriting recognition in general and online handwriting recognition in particular has been an active research area for several decades. Most of the research have been focused on English and recently on other scripts like Arabic and Chinese. There is a lack of research on recognition in Turkish text and this work primarily fills that gap with a state-of-the-art recognizer for the first time. It contains design and implementation details of a complete recognition system for recognition of Turkish isolated words. Based on the Hidden Markov Models, the system comprises pre-processing, feature extraction, optical modeling and language modeling modules. It considers the recognition of unconstrained handwriting with a limited vocabulary size first and then evolves to a large vocabulary system. Turkish script has many similarities with other Latin scripts, like English, which makes it possible to adapt strategies that work for them. However, there are some other issues which are particular to Turkish that should be taken into consideration separately. Two of the challenging issues in recognition of Turkish text are determined as delayed strokes which introduce an extra source of variation in the sequence order of the handwritten input and high Out-of-Vocabulary (OOV) rate of Turkish when words are used as vocabulary units in the decoding process. This work examines the problems and alternative solutions at depth and proposes suitable solutions for Turkish script particularly. In delayed stroke handling, first a clear definition of the delayed strokes is developed and then using that definition some alternative handling methods are evaluated extensively on the UNIPEN and Turkish datasets. The best results are obtained by removing all delayed strokes, with up to 2.13% and 2.03% points recognition accuracy increases, over the respective baselines of English and Turkish. The overall system performances are assessed as 86.1% with a 1,000-word lexicon and 83.0% with a 3,500-word lexicon on the UNIPEN dataset and 91.7% on the Turkish dataset. Alternative decoding vocabularies are designed with grammatical sub-lexical units in order to solve the problem of high OOV rate. Additionally, statistical bi-gram and tri-gram language models are applied during the decoding process. The best performance, 67.9% is obtained by the large stem-ending vocabulary that is expanded with a bi-gram model on the Turkish dataset. This result is superior to the accuracy of the word-based vocabulary (63.8%) with the same coverage of 95% on the BOUN Web Corpus
    corecore