8 research outputs found
Clusterization by the K-means method when K is unknown
There are various methods of objectsβ clusterization used in different areas of machine learning. Among the vast amount of clusterization methods, the K-means method is one of the most popular. Such a method has as pros as cons. Speaking about the advantages of this method, we can mention the rather high speed of objects clusterization. The main disadvantage is a necessity to know the number of clusters before the experiment. This paper describes the new way and the new method of clusterization, based on the K-means method. The method we suggest is also quite fast in terms of processing speed, however, it does not require the user to know in advance the exact number of clusters to be processed. The user only has to define the range within which the number of clusters is located. Besides, using suggested method there is a possibility to limit the radius of clusters, which would allow finding objects that express the criteria of one cluster in the most distinctive and accurate way, and it would also allow limiting the number of objects in each cluster within the certain range
Voice Identification Using Classification Algorithms
This article discusses the classification algorithms for the problem of personality identification by voice using machine learning methods. We used the MFCC algorithm in the speech preprocessing process. To solve the problem, a comparative analysis of five classification algorithms was carried out. In the first experiment, the support vector method was determinedβ0.90 and multilayer perceptronβ0.83, that showed the best results. In the second experiment, a multilayer perceptron with an accuracy of 0.93 was proposed using the Robust scaler method for personal identification. Therefore, to solve this problem, it is possible to use a multi-layer perceptron, taking into account the specifics of the speech signal
Continuous Speech Recognition of Kazakh Language
This article describes the methods of creating a system of recognizing the continuous speech of Kazakh language. Studies on recognition of Kazakh speech in comparison with other languages began relatively recently, that is after obtaining independence of the country, and belongs to low resource languages. A large amount of data is required to create a reliable system and evaluate it accurately. A database has been created for the Kazakh language, consisting of a speech signal and corresponding transcriptions. The continuous speech has been composed of 200 speakers of different genders and ages, and the pronunciation vocabulary of the selected language. Traditional models and deep neural networks have been used to train the system. As a result, a word error rate (WER) of 30.01% has been obtained
Persian sentences to phoneme sequences conversion based on recurrent neural networks
Grapheme to phoneme conversion is one of the
main subsystems of Text-to-Speech (TTS) systems. Converting
sequence of written words to their corresponding
phoneme sequences for the Persian language is more challenging
than other languages; because in the standard orthography
of this language the short vowels are omitted
and the pronunciation ofwords depends on their positions
in a sentence. Common approaches used in the Persian
commercial TTS systems have several modules and complicated
models for natural language processing and homograph
disambiguation that make the implementation
harder as well as reducing the overall precision of system.
In this paper we define the grapheme-to-phoneme conversion
as a sequential labeling problem; and use the modified
Recurrent Neural Networks (RNN) to create a smart
and integrated model for this purpose. The recurrent networks
are modified to be bidirectional and equipped with
Long-Short Term Memory (LSTM) blocks to acquire most
of the past and future contextual information for decision
making. The experiments conducted in this paper show
that in addition to having a unified structure the bidirectional
RNN-LSTM has a good performance in recognizing
the pronunciation of the Persian sentences with the precision
more than 98 percent
Clusterization by the K-means method when K is unknown
There are various methods of objectsβ clusterization used in different areas of machine learning. Among the vast amount of clusterization methods, the K-means method is one of the most popular. Such a method has as pros as cons. Speaking about the advantages of this method, we can mention the rather high speed of objects clusterization. The main disadvantage is a necessity to know the number of clusters before the experiment. This paper describes the new way and the new method of clusterization, based on the K-means method. The method we suggest is also quite fast in terms of processing speed, however, it does not require the user to know in advance the exact number of clusters to be processed. The user only has to define the range within which the number of clusters is located. Besides, using suggested method there is a possibility to limit the radius of clusters, which would allow finding objects that express the criteria of one cluster in the most distinctive and accurate way, and it would also allow limiting the number of objects in each cluster within the certain range
ΠΠΈΠ·Π½Π°ΡΠ΅Π½Π½Ρ Π³ΡΠ°ΠΌΠ°ΡΠΈΡΠ½ΠΈΡ ΠΊΠ°ΡΠ΅Π³ΠΎΡΡΠΉ ΡΡΡΠ΅ΡΡΠΊΠΎΡ ΡΠ° ΠΊΠ°Π·Π°Ρ ΡΡΠΊΠΎΡ ΠΌΠΎΠ² Π· Π²ΠΈΠΊΠΎΡΠΈΡΡΠ°Π½Π½ΡΠΌ Π°Π»Π³ΠΎΡΠΈΡΠΌΡΠ² ΠΌΠ°ΡΠΈΠ½Π½ΠΎΠ³ΠΎ Π½Π°Π²ΡΠ°Π½Π½Ρ ΡΠ° ΡΠΊΠ»Π°Π΄Π°Π½Π½Ρ ΡΠ»ΠΎΠ²Π½ΠΈΠΊΡΠ² ΡΠΈΠ½ΡΠ°ΠΊΡΠΈΡΠ½ΠΎΠ³ΠΎ Π°Π½Π°Π»ΡΠ·Π°ΡΠΎΡΠ° Π½Π° ΠΎΡΠ½ΠΎΠ²Ρ Π³ΡΠ°ΠΌΠ°ΡΠΈΠΊΠΈ Π·Π²'ΡΠ·ΠΊΡΠ²
This research is aimed at identifying the parts of speech for the Kazakh and Turkish languages in an information retrieval system. The proposed algorithms are based on machine learning techniques. In this paper, we consider the binary classification of words according to parts of speech. We decided to take the most popular machine learning algorithms. In this paper, the following approaches and well-known machine learning algorithms are studied and considered. We defined 7Β dictionaries and tagged 135Β million words in Kazakh and 9Β dictionaries and 50Β million words in the Turkish language.
The main problem considered in the paper is to create algorithms for the execution of dictionaries of the so-called Link Grammar Parser (LGP) system, in particular for the Kazakh and Turkish languages, using machine learning techniques.
The focus of the research is on the review and comparison of machine learning algorithms and methods that have accomplished results on various natural language processing tasks such as grammatical categories determination.
For the operation of the LGP system, a dictionary is created in which a connector for each word is indicated β the type of connection that can be created using this word. The authors considered methods of filling in LGP dictionaries using machine learning.Β
The complexities of natural language processing, however, do not exclude the possibility of identifying narrower tasks that can already be solved algorithmically: for example, determining parts of speech or splitting texts into logical groups. However, some features of natural languages significantly reduce the effectiveness of these solutions. Thus, taking into account all word forms for each word in the Kazakh and Turkish languages increases the complexity of text processing by an order of magnitudeΠΠ°Π½Π½ΠΎΠ΅ ΠΈΡΡΠ»Π΅Π΄ΠΎΠ²Π°Π½ΠΈΠ΅ Π½Π°ΠΏΡΠ°Π²Π»Π΅Π½ΠΎ Π½Π° ΠΎΠΏΡΠ΅Π΄Π΅Π»Π΅Π½ΠΈΠ΅ ΡΠ°ΡΡΠ΅ΠΉ ΡΠ΅ΡΠΈ ΠΊΠ°Π·Π°Ρ
ΡΠΊΠΎΠ³ΠΎ ΠΈ ΡΡΡΠ΅ΡΠΊΠΎΠ³ΠΎ ΡΠ·ΡΠΊΠΎΠ² Π² ΠΈΠ½ΡΠΎΡΠΌΠ°ΡΠΈΠΎΠ½Π½ΠΎ-ΠΏΠΎΠΈΡΠΊΠΎΠ²ΠΎΠΉ ΡΠΈΡΡΠ΅ΠΌΠ΅. ΠΡΠ΅Π΄Π»Π°Π³Π°Π΅ΠΌΡΠ΅ Π°Π»Π³ΠΎΡΠΈΡΠΌΡ ΠΎΡΠ½ΠΎΠ²Π°Π½Ρ Π½Π° ΠΌΠ΅ΡΠΎΠ΄Π°Ρ
ΠΌΠ°ΡΠΈΠ½Π½ΠΎΠ³ΠΎ ΠΎΠ±ΡΡΠ΅Π½ΠΈΡ. Π ΡΠ°Π±ΠΎΡΠ΅ ΡΠ°ΡΡΠΌΠ°ΡΡΠΈΠ²Π°Π΅ΡΡΡ Π΄Π²ΠΎΠΈΡΠ½Π°Ρ ΠΊΠ»Π°ΡΡΠΈΡΠΈΠΊΠ°ΡΠΈΡ ΡΠ»ΠΎΠ² ΠΏΠΎ ΡΠ°ΡΡΡΠΌ ΡΠ΅ΡΠΈ. ΠΡ ΡΠ΅ΡΠΈΠ»ΠΈ ΠΈΡΠΏΠΎΠ»ΡΠ·ΠΎΠ²Π°ΡΡ ΡΠ°ΠΌΡΠ΅ ΠΈΠ·Π²Π΅ΡΡΠ½ΡΠ΅ Π°Π»Π³ΠΎΡΠΈΡΠΌΡ ΠΌΠ°ΡΠΈΠ½Π½ΠΎΠ³ΠΎ ΠΎΠ±ΡΡΠ΅Π½ΠΈΡ. Π Π΄Π°Π½Π½ΠΎΠΉ ΡΡΠ°ΡΡΠ΅ ΠΈΠ·ΡΡΠ°ΡΡΡΡ ΠΈ ΡΠ°ΡΡΠΌΠ°ΡΡΠΈΠ²Π°ΡΡΡΡ ΡΠ»Π΅Π΄ΡΡΡΠΈΠ΅ ΠΏΠΎΠ΄Ρ
ΠΎΠ΄Ρ ΠΈ ΠΈΠ·Π²Π΅ΡΡΠ½ΡΠ΅ Π°Π»Π³ΠΎΡΠΈΡΠΌΡ ΠΌΠ°ΡΠΈΠ½Π½ΠΎΠ³ΠΎ ΠΎΠ±ΡΡΠ΅Π½ΠΈΡ. ΠΡ ΠΎΠΏΡΠ΅Π΄Π΅Π»ΠΈΠ»ΠΈ 7 ΡΠ»ΠΎΠ²Π°ΡΠ΅ΠΉ ΠΈ ΠΎΡΠΌΠ΅ΡΠΈΠ»ΠΈ 135 ΠΌΠΈΠ»Π»ΠΈΠΎΠ½ΠΎΠ² ΡΠ»ΠΎΠ² Π½Π° ΠΊΠ°Π·Π°Ρ
ΡΠΊΠΎΠΌ ΡΠ·ΡΠΊΠ΅ ΠΈ 9 ΡΠ»ΠΎΠ²Π°ΡΠ΅ΠΉ ΠΈ 50 ΠΌΠΈΠ»Π»ΠΈΠΎΠ½ΠΎΠ² ΡΠ»ΠΎΠ² Π½Π° ΡΡΡΠ΅ΡΠΊΠΎΠΌ ΡΠ·ΡΠΊΠ΅.
ΠΠ»Π°Π²Π½ΠΎΠΉ Π·Π°Π΄Π°ΡΠ΅ΠΉ, ΡΠ°ΡΡΠΌΠ°ΡΡΠΈΠ²Π°Π΅ΠΌΠΎΠΉ Π² ΡΠ°Π±ΠΎΡΠ΅, ΡΠ²Π»ΡΠ΅ΡΡΡ ΡΠΎΠ·Π΄Π°Π½ΠΈΠ΅ Π°Π»Π³ΠΎΡΠΈΡΠΌΠΎΠ² ΡΠΎΡΡΠ°Π²Π»Π΅Π½ΠΈΡ ΡΠ»ΠΎΠ²Π°ΡΠ΅ΠΉ ΡΠ°ΠΊ Π½Π°Π·ΡΠ²Π°Π΅ΠΌΠΎΠΉ ΡΠΈΡΡΠ΅ΠΌΡ ΡΠΈΠ½ΡΠ°ΠΊΡΠΈΡΠ΅ΡΠΊΠΎΠ³ΠΎ Π°Π½Π°Π»ΠΈΠ·Π°ΡΠΎΡΠ° Π½Π° ΠΎΡΠ½ΠΎΠ²Π΅ Π³ΡΠ°ΠΌΠΌΠ°ΡΠΈΠΊΠΈ ΡΠ²ΡΠ·Π΅ΠΉ (LGP), Π² ΡΠ°ΡΡΠ½ΠΎΡΡΠΈ Π΄Π»Ρ ΠΊΠ°Π·Π°Ρ
ΡΠΊΠΎΠ³ΠΎ ΠΈ ΡΡΡΠ΅ΡΠΊΠΎΠ³ΠΎ ΡΠ·ΡΠΊΠΎΠ², Ρ ΠΈΡΠΏΠΎΠ»ΡΠ·ΠΎΠ²Π°Π½ΠΈΠ΅ΠΌ ΠΌΠ΅ΡΠΎΠ΄ΠΎΠ² ΠΌΠ°ΡΠΈΠ½Π½ΠΎΠ³ΠΎ ΠΎΠ±ΡΡΠ΅Π½ΠΈΡ.
ΠΡΠ½ΠΎΠ²Π½ΠΎΠ΅ Π²Π½ΠΈΠΌΠ°Π½ΠΈΠ΅ Π² ΠΈΡΡΠ»Π΅Π΄ΠΎΠ²Π°Π½ΠΈΠΈ ΡΠ΄Π΅Π»ΡΠ΅ΡΡΡ Π°Π½Π°Π»ΠΈΠ·Ρ ΠΈ ΡΡΠ°Π²Π½Π΅Π½ΠΈΡ Π°Π»Π³ΠΎΡΠΈΡΠΌΠΎΠ² ΠΈ ΠΌΠ΅ΡΠΎΠ΄ΠΎΠ² ΠΌΠ°ΡΠΈΠ½Π½ΠΎΠ³ΠΎ ΠΎΠ±ΡΡΠ΅Π½ΠΈΡ, ΠΊΠΎΡΠΎΡΡΠ΅ Π΄Π°Π»ΠΈ ΡΠ΅Π·ΡΠ»ΡΡΠ°ΡΡ Π² ΡΠ°Π·Π»ΠΈΡΠ½ΡΡ
Π·Π°Π΄Π°ΡΠ°Ρ
ΠΎΠ±ΡΠ°Π±ΠΎΡΠΊΠΈ Π΅ΡΡΠ΅ΡΡΠ²Π΅Π½Π½ΠΎΠ³ΠΎ ΡΠ·ΡΠΊΠ°, ΡΠ°ΠΊΠΈΡ
ΠΊΠ°ΠΊ ΠΎΠΏΡΠ΅Π΄Π΅Π»Π΅Π½ΠΈΠ΅ Π³ΡΠ°ΠΌΠΌΠ°ΡΠΈΡΠ΅ΡΠΊΠΈΡ
ΠΊΠ°ΡΠ΅Π³ΠΎΡΠΈΠΉ.
ΠΠ»Ρ ΡΠΈΡΡΠ΅ΠΌΡ LGP ΡΠΎΠ·Π΄Π°Π΅ΡΡΡ ΡΠ»ΠΎΠ²Π°ΡΡ, Π² ΠΊΠΎΡΠΎΡΠΎΠΌ Π΄Π»Ρ ΠΊΠ°ΠΆΠ΄ΠΎΠ³ΠΎ ΡΠ»ΠΎΠ²Π° ΡΠΊΠ°Π·ΡΠ²Π°Π΅ΡΡΡ ΡΠ²ΡΠ·ΠΊΠ° β ΡΠΈΠΏ ΡΠ²ΡΠ·ΠΊΠΈ, ΠΊΠΎΡΠΎΡΡΡ ΠΌΠΎΠΆΠ½ΠΎ ΡΠΎΠ·Π΄Π°ΡΡ Ρ ΠΏΠΎΠΌΠΎΡΡΡ ΡΡΠΎΠ³ΠΎ ΡΠ»ΠΎΠ²Π°. ΠΠ²ΡΠΎΡΠ°ΠΌΠΈ ΡΠ°ΡΡΠΌΠΎΡΡΠ΅Π½Ρ ΠΌΠ΅ΡΠΎΠ΄Ρ ΡΠΎΡΡΠ°Π²Π»Π΅Π½ΠΈΡ ΡΠ»ΠΎΠ²Π°ΡΠ΅ΠΉ LGP Ρ ΠΈΡΠΏΠΎΠ»ΡΠ·ΠΎΠ²Π°Π½ΠΈΠ΅ΠΌ ΠΌΠ°ΡΠΈΠ½Π½ΠΎΠ³ΠΎ ΠΎΠ±ΡΡΠ΅Π½ΠΈΡ.
ΠΠ΄Π½Π°ΠΊΠΎ ΡΠ»ΠΎΠΆΠ½ΠΎΡΡΠΈ ΠΎΠ±ΡΠ°Π±ΠΎΡΠΊΠΈ Π΅ΡΡΠ΅ΡΡΠ²Π΅Π½Π½ΠΎΠ³ΠΎ ΡΠ·ΡΠΊΠ° Π½Π΅ ΠΈΡΠΊΠ»ΡΡΠ°ΡΡ Π²ΠΎΠ·ΠΌΠΎΠΆΠ½ΠΎΡΡΠΈ ΠΎΠΏΡΠ΅Π΄Π΅Π»Π΅Π½ΠΈΡ Π±ΠΎΠ»Π΅Π΅ ΡΠ·ΠΊΠΈΡ
Π·Π°Π΄Π°Ρ, ΠΊΠΎΡΠΎΡΡΠ΅ ΡΠΆΠ΅ ΠΌΠΎΠ³ΡΡ ΡΠ΅ΡΠ°ΡΡΡΡ Π°Π»Π³ΠΎΡΠΈΡΠΌΠΈΡΠ΅ΡΠΊΠΈ: Π½Π°ΠΏΡΠΈΠΌΠ΅Ρ, ΠΎΠΏΡΠ΅Π΄Π΅Π»Π΅Π½ΠΈΠ΅ ΡΠ°ΡΡΠ΅ΠΉ ΡΠ΅ΡΠΈ ΠΈΠ»ΠΈ ΡΠ°Π·Π±ΠΈΠ΅Π½ΠΈΠ΅ ΡΠ΅ΠΊΡΡΠΎΠ² Π½Π° Π»ΠΎΠ³ΠΈΡΠ΅ΡΠΊΠΈΠ΅ Π³ΡΡΠΏΠΏΡ. ΠΠΏΡΠΎΡΠ΅ΠΌ Π½Π΅ΠΊΠΎΡΠΎΡΡΠ΅ ΠΎΡΠΎΠ±Π΅Π½Π½ΠΎΡΡΠΈ Π΅ΡΡΠ΅ΡΡΠ²Π΅Π½Π½ΡΡ
ΡΠ·ΡΠΊΠΎΠ² Π·Π½Π°ΡΠΈΡΠ΅Π»ΡΠ½ΠΎ ΡΠ½ΠΈΠΆΠ°ΡΡ ΡΡΡΠ΅ΠΊΡΠΈΠ²Π½ΠΎΡΡΡ ΡΡΠΈΡ
ΡΠ΅ΡΠ΅Π½ΠΈΠΉ. Π’Π°ΠΊΠΈΠΌ ΠΎΠ±ΡΠ°Π·ΠΎΠΌ, ΡΡΠ΅Ρ Π²ΡΠ΅Ρ
ΡΠ»ΠΎΠ²ΠΎΡΠΎΡΠΌ Π΄Π»Ρ ΠΊΠ°ΠΆΠ΄ΠΎΠ³ΠΎ ΡΠ»ΠΎΠ²Π° Π² ΠΊΠ°Π·Π°Ρ
ΡΠΊΠΎΠΌ ΠΈ ΡΡΡΠ΅ΡΠΊΠΎΠΌ ΡΠ·ΡΠΊΠ°Ρ
ΡΠ²Π΅Π»ΠΈΡΠΈΠ²Π°Π΅Ρ ΡΠ»ΠΎΠΆΠ½ΠΎΡΡΡ ΠΎΠ±ΡΠ°Π±ΠΎΡΠΊΠΈ ΡΠ΅ΠΊΡΡΠ° Π½Π° ΠΏΠΎΡΡΠ΄ΠΎΠΊΠΠ°Π½Π΅ Π΄ΠΎΡΠ»ΡΠ΄ΠΆΠ΅Π½Π½Ρ ΡΠΏΡΡΠΌΠΎΠ²Π°Π½Π΅ Π½Π° Π²ΠΈΠ·Π½Π°ΡΠ΅Π½Π½Ρ ΡΠ°ΡΡΠΈΠ½ ΠΌΠΎΠ²ΠΈ ΠΊΠ°Π·Π°Ρ
ΡΡΠΊΠΎΡ ΡΠ° ΡΡΡΠ΅ΡΡΠΊΠΎΡ ΠΌΠΎΠ² Π² ΡΠ½ΡΠΎΡΠΌΠ°ΡΡΠΉΠ½ΠΎ-ΠΏΠΎΡΡΠΊΠΎΠ²ΡΠΉ ΡΠΈΡΡΠ΅ΠΌΡ. ΠΠ°ΠΏΡΠΎΠΏΠΎΠ½ΠΎΠ²Π°Π½Ρ Π°Π»Π³ΠΎΡΠΈΡΠΌΠΈ Π·Π°ΡΠ½ΠΎΠ²Π°Π½Ρ Π½Π° ΠΌΠ΅ΡΠΎΠ΄Π°Ρ
ΠΌΠ°ΡΠΈΠ½Π½ΠΎΠ³ΠΎ Π½Π°Π²ΡΠ°Π½Π½Ρ. Π£ ΡΠΎΠ±ΠΎΡΡ ΡΠΎΠ·Π³Π»ΡΠ΄Π°ΡΡΡΡΡ Π΄Π²ΡΠΉΠΊΠΎΠ²Π° ΠΊΠ»Π°ΡΠΈΡΡΠΊΠ°ΡΡΡ ΡΠ»ΡΠ² Π·Π° ΡΠ°ΡΡΠΈΠ½Π°ΠΌΠΈ ΠΌΠΎΠ²ΠΈ. ΠΠΈ Π²ΠΈΡΡΡΠΈΠ»ΠΈ Π²ΠΈΠΊΠΎΡΠΈΡΡΠΎΠ²ΡΠ²Π°ΡΠΈ Π½Π°ΠΉΠ²ΡΠ΄ΠΎΠΌΡΡΡ Π°Π»Π³ΠΎΡΠΈΡΠΌΠΈ ΠΌΠ°ΡΠΈΠ½Π½ΠΎΠ³ΠΎ Π½Π°Π²ΡΠ°Π½Π½Ρ. Π£ Π΄Π°Π½ΡΠΉ ΡΡΠ°ΡΡΡ Π²ΠΈΠ²ΡΠ°ΡΡΡΡΡ Ρ ΡΠΎΠ·Π³Π»ΡΠ΄Π°ΡΡΡΡΡ Π½Π°ΡΡΡΠΏΠ½Ρ ΠΏΡΠ΄Ρ
ΠΎΠ΄ΠΈ Ρ Π²ΡΠ΄ΠΎΠΌΡ Π°Π»Π³ΠΎΡΠΈΡΠΌΠΈ ΠΌΠ°ΡΠΈΠ½Π½ΠΎΠ³ΠΎ Π½Π°Π²ΡΠ°Π½Π½Ρ. ΠΠΈ Π²ΠΈΠ·Π½Π°ΡΠΈΠ»ΠΈ 7 ΡΠ»ΠΎΠ²Π½ΠΈΠΊΡΠ² Ρ Π²ΡΠ΄Π·Π½Π°ΡΠΈΠ»ΠΈ 135 ΠΌΡΠ»ΡΠΉΠΎΠ½ΡΠ² ΡΠ»ΡΠ² ΠΊΠ°Π·Π°Ρ
ΡΡΠΊΠΎΡ ΠΌΠΎΠ²ΠΎΡ Ρ 9 ΡΠ»ΠΎΠ²Π½ΠΈΠΊΡΠ² Ρ 50 ΠΌΡΠ»ΡΠΉΠΎΠ½ΡΠ² ΡΠ»ΡΠ² ΡΡΡΠ΅ΡΡΠΊΠΎΡ ΠΌΠΎΠ²ΠΎΡ.
ΠΠΎΠ»ΠΎΠ²Π½ΠΈΠΌ Π·Π°Π²Π΄Π°Π½Π½ΡΠΌ, ΡΠΎ ΡΠΎΠ·Π³Π»ΡΠ΄Π°ΡΡΡΡΡ Π² ΡΠΎΠ±ΠΎΡΡ, Ρ ΡΡΠ²ΠΎΡΠ΅Π½Π½Ρ Π°Π»Π³ΠΎΡΠΈΡΠΌΡΠ² ΡΠΊΠ»Π°Π΄Π°Π½Π½Ρ ΡΠ»ΠΎΠ²Π½ΠΈΠΊΡΠ² ΡΠ°ΠΊ Π·Π²Π°Π½ΠΎΡ ΡΠΈΡΡΠ΅ΠΌΠΈ ΡΠΈΠ½ΡΠ°ΠΊΡΠΈΡΠ½ΠΎΠ³ΠΎ Π°Π½Π°Π»ΡΠ·Π°ΡΠΎΡΠ° Π½Π° ΠΎΡΠ½ΠΎΠ²Ρ Π³ΡΠ°ΠΌΠ°ΡΠΈΠΊΠΈ Π·Π²'ΡΠ·ΠΊΡΠ² (LGP), Π·ΠΎΠΊΡΠ΅ΠΌΠ° ΠΊΠ°Π·Π°Ρ
ΡΡΠΊΠΎΡ ΡΠ° ΡΡΡΠ΅ΡΡΠΊΠΎΡ ΠΌΠΎΠ², Π· Π²ΠΈΠΊΠΎΡΠΈΡΡΠ°Π½Π½ΡΠΌ ΠΌΠ΅ΡΠΎΠ΄ΡΠ² ΠΌΠ°ΡΠΈΠ½Π½ΠΎΠ³ΠΎ Π½Π°Π²ΡΠ°Π½Π½Ρ.
ΠΡΠ½ΠΎΠ²Π½Π° ΡΠ²Π°Π³Π° Π² Π΄ΠΎΡΠ»ΡΠ΄ΠΆΠ΅Π½Π½Ρ ΠΏΡΠΈΠ΄ΡΠ»ΡΡΡΡΡΡ Π°Π½Π°Π»ΡΠ·Ρ ΡΠ° ΠΏΠΎΡΡΠ²Π½ΡΠ½Π½Ρ Π°Π»Π³ΠΎΡΠΈΡΠΌΡΠ² Ρ ΠΌΠ΅ΡΠΎΠ΄ΡΠ² ΠΌΠ°ΡΠΈΠ½Π½ΠΎΠ³ΠΎ Π½Π°Π²ΡΠ°Π½Π½Ρ, ΡΠΊΡ Π΄Π°Π»ΠΈ ΡΠ΅Π·ΡΠ»ΡΡΠ°ΡΠΈ Π² ΡΡΠ·Π½ΠΈΡ
Π·Π°Π΄Π°ΡΠ°Ρ
ΠΎΠ±ΡΠΎΠ±ΠΊΠΈ ΠΏΡΠΈΡΠΎΠ΄Π½ΠΎΡ ΠΌΠΎΠ²ΠΈ, ΡΠ°ΠΊΠΈΡ
ΡΠΊ Π²ΠΈΠ·Π½Π°ΡΠ΅Π½Π½Ρ Π³ΡΠ°ΠΌΠ°ΡΠΈΡΠ½ΠΈΡ
ΠΊΠ°ΡΠ΅Π³ΠΎΡΡΠΉ.
ΠΠ»Ρ ΡΠΈΡΡΠ΅ΠΌΠΈ LGP ΡΡΠ²ΠΎΡΡΡΡΡΡΡ ΡΠ»ΠΎΠ²Π½ΠΈΠΊ, Π² ΡΠΊΠΎΠΌΡ Π΄Π»Ρ ΠΊΠΎΠΆΠ½ΠΎΠ³ΠΎ ΡΠ»ΠΎΠ²Π° Π²ΠΊΠ°Π·ΡΡΡΡΡΡ Π·Π²'ΡΠ·ΠΊΠ° β ΡΠΈΠΏ Π·Π²'ΡΠ·ΠΊΠΈ, ΡΠΊΡ ΠΌΠΎΠΆΠ½Π° ΡΡΠ²ΠΎΡΠΈΡΠΈ Π·Π° Π΄ΠΎΠΏΠΎΠΌΠΎΠ³ΠΎΡ ΡΡΠΎΠ³ΠΎ ΡΠ»ΠΎΠ²Π°. ΠΠ²ΡΠΎΡΠ°ΠΌΠΈ ΡΠΎΠ·Π³Π»ΡΠ½ΡΡΠΎ ΠΌΠ΅ΡΠΎΠ΄ΠΈ ΡΠΊΠ»Π°Π΄Π°Π½Π½Ρ ΡΠ»ΠΎΠ²Π½ΠΈΠΊΡΠ² LGP Π· Π²ΠΈΠΊΠΎΡΠΈΡΡΠ°Π½Π½ΡΠΌ ΠΌΠ°ΡΠΈΠ½Π½ΠΎΠ³ΠΎ Π½Π°Π²ΡΠ°Π½Π½Ρ.
ΠΠ΄Π½Π°ΠΊ ΡΠΊΠ»Π°Π΄Π½ΠΎΡΡΡ ΠΎΠ±ΡΠΎΠ±ΠΊΠΈ ΠΏΡΠΈΡΠΎΠ΄Π½ΠΎΡ ΠΌΠΎΠ²ΠΈ Π½Π΅ Π²ΠΈΠΊΠ»ΡΡΠ°ΡΡΡ ΠΌΠΎΠΆΠ»ΠΈΠ²ΠΎΡΡΡ Π²ΠΈΠ·Π½Π°ΡΠ΅Π½Π½Ρ Π±ΡΠ»ΡΡ Π²ΡΠ·ΡΠΊΠΈΡ
Π·Π°Π΄Π°Ρ, ΡΠΊΡ Π²ΠΆΠ΅ ΠΌΠΎΠΆΡΡΡ Π²ΠΈΡΡΡΡΠ²Π°ΡΠΈΡΡ Π°Π»Π³ΠΎΡΠΈΡΠΌΡΡΠ½ΠΎ: Π½Π°ΠΏΡΠΈΠΊΠ»Π°Π΄, Π²ΠΈΠ·Π½Π°ΡΠ΅Π½Π½Ρ ΡΠ°ΡΡΠΈΠ½ ΠΌΠΎΠ²ΠΈ Π°Π±ΠΎ ΡΠΎΠ·Π±ΠΈΡΡΡ ΡΠ΅ΠΊΡΡΡΠ² Π½Π° Π»ΠΎΠ³ΡΡΠ½Ρ Π³ΡΡΠΏΠΈ. ΠΡΡΠΌ Π΄Π΅ΡΠΊΡ ΠΎΡΠΎΠ±Π»ΠΈΠ²ΠΎΡΡΡ ΠΏΡΠΈΡΠΎΠ΄Π½ΠΈΡ
ΠΌΠΎΠ² Π·Π½Π°ΡΠ½ΠΎ Π·Π½ΠΈΠΆΡΡΡΡ Π΅ΡΠ΅ΠΊΡΠΈΠ²Π½ΡΡΡΡ ΡΠΈΡ
ΡΡΡΠ΅Π½Ρ. Π’Π°ΠΊΠΈΠΌ ΡΠΈΠ½ΠΎΠΌ, Π²ΡΠ°Ρ
ΡΠ²Π°Π½Π½Ρ Π²ΡΡΡ
ΡΠ»ΠΎΠ²ΠΎΡΠΎΡΠΌ Π΄Π»Ρ ΠΊΠΎΠΆΠ½ΠΎΠ³ΠΎ ΡΠ»ΠΎΠ²Π° Π² ΠΊΠ°Π·Π°Ρ
ΡΡΠΊΡΠΉ Ρ ΡΡΡΠ΅ΡΡΠΊΡΠΉ ΠΌΠΎΠ²Π°Ρ
Π·Π±ΡΠ»ΡΡΡΡ ΡΠΊΠ»Π°Π΄Π½ΡΡΡΡ ΠΎΠ±ΡΠΎΠ±ΠΊΠΈ ΡΠ΅ΠΊΡΡΡ Π½Π° ΠΏΠΎΡΡΠ΄ΠΎ
Grammatical Categories Determination for Turkish and Kazakh Languages Based on Machine Learning Algorithms and Fulfilling Dictionaries of Link Grammar Parser
This research is aimed at identifying the parts of speech for the Kazakh and Turkish languages in an information retrieval system. The proposed algorithms are based on machine learning techniques. In this paper, we consider the binary classification of words according to parts of speech. We decided to take the most popular machine learning algorithms. In this paper, the following approaches and well-known machine learning algorithms are studied and considered. We defined 7 dictionaries and tagged 135 million words in Kazakh and 9 dictionaries and 50 million words in the Turkish language.
The main problem considered in the paper is to create algorithms for the execution of dictionaries of the so-called Link Grammar Parser (LGP) system, in particular for the Kazakh and Turkish languages, using machine learning techniques.
The focus of the research is on the review and comparison of machine learning algorithms and methods that have accomplished results on various natural language processing tasks such as grammatical categories determination.
For the operation of the LGP system, a dictionary is created in which a connector for each word is indicated β the type of connection that can be created using this word. The authors considered methods of filling in LGP dictionaries using machine learning.
The complexities of natural language processing, however, do not exclude the possibility of identifying narrower tasks that can already be solved algorithmically: for example, determining parts of speech or splitting texts into logical groups. However, some features of natural languages significantly reduce the effectiveness of these solutions. Thus, taking into account all word forms for each word in the Kazakh and Turkish languages increases the complexity of text processing by an order of magnitud
Continuous Speech Recognition of Kazakh Language
This article describes the methods of creating a system of recognizing the continuous speech of Kazakh language. Studies on recognition of Kazakh speech in comparison with other languages began relatively recently, that is after obtaining independence of the country, and belongs to low resource languages. A large amount of data is required to create a reliable system and evaluate it accurately. A database has been created for the Kazakh language, consisting of a speech signal and corresponding transcriptions. The continuous speech has been composed of 200 speakers of different genders and ages, and the pronunciation vocabulary of the selected language. Traditional models and deep neural networks have been used to train the system. As a result, a word error rate (WER) of 30.01% has been obtained