360 research outputs found
Named Entity Recognition and Text Compression
Import 13/01/2017In recent years, social networks have become very popular. It is easy for users
to share their data using online social networks. Since data on social networks is
idiomatic, irregular, brief, and includes acronyms and spelling errors, dealing with
such data is more challenging than that of news or formal texts. With the huge
volume of posts each day, effective extraction and processing of these data will bring
great benefit to information extraction applications.
This thesis proposes a method to normalize Vietnamese informal text in social
networks. This method has the ability to identify and normalize informal text
based on the structure of Vietnamese words, Vietnamese syllable rules, and a trigram
model. After normalization, the data will be processed by a named entity
recognition (NER) model to identify and classify the named entities in these data.
In our NER model, we use six different types of features to recognize named entities
categorized in three predefined classes: Person (PER), Location (LOC), and
Organization (ORG).
When viewing social network data, we found that the size of these data are very
large and increase daily. This raises the challenge of how to decrease this size. Due
to the size of the data to be normalized, we use a trigram dictionary that is quite
big, therefore we also need to decrease its size. To deal with this challenge, in this
thesis, we propose three methods to compress text files, especially in Vietnamese
text. The first method is a syllable-based method relying on the structure of
Vietnamese morphosyllables, consonants, syllables and vowels. The second method
is trigram-based Vietnamese text compression based on a trigram dictionary. The
last method is based on an n-gram slide window, in which we use five dictionaries
for unigrams, bigrams, trigrams, four-grams and five-grams. This method achieves
a promising compression ratio of around 90% and can be used for any size of text file.In recent years, social networks have become very popular. It is easy for users
to share their data using online social networks. Since data on social networks is
idiomatic, irregular, brief, and includes acronyms and spelling errors, dealing with
such data is more challenging than that of news or formal texts. With the huge
volume of posts each day, effective extraction and processing of these data will bring
great benefit to information extraction applications.
This thesis proposes a method to normalize Vietnamese informal text in social
networks. This method has the ability to identify and normalize informal text
based on the structure of Vietnamese words, Vietnamese syllable rules, and a trigram
model. After normalization, the data will be processed by a named entity
recognition (NER) model to identify and classify the named entities in these data.
In our NER model, we use six different types of features to recognize named entities
categorized in three predefined classes: Person (PER), Location (LOC), and
Organization (ORG).
When viewing social network data, we found that the size of these data are very
large and increase daily. This raises the challenge of how to decrease this size. Due
to the size of the data to be normalized, we use a trigram dictionary that is quite
big, therefore we also need to decrease its size. To deal with this challenge, in this
thesis, we propose three methods to compress text files, especially in Vietnamese
text. The first method is a syllable-based method relying on the structure of
Vietnamese morphosyllables, consonants, syllables and vowels. The second method
is trigram-based Vietnamese text compression based on a trigram dictionary. The
last method is based on an n-gram slide window, in which we use five dictionaries
for unigrams, bigrams, trigrams, four-grams and five-grams. This method achieves
a promising compression ratio of around 90% and can be used for any size of text file.460 - Katedra informatikyvyhově
n-Gram-based text compression
We propose an efficient method for compressing Vietnamese text using n-gram dictionaries. It has a significant compression ratio in comparison with those of state-of-the-art methods on the same dataset. Given a text, first, the proposed method splits it into n-grams and then encodes them based on n-gram dictionaries. In the encoding phase, we use a sliding window with a size that ranges from bigram to five grams to obtain the best encoding stream. Each n-gram is encoded by two to four bytes accordingly based on its corresponding n-gram dictionary. We collected 2.5 GB text corpus from some Vietnamese news agencies to build n-gram dictionaries from unigram to five grams and achieve dictionaries with a size of 12 GB in total. In order to evaluate our method, we collected a testing set of 10 different text files with different sizes. The experimental results indicate that our method achieves compression ratio around 90% and outperforms state-of-the-art methods.Web of Scienceart. no. 948364
Tonal placement in Tashlhiyt: How an intonation system accommodates to adverse phonological environments
In most languages, words contain vowels, elements of high intensity with rich harmonic structure, enabling the perceptual retrieval of pitch. By contrast, in Tashlhiyt, a Berber language, words can be composed entirely of voiceless segments. When an utterance consists of such words, the phonetic opportunity for the execution of intonational pitch movements is exceptionally limited. This book explores in a series of production and perception experiments how these typologically rare phonotactic patterns interact with intonational aspects of linguistic structure. It turns out that Tashlhiyt allows for a tremendously flexible placement of tonal events. Observed intonational structures can be conceived of as different solutions to a functional dilemma: The requirement to realise meaningful pitch movements in certain positions and the extent to which segments lend themselves to a clear manifestation of these pitch movements
Tonal placement in Tashlhiyt: How an intonation system accommodates to adverse phonological environments
In most languages, words contain vowels, elements of high intensity with rich harmonic structure, enabling the perceptual retrieval of pitch. By contrast, in Tashlhiyt, a Berber language, words can be composed entirely of voiceless segments. When an utterance consists of such words, the phonetic opportunity for the execution of intonational pitch movements is exceptionally limited. This book explores in a series of production and perception experiments how these typologically rare phonotactic patterns interact with intonational aspects of linguistic structure. It turns out that Tashlhiyt allows for a tremendously flexible placement of tonal events. Observed intonational structures can be conceived of as different solutions to a functional dilemma: The requirement to realise meaningful pitch movements in certain positions and the extent to which segments lend themselves to a clear manifestation of these pitch movements
Tonal placement in Tashlhiyt: How an intonation system accommodates to adverse phonological environments
In most languages, words contain vowels, elements of high intensity with rich harmonic structure, enabling the perceptual retrieval of pitch. By contrast, in Tashlhiyt, a Berber language, words can be composed entirely of voiceless segments. When an utterance consists of such words, the phonetic opportunity for the execution of intonational pitch movements is exceptionally limited. This book explores in a series of production and perception experiments how these typologically rare phonotactic patterns interact with intonational aspects of linguistic structure. It turns out that Tashlhiyt allows for a tremendously flexible placement of tonal events. Observed intonational structures can be conceived of as different solutions to a functional dilemma: The requirement to realise meaningful pitch movements in certain positions and the extent to which segments lend themselves to a clear manifestation of these pitch movements
Znělostní kontrast ve vietnamské angličtině
Tato práce se zabývá znělostním kontrastem ve vietnamské angličtině. Teoretická část nabízí přehled týkající se znělostního kontrastu obecně, tak, jak se s ním setkáváme v angličtině. Následující kapitola představuje několik vybraných teorií, jejichž cílem je zobecnit hlavní tendence spojené s osvojováním cizího jazyka. Závěr teoretického úvodu se věnuje Vietnamštině a podstatě této práce - Vietnamské angličtině s ohledem na počáteční konsonanty. Samotné analýze předchází popis metody, který poskytuje informace o nahraných Vietnamských mluvčích angličtiny, o postupu nahrávání a zpracování dat. Tabulky a grafy slouží k ilustraci statistických výpočtů provedených za použití ANOVA a Post- hoc testů, které rozpoznávají celkové a dílčí srovnání konkrétních vztahů. Výsledky analýzy ukazují, že angličtina s vietnamským přízvukem zachovává pro své počáteční přízvučné plozivy znělostní kontrast srovnatelný s rodilou angličtinou. Průměrné hodnoty doby nástupu znělosti pro znělé plozivy bez znělosti v okluzi byly naměřeny mírně vyšší, zatímco hodnoty pro znělé plozivy se znělostí v okluzi vykazují hodnoty téměř identické v porovnání s rodilými mluvčími americké angličtiny. Tuto shodu připisujeme průměrným hodnotám znělosti v artikulačním závěru, které má vietnamština taktéž velmi obdobné. Hodnoty pro...This thesis deals with the voicing contrast in Vietnamese-accented English. The theoretical part introduces the generally accepted phenomenon of voicing contrast and several theories aimed at generalization of the main tendencies in second language acquisition. The final part of the theoretical background addresses initial consonants in Vietnamese and Vietnamese English. The methodological section provides information about the informants, the recording, and data processing prior to the analysis itself. I also present graphs and tables illustrating the statistical calculations - using ANOVA and Tukey's post-hoc tests - that identify the relations among the measured units. The results of this analysis show that in its initial stressed plosives, Vietnamese-accented English maintains a voicing contrast similar to that of a native English accent. The average Voice Onset Time values of lenis stops without prevoicing are slightly higher than those produced by American English (AmE) speakers, while the average values of voiced initial stops prove to be fairly close to their AmE equivalents. This affinity is attributed to the fact that prevoicing in Vietnamese exhibits strikingly similar values to AmE. The values for fortis initial plosives are shown to be higher in VE than AmE, due to the fact that in...Department of the English Language and ELT MethodologyÚstav anglického jazyka a didaktikyFilozofická fakultaFaculty of Art
The optimality of word lengths. Theoretical foundations and an empirical study
Zipf's law of abbreviation, namely the tendency of more frequent words to be
shorter, has been viewed as a manifestation of compression, i.e. the
minimization of the length of forms -- a universal principle of natural
communication. Although the claim that languages are optimized has become
trendy, attempts to measure the degree of optimization of languages have been
rather scarce. Here we present two optimality scores that are dualy normalized,
namely, they are normalized with respect to both the minimum and the random
baseline. We analyze the theoretical and statistical pros and cons of these and
other scores. Harnessing the best score, we quantify for the first time the
degree of optimality of word lengths in languages. This indicates that
languages are optimized to 62 or 67 percent on average (depending on the
source) when word lengths are measured in characters, and to 65 percent on
average when word lengths are measured in time. In general, spoken word
durations are more optimized than written word lengths in characters. Our work
paves the way to measure the degree of optimality of the vocalizations or
gestures of other species, and to compare them against written, spoken, or
signed human languages.Comment: On the one hand, the article has been reduced: analyses of the law of
abbreviation and some of the methods have been moved to another article;
appendix B has been reduced. On the other hand, various parts have been
rewritten for clarity; new figures have been added to ease the understanding
of the scores; new citations added. Many typos have been correcte
Znělostní kontrast ve vietnamské angličtině
Tato práce se zabývá znělostním kontrastem ve vietnamské angličtině. Teoretická část nabízí přehled týkající se znělostního kontrastu obecně, tak, jak se s ním setkáváme v angličtině. V následující kapitole jsme také představili několik vybraných teorií, jejichž cílem je zobecnit hlavní tendence probíhající při osvojování cizího jazyka. Na závěr teoretického úvodu se dostaneme k Vietnamštině a k podstatě této práce - Vietnamské angličtině s ohledem k počátečním konsonantům. Samotné analýze předchází popis metody, který poskytuje informace o námi nahraných Vietnamcích mluvících anglicky, o postupu nahrávání a zpracování dat. Tabulky a grafy ilustrují statistické výpočty provedené za použití ANOVA a Post-hoc testů, které rozpoznávají celkové a dílčí srovnání konkrétních vztahů. Výsledky analýzy ukazují, že angličtina s vietnamským přízvukem zachovává pro své počáteční přízvučné plozivy srovnatelný znělostní kontrast jako rodilá angličtina. Průměrné hodnoty doby nástupu znělosti pro znělé plozivy bez znělosti v závěru byly naměřeny mírně vyšší, zatímco hodnoty pro znělé plozivy s přítomností znělosti v závěru vykazují hodnoty téměř identické v porovnání s rodilými mluvčími americké angličtiny. Tuto shodu připisujeme průměrným hodnotám znělosti v závěru, které má vietnamština taktéž obdobné. Hodnoty pro...This thesis deals with the voicing contrast in Vietnamese accented English. The theoretical part introduces the generally accepted phenomena of voicing contrast, and several theories aiming at generalizing the main tendencies in acquiring a second language. The final part of the theoretical background is devoted to Vietnamese and Vietnamese English where we addressed the initial consonants. The methodological section provides information about the informants, recording, and data processing prior to the analysis itself. Furthermore, graphs and tables illustrate the statistical calculations using ANOVA and Tukey's post-hoc tests that identify the aggregate and concrete relations among the measured units. The results of the analysis show that Vietnamese-accented English maintains comparable voicing contrast in its initial stressed plosives as a native English accent does. The average Voice Onset Times values of the lenis stops without prevoicing shows to be slightly higher, while the average values of voiced initial stops prove to be similar or close to similar those produced by American English (AmE) speakers, which we assign to the fact that pre-voicing in Vietnamese exhibits strikingly similar values. The values for fortis initial plosives showed to be higher due to such quality typical for Vietnamese...Department of the English Language and ELT MethodologyÚstav anglického jazyka a didaktikyFilozofická fakultaFaculty of Art
Tonal placement in Tashlhiyt
In most languages, words contain vowels, elements of high intensity with rich harmonic structure, enabling the perceptual retrieval of pitch. By contrast, in Tashlhiyt, a Berber language, words can be composed entirely of voiceless segments. When an utterance consists of such words, the phonetic opportunity for the execution of intonational pitch movements is exceptionally limited. This book explores in a series of production and perception experiments how these typologically rare phonotactic patterns interact with intonational aspects of linguistic structure. It turns out that Tashlhiyt allows for a tremendously flexible placement of tonal events. Observed intonational structures can be conceived of as different solutions to a functional dilemma: The requirement to realise meaningful pitch movements in certain positions and the extent to which segments lend themselves to a clear manifestation of these pitch movements
- …