360 research outputs found

    Named Entity Recognition and Text Compression

    Get PDF
    Import 13/01/2017In recent years, social networks have become very popular. It is easy for users to share their data using online social networks. Since data on social networks is idiomatic, irregular, brief, and includes acronyms and spelling errors, dealing with such data is more challenging than that of news or formal texts. With the huge volume of posts each day, effective extraction and processing of these data will bring great benefit to information extraction applications. This thesis proposes a method to normalize Vietnamese informal text in social networks. This method has the ability to identify and normalize informal text based on the structure of Vietnamese words, Vietnamese syllable rules, and a trigram model. After normalization, the data will be processed by a named entity recognition (NER) model to identify and classify the named entities in these data. In our NER model, we use six different types of features to recognize named entities categorized in three predefined classes: Person (PER), Location (LOC), and Organization (ORG). When viewing social network data, we found that the size of these data are very large and increase daily. This raises the challenge of how to decrease this size. Due to the size of the data to be normalized, we use a trigram dictionary that is quite big, therefore we also need to decrease its size. To deal with this challenge, in this thesis, we propose three methods to compress text files, especially in Vietnamese text. The first method is a syllable-based method relying on the structure of Vietnamese morphosyllables, consonants, syllables and vowels. The second method is trigram-based Vietnamese text compression based on a trigram dictionary. The last method is based on an n-gram slide window, in which we use five dictionaries for unigrams, bigrams, trigrams, four-grams and five-grams. This method achieves a promising compression ratio of around 90% and can be used for any size of text file.In recent years, social networks have become very popular. It is easy for users to share their data using online social networks. Since data on social networks is idiomatic, irregular, brief, and includes acronyms and spelling errors, dealing with such data is more challenging than that of news or formal texts. With the huge volume of posts each day, effective extraction and processing of these data will bring great benefit to information extraction applications. This thesis proposes a method to normalize Vietnamese informal text in social networks. This method has the ability to identify and normalize informal text based on the structure of Vietnamese words, Vietnamese syllable rules, and a trigram model. After normalization, the data will be processed by a named entity recognition (NER) model to identify and classify the named entities in these data. In our NER model, we use six different types of features to recognize named entities categorized in three predefined classes: Person (PER), Location (LOC), and Organization (ORG). When viewing social network data, we found that the size of these data are very large and increase daily. This raises the challenge of how to decrease this size. Due to the size of the data to be normalized, we use a trigram dictionary that is quite big, therefore we also need to decrease its size. To deal with this challenge, in this thesis, we propose three methods to compress text files, especially in Vietnamese text. The first method is a syllable-based method relying on the structure of Vietnamese morphosyllables, consonants, syllables and vowels. The second method is trigram-based Vietnamese text compression based on a trigram dictionary. The last method is based on an n-gram slide window, in which we use five dictionaries for unigrams, bigrams, trigrams, four-grams and five-grams. This method achieves a promising compression ratio of around 90% and can be used for any size of text file.460 - Katedra informatikyvyhově

    n-Gram-based text compression

    Get PDF
    We propose an efficient method for compressing Vietnamese text using n-gram dictionaries. It has a significant compression ratio in comparison with those of state-of-the-art methods on the same dataset. Given a text, first, the proposed method splits it into n-grams and then encodes them based on n-gram dictionaries. In the encoding phase, we use a sliding window with a size that ranges from bigram to five grams to obtain the best encoding stream. Each n-gram is encoded by two to four bytes accordingly based on its corresponding n-gram dictionary. We collected 2.5 GB text corpus from some Vietnamese news agencies to build n-gram dictionaries from unigram to five grams and achieve dictionaries with a size of 12 GB in total. In order to evaluate our method, we collected a testing set of 10 different text files with different sizes. The experimental results indicate that our method achieves compression ratio around 90% and outperforms state-of-the-art methods.Web of Scienceart. no. 948364

    Tonal placement in Tashlhiyt: How an intonation system accommodates to adverse phonological environments

    Get PDF
    In most languages, words contain vowels, elements of high intensity with rich harmonic structure, enabling the  perceptual retrieval of pitch. By contrast, in Tashlhiyt, a Berber language, words can be composed entirely of voiceless segments. When an utterance consists of such words, the phonetic opportunity for the execution of intonational pitch movements is exceptionally limited. This book explores in a series of production and perception experiments how these typologically rare phonotactic patterns interact with intonational aspects of linguistic structure. It turns out that Tashlhiyt allows for a tremendously flexible placement of tonal events. Observed intonational structures can be conceived of as different solutions to a functional dilemma: The requirement to realise meaningful pitch movements in certain positions and the extent to which segments lend themselves to a clear manifestation of these pitch movements

    Tonal placement in Tashlhiyt: How an intonation system accommodates to adverse phonological environments

    Get PDF
    In most languages, words contain vowels, elements of high intensity with rich harmonic structure, enabling the  perceptual retrieval of pitch. By contrast, in Tashlhiyt, a Berber language, words can be composed entirely of voiceless segments. When an utterance consists of such words, the phonetic opportunity for the execution of intonational pitch movements is exceptionally limited. This book explores in a series of production and perception experiments how these typologically rare phonotactic patterns interact with intonational aspects of linguistic structure. It turns out that Tashlhiyt allows for a tremendously flexible placement of tonal events. Observed intonational structures can be conceived of as different solutions to a functional dilemma: The requirement to realise meaningful pitch movements in certain positions and the extent to which segments lend themselves to a clear manifestation of these pitch movements

    Tonal placement in Tashlhiyt: How an intonation system accommodates to adverse phonological environments

    Get PDF
    In most languages, words contain vowels, elements of high intensity with rich harmonic structure, enabling the  perceptual retrieval of pitch. By contrast, in Tashlhiyt, a Berber language, words can be composed entirely of voiceless segments. When an utterance consists of such words, the phonetic opportunity for the execution of intonational pitch movements is exceptionally limited. This book explores in a series of production and perception experiments how these typologically rare phonotactic patterns interact with intonational aspects of linguistic structure. It turns out that Tashlhiyt allows for a tremendously flexible placement of tonal events. Observed intonational structures can be conceived of as different solutions to a functional dilemma: The requirement to realise meaningful pitch movements in certain positions and the extent to which segments lend themselves to a clear manifestation of these pitch movements

    Znělostní kontrast ve vietnamské angličtině

    Get PDF
    Tato práce se zabývá znělostním kontrastem ve vietnamské angličtině. Teoretická část nabízí přehled týkající se znělostního kontrastu obecně, tak, jak se s ním setkáváme v angličtině. Následující kapitola představuje několik vybraných teorií, jejichž cílem je zobecnit hlavní tendence spojené s osvojováním cizího jazyka. Závěr teoretického úvodu se věnuje Vietnamštině a podstatě této práce - Vietnamské angličtině s ohledem na počáteční konsonanty. Samotné analýze předchází popis metody, který poskytuje informace o nahraných Vietnamských mluvčích angličtiny, o postupu nahrávání a zpracování dat. Tabulky a grafy slouží k ilustraci statistických výpočtů provedených za použití ANOVA a Post- hoc testů, které rozpoznávají celkové a dílčí srovnání konkrétních vztahů. Výsledky analýzy ukazují, že angličtina s vietnamským přízvukem zachovává pro své počáteční přízvučné plozivy znělostní kontrast srovnatelný s rodilou angličtinou. Průměrné hodnoty doby nástupu znělosti pro znělé plozivy bez znělosti v okluzi byly naměřeny mírně vyšší, zatímco hodnoty pro znělé plozivy se znělostí v okluzi vykazují hodnoty téměř identické v porovnání s rodilými mluvčími americké angličtiny. Tuto shodu připisujeme průměrným hodnotám znělosti v artikulačním závěru, které má vietnamština taktéž velmi obdobné. Hodnoty pro...This thesis deals with the voicing contrast in Vietnamese-accented English. The theoretical part introduces the generally accepted phenomenon of voicing contrast and several theories aimed at generalization of the main tendencies in second language acquisition. The final part of the theoretical background addresses initial consonants in Vietnamese and Vietnamese English. The methodological section provides information about the informants, the recording, and data processing prior to the analysis itself. I also present graphs and tables illustrating the statistical calculations - using ANOVA and Tukey's post-hoc tests - that identify the relations among the measured units. The results of this analysis show that in its initial stressed plosives, Vietnamese-accented English maintains a voicing contrast similar to that of a native English accent. The average Voice Onset Time values of lenis stops without prevoicing are slightly higher than those produced by American English (AmE) speakers, while the average values of voiced initial stops prove to be fairly close to their AmE equivalents. This affinity is attributed to the fact that prevoicing in Vietnamese exhibits strikingly similar values to AmE. The values for fortis initial plosives are shown to be higher in VE than AmE, due to the fact that in...Department of the English Language and ELT MethodologyÚstav anglického jazyka a didaktikyFilozofická fakultaFaculty of Art

    The optimality of word lengths. Theoretical foundations and an empirical study

    Full text link
    Zipf's law of abbreviation, namely the tendency of more frequent words to be shorter, has been viewed as a manifestation of compression, i.e. the minimization of the length of forms -- a universal principle of natural communication. Although the claim that languages are optimized has become trendy, attempts to measure the degree of optimization of languages have been rather scarce. Here we present two optimality scores that are dualy normalized, namely, they are normalized with respect to both the minimum and the random baseline. We analyze the theoretical and statistical pros and cons of these and other scores. Harnessing the best score, we quantify for the first time the degree of optimality of word lengths in languages. This indicates that languages are optimized to 62 or 67 percent on average (depending on the source) when word lengths are measured in characters, and to 65 percent on average when word lengths are measured in time. In general, spoken word durations are more optimized than written word lengths in characters. Our work paves the way to measure the degree of optimality of the vocalizations or gestures of other species, and to compare them against written, spoken, or signed human languages.Comment: On the one hand, the article has been reduced: analyses of the law of abbreviation and some of the methods have been moved to another article; appendix B has been reduced. On the other hand, various parts have been rewritten for clarity; new figures have been added to ease the understanding of the scores; new citations added. Many typos have been correcte

    Znělostní kontrast ve vietnamské angličtině

    Get PDF
    Tato práce se zabývá znělostním kontrastem ve vietnamské angličtině. Teoretická část nabízí přehled týkající se znělostního kontrastu obecně, tak, jak se s ním setkáváme v angličtině. V následující kapitole jsme také představili několik vybraných teorií, jejichž cílem je zobecnit hlavní tendence probíhající při osvojování cizího jazyka. Na závěr teoretického úvodu se dostaneme k Vietnamštině a k podstatě této práce - Vietnamské angličtině s ohledem k počátečním konsonantům. Samotné analýze předchází popis metody, který poskytuje informace o námi nahraných Vietnamcích mluvících anglicky, o postupu nahrávání a zpracování dat. Tabulky a grafy ilustrují statistické výpočty provedené za použití ANOVA a Post-hoc testů, které rozpoznávají celkové a dílčí srovnání konkrétních vztahů. Výsledky analýzy ukazují, že angličtina s vietnamským přízvukem zachovává pro své počáteční přízvučné plozivy srovnatelný znělostní kontrast jako rodilá angličtina. Průměrné hodnoty doby nástupu znělosti pro znělé plozivy bez znělosti v závěru byly naměřeny mírně vyšší, zatímco hodnoty pro znělé plozivy s přítomností znělosti v závěru vykazují hodnoty téměř identické v porovnání s rodilými mluvčími americké angličtiny. Tuto shodu připisujeme průměrným hodnotám znělosti v závěru, které má vietnamština taktéž obdobné. Hodnoty pro...This thesis deals with the voicing contrast in Vietnamese accented English. The theoretical part introduces the generally accepted phenomena of voicing contrast, and several theories aiming at generalizing the main tendencies in acquiring a second language. The final part of the theoretical background is devoted to Vietnamese and Vietnamese English where we addressed the initial consonants. The methodological section provides information about the informants, recording, and data processing prior to the analysis itself. Furthermore, graphs and tables illustrate the statistical calculations using ANOVA and Tukey's post-hoc tests that identify the aggregate and concrete relations among the measured units. The results of the analysis show that Vietnamese-accented English maintains comparable voicing contrast in its initial stressed plosives as a native English accent does. The average Voice Onset Times values of the lenis stops without prevoicing shows to be slightly higher, while the average values of voiced initial stops prove to be similar or close to similar those produced by American English (AmE) speakers, which we assign to the fact that pre-voicing in Vietnamese exhibits strikingly similar values. The values for fortis initial plosives showed to be higher due to such quality typical for Vietnamese...Department of the English Language and ELT MethodologyÚstav anglického jazyka a didaktikyFilozofická fakultaFaculty of Art

    Tonal placement in Tashlhiyt

    Get PDF
    In most languages, words contain vowels, elements of high intensity with rich harmonic structure, enabling the perceptual retrieval of pitch. By contrast, in Tashlhiyt, a Berber language, words can be composed entirely of voiceless segments. When an utterance consists of such words, the phonetic opportunity for the execution of intonational pitch movements is exceptionally limited. This book explores in a series of production and perception experiments how these typologically rare phonotactic patterns interact with intonational aspects of linguistic structure. It turns out that Tashlhiyt allows for a tremendously flexible placement of tonal events. Observed intonational structures can be conceived of as different solutions to a functional dilemma: The requirement to realise meaningful pitch movements in certain positions and the extent to which segments lend themselves to a clear manifestation of these pitch movements
    corecore