1 research outputs found
Named Entity Recognition and Text Compression
Import 13/01/2017In recent years, social networks have become very popular. It is easy for users
to share their data using online social networks. Since data on social networks is
idiomatic, irregular, brief, and includes acronyms and spelling errors, dealing with
such data is more challenging than that of news or formal texts. With the huge
volume of posts each day, effective extraction and processing of these data will bring
great benefit to information extraction applications.
This thesis proposes a method to normalize Vietnamese informal text in social
networks. This method has the ability to identify and normalize informal text
based on the structure of Vietnamese words, Vietnamese syllable rules, and a trigram
model. After normalization, the data will be processed by a named entity
recognition (NER) model to identify and classify the named entities in these data.
In our NER model, we use six different types of features to recognize named entities
categorized in three predefined classes: Person (PER), Location (LOC), and
Organization (ORG).
When viewing social network data, we found that the size of these data are very
large and increase daily. This raises the challenge of how to decrease this size. Due
to the size of the data to be normalized, we use a trigram dictionary that is quite
big, therefore we also need to decrease its size. To deal with this challenge, in this
thesis, we propose three methods to compress text files, especially in Vietnamese
text. The first method is a syllable-based method relying on the structure of
Vietnamese morphosyllables, consonants, syllables and vowels. The second method
is trigram-based Vietnamese text compression based on a trigram dictionary. The
last method is based on an n-gram slide window, in which we use five dictionaries
for unigrams, bigrams, trigrams, four-grams and five-grams. This method achieves
a promising compression ratio of around 90% and can be used for any size of text file.In recent years, social networks have become very popular. It is easy for users
to share their data using online social networks. Since data on social networks is
idiomatic, irregular, brief, and includes acronyms and spelling errors, dealing with
such data is more challenging than that of news or formal texts. With the huge
volume of posts each day, effective extraction and processing of these data will bring
great benefit to information extraction applications.
This thesis proposes a method to normalize Vietnamese informal text in social
networks. This method has the ability to identify and normalize informal text
based on the structure of Vietnamese words, Vietnamese syllable rules, and a trigram
model. After normalization, the data will be processed by a named entity
recognition (NER) model to identify and classify the named entities in these data.
In our NER model, we use six different types of features to recognize named entities
categorized in three predefined classes: Person (PER), Location (LOC), and
Organization (ORG).
When viewing social network data, we found that the size of these data are very
large and increase daily. This raises the challenge of how to decrease this size. Due
to the size of the data to be normalized, we use a trigram dictionary that is quite
big, therefore we also need to decrease its size. To deal with this challenge, in this
thesis, we propose three methods to compress text files, especially in Vietnamese
text. The first method is a syllable-based method relying on the structure of
Vietnamese morphosyllables, consonants, syllables and vowels. The second method
is trigram-based Vietnamese text compression based on a trigram dictionary. The
last method is based on an n-gram slide window, in which we use five dictionaries
for unigrams, bigrams, trigrams, four-grams and five-grams. This method achieves
a promising compression ratio of around 90% and can be used for any size of text file.460 - Katedra informatikyvyhově