2 research outputs found

    Vietnamese Text Classification Algorithm using Long Short Term Memory and Word2Vec

    Get PDF
    In the context of the ongoing forth industrial revolution and fast computer science development the amount of textual information becomes huge. So, prior to applying the seemingly appropriate methodologies and techniques to the above data processing their nature and characteristics should be thoroughly analyzed and understood. At that, automatic text processing incorporated in the existing systems may facilitate many procedures. So far, text classiļ¬cation is one of the basic applications to natural language processing accounting for such factors as emotionsā€™ analysis, subject labeling etc. In particular, the existing advancements in deep learning networks demonstrate that the proposed methods may fit the documentsā€™ classifying, since they possess certain extra efficiency; for instance, they appeared to be eļ¬€ective for classifying texts in English. The thorough study revealed that practically no research effort was put into an expertise of the documents in Vietnamese language. In the scope of our study, there is not much research for documents in Vietnamese. The development of deep learning models for document classiļ¬cation has demonstrated certain improvements for texts in Vietnamese. Therefore, the use of long short term memory network with Word2vec is proposed to classify text that improves both performance and accuracy. The here developed approach when compared with other traditional methods demonstrated somewhat better results at classifying texts in Vietnamese language. The evaluation made over datasets in Vietnamese shows an accuracy of over 90%; also the proposed approach looks quite promising for real applications

    Online content clustering using variant K-Means Algorithms

    Get PDF
    Thesis (MTech)--Cape Peninsula University of Technology, 2019We live at a time when so much information is created. Unfortunately, much of the information is redundant. There is a huge amount of online information in the form of news articles that discuss similar stories. The number of articles is projected to grow. The growth makes it difficult for a person to process all that information in order to update themselves on a subject matter. There is an overwhelming amount of similar information on the internet. There is need for a solution that can organize this similar information into specific themes. The solution is a branch of Artificial intelligence (AI) called machine learning (ML) using clustering algorithms. This refers to clustering groups of information that is similar into containers. When the information is clustered people can be presented with information on their subject of interest, grouped together. The information in a group can be further processed into a summary. This research focuses on unsupervised learning. Literature has it that K-Means is one of the most widely used unsupervised clustering algorithm. K-Means is easy to learn, easy to implement and is also efficient. However, there is a horde of variations of K-Means. The research seeks to find a variant of K-Means that can be used with an acceptable performance, to cluster duplicate or similar news articles into correct semantic groups. The research is an experiment. News articles were collected from the internet using gocrawler. gocrawler is a program that takes Universal Resource Locators (URLs) as an argument and collects a story from a website pointed to by the URL. The URLs are read from a repository. The stories come riddled with adverts and images from the web page. This is referred to as a dirty text. The dirty text is sanitized. Sanitization is basically cleaning the collected news articles. This includes removing adverts and images from the web page. The clean text is stored in a repository, it is the input for the algorithm. The other input is the K value. All K-Means based variants take K value that defines the number of clusters to be produced. The stories are manually classified and labelled. The labelling is done to check the accuracy of machine clustering. Each story is labelled with a class to which it belongs. The data collection process itself was not unsupervised but the algorithms used to cluster are totally unsupervised. A total of 45 stories were collected and 9 manual clusters were identified. Under each manual cluster there are sub clusters of stories talking about one specific event. The performance of all the variants is compared to see the one with the best clustering results. Performance was checked by comparing the manual classification and the clustering results from the algorithm. Each K-Means variant is run on the same set of settings and same data set, that is 45 stories. The settings used are, ā€¢ Dimensionality of the feature vectors, ā€¢ Window size, ā€¢ Maximum distance between the current and predicted word in a sentence, ā€¢ Minimum word frequency, ā€¢ Specified range of words to ignore, ā€¢ Number of threads to train the model. ā€¢ The training algorithm either distributed memory (PV-DM) or distributed bag of words (PV-DBOW), ā€¢ The initial learning rate. The learning rate decreases to minimum alpha as training progresses, ā€¢ Number of iterations per cycle, ā€¢ Final learning rate, ā€¢ Number of clusters to form, ā€¢ The number of times the algorithm will be run, ā€¢ The method used for initialization. The results obtained show that K-Means can perform better than K-Modes. The results are tabulated and presented in graphs in chapter six. Clustering can be improved by incorporating Named Entity (NER) recognition into the K-Means algorithms. Results can also be improved by implementing multi-stage clustering technique. Where initial clustering is done then you take the cluster group and further cluster it to achieve finer clustering results
    corecore