17 research outputs found

    Ensemble feature selection using weighted concatenated voting for text classification

    Get PDF
    Following the increasing number of high dimensional data, selecting relevant features has always been better handled by filter feature selection techniques due to its improved generalization, faster training time, dimensionality reduction, less prone to overfitting, and improved model performance. However, the most used feature selection methods are unstable; a feature selection method chooses different subsets of characteristics that produce different classification accuracy. Selecting an appropriate hybrid harnesses the local feature relevant to the discriminative power of filter methods for improved text classification, which is lacking in past literature. In this paper, we proposed a novel multi-univariate hybrid feature selection method (MUNIFES) for enhanced discriminative power between the features and the target class. The proposed method utilizes multi-iterative processes to select the best feature sets from each univariate feature selection method. MUNIFES has employed the ensemble of multi-filter discriminative strength of Chi-Square (Chi2), Analysis of Variance (ANOVA), and Infogain methods to select optimal feature subsets. To evaluate the success of the proposed method, several experiments were performed on the 20newsgroup dataset and its variant (17newsgroup) with 10 classifiers (including ensemble, classification and optimization algorithms, and Artificial Neural Network (ANN)), compared with the state-of-the-art feature selection methods. The MUNIFES results indicated a better accuracy classification performance

    Hybrid Machine Translation with Multi-Source Encoder-Decoder Long Short-Term Memory in English-Malay Translation

    Get PDF
    Statistical Machine Translation (SMT) and Neural Machine Translation (NMT) are the state-of-the-art approaches in machine translation (MT). The translation produced by a SMT is based on the statistical analysis of text corpora, while NMT uses deep neural network to model and to generate a translation. SMT and NMT have their strength and weaknesses. SMT may produce better translation with a small parallel text corpus compared to NMT. Nevertheless, when the amount of parallel text available is large, the quality of the translation produced by NMT is often higher than SMT. Besides that, study also shown that the translation produced by SMT is better than NMT in cases where there is a domain mismatch between training and testing. SMT also has an advantage on long sentences. In addition, when a translation produced by an NMT is wrong, it is very difficult to find the error. In this paper, we investigate a hybrid approach that combine SMT and NMT to perform English to Malay translation. The motivation of using a hybrid machine translation is to combine the strength of both approaches to produce a more accurate translation. Our approach uses the multi-source encoder-decoder long short-term memory (LSTM) architecture. The architecture uses two encoders, one to embed the sentence to be translated, and another encoder to embed the initial translation produced by SMT. The translation from the SMT can be viewed as a “suggestion translation” to the neural MT. Our experiments show that the hybrid MT increases the BLEU scores of our best baseline machine translation in computer science domain and news domain from 21.21 and 48.35 to 35.97 and 61.81 respectively

    Text Mining and Determinants of Sentiments towards the COVID-19 Vaccine Booster of Twitter Users in Malaysia

    Get PDF
    Vaccination is the primary preventive measure against the COVID-19 infection, and an additional vaccine dosage is crucial to increase the immunity level of the community. However, public bias, as reflected on social media, may have a significant impact on the vaccination program. We aim to investigate the attitudes to the COVID-19 vaccination booster in Malaysia by using sentiment analysis. We retrieved 788 tweets containing COVID-19 vaccine booster keywords and identified the common topics discussed in tweets that related to the booster by using latent Dirichlet allocation (LDA) and performed sentiment analysis to understand the determinants for the sentiments to receiving the vaccination booster in Malaysia. We identified three important LDA topics: (1) type of vaccination booster; (2) effects of vaccination booster; (3) vaccination program operation. The type of vaccination further transformed into attributes of “az”, “pfizer”, “sinovac”, and “mix” for determinants’ assessments. Effect and type of vaccine booster associated stronger than program operation topic for the sentiments, and “pfizer” and “mix” were the strongest determinants of the tweet’s sentiments after the Boruta feature selection and validated from the performance of regression analysis. This study provided a comprehensive workflow to retrieve and identify important healthcare topic from social media

    Evaluating LSTM Networks, HMM and WFST in Malay Part-of-Speech Tagging

    Get PDF
    Long short term memory (LSTM) networks have been gaining popularity in modeling sequential data such as phoneme recognition, speech translation, language modeling, speech synthesis, chatbot-like dialog systems and others. This paper investigates the attention-based encoder-decoder LSTM networks in Malay part-of-speech (POS) tagging when it is compared to weighted finite state transducer (WFST) and hidden Markov model (HMM). The attractiveness of LSTM networks is its strength in modeling long distance dependencies. Malay POS tagging is examined from two different conditions: with and without morphological information. The experiment results show that LSTM networks that are trained without any explicit morphological knowledge perform nearly equally with WFST but better than HMM approach that is trained with morphological information

    Evaluating LSTM Networks, HMM and WFST in Malay Part-of-Speech Tagging

    Get PDF
    Tien-Ping Tan1, Bali Ranaivo-Malançon2, Laurent Besacier3, Yin-Lai Yeong1, Keng Hoon Gan1, and Enya Kong Tang

    A flexible query transformation framework for structured retrieval / Gan Keng Hoon

    Get PDF
    Recent years, there exist meaningful structured collections that can be exploited in search task. When searching for these structured collections, the expressiveness of structured queries allows structures to be specified at the query layer in order to obtain a more focused and precise search results. However, constructing such queries in an adhoc search environment is difficult as users need to be familiar with the syntax of the query languages. Heterogeneities of structure usages across different collections also hinder users from selecting appropriate structure or concept when writing queries. In this thesis, we are motivated to automate the construction of these queries from keywords query which are more familiar to any user. The work of query transformation results in two main challenges. First, to propose a generic framework such as it can be easily adapted to changes in structured retrieval environment such as retrieval systems, collections, scoring models. Second, to propose a query interpretation within the framework that will handle structure complexities in collection. Since the usage of markups and structures in current structured collections can be loosely defined, these collections are now richer and more complex in their information structures, especially for text centric collection. Current works have yet to explore into these newly emerging complex structures when capturing knowledge for query interpretation. In order to address these challenges, a flexible query transformation framework (FQT) is proposed. The flexibility feature is desired such that the framework can cater for various settings of structured retrieval environment e.g. different types of structured collections and structured query interfaces. This framework consists of a novel intermediate query representation that will be the central of the transformation process, i.e. a structure that captures the information needs of query and the syntax of query separately. Its main strength is to allow the transformation to be generic to cater for more than single type of structure query. Supporting this intermediate query representation are the query interpretation and query construction algorithms. The former uses context-based probabilistic approach for interpreting source query, whereas the latter constructs the interpreted query into an intermediate query. Once a source query is interpreted and represented as intermediate query, it can be easily mapped to a structured query language using a set of predefined query templates in knowledge base. Lastly, experiments are carried out at the algorithm, application and representation levels on both synthetic and real world data sets to demonstrate the feasibility and scalability of the query transformation framework. The experimental results confirm that our framework is more effective in terms of query interpretation especially dealing with collection with complex structures. The framework is also able to represent various kinds of information needs and structured query languages with its proposed intermediate query representation. Better performance in terms of precision has also been achieved when structured query generated by the framework is applied in structured retrieval task

    Extraction and Visualization of Tourist Attraction Semantics from Travel Blogs

    No full text
    Travel blogs are a significant source for modeling human travelling behavior and characterizing tourist destinations owing to the presence of rich geospatial and thematic content. However, the bulk of unstructured text requires extensive processing for an efficient transformation of data to knowledge. Existing works have studied tourist places, but results lack a coherent outline and visualization of the semantic knowledge associated with tourist attractions. Hence, this work proposes place semantics extraction based on a fusion of content analysis and natural language processing (NLP) techniques. A weighted-sum equation model is then employed to construct a points of interest graph (POI graph) that integrates extracted semantics with conventional frequency-based weighting of tourist spots and routes. The framework offers determination and visualization of massive blog text in a comprehensible manner to facilitate individuals in travel decision-making as well as tourism managers to devise effective destination planning and management strategies

    Effects of the Hybrid CRITIC–VIKOR Method on Product Aspect Ranking in Customer Reviews

    No full text
    Product aspect ranking is critical for prioritizing the most important aspects of a specific product/service to assist probable customers in selecting suitable products that can realize their needs. However, given the voluminous customer reviews published on websites, customers are hindered from manually extracting and characterizing the specific aspects of searched products. A few multicriteria decision-making methods have been implemented to rank the most relevant product aspects. As weights greatly affect the ranking results of product aspects, this study used objective methods in finding the importance degree of a criteria set to overcome the limitations of subjective weighting. The growing popularity of online shopping has led to an exponential increase in the number of customer reviews available on various e-commerce websites. The sheer volume of these reviews makes it nearly impossible for customers to manually extract and analyze the specific aspects of the products they are interested in. This challenge highlights the need for automated techniques that can efficiently rank the product aspects based on their relevance and importance. Multicriteria decision-making techniques can address the issue of product aspect ranking. These techniques seek to offer a methodical strategy for assessing and contrasting various product attributes based on various criteria. The subjective nature of determining weights for each criterion raises serious issues because it might lead to bias and inconsistent ranking outcomes. The CRITIC–VIKOR method was adopted in the product aspect ranking process. The statistical findings based on a benchmark dataset using NDCG demonstrate the superior performance of the method of using objective weighting to reasonably acquire subjective weighting results. Also, the results show that the product aspects ranked by using CRITIC–VIKOR could be considered guidelines for probable customers to make a wise purchasing decision
    corecore