1 research outputs found

    Weblog and short text feature extraction and impact on categorisation

    Full text link
    The characterisation and categorisation of weblogs and other short texts has become an important research theme in the areas of topic/trend detection, and pattern recognition, amongst others. The value of analysing and characterising short text is to understand and identify the features that can identify and distinguish them, thereby improving input to the classification process. In this research work, we analyse a large number of text features and establish which combinations are useful to discriminate between the different genres of short text. Having identified the most promising features, we then confirm our findings by performing the categorisation task using three approaches: the Gaussian and SVM classifiers and the K-means clustering algorithm. Several hundred combinations of features were analysed in order to identify the best combinations and the results confirmed the observations made. The novel aspect of our work is the detection of the best combination of individual metrics which are identified as potential features to be used for the categorisation process.The research work of the third author is partially funded by the WIQ-EI (IRSES grant n. 269180) and DIANA APPLICATIONS (TIN2012-38603-C02-01), and done in the framework of the VLC/Campus Microcluster on Multimodal Interaction in Intelligent Systems.Perez Tellez, F.; Cardiff, J.; Rosso, P.; Pinto Avendaño, DE. (2014). Weblog and short text feature extraction and impact on categorisation. Journal of Intelligent and Fuzzy Systems. 27(5):2529-2544. https://doi.org/10.3233/IFS-141227S2529254427
    corecore