13 research outputs found

    The impact of review valence, rating and type on review helpfulness : a text clustering and text categorization study

    Get PDF
    Dissertation presented as partial requirement for obtaining the Master`s degree in Information Management, with specialization in Marketing IntelligenceConsumers trust on online reviews to help them making their purchasing decisions. Online reviews provide consumers with clues about the quality of the products that they want to buy. Consumers rely on clues, such as, review helpfulness votes and rating to infer product quality. In this study, we perform a Text Clustering and a Text Categorization analysis to uncover the review characteristics and to predict the review rating, helpfulness votes and the product price, based on review corpus. We use a dataset with 72 878 reviews of unlocked mobile phones sold on Amazon.com to perform this analysis. The main goal of this research is to understand the impact of review valence, rating and type on helpfulness votes on Amazon, for unlocked mobile phones. This research aims, also to understand the impact of price on customer satisfaction and the relationship between customer satisfaction and ratings. Our results suggest that positive reviews that emphasize the feature level quality of the products receive more helpful votes than the positive reviews that contain mainly subjective expressions or negative reviews. Another important finding of this research is on the influence of the price of the product. The phones with high price tend to receive more positive reviews and more helpful votes. These findings have important managerial and theoretical implications. To best of our knowledge, our study is the first one to analyze the effect of the combination of valence, rating and subjectivity of the review text on helpful votes

    A framework for smart traffic management using heterogeneous data sources

    Get PDF
    A thesis submitted in partial fulfilment of the requirements of the University of Wolverhampton for the degree of Doctor of Philosophy.Traffic congestion constitutes a social, economic and environmental issue to modern cities as it can negatively impact travel times, fuel consumption and carbon emissions. Traffic forecasting and incident detection systems are fundamental areas of Intelligent Transportation Systems (ITS) that have been widely researched in the last decade. These systems provide real time information about traffic congestion and other unexpected incidents that can support traffic management agencies to activate strategies and notify users accordingly. However, existing techniques suffer from high false alarm rate and incorrect traffic measurements. In recent years, there has been an increasing interest in integrating different types of data sources to achieve higher precision in traffic forecasting and incident detection techniques. In fact, a considerable amount of literature has grown around the influence of integrating data from heterogeneous data sources into existing traffic management systems. This thesis presents a Smart Traffic Management framework for future cities. The proposed framework fusions different data sources and technologies to improve traffic prediction and incident detection systems. It is composed of two components: social media and simulator component. The social media component consists of a text classification algorithm to identify traffic related tweets. These traffic messages are then geolocated using Natural Language Processing (NLP) techniques. Finally, with the purpose of further analysing user emotions within the tweet, stress and relaxation strength detection is performed. The proposed text classification algorithm outperformed similar studies in the literature and demonstrated to be more accurate than other machine learning algorithms in the same dataset. Results from the stress and relaxation analysis detected a significant amount of stress in 40% of the tweets, while the other portion did not show any emotions associated with them. This information can potentially be used for policy making in transportation, to understand the users��� perception of the transportation network. The simulator component proposes an optimisation procedure for determining missing roundabouts and urban roads flow distribution using constrained optimisation. Existing imputation methodologies have been developed on straight section of highways and their applicability for more complex networks have not been validated. This task presented a solution for the unavailability of roadway sensors in specific parts of the network and was able to successfully predict the missing values with very low percentage error. The proposed imputation methodology can serve as an aid for existing traffic forecasting and incident detection methodologies, as well as for the development of more realistic simulation networks

    A graph theoretical perspective for the unsupervised clustering of free text corpora

    Get PDF
    This thesis introduces a robust end to end topic discovery framework that extracts a set of coherent topics stemming intrinsically from document similarities. Some topic clustering methods can support embedded vectors instead of traditional Bag-of-Words (BoW) representation. Some can be free from the number of topics hyperparameter and some others can extract a multi-scale relation between topics. However, no topic clustering method supports all these properties together. This thesis focuses on this gap in the literature by designing a framework that supports any type of document-level features especially the embedded vectors. This framework does not require any uninformed decision making about the underlying data such as the number of topics, instead, the framework extracts topics in multiple resolutions. To achieve this goal, we combine existing methods from natural language processing (NLP) for feature generation and graph theory, first for graph construction based on semantic document similarities, then for graph partitioning to extract corresponding topics in multiple resolutions. Finally, we use specific methods from statistical machine learning to obtain highly generalisable supervised models to deploy topic classifiers for the deployment of topic extraction in real-time. Our applications on both a noisy and specialised corpus of medical records (i.e., descriptions for patient incidents within the NHS) and public news articles in daily language show that our framework extracts coherent topics that have better quantitative benchmark scores than other methods in most cases. The resulting multi-scale topics in both applications enable us to capture specific details more easily and choose the relevant resolutions for the specific objective. This study contributes to topic clustering literature by introducing a novel graph theoretical perspective that provides a combination of new properties. These properties are multiple resolutions, independence from uninformed decisions about the corpus, and usage of recent NLP features, such as vector embeddings.Open Acces

    “It ain’t all good:" Machinic abuse detection and marginalisation in machine learning

    Get PDF
    Online abusive language has been given increasing prominence as a societal problem over the past few years as people are increasingly communicating on online platforms. This increase in prominence has resulted in an increase in academic attention to the issue, particularly within the field of Natural Language Processing (NLP), which has proposed multiple datasets and machine learning methods for the detection of text-based abuse. Recently, the issue of disparate impacts of machine learning has been given attention, showing that marginalised groups in society are disproportionately negatively affected by automated content moderation systems. Moreover, a number of challenges have been identified for abusive language detection technologies, including poor model performance across datasets and a lack of ability of models to contextualise potentially abusive speech within the context of speaker intentions. This dissertation aims to ask how NLP models for online abuse detection can address issues of generalisation and context. Through critically examining the task of online abuse detection, I highlight how content moderation acts as protective filter that seeks to maintain a sanitised environment. I find that when considering automated content moderation systems through this lens, it is made clear that such systems are centred around experiences of some bodies at the expense of others, often those who are already marginalised. In efforts to address this, I propose two different modelling processes that a) centre the the mental and emotional states of the speaker by representing documents through the Linguistic Inquiry and Word Count (LIWC) categories that they invoke, and using Multi-Task Learning (MTL) to model abuse, such that the model takes aims to take account the intentions of the speaker. I find that through the use of LIWC for representing documents, machine learning models for online abuse detection can see improvements in classification scores on in-domain and out-of-domain datasets. Similarly, I show that through a use of MTL, machine learning models can gain improvements by using a variety of auxiliary tasks that combine data for content moderation systems and data for related tasks such as sarcasm detection. Finally, I critique the machine learning pipeline in an effort to identify paths forward that can bring into focus the people who are excluded and are likely to experience harms from machine learning models for content moderation

    Advanced document analysis and automatic classification of PDF documents

    Get PDF
    This thesis explores the domain of document analysis and document classification within the PDF document environment The main focus is the creation of a document classification technique which can identify the logical class of a PDF document and so provide necessary information to document class specific algorithms (such as document understanding techniques). The thesis describes a page decomposition technique which is tailored to render the information contained in an unstructured PDF file into a set of blocks. The new technique is based on published research but contains many modifications which enable it to competently analyse the internal document model of PDF documents. A new level of document processing is presented: advanced document analysis. The aim of advanced document analysis is to extract information from the PDF file which can be used to help identify the logical class of that PDF file. A blackboard framework is used in a process of block labelling in which the blocks created from earlier segmentation techniques are classified into one of eight basic categories. The blackboard's knowledge sources are programmed to find recurring patterns amongst the document's blocks and formulate document-specific heuristics which can be used to tag those blocks. Meaningful document features are found from three information sources: a statistical evaluation of the document's esthetic components; a logical based evaluation of the labelled document blocks and an appearance based evaluation of the labelled document blocks. The features are used to train and test a neural net classification system which identifies the recurring patterns amongst these features for four basic document classes: newspapers; brochures; forms and academic documents. In summary this thesis shows that it is possible to classify a PDF document (which is logically unstructured) into a basic logical document class. This has important ramifications for document processing systems which have traditionally relied upon a priori knowledge of the logical class of the document they are processing

    A machine learning approach to detect insider threats in emails caused by human behaviour

    Get PDF
    In recent years, there has been a significant increase in insider threats within organisations and these have caused massive losses and damages. Due to the fact that email communications are a crucial part of the modern-day working environment, many insider threats exist within organisations’ email infrastructure. It is a well-known fact that employees not only dispatch ‘business-as-usual’ emails, but also emails that are completely unrelated to company business, perhaps even involving malicious activity and unethical behaviour. Such insider threat activities are mostly caused by employees who have legitimate access to their organisation’s resources, servers, and non-public data. However, these same employees abuse their privileges for personal gain or even to inflict malicious damage on the employer. The problem is that the high volume and velocity of email communication make it virtually impossible to minimise the risk of insider threat activities, by using techniques such as filtering and rule-based systems. The research presented in this dissertation suggests strategies to minimise the risk of insider threat via email systems by employing a machine-learning-based approach. This is done by studying and creating categories of malicious behaviours posed by insiders, and mapping these to phrases that would appear in email communications. Furthermore, a large email dataset is classified according to behavioural characteristics of employees. Machine learning algorithms are employed to identify commonly occurring insider threats and to group the occurrences according to insider threat classifications.Dissertation (MSc (Computer Science))--University of Pretoria, 2020.Computer ScienceMSc (Computer Science)Unrestricte

    Contributions au traitement des images multivariées

    Get PDF
    Ce mémoire résume mon activité pédagogique et scientifique en vue de l’obtention de l’habilitation à diriger des recherches

    Amélioration de l'exactitude de l'inférence phylogénomique

    Full text link
    L’explosion du nombre de séquences permet à la phylogénomique, c’est-à-dire l’étude des liens de parenté entre espèces à partir de grands alignements multi-gènes, de prendre son essor. C’est incontestablement un moyen de pallier aux erreurs stochastiques des phylogénies simple gène, mais de nombreux problèmes demeurent malgré les progrès réalisés dans la modélisation du processus évolutif. Dans cette thèse, nous nous attachons à caractériser certains aspects du mauvais ajustement du modèle aux données, et à étudier leur impact sur l’exactitude de l’inférence. Contrairement à l’hétérotachie, la variation au cours du temps du processus de substitution en acides aminés a reçu peu d’attention jusqu’alors. Non seulement nous montrons que cette hétérogénéité est largement répandue chez les animaux, mais aussi que son existence peut nuire à la qualité de l’inférence phylogénomique. Ainsi en l’absence d’un modèle adéquat, la suppression des colonnes hétérogènes, mal gérées par le modèle, peut faire disparaître un artéfact de reconstruction. Dans un cadre phylogénomique, les techniques de séquençage utilisées impliquent souvent que tous les gènes ne sont pas présents pour toutes les espèces. La controverse sur l’impact de la quantité de cellules vides a récemment été réactualisée, mais la majorité des études sur les données manquantes sont faites sur de petits jeux de séquences simulées. Nous nous sommes donc intéressés à quantifier cet impact dans le cas d’un large alignement de données réelles. Pour un taux raisonnable de données manquantes, il appert que l’incomplétude de l’alignement affecte moins l’exactitude de l’inférence que le choix du modèle. Au contraire, l’ajout d’une séquence incomplète mais qui casse une longue branche peut restaurer, au moins partiellement, une phylogénie erronée. Comme les violations de modèle constituent toujours la limitation majeure dans l’exactitude de l’inférence phylogénétique, l’amélioration de l’échantillonnage des espèces et des gènes reste une alternative utile en l’absence d’un modèle adéquat. Nous avons donc développé un logiciel de sélection de séquences qui construit des jeux de données reproductibles, en se basant sur la quantité de données présentes, la vitesse d’évolution et les biais de composition. Lors de cette étude nous avons montré que l’expertise humaine apporte pour l’instant encore un savoir incontournable. Les différentes analyses réalisées pour cette thèse concluent à l’importance primordiale du modèle évolutif.The explosion of sequence number allows for phylogenomics, the study of species relationships based on large multi-gene alignments, to flourish. Without any doubt, phylogenomics is essentially an efficient way to eliminate the problems of single gene phylogenies due to stochastic errors, but numerous problems remain despite obvious progress realized in modeling evolutionary process. In this PhD-thesis, we are trying to characterize some consequences of a poor model fit and to study their impact on the accuracy of the phylogenetic inference. In contrast to heterotachy, the variation in the amino acid substitution process over time did not attract so far a lot of attention. We demonstrate that this heterogeneity is frequently observed within animals, but also that its existence can interfere with the quality of phylogenomic inference. In absence of an adequate model, the elimination of heterogeneous columns, which are poorly handled by the model, can eliminate an artefactual reconstruction. In a phylogenomic framework, the sequencing strategies often result in a situation where some genes are absent for some species. The issue about the impact of the quantity of empty cells was recently relaunched, but the majority of studies on missing data is performed on small datasets of simulated sequences. Therefore, we were interested on measuring the impact in the case of a large alignment of real data. With a reasonable amount of missing data, it seems that the accuracy of the inference is influenced rather by the choice of the model than the incompleteness of the alignment. For example, the addition of an incomplete sequence that breaks a long branch can at least partially re-establish an artefactual phylogeny. Because, model violations are always representing the major limitation of the accuracy of the phylogenetic inference, the improvement of species and gene sampling remains a useful alternative in the absence of an adequate model. Therefore, we developed a sequence-selection software, which allows the reproducible construction of datasets, based on the quantity of data, their evolutionary speed and their compositional bias. During this study, we did realize that the human expertise still furnishes an indispensable knowledge. The various analyses performed in the course of this PhD thesis agree on the primordial importance of the model of sequence evolution
    corecore