5 research outputs found

    Cross Lingual Sentiment Analysis: A Clustering-Based Bee Colony Instance Selection and Target-Based Feature Weighting Approach

    Get PDF
    The lack of sentiment resources in poor resource languages poses challenges for the sentiment analysis in which machine learning is involved. Cross-lingual and semi-supervised learning approaches have been deployed to represent the most common ways that can overcome this issue. However, performance of the existing methods degrades due to the poor quality of translated resources, data sparseness and more specifically, language divergence. An integrated learning model that uses a semi-supervised and an ensembled model while utilizing the available sentiment resources to tackle language divergence related issues is proposed. Additionally, to reduce the impact of translation errors and handle instance selection problem, we propose a clustering-based bee-colony-sample selection method for the optimal selection of most distinguishing features representing the target data. To evaluate the proposed model, various experiments are conducted employing an English-Arabic cross-lingual data set. Simulations results demonstrate that the proposed model outperforms the baseline approaches in terms of classification performances. Furthermore, the statistical outcomes indicate the advantages of the proposed training data sampling and target-based feature selection to reduce the negative effect of translation errors. These results highlight the fact that the proposed approach achieves a performance that is close to in-language supervised models

    A Roadmap for Natural Language Processing Research in Information Systems

    Get PDF
    Natural Language Processing (NLP) is now widely integrated into web and mobile applications, enabling natural interactions between human and computers. Although many NLP studies have been published, none have comprehensively reviewed or synthesized tasks most commonly addressed in NLP research. We conduct a thorough review of IS literature to assess the current state of NLP research, and identify 12 prototypical tasks that are widely researched. Our analysis of 238 articles in Information Systems (IS) journals between 2004 and 2015 shows an increasing trend in NLP research, especially since 2011. Based on our analysis, we propose a roadmap for NLP research, and detail how it may be useful to guide future NLP research in IS. In addition, we employ Association Rules (AR) mining for data analysis to investigate co-occurrence of prototypical tasks and discuss insights from the findings

    Cross-lingual sentiment classification using semi-supervised learning

    Get PDF
    Cross-lingual sentiment classification aims to utilize annotated sentiment resources in one language for text sentiment classification in another language. Automatic machine translation services are the most commonly used tools to directly project information from one language into another. However, different term distribution between translated and original documents, translation errors and different intrinsic structure of documents in various languages are the problems that lead to low performance in sentiment classification. Furthermore, due to the existence of different linguistic terms in different languages, translated documents cannot cover all vocabularies which exist in the original documents. The aim of this thesis is to propose an enhanced framework for cross-lingual sentiment classification to overcome all the aforementioned problems in order to improve the classification performance. Combination of active learning and semi-supervised learning in both single view and bi-view frameworks is proposed to incorporate unlabelled data from the target language in order to reduce term distribution divergence. Using bi-view documents can partially alleviate the negative effects of translation errors. Multi-view semisupervised learning is also used to overcome the problem of low term-coverage through employing multiple source languages. Features that are extracted from multiple source languages can cover more vocabularies from test data and consequently, more sentimental terms can be used in the classification process. Content similarities of labelled and unlabelled documents are used through graphbased semi-supervised learning approach to incorporate the structure of documents in the target language into the learning process. Performance evaluation performed on sentiment data sets in four different languages certifies the effectiveness of the proposed approaches in comparison to the well-known baseline classification methods. The experiments show that incorporation of unlabelled data from the target language can effectively improve the classification performance. Experimental results also show that using multiple source languages in the multi-view learning model outperforms other methods. The proposed framework is flexible enough to be applied on any new language, and therefore, it can be used to develop multilingual sentiment analysis systems

    Évaluation et amĂ©lioration du rendement de la formation en entreprise : vers une dĂ©marche basĂ©e sur la gestion des processus d’affaires

    Full text link
    La formation est une stratĂ©gie clĂ© pour le dĂ©veloppement des compĂ©tences. Les entreprises continuent Ă  investir dans la formation et le dĂ©veloppement, mais elles possĂšdent rarement des donnĂ©es pour Ă©valuer les rĂ©sultats de cet investissement. La plupart des entreprises utilisent le modĂšle Kirkpatrick/Phillips pour Ă©valuer la formation en entreprise. Cependant, il ressort de la littĂ©rature que les entreprises ont des difficultĂ©s Ă  utiliser ce modĂšle. Les principales barriĂšres sont la difficultĂ© d’isoler l’apprentissage comme un facteur qui a une incidence sur les rĂ©sultats, l’absence d’un systĂšme d’évaluation utile avec le systĂšme de gestion de l’apprentissage (Learning Management System - LMS) et le manque de donnĂ©es standardisĂ©es pour pouvoir comparer diffĂ©rentes fonctions d’apprentissage. Dans cette thĂšse, nous proposons un modĂšle (Analyse, ModĂ©lisation, Monitoring et Optimisation - AM2O) de gestion de projets de formation en entreprise, basĂ©e sur la gestion des processus d’affaires (Business Process Management - BPM). Un tel scĂ©nario suppose que les activitĂ©s de formation en entreprise doivent ĂȘtre considĂ©rĂ©es comme des processus d’affaires. Notre modĂšle est inspirĂ© de cette mĂ©thode (BPM), Ă  travers la dĂ©finition et le suivi des indicateurs de performance pour gĂ©rer les projets de formation dans les organisations. Elle est basĂ©e sur l’analyse et la modĂ©lisation des besoins de formation pour assurer l’alignement entre les activitĂ©s de formation et les objectifs d’affaires de l’entreprise. Elle permet le suivi des projets de formation ainsi que le calcul des avantages tangibles et intangibles de la formation (sans coĂ»t supplĂ©mentaire). En outre, elle permet la production d’une classification des projets de formation en fonction de critĂšres relatifs Ă  l’entreprise. Ainsi, avec assez de donnĂ©es, notre approche peut ĂȘtre utilisĂ©e pour optimiser le rendement de la formation par une sĂ©rie de simulations utilisant des algorithmes d’apprentissage machine : rĂ©gression logistique, rĂ©seau de neurones, co-apprentissage. Enfin, nous avons conçu un systĂšme informatique, Enterprise TRaining programs Evaluation and Optimization System - ETREOSys, pour la gestion des programmes de formation en entreprise et l’aide Ă  la dĂ©cision. ETREOSys est une plateforme Web utilisant des services en nuage (cloud services) et les bases de donnĂ©es NoSQL. A travers AM2O et ETREOSys nous rĂ©solvons les principaux problĂšmes liĂ©s Ă  la gestion et l’évaluation de la formation en entreprise Ă  savoir : la difficultĂ© d’isoler les effets de la formation dans les rĂ©sultats de l’entreprise et le manque de systĂšmes informatiques.Training is a key strategy to develop employees’ skills. Businesses continue to invest in training and development, but they rarely have data to evaluate the results of this investment. Most companies use the Kirkpatrick/Phillips model to evaluate the training. However, it emerges from the literature that companies have difficulties in using this model. There are three main barriers to the evaluation of training. The first barrier is the difficulty of isolating learning as a factor that affects the results. Another barrier is the lack of a useful assessment IT tool with the learning management system and the third barrier is the lack of standardized data to compare various learning functions. In this thesis, we propose a model to manage training projects in enterprises (Analysis, Modelling, Monitoring and Optimization - AM2O), based on Business Process Management (BPM). Such a scenario considers the training activities as business processes. Our model is inspired by the BPM method through the definition and monitoring of performance indicators for managing all aspects of training. It is based on the analysis and modeling of training needs to ensure alignment between training activities and business objectives. It allows the monitoring of training projects as well as the calculation of tangible and intangible benefits (without additional cost). In addition, it allows us to classify training projects according to criteria relative to the enterprise. Thus, our approach could be used to optimize the yield of the training through a series of simulations using the machine learning algorithms logistic regression, neural network and co-training. Finally, we develop an IT tool, Enterprise Training Programs Evaluation and Optimization System - ETREOSys, to manage training programs in enterprises and for decisionmaking support. ETREOSys is aWeb platform which uses cloud services such as virtual machines, data centers, Content Delivery Network and NoSQL databases. AM2O and ETREOSys resolve the main problems related to the management and evaluation of training in enterprises namely: the difficulty of isolating the effects of training and the lack of IT tools

    Semi-supervised machine learning techniques for classification of evolving data in pattern recognition

    Get PDF
    The amount of data recorded and processed over recent years has increased exponentially. To create intelligent systems that can learn from this data, we need to be able to identify patterns hidden in the data itself, learn these pattern and predict future results based on our current observations. If we think about this system in the context of time, the data itself evolves and so does the nature of the classification problem. As more data become available, different classification algorithms are suitable for a particular setting. At the beginning of the learning cycle when we have a limited amount of data, online learning algorithms are more suitable. When truly large amounts of data become available, we need algorithms that can handle large amounts of data that might be only partially labeled as a result of the bottleneck in the learning pipeline from human labeling of the data. An excellent example of evolving data is gesture recognition, and it is present throughout our work. We need a gesture recognition system to work fast and with very few examples at the beginning. Over time, we are able to collect more data and the system can improve. As the system evolves, the user expects it to work better and not to have to become involved when the classifier is unsure about decisions. This latter situation produces additional unlabeled data. Another example of an application is medical classification, where experts’ time is a rare resource and the amount of received and labeled data disproportionately increases over time. Although the process of data evolution is continuous, we identify three main discrete areas of contribution in different scenarios. When the system is very new and not enough data are available, online learning is used to learn after every single example and to capture the knowledge very fast. With increasing amounts of data, offline learning techniques are applicable. Once the amount of data is overwhelming and the teacher cannot provide labels for all the data, we have another setup that combines labeled and unlabeled data. These three setups define our areas of contribution; and our techniques contribute in each of them with applications to pattern recognition scenarios, such as gesture recognition and sketch recognition. An online learning setup significantly restricts the range of techniques that can be used. In our case, the selected baseline technique is the Evolving TS-Fuzzy Model. The semi-supervised aspect we use is a relation between rules created by this model. Specifically, we propose a transductive similarity model that utilizes the relationship between generated rules based on their decisions about a query sample during the inference time. The activation of each of these rules is adjusted according to the transductive similarity, and the new decision is obtained using the adjusted activation. We also propose several new variations to the transductive similarity itself. Once the amount of data increases, we are not limited to the online learning setup, and we can take advantage of the offline learning scenario, which normally performs better than the online one because of the independence of sample ordering and global optimization with respect to all samples. We use generative methods to obtain data outside of the training set. Specifically, we aim to improve the previously mentioned TS Fuzzy Model by incorporating semi-supervised learning in the offline learning setup without unlabeled data. We use the Universum learning approach and have developed a method called UFuzzy. This method relies on artificially generated examples with high uncertainty (Universum set), and it adjusts the cost function of the algorithm to force the decision boundary to be close to the Universum data. We were able to prove the hypothesis behind the design of the UFuzzy classifier that Universum learning can improve the TS Fuzzy Model and have achieved improved performance on more than two dozen datasets and applications. With increasing amounts of data, we use the last scenario, in which the data comprises both labeled data and additional non-labeled data. This setting is one of the most common ones for semi-supervised learning problems. In this part of our work, we aim to improve the widely popular tecjniques of self-training (and its successor help-training) that are both meta-frameworks over regular classifier methods but require probabilistic representation of output, which can be hard to obtain in the case of discriminative classifiers. Therefore, we develop a new algorithm that uses the modified active learning technique Query-by-Committee (QbC) to sample data with high certainty from the unlabeled set and subsequently embed them into the original training set. Our new method allows us to achieve increased performance over both a range of datasets and a range of classifiers. These three works are connected by gradually relaxing the constraints on the learning setting in which we operate. Although our main motivation behind the development was to increase performance in various real-world tasks (gesture recognition, sketch recognition), we formulated our work as general methods in such a way that they can be used outside a specific application setup, the only restriction being that the underlying data evolve over time. Each of these methods can successfully exist on its own. The best setting in which they can be used is a learning problem where the data evolve over time and it is possible to discretize the evolutionary process. Overall, this work represents a significant contribution to the area of both semi-supervised learning and pattern recognition. It presents new state-of-the-art techniques that overperform baseline solutions, and it opens up new possibilities for future research
    corecore