3,474 research outputs found

    Learning From Labeled And Unlabeled Data: An Empirical Study Across Techniques And Domains

    Full text link
    There has been increased interest in devising learning techniques that combine unlabeled data with labeled data ? i.e. semi-supervised learning. However, to the best of our knowledge, no study has been performed across various techniques and different types and amounts of labeled and unlabeled data. Moreover, most of the published work on semi-supervised learning techniques assumes that the labeled and unlabeled data come from the same distribution. It is possible for the labeling process to be associated with a selection bias such that the distributions of data points in the labeled and unlabeled sets are different. Not correcting for such bias can result in biased function approximation with potentially poor performance. In this paper, we present an empirical study of various semi-supervised learning techniques on a variety of datasets. We attempt to answer various questions such as the effect of independence or relevance amongst features, the effect of the size of the labeled and unlabeled sets and the effect of noise. We also investigate the impact of sample-selection bias on the semi-supervised learning techniques under study and implement a bivariate probit technique particularly designed to correct for such bias

    Customers Behavior Modeling by Semi-Supervised Learning in Customer Relationship Management

    Full text link
    Leveraging the power of increasing amounts of data to analyze customer base for attracting and retaining the most valuable customers is a major problem facing companies in this information age. Data mining technologies extract hidden information and knowledge from large data stored in databases or data warehouses, thereby supporting the corporate decision making process. CRM uses data mining (one of the elements of CRM) techniques to interact with customers. This study investigates the use of a technique, semi-supervised learning, for the management and analysis of customer-related data warehouse and information. The idea of semi-supervised learning is to learn not only from the labeled training data, but to exploit also the structural information in additionally available unlabeled data. The proposed semi-supervised method is a model by means of a feed-forward neural network trained by a back propagation algorithm (multi-layer perceptron) in order to predict the category of an unknown customer (potential customers). In addition, this technique can be used with Rapid Miner tools for both labeled and unlabeled data

    Sentiment Analysis Using Machine Learning Techniques

    Get PDF
    Before buying a product, people usually go to various shops in the market, query about the product, cost, and warranty, and then finally buy the product based on the opinions they received on cost and quality of service. This process is time consuming and the chances of being cheated by the seller are more as there is nobody to guide as to where the buyer can get authentic product and with proper cost. But now-a-days a good number of persons depend upon the on-line market for buying their required products. This is because the information about the products is available from multiple sources; thus it is comparatively cheap and also has the facility of home delivery. Again, before going through the process of placing order for any product, customers very often refer to the comments or reviews of the present users of the product, which help them take decision about the quality of the product as well as the service provided by the seller. Similar to placing order for products, it is observed that there are quite a few specialists in the field of movies, who go though the movie and then finally give a comment about the quality of the movie, i.e., to watch the movie or not or in five-star rating. These reviews are mainly in the text format and sometimes tough to understand. Thus, these reports need to be processed appropriately to obtain some meaningful information. Classification of these reviews is one of the approaches to extract knowledge about the reviews. In this thesis, different machine learning techniques are used to classify the reviews. Simulation and experiments are carried out to evaluate the performance of the proposed classification methods. It is observed that a good number of researchers have often considered two different review datasets for sentiment classification namely aclIMDb and Polarity dataset. The IMDb dataset is divided into training and testing data. Thus, training data are used for training the machine learning algorithms and testing data are used to test the data based on the training information. On the other hand, polarity dataset does not have separate data for training and testing. Thus, k-fold cross validation technique is used to classify the reviews. Four different machine learning techniques (MLTs) viz., Naive Bayes (NB), Support Vector Machine (SVM), Random Forest (RF), and Linear Discriminant Analysis (LDA) are used for the classification of these movie reviews. Different performance evaluation parameters are used to evaluate the performance of the machine learning techniques. It is observed that among the above four machine learning algorithms, RF technique yields the classification result, with more accuracy. Secondly, n-gram based classification of reviews are carried out on the aclIMDb dataset..

    Transfer Learning using Computational Intelligence: A Survey

    Get PDF
    Abstract Transfer learning aims to provide a framework to utilize previously-acquired knowledge to solve new but similar problems much more quickly and effectively. In contrast to classical machine learning methods, transfer learning methods exploit the knowledge accumulated from data in auxiliary domains to facilitate predictive modeling consisting of different data patterns in the current domain. To improve the performance of existing transfer learning methods and handle the knowledge transfer process in real-world systems, ..

    Using various Natural Language Processing Techniques to Automate Information Retrieval

    Get PDF
    The existence of Natural Language Processing(NLP) provides numerous benefits, including the understanding and analysis of unstructured data, as well as the efficient and precise automation of real-time processes. Despite the fact that NLP began in the 1940s, the importance of having an application that uses the benefits of NLP has never been greater than in the last two decades. This is because as the number of people who have access to the internet or digital devices grows, so does the size of the data collected. Thus, NLP and automated processes play a significant role in the quality and performance of services that users encounter. Datasets are not always structured or automated. This is due to the size of the data or the companies' age in terms of data collection. Several studies have shown that unstructured data contains useful information that, when managed properly, can point businesses in the right direction. To address these issues, it is critical to combine NLP and Machine Learning(ML) or Deep Learning(DL) algorithms. In other words, algorithms can deal with structured, unstructured, or both types of data. The algorithms' contributions are to automatically learn the language pattern in the given text and use that pattern to identify the unseen or validation data. Hyperparameter optimization are also performed in both supervised and unsupervised type of machine learning to make the algorithms as flexible as possible while achieving the desired results. The goal of this thesis is to develop an automated system that classifies files using various NLP in conjunction with the ML/DL algorithm that produces the best performance results. Autiliy AS is a young company focused on digitalization buildings. There are thousands of structured and unstructured files in Autility. Autility intends to use an automated system to extract information and classify files based on the system-code labeled "SYSTEMKODELIST NS3451". The "SYSTEMKODELISTE NS3451" is the "backbone'' for the entire system creation process. The first part of the main "SYSTEMKODELISTE NS3451'' from Norwegian Statsbygg is shown in figure 1. Only 12 rows of the standard "SYSTEMKODELISTE NS3451'' are displayed. The labeled dataset produces models with an average accuracy of roughly 85%. However, because the dataset contains far more unstructured files than structured files, research into algorithms that handle both structured and unstructured data is critical. Because many of the files contained drawings of buildings and pictures, the results of semi-supervised algorithms indicated the importance of formal language. To ensure consistent performance and a system with less overfitting, textaugmentation and hypertunneling are used. The assumptions made and the challenges faced are documented throughout this project. A few algorithms are presented in detail, along with their theoretical and mathematical concepts

    Adaptive Semi-supervised Learning for Cross-domain Sentiment Classification

    Full text link
    We consider the cross-domain sentiment classification problem, where a sentiment classifier is to be learned from a source domain and to be generalized to a target domain. Our approach explicitly minimizes the distance between the source and the target instances in an embedded feature space. With the difference between source and target minimized, we then exploit additional information from the target domain by consolidating the idea of semi-supervised learning, for which, we jointly employ two regularizations -- entropy minimization and self-ensemble bootstrapping -- to incorporate the unlabeled target data for classifier refinement. Our experimental results demonstrate that the proposed approach can better leverage unlabeled data from the target domain and achieve substantial improvements over baseline methods in various experimental settings.Comment: Accepted to EMNLP201
    corecore