3,474 research outputs found
Learning From Labeled And Unlabeled Data: An Empirical Study Across Techniques And Domains
There has been increased interest in devising learning techniques that
combine unlabeled data with labeled data ? i.e. semi-supervised learning.
However, to the best of our knowledge, no study has been performed across
various techniques and different types and amounts of labeled and unlabeled
data. Moreover, most of the published work on semi-supervised learning
techniques assumes that the labeled and unlabeled data come from the same
distribution. It is possible for the labeling process to be associated with a
selection bias such that the distributions of data points in the labeled and
unlabeled sets are different. Not correcting for such bias can result in biased
function approximation with potentially poor performance. In this paper, we
present an empirical study of various semi-supervised learning techniques on a
variety of datasets. We attempt to answer various questions such as the effect
of independence or relevance amongst features, the effect of the size of the
labeled and unlabeled sets and the effect of noise. We also investigate the
impact of sample-selection bias on the semi-supervised learning techniques
under study and implement a bivariate probit technique particularly designed to
correct for such bias
Customers Behavior Modeling by Semi-Supervised Learning in Customer Relationship Management
Leveraging the power of increasing amounts of data to analyze customer base
for attracting and retaining the most valuable customers is a major problem
facing companies in this information age. Data mining technologies extract
hidden information and knowledge from large data stored in databases or data
warehouses, thereby supporting the corporate decision making process. CRM uses
data mining (one of the elements of CRM) techniques to interact with customers.
This study investigates the use of a technique, semi-supervised learning, for
the management and analysis of customer-related data warehouse and information.
The idea of semi-supervised learning is to learn not only from the labeled
training data, but to exploit also the structural information in additionally
available unlabeled data. The proposed semi-supervised method is a model by
means of a feed-forward neural network trained by a back propagation algorithm
(multi-layer perceptron) in order to predict the category of an unknown
customer (potential customers). In addition, this technique can be used with
Rapid Miner tools for both labeled and unlabeled data
Sentiment Analysis Using Machine Learning Techniques
Before buying a product, people usually go to various shops in the market, query about the product, cost, and warranty, and then finally buy the product based on the opinions they received on cost and quality of service. This process is time consuming and the chances of being cheated by the seller are more as there is nobody to guide as to where the buyer can get authentic product and with proper cost. But now-a-days a good number of persons depend upon the on-line market for buying their required products. This is because the information about the products is available from multiple sources; thus it is comparatively cheap and also has the facility of home delivery. Again, before going through the process of placing order for any product, customers very often refer to the comments or reviews of the present users of the product, which help them take decision about the quality of the product as well as the service provided by the seller. Similar to placing order for products, it is observed that there are quite a few specialists in the field of movies, who go though the movie and then finally give a comment about the quality of the movie, i.e., to watch the movie or not or in five-star rating. These reviews are mainly in the text format and sometimes tough to understand. Thus, these reports need to be processed appropriately to obtain some meaningful information. Classification of these reviews is one of the approaches to extract knowledge about the reviews. In this thesis, different machine learning techniques are used to classify the reviews. Simulation and experiments are carried out to evaluate the performance of the proposed classification methods. It is observed that a good number of researchers have often considered two different review datasets for sentiment classification namely aclIMDb and Polarity dataset. The IMDb dataset is divided into training and testing data. Thus, training data are used for training the machine learning algorithms and testing data are used to test the data based on the training information. On the other hand, polarity dataset does not have separate data for training and testing. Thus, k-fold cross validation technique is used to classify the reviews. Four different machine learning techniques (MLTs) viz., Naive Bayes (NB), Support Vector Machine (SVM), Random Forest (RF), and Linear Discriminant Analysis (LDA) are used for the classification of these movie reviews. Different performance evaluation parameters are used to evaluate the performance of the machine learning techniques. It is observed that among the above four machine learning algorithms, RF technique yields the classification result, with more accuracy. Secondly, n-gram based classification of reviews are carried out on the aclIMDb dataset..
Transfer Learning using Computational Intelligence: A Survey
Abstract Transfer learning aims to provide a framework to utilize previously-acquired knowledge to solve new but similar problems much more quickly and effectively. In contrast to classical machine learning methods, transfer learning methods exploit the knowledge accumulated from data in auxiliary domains to facilitate predictive modeling consisting of different data patterns in the current domain. To improve the performance of existing transfer learning methods and handle the knowledge transfer process in real-world systems, ..
Using various Natural Language Processing Techniques to Automate Information Retrieval
The existence of Natural Language Processing(NLP) provides numerous benefits, including the understanding and analysis of unstructured data, as well as the efficient and precise automation of real-time processes. Despite the fact that NLP began in the 1940s, the importance of having an application that uses the benefits of NLP has never been greater than in the last two decades. This is because as the number of people who have access to the internet or digital devices grows, so does the size of the data collected. Thus, NLP and automated processes play a significant role in the quality and performance of services that users encounter.
Datasets are not always structured or automated. This is due to the size of the data or the companies' age in terms of data collection. Several studies have shown that unstructured data contains useful information that, when managed properly, can point businesses in the right direction. To address these issues, it is critical to combine NLP and Machine Learning(ML) or Deep Learning(DL) algorithms. In other words, algorithms can deal with structured, unstructured, or both types of data. The algorithms' contributions are to automatically learn the language pattern in the given text and use that pattern to identify the unseen or validation data. Hyperparameter optimization are also performed in both supervised and unsupervised type of machine learning to make the algorithms as flexible as possible while achieving the desired results.
The goal of this thesis is to develop an automated system that classifies files using various NLP in conjunction with the ML/DL algorithm that produces the best performance results. Autiliy AS is a young company focused on digitalization buildings. There are thousands of structured and unstructured files in Autility. Autility intends to use an automated system to extract information and classify files based on the system-code labeled "SYSTEMKODELIST NS3451". The "SYSTEMKODELISTE NS3451" is the "backbone'' for the entire system creation process. The first part of the main "SYSTEMKODELISTE NS3451'' from Norwegian Statsbygg is shown in figure 1. Only 12 rows of the standard "SYSTEMKODELISTE NS3451'' are displayed.
The labeled dataset produces models with an average accuracy of roughly 85%. However, because the dataset contains far more unstructured files than structured files, research into algorithms that handle both structured and unstructured data is critical. Because many of the files contained drawings of buildings and pictures, the results of semi-supervised algorithms indicated the importance of formal language. To ensure consistent performance and a system with less overfitting, textaugmentation and hypertunneling are used. The assumptions made and the challenges faced are documented throughout this project. A few algorithms are presented in detail, along with their theoretical and mathematical concepts
Adaptive Semi-supervised Learning for Cross-domain Sentiment Classification
We consider the cross-domain sentiment classification problem, where a
sentiment classifier is to be learned from a source domain and to be
generalized to a target domain. Our approach explicitly minimizes the distance
between the source and the target instances in an embedded feature space. With
the difference between source and target minimized, we then exploit additional
information from the target domain by consolidating the idea of semi-supervised
learning, for which, we jointly employ two regularizations -- entropy
minimization and self-ensemble bootstrapping -- to incorporate the unlabeled
target data for classifier refinement. Our experimental results demonstrate
that the proposed approach can better leverage unlabeled data from the target
domain and achieve substantial improvements over baseline methods in various
experimental settings.Comment: Accepted to EMNLP201
- …