328,573 research outputs found
PatentSBERTa: A Deep NLP based Hybrid Model for Patent Distance and Classification using Augmented SBERT
This study provides an efficient approach for using text data to calculate
patent-to-patent (p2p) technological similarity, and presents a hybrid
framework for leveraging the resulting p2p similarity for applications such as
semantic search and automated patent classification. We create embeddings using
Sentence-BERT (SBERT) based on patent claims. We leverage SBERTs efficiency in
creating embedding distance measures to map p2p similarity in large sets of
patent data. We deploy our framework for classification with a simple Nearest
Neighbors (KNN) model that predicts Cooperative Patent Classification (CPC) of
a patent based on the class assignment of the K patents with the highest p2p
similarity. We thereby validate that the p2p similarity captures their
technological features in terms of CPC overlap, and at the same demonstrate the
usefulness of this approach for automatic patent classification based on text
data. Furthermore, the presented classification framework is simple and the
results easy to interpret and evaluate by end-users. In the out-of-sample model
validation, we are able to perform a multi-label prediction of all assigned CPC
classes on the subclass (663) level on 1,492,294 patents with an accuracy of
54% and F1 score > 66%, which suggests that our model outperforms the current
state-of-the-art in text-based multi-label and multi-class patent
classification. We furthermore discuss the applicability of the presented
framework for semantic IP search, patent landscaping, and technology
intelligence. We finally point towards a future research agenda for leveraging
multi-source patent embeddings, their appropriateness across applications, as
well as to improve and validate patent embeddings by creating domain-expert
curated Semantic Textual Similarity (STS) benchmark datasets.Comment: 18 pages, 7 figures and 4 Table
Productivity Measurement of Call Centre Agents using a Multimodal Classification Approach
Call centre channels play a cornerstone role in business communications and transactions, especially in challenging business situations. Operations’ efficiency, service quality, and resource productivity are core aspects of call centres’ competitive advantage in rapid market competition. Performance evaluation in call centres is challenging due to human subjective evaluation, manual assortment to massive calls, and inequality in evaluations because of different raters. These challenges impact these operations' efficiency and lead to frustrated customers. This study aims to automate performance evaluation in call centres using various deep learning approaches. Calls recorded in a call centre are modelled and classified into high- or low-performance evaluations categorised as productive or nonproductive calls.
The proposed conceptual model considers a deep learning network approach to model the recorded calls as text and speech. It is based on the following: 1) focus on the technical part of agent performance, 2) objective evaluation of the corpus, 3) extension of features for both text and speech, and 4) combination of the best accuracy from text and speech data using a multimodal structure. Accordingly, the diarisation algorithm extracts that part of the call where the agent is talking from which the customer is doing so. Manual annotation is also necessary to divide the modelling corpus into productive and nonproductive (supervised training). Krippendorff’s alpha was applied to avoid subjectivity in the manual annotation. Arabic speech recognition is then developed to transcribe the speech into text. The text features are the words embedded using the embedding layer. The speech features make several attempts to use the Mel Frequency Cepstral Coefficient (MFCC) upgraded with Low-Level Descriptors (LLD) to improve classification accuracy. The data modelling architectures for speech and text are based on CNNs, BiLSTMs, and the attention layer. The multimodal approach follows the generated models to improve performance accuracy by concatenating the text and speech models using the joint representation methodology.
The main contributions of this thesis are:
• Developing an Arabic Speech recognition method for automatic transcription of speech into text.
• Drawing several DNN architectures to improve performance evaluation using speech features based on MFCC and LLD.
• Developing a Max Weight Similarity (MWS) function to outperform the SoftMax function used in the attention layer.
• Proposing a multimodal approach for combining the text and speech models for best performance evaluation
Consistency and trends of technological innovations: a network approach to the international patent classification data
Classifying patents by the technology areas they pertain is important to enable information search and facilitate policy analysis and socio-economic studies. Based on the OECD Triadic Patent Family database, this study constructs a cohort network based on the grouping of IPC subclasses in the same patent families, and a citation network based on citations between subclasses of patent families citing each other. This paper presents a systematic analysis approach which obtains naturally formed network clusters identified using a Lumped Markov Chain method, extracts community keys traceable over time, and investigates two important community characteristics: consistency and changing trends. The results are verified against several other methods, including a recent research measuring patent text similarity. The proposed method contributes to the literature a network-based approach to study the endogenous community properties of an exogenously devised classification system. The application of this method may improve accuracy and efficiency of the IPC search platform and help detect the emergence of new technologies
An HMM-based synthetic view generator to improve the efficiency of ensemble systems
One of the most active areas of research in semi-supervised learning has been to study methods for constructing good ensembles of classifiers. Ensemble systems are techniques that create multiple models and then combine them to produce improved results. These systems usually produce more accurate solutions than a single model would. Specially, multi-view ensemble systems improve the accuracy of text classification because they optimize the functions to exploit different views of the same input data. However, despite being more promising than the single-view approaches, document datasets often have no natural multiple views available. This study proposes an algorithm to generate a synthetic view from a standard text dataset. The model generates a new view from the standard bag-of-words approach using an algorithm based on hidden Markov models (HMMs). To show the effectiveness of the proposed HMM-based synthetic view generation method, it has been integrated in a co-training ensemble system and tested with four text corpora: Reuters, 20 Newsgroup, TREC Genomics and OHSUMED. The results obtained are promising, showing a significant increase in the efficiency of the ensemble system compared to a single-view approach.European Union | Ref. FP7/REGPOT-2012-2013.1, n.316265, BIOCAPSMinisterio de EconomĂa y Competitividad de España | Ref. TIN2013-47153-C3-3-RUniversidade de Vigo | Ref. 14VI0
Evolutionary Multiobjective Feature Selection for Sentiment Analysis
AuthorSentiment analysis is one of the prominent research areas in data mining and knowledge discovery, which has proven to be an effective technique for monitoring public opinion. The big data era with a high volume of data generated by a variety of sources has provided enhanced opportunities for utilizing sentiment analysis in various domains. In order to take best advantage of the high volume of data for accurate sentiment analysis, it is essential to clean the data before the analysis, as irrelevant or redundant data will hinder extracting valuable information. In this paper, we propose a hybrid feature selection algorithm to improve the performance of sentiment analysis tasks. Our proposed sentiment analysis approach builds a binary classification model based on two feature selection techniques: an entropy-based metric and an evolutionary algorithm. We have performed comprehensive experiments in two different domains using a benchmark dataset, Stanford Sentiment Treebank, and a real-world dataset we have created based on World Health Organization (WHO) public speeches regarding COVID-19. The proposed feature selection model is shown to achieve significant performance improvements in both datasets, increasing classification accuracy for all utilized machine learning and text representation technique combinations. Moreover, it achieves over 70% reduction in feature size, which provides efficiency in computation time and space
Using Machine Learning and Natural Language Processing to Review and Classify the Medical Literature on Cancer Susceptibility Genes
PURPOSE: The medical literature relevant to germline genetics is growing
exponentially. Clinicians need tools monitoring and prioritizing the literature
to understand the clinical implications of the pathogenic genetic variants. We
developed and evaluated two machine learning models to classify abstracts as
relevant to the penetrance (risk of cancer for germline mutation carriers) or
prevalence of germline genetic mutations. METHODS: We conducted literature
searches in PubMed and retrieved paper titles and abstracts to create an
annotated dataset for training and evaluating the two machine learning
classification models. Our first model is a support vector machine (SVM) which
learns a linear decision rule based on the bag-of-ngrams representation of each
title and abstract. Our second model is a convolutional neural network (CNN)
which learns a complex nonlinear decision rule based on the raw title and
abstract. We evaluated the performance of the two models on the classification
of papers as relevant to penetrance or prevalence. RESULTS: For penetrance
classification, we annotated 3740 paper titles and abstracts and used 60% for
training the model, 20% for tuning the model, and 20% for evaluating the model.
The SVM model achieves 89.53% accuracy (percentage of papers that were
correctly classified) while the CNN model achieves 88.95 % accuracy. For
prevalence classification, we annotated 3753 paper titles and abstracts. The
SVM model achieves 89.14% accuracy while the CNN model achieves 89.13 %
accuracy. CONCLUSION: Our models achieve high accuracy in classifying abstracts
as relevant to penetrance or prevalence. By facilitating literature review,
this tool could help clinicians and researchers keep abreast of the burgeoning
knowledge of gene-cancer associations and keep the knowledge bases for clinical
decision support tools up to date
Fine-Grained Product Class Recognition for Assisted Shopping
Assistive solutions for a better shopping experience can improve the quality
of life of people, in particular also of visually impaired shoppers. We present
a system that visually recognizes the fine-grained product classes of items on
a shopping list, in shelves images taken with a smartphone in a grocery store.
Our system consists of three components: (a) We automatically recognize useful
text on product packaging, e.g., product name and brand, and build a mapping of
words to product classes based on the large-scale GroceryProducts dataset. When
the user populates the shopping list, we automatically infer the product class
of each entered word. (b) We perform fine-grained product class recognition
when the user is facing a shelf. We discover discriminative patches on product
packaging to differentiate between visually similar product classes and to
increase the robustness against continuous changes in product design. (c) We
continuously improve the recognition accuracy through active learning. Our
experiments show the robustness of the proposed method against cross-domain
challenges, and the scalability to an increasing number of products with
minimal re-training.Comment: Accepted at ICCV Workshop on Assistive Computer Vision and Robotics
(ICCV-ACVR) 201
- …