43,018 research outputs found
PatentSBERTa: A Deep NLP based Hybrid Model for Patent Distance and Classification using Augmented SBERT
This study provides an efficient approach for using text data to calculate
patent-to-patent (p2p) technological similarity, and presents a hybrid
framework for leveraging the resulting p2p similarity for applications such as
semantic search and automated patent classification. We create embeddings using
Sentence-BERT (SBERT) based on patent claims. We leverage SBERTs efficiency in
creating embedding distance measures to map p2p similarity in large sets of
patent data. We deploy our framework for classification with a simple Nearest
Neighbors (KNN) model that predicts Cooperative Patent Classification (CPC) of
a patent based on the class assignment of the K patents with the highest p2p
similarity. We thereby validate that the p2p similarity captures their
technological features in terms of CPC overlap, and at the same demonstrate the
usefulness of this approach for automatic patent classification based on text
data. Furthermore, the presented classification framework is simple and the
results easy to interpret and evaluate by end-users. In the out-of-sample model
validation, we are able to perform a multi-label prediction of all assigned CPC
classes on the subclass (663) level on 1,492,294 patents with an accuracy of
54% and F1 score > 66%, which suggests that our model outperforms the current
state-of-the-art in text-based multi-label and multi-class patent
classification. We furthermore discuss the applicability of the presented
framework for semantic IP search, patent landscaping, and technology
intelligence. We finally point towards a future research agenda for leveraging
multi-source patent embeddings, their appropriateness across applications, as
well as to improve and validate patent embeddings by creating domain-expert
curated Semantic Textual Similarity (STS) benchmark datasets.Comment: 18 pages, 7 figures and 4 Table
Examination of machine learning methods for multi-label classification of intellectual property documents
This thesis explores the performance of a variety of machine learning techniques for the task of multi-label document classification applied to a corpus of United States patent grants. The rapidly rising number of patent applications in the past several decades has led to a rising need for enhanced automatic patent processing tools. The task of automated document classification in particular has been targeted as an important point of research. However, the development of adequate tools has been limited in part by the esoteric writing style particular to intellectual property and the overlapping categorizations of the branched hierarchical classification system employed by the CPC. A patent document corpus offers a large, publicly available training set consisting of both structured and unstructured data. The application of machine learning techniques to this corpus may help relieve the increasing need for highly trained human classifiers. The contributions of the present work are 2-fold. First, the present work constructed a patent document corpus by gathering 4500 patent documents from years 2015 and 2014 and compiling relevant structured and textual data relevant to an automated classification task. Second, it offers an examination of five different machine learning techniques as automated classifiers for patent documents by section. Test trials under different preprocessing conditions utilizing principal component analysis and word selection were applied in training supervised learning classifiers. It was found that principal component analysis of the patent documents without further feature selection yielded the greatest performance for all machine learning models. This approach also revealed an effect of dataset size where increasing the size of the training set increased the overall performance of Decision Tree, Support Vector Machine, Logistic Regression, and Neural Net models. It was further found that some classifiers trained on data not subject to principal component analysis showed decreasing performance metrics with increasing data sizes
Recommended from our members
The impact of metadata on the accuracy of automated patent classification
During the last decade, the advance of machine-learning tools and algorithms has resulted in tremendous progress in the automated classification of documents. However, many classifiers base their classification decisions solely on document text and ignore metadata (such as authors, publication date, and author affiliation). In this project, automated classifiers using the k-Nearest Neighbour algorithm were developed for the classification of patents into two different classification systems. Those using metadata (in this case inventor names, applicant names and International Patent Classification codes) were compared with those ignoring it. The use of metadata could significantly improve the classification of patents with one classification system, improving classification accuracy from 70.8% up to 75.4%, which was highly statistically significant. However, the results for the other classification system were inconclusive: while metadata could improve the quality of the classifier for some experiments (recall increased from 66.0% to 68.9%, which was a small but nonetheless significant improvement), experiments with different parameters showed that it could also lead to a deterioration of quality (recall dropping as low as 61.0%). The study shows that metadata can play an extremely useful role in the classification of patents. Nonetheless, it must not be used indiscriminately but only after careful evaluation of its usefulness
Advanced Text Analytics and Machine Learning Approach for Document Classification
Text classification is used in information extraction and retrieval from a given text, and text classification has been considered as an important step to manage a vast number of records given in digital form that is far-reaching and expanding. This thesis addresses patent document classification problem into fifteen different categories or classes, where some classes overlap with other classes for practical reasons. For the development of the classification model using machine learning techniques, useful features have been extracted from the given documents. The features are used to classify patent document as well as to generate useful tag-words. The overall objective of this work is to systematize NASA’s patent management, by developing a set of automated tools that can assist NASA to manage and market its portfolio of intellectual properties (IP), and to enable easier discovery of relevant IP by users. We have identified an array of methods that can be applied such as k-Nearest Neighbors (kNN), two variations of the Support Vector Machine (SVM) algorithms, and two tree based classification algorithms: Random Forest and J48. The major research steps in this work consist of filtering techniques for variable selection, information gain and feature correlation analysis, and training and testing potential models using effective classifiers. Further, the obstacles associated with the imbalanced data were mitigated by adding synthetic data wherever appropriate, which resulted in a superior SVM classifier based model
Advanced Text Analytics and Machine Learning Approach for Document Classification
Text classification is used in information extraction and retrieval from a given text, and text classification has been considered as an important step to manage a vast number of records given in digital form that is far-reaching and expanding. This thesis addresses patent document classification problem into fifteen different categories or classes, where some classes overlap with other classes for practical reasons. For the development of the classification model using machine learning techniques, useful features have been extracted from the given documents. The features are used to classify patent document as well as to generate useful tag-words. The overall objective of this work is to systematize NASA’s patent management, by developing a set of automated tools that can assist NASA to manage and market its portfolio of intellectual properties (IP), and to enable easier discovery of relevant IP by users. We have identified an array of methods that can be applied such as k-Nearest Neighbors (kNN), two variations of the Support Vector Machine (SVM) algorithms, and two tree based classification algorithms: Random Forest and J48. The major research steps in this work consist of filtering techniques for variable selection, information gain and feature correlation analysis, and training and testing potential models using effective classifiers. Further, the obstacles associated with the imbalanced data were mitigated by adding synthetic data wherever appropriate, which resulted in a superior SVM classifier based model
OCRIS : online catalogue and repository interoperability study. Final report
The aims and objectives of OCRIS were to: • Survey the extent to which repository content is in scope for institutional library OPACs, and the extent to which it is already recorded there; • Examine the interoperability of OPAC and repository software for the exchange of metadata and other information; • List the various services to institutional managers, researchers, teachers and learners offered respectively by OPACs and repositories; • Identify the potential for improvements in the links (e.g. using link resolver technology) from repositories and/or OPACs to other institutional services, such as finance or research administration; • Make recommendations for the development of possible further links between library OPACs and institutional repositories, identifying the benefits to relevant stakeholder groups
- …