Search CORE

43,018 research outputs found

PatentSBERTa: A Deep NLP based Hybrid Model for Patent Distance and Classification using Augmented SBERT

Author: Bekamiri Hamid
Hain Daniel S.
Jurowetzki Roman
Publication venue
Publication date: 17/10/2021
Field of study

This study provides an efficient approach for using text data to calculate patent-to-patent (p2p) technological similarity, and presents a hybrid framework for leveraging the resulting p2p similarity for applications such as semantic search and automated patent classification. We create embeddings using Sentence-BERT (SBERT) based on patent claims. We leverage SBERTs efficiency in creating embedding distance measures to map p2p similarity in large sets of patent data. We deploy our framework for classification with a simple Nearest Neighbors (KNN) model that predicts Cooperative Patent Classification (CPC) of a patent based on the class assignment of the K patents with the highest p2p similarity. We thereby validate that the p2p similarity captures their technological features in terms of CPC overlap, and at the same demonstrate the usefulness of this approach for automatic patent classification based on text data. Furthermore, the presented classification framework is simple and the results easy to interpret and evaluate by end-users. In the out-of-sample model validation, we are able to perform a multi-label prediction of all assigned CPC classes on the subclass (663) level on 1,492,294 patents with an accuracy of 54% and F1 score > 66%, which suggests that our model outperforms the current state-of-the-art in text-based multi-label and multi-class patent classification. We furthermore discuss the applicability of the presented framework for semantic IP search, patent landscaping, and technology intelligence. We finally point towards a future research agenda for leveraging multi-source patent embeddings, their appropriateness across applications, as well as to improve and validate patent embeddings by creating domain-expert curated Semantic Textual Similarity (STS) benchmark datasets.Comment: 18 pages, 7 figures and 4 Table

arXiv.org e-Print Archive

VBN

Examination of machine learning methods for multi-label classification of intellectual property documents

Author: Hall John William
Publication venue
Publication date: 01/05/2017
Field of study

This thesis explores the performance of a variety of machine learning techniques for the task of multi-label document classification applied to a corpus of United States patent grants. The rapidly rising number of patent applications in the past several decades has led to a rising need for enhanced automatic patent processing tools. The task of automated document classification in particular has been targeted as an important point of research. However, the development of adequate tools has been limited in part by the esoteric writing style particular to intellectual property and the overlapping categorizations of the branched hierarchical classification system employed by the CPC. A patent document corpus offers a large, publicly available training set consisting of both structured and unstructured data. The application of machine learning techniques to this corpus may help relieve the increasing need for highly trained human classifiers. The contributions of the present work are 2-fold. First, the present work constructed a patent document corpus by gathering 4500 patent documents from years 2015 and 2014 and compiling relevant structured and textual data relevant to an automated classification task. Second, it offers an examination of five different machine learning techniques as automated classifiers for patent documents by section. Test trials under different preprocessing conditions utilizing principal component analysis and word selection were applied in training supervised learning classifiers. It was found that principal component analysis of the patent documents without further feature selection yielded the greatest performance for all machine learning models. This approach also revealed an effect of dataset size where increasing the size of the training set increased the overall performance of Decision Tree, Support Vector Machine, Logistic Regression, and Neural Net models. It was further found that some classifiers trained on data not subject to principal component analysis showed decreasing performance metrics with increasing data sizes

Illinois Digital Environment for Access to Learning and Scholarship Repository

Recommended from our members

The impact of metadata on the accuracy of automated patent classification

Author: Andrew MacFarlane
Chai
Chakrabarti
Creecy
Georg Richter
Koster
Krier
Larkey
Larkey
Larson
Lewis
Li
Salton
Salton
Sebastiani
Smith
Stanfill
Tumer
Yang
Yang
Yang
Yang
Publication venue: 'Elsevier BV'
Publication date: 01/03/2005
Field of study

During the last decade, the advance of machine-learning tools and algorithms has resulted in tremendous progress in the automated classification of documents. However, many classifiers base their classification decisions solely on document text and ignore metadata (such as authors, publication date, and author affiliation). In this project, automated classifiers using the k-Nearest Neighbour algorithm were developed for the classification of patents into two different classification systems. Those using metadata (in this case inventor names, applicant names and International Patent Classification codes) were compared with those ignoring it. The use of metadata could significantly improve the classification of patents with one classification system, improving classification accuracy from 70.8% up to 75.4%, which was highly statistically significant. However, the results for the other classification system were inconclusive: while metadata could improve the quality of the classifier for some experiments (recall increased from 66.0% to 68.9%, which was a small but nonetheless significant improvement), experiments with different parameters showed that it could also lead to a deterioration of quality (recall dropping as low as 61.0%). The study shows that metadata can play an extremely useful role in the classification of patents. Nonetheless, it must not be used indiscriminately but only after careful evaluation of its usefulness

City Research Online

Crossref

Advanced Text Analytics and Machine Learning Approach for Document Classification

Author: Anne Chaitanya
Publication venue: ScholarWorks@UNO
Publication date: 19/05/2017
Field of study

Text classification is used in information extraction and retrieval from a given text, and text classification has been considered as an important step to manage a vast number of records given in digital form that is far-reaching and expanding. This thesis addresses patent document classification problem into fifteen different categories or classes, where some classes overlap with other classes for practical reasons. For the development of the classification model using machine learning techniques, useful features have been extracted from the given documents. The features are used to classify patent document as well as to generate useful tag-words. The overall objective of this work is to systematize NASA’s patent management, by developing a set of automated tools that can assist NASA to manage and market its portfolio of intellectual properties (IP), and to enable easier discovery of relevant IP by users. We have identified an array of methods that can be applied such as k-Nearest Neighbors (kNN), two variations of the Support Vector Machine (SVM) algorithms, and two tree based classification algorithms: Random Forest and J48. The major research steps in this work consist of filtering techniques for variable selection, information gain and feature correlation analysis, and training and testing potential models using effective classifiers. Further, the obstacles associated with the imbalanced data were mitigated by adding synthetic data wherever appropriate, which resulted in a superior SVM classifier based model

University of New Orleans

Advanced Text Analytics and Machine Learning Approach for Document Classification

Author: Anne Chaitanya
Publication venue: ScholarWorks@UNO
Publication date: 19/05/2017
Field of study

OCRIS : online catalogue and repository interoperability study. Final report

Author: Birrell Duncan
Dunsire Gordon
Menzies Kathleen
Publication venue: University of Strathclyde
Publication date: 01/01/2009
Field of study

The aims and objectives of OCRIS were to: • Survey the extent to which repository content is in scope for institutional library OPACs, and the extent to which it is already recorded there; • Examine the interoperability of OPAC and repository software for the exchange of metadata and other information; • List the various services to institutional managers, researchers, teachers and learners offered respectively by OPACs and repositories; • Identify the potential for improvements in the links (e.g. using link resolver technology) from repositories and/or OPACs to other institutional services, such as finance or research administration; • Make recommendations for the development of possible further links between library OPACs and institutional repositories, identifying the benefits to relevant stakeholder groups

E-LIS

University of Strathclyde Institutional Repository