2 research outputs found
Portuguese patent classification: A use case of text classification using machine learning and transfer learning approaches
Project Work presented as the partial requirement for obtaining a Master's degree in Data Science and Advanced AnalyticsPatent classification is one of the areas in Intellectual Property Analytics (IPA), and a growing use case since the number of patent applications has been increasing through the years worldwide. Patents are more than ever being used as financial protection for companies that also use patent databases to raise researches and leverage product innovations. Instituto Nacional de Propriedade Industrial, INPI, is the government agency responsible for protecting Industrial Property rights in Portugal. INPI has promoted a competition to explore technologies to solve some challenges related to Industrial Properties, including the classification of patents, one of the critical phases of the grant patent process.
In this work project, we used the dataset put available by INPI to explore traditional machine learning algorithms to classify Portuguese patents and evaluate the performance of transfer learning methodologies to solve this task. BERTTimbau, a BERT architecture model pre-trained on a large Portuguese corpus, presented the best results to the task, even though with a performance only 4% superior to a LinearSVC model using TF-IDF feature engineering. In general, the model presents a good performance, despite the low score when classes had few training samples. However, the analysis of misclassified samples showed that the specificity of the context has more influence on the learning than the number of samples itself.
Patent classification is a challenging task not just because of 1) the hierarchical structure of the classification but also because of 2) the way a patent is described, 3) the overlap of the contexts, and 4) the underrepresentation of the classes. Nevertheless, it is an area of growing interest, and that can be leveraged by the new researches that are revolutionizing machine learning applications, especially text mining
Deep Learning for Technical Document Classification
In large technology companies, the requirements for managing and organizing
technical documents created by engineers and managers have increased
dramatically in recent years, which has led to a higher demand for more
scalable, accurate, and automated document classification. Prior studies have
only focused on processing text for classification, whereas technical documents
often contain multimodal information. To leverage multimodal information for
document classification to improve the model performance, this paper presents a
novel multimodal deep learning architecture, TechDoc, which utilizes three
types of information, including natural language texts and descriptive images
within documents and the associations among the documents. The architecture
synthesizes the convolutional neural network, recurrent neural network, and
graph neural network through an integrated training process. We applied the
architecture to a large multimodal technical document database and trained the
model for classifying documents based on the hierarchical International Patent
Classification system. Our results show that TechDoc presents a greater
classification accuracy than the unimodal methods and other state-of-the-art
benchmarks. The trained model can potentially be scaled to millions of
real-world multimodal technical documents, which is useful for data and
knowledge management in large technology companies and organizations.Comment: 16 pages, 8 figures, 9 table