Search CORE

66,100 research outputs found

Examination of machine learning methods for multi-label classification of intellectual property documents

Author: Hall John William
Publication venue
Publication date: 01/05/2017
Field of study

This thesis explores the performance of a variety of machine learning techniques for the task of multi-label document classification applied to a corpus of United States patent grants. The rapidly rising number of patent applications in the past several decades has led to a rising need for enhanced automatic patent processing tools. The task of automated document classification in particular has been targeted as an important point of research. However, the development of adequate tools has been limited in part by the esoteric writing style particular to intellectual property and the overlapping categorizations of the branched hierarchical classification system employed by the CPC. A patent document corpus offers a large, publicly available training set consisting of both structured and unstructured data. The application of machine learning techniques to this corpus may help relieve the increasing need for highly trained human classifiers. The contributions of the present work are 2-fold. First, the present work constructed a patent document corpus by gathering 4500 patent documents from years 2015 and 2014 and compiling relevant structured and textual data relevant to an automated classification task. Second, it offers an examination of five different machine learning techniques as automated classifiers for patent documents by section. Test trials under different preprocessing conditions utilizing principal component analysis and word selection were applied in training supervised learning classifiers. It was found that principal component analysis of the patent documents without further feature selection yielded the greatest performance for all machine learning models. This approach also revealed an effect of dataset size where increasing the size of the training set increased the overall performance of Decision Tree, Support Vector Machine, Logistic Regression, and Neural Net models. It was further found that some classifiers trained on data not subject to principal component analysis showed decreasing performance metrics with increasing data sizes

Illinois Digital Environment for Access to Learning and Scholarship Repository

Feature extraction and classification of movie reviews

Author: Awukam Awukam Ojang
Mtetwa Nhamo
Yousefi Mehdi
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 02/05/2019
Field of study

ResearchOnline@GCU

Taming Wild High Dimensional Text Data with a Fuzzy Lash

Author: Karami Amir
Publication venue
Publication date: 01/11/2017
Field of study

The bag of words (BOW) represents a corpus in a matrix whose elements are the frequency of words. However, each row in the matrix is a very high-dimensional sparse vector. Dimension reduction (DR) is a popular method to address sparsity and high-dimensionality issues. Among different strategies to develop DR method, Unsupervised Feature Transformation (UFT) is a popular strategy to map all words on a new basis to represent BOW. The recent increase of text data and its challenges imply that DR area still needs new perspectives. Although a wide range of methods based on the UFT strategy has been developed, the fuzzy approach has not been considered for DR based on this strategy. This research investigates the application of fuzzy clustering as a DR method based on the UFT strategy to collapse BOW matrix to provide a lower-dimensional representation of documents instead of the words in a corpus. The quantitative evaluation shows that fuzzy clustering produces superior performance and features to Principal Components Analysis (PCA) and Singular Value Decomposition (SVD), two popular DR methods based on the UFT strategy

arXiv.org e-Print Archive

Crossref

Scholar Commons - Institutional Repository of the University of South Carolina

Transforming Graph Representations for Statistical Relational Learning

Author: Aha David W.
McDowell Luke K.
Neville Jennifer
Rossi Ryan A.
Publication venue
Publication date: 01/01/2012
Field of study

Relational data representations have become an increasingly important topic due to the recent proliferation of network datasets (e.g., social, biological, information networks) and a corresponding increase in the application of statistical relational learning (SRL) algorithms to these domains. In this article, we examine a range of representation issues for graph-based relational data. Since the choice of relational data representation for the nodes, links, and features can dramatically affect the capabilities of SRL algorithms, we survey approaches and opportunities for relational representation transformation designed to improve the performance of these algorithms. This leads us to introduce an intuitive taxonomy for data representation transformations in relational domains that incorporates link transformation and node transformation as symmetric representation tasks. In particular, the transformation tasks for both nodes and links include (i) predicting their existence, (ii) predicting their label or type, (iii) estimating their weight or importance, and (iv) systematically constructing their relevant features. We motivate our taxonomy through detailed examples and use it to survey and compare competing approaches for each of these tasks. We also discuss general conditions for transforming links, nodes, and features. Finally, we highlight challenges that remain to be addressed

arXiv.org e-Print Archive

CiteSeerX

Classification of damage in structural systems using time series analysis and supervised and unsupervised pattern recognition techniques

Author: De Lautour O. R.
Omenzetter P.
Publication venue: 'SPIE-Intl Soc Optical Eng'
Publication date: 26/03/2010
Field of study

Peer reviewedPostprin

Aberdeen University Research