5 research outputs found
Legal Document Classification: An Application to Law Area Prediction of Petitions to Public Prosecution Service
In recent years, there has been an increased interest in the application of
Natural Language Processing (NLP) to legal documents. The use of convolutional
and recurrent neural networks along with word embedding techniques have
presented promising results when applied to textual classification problems,
such as sentiment analysis and topic segmentation of documents. This paper
proposes the use of NLP techniques for textual classification, with the purpose
of categorizing the descriptions of the services provided by the Public
Prosecutor's Office of the State of Paran\'a to the population in one of the
areas of law covered by the institution. Our main goal is to automate the
process of assigning petitions to their respective areas of law, with a
consequent reduction in costs and time associated with such process while
allowing the allocation of human resources to more complex tasks. In this
paper, we compare different approaches to word representations in the
aforementioned task: including document-term matrices and a few different word
embeddings. With regards to the classification models, we evaluated three
different families: linear models, boosted trees and neural networks. The best
results were obtained with a combination of Word2Vec trained on a
domain-specific corpus and a Recurrent Neural Network (RNN) architecture (more
specifically, LSTM), leading to an accuracy of 90\% and F1-Score of 85\% in the
classification of eighteen categories (law areas)
Combining State-of-the-Art Models with Maximal Marginal Relevance for Few-Shot and Zero-Shot Multi-Document Summarization
In Natural Language Processing, multi-document summarization (MDS) poses many
challenges to researchers above those posed by single-document summarization
(SDS). These challenges include the increased search space and greater
potential for the inclusion of redundant information. While advancements in
deep learning approaches have led to the development of several advanced
language models capable of summarization, the variety of training data specific
to the problem of MDS remains relatively limited. Therefore, MDS approaches
which require little to no pretraining, known as few-shot or zero-shot
applications, respectively, could be beneficial additions to the current set of
tools available in summarization. To explore one possible approach, we devise a
strategy for combining state-of-the-art models' outputs using maximal marginal
relevance (MMR) with a focus on query relevance rather than document diversity.
Our MMR-based approach shows improvement over some aspects of the current
state-of-the-art results in both few-shot and zero-shot MDS applications while
maintaining a state-of-the-art standard of output by all available metrics
Categorizing Natural Language-Based Customer Satisfaction: An Implementation Method Using Support Vector Machine and Long Short-Term Memory Neural Network
Analyzing natural language-based Customer Satisfaction (CS) is a tedious process. This issue is practically true if one is to manually categorize large datasets. Fortunately, the advent of supervised machine learning techniques has paved the way toward the design of efficient categorization systems used for CS. This paper presents the feasibility of designing a text categorization model using two popular and robust algorithms – the Support Vector Machine (SVM) and Long Short-Term Memory (LSTM) Neural Network, in order to automatically categorize complaints, suggestions, feedbacks, and commendations. The study found that, in terms of training accuracy, SVM has best rating of 98.63% while LSTM has best rating of 99.32%. Such results mean that both SVM and LSTM algorithms are at par with each other in terms of training accuracy, but SVM is significantly faster than LSTM by approximately 35.47s. The training performance results of both algorithms are attributed on the limitations of the dataset size, high-dimensionality of both English and Tagalog languages, and applicability of the feature engineering techniques used. Interestingly, based on the results of actual implementation, both algorithms are found to be 100% effective in accurately predicting the correct CS categories. Hence, the extent of preference between the two algorithms boils down on the available dataset and the skill in optimizing these algorithms through feature engineering techniques and in implementing them toward actual text categorization applications
Adversarial command detection using parallel Speech Recognition systems
Personal Voice Assistants (PVAs) such as Apple's Siri, Amazon's Alexa and Google Home are now commonplace. PVAs are susceptible to adversarial commands; an attacker is able to modify an audio signal such that humans do not notice this modification but the Speech Recognition (SR) will recognise a command of the attacker's choice. In this paper we describe a defence method against such adversarial commands. By using a second SR in parallel to the main SR of the PVA it is possible to detect adversarial commands. It is difficult for an attacker to craft an adversarial command that is able to force two different SR into recognising the adversarial command while ensuring inaudibility. We demonstrate the feasibility of this defence mechanism for practical setups. For instance, our evaluation shows that such system can be tuned to detect 50% of adversarial commands while not impacting on normal PVA use
Performance evaluation of keyword extraction methods and visualization for student online comments
Topic keyword extraction (as a typical task in information retrieval) refers to extracting the core keywords from document topics. In an online environment, students often post comments in subject forums. The automatic and accurate extraction of keywords from these comments are beneficial to lecturers (particular when it comes to repeatedly delivered subjects). In this paper, we compare the performance of traditional machine learning algorithms and two deep learning methods in extracting topic keywords from student comments posted in subject forums. For this purpose, we collected student comment data from a period of two years, manually tagging part of the raw data for our experiments. Based on this dataset, we comprehensively compared the five typical algorithms of naïve Bayes, logistic regression, support vector machine, convolutional neural networks, and Long Short-Term Memory with Attention (Att-LSTM). The performances were measured by the four evaluation metrics. We further examined the keywords by visualization. From the results of our experiment and visualization, we conclude that the Att-LSTM method is the best approach for topic keyword extraction from student comments. Further, the results from the algorithms and visualization are symmetry, to some degree. In particular, the extracted topics from the comments posted at the same stages of different teaching sessions are, almost, reflection symmetry