102 research outputs found
Bidirectional Long Short Term Memory Method and Word2vec Extraction Approach for Hate Speech Detection
Currently, the discussion about hate speech in Indonesia is warm, primarily through social media. Hate speech is communication that disparages a person or group based on characteristics such as (race, ethnicity, gender, citizenship, religion and organization). Twitter is one of the social media that someone uses to express their feelings and opinions through tweets, including tweets that contain expressions of hatred because Twitter has a significant influence on the success or destruction of one's image.This study aims to detect hate speech or not hate Indonesian speech tweets by using the Bidirectional Long Short Term Memory method and the word2vec feature extraction method with Continuous bag-of-word (CBOW) architecture. For testing the BiLSTM purpose with the calculation of the value of accuracy, precision, recall, and F-measure.The use of word2vec and the Bidirectional Long Short Term Memory method with CBOW architecture, with epoch 10, learning rate 0.001 and the number of neurons 200 on the hidden layer, produce an accuracy rate of 94.66%, with each precision value of 99.08%, recall 93, 74% and F-measure 96.29%. In contrast, the Bidirectional Long Short Term Memory with three layers has an accuracy of 96.93%. The addition of one layer to BiLSTM increased by 2.27%
Neural Natural Language Processing for Long Texts: A Survey of the State-of-the-Art
The adoption of Deep Neural Networks (DNNs) has greatly benefited Natural
Language Processing (NLP) during the past decade. However, the demands of long
document analysis are quite different from those of shorter texts, while the
ever increasing size of documents uploaded on-line renders automated
understanding of long texts a critical area of research. This article has two
goals: a) it overviews the relevant neural building blocks, thus serving as a
short tutorial, and b) it surveys the state-of-the-art in long document NLP,
mainly focusing on two central tasks: document classification and document
summarization. Sentiment analysis for long texts is also covered, since it is
typically treated as a particular case of document classification.
Additionally, this article discusses the main challenges, issues and current
solutions related to long document NLP. Finally, the relevant, publicly
available, annotated datasets are presented, in order to facilitate further
research.Comment: 53 pages, 2 figures, 171 citation
Automatic movie analysis and summarisation
Automatic movie analysis is the task of employing Machine Learning methods to the
field of screenplays, movie scripts, and motion pictures to facilitate or enable various
tasks throughout the entirety of a movie’s life-cycle. From helping with making
informed decisions about a new movie script with respect to aspects such as its originality,
similarity to other movies, or even commercial viability, all the way to offering
consumers new and interesting ways of viewing the final movie, many stages in the
life-cycle of a movie stand to benefit from Machine Learning techniques that promise
to reduce human effort, time, or both. Within this field of automatic movie analysis,
this thesis addresses the task of summarising the content of screenplays, enabling users
at any stage to gain a broad understanding of a movie from greatly reduced data. The
contributions of this thesis are four-fold: (i)We introduce ScriptBase, a new large-scale
data set of original movie scripts, annotated with additional meta-information such as
genre and plot tags, cast information, and log- and tag-lines. To our knowledge, Script-
Base is the largest data set of its kind, containing scripts and information for almost
1,000 Hollywood movies. (ii) We present a dynamic summarisation model for the
screenplay domain, which allows for extraction of highly informative and important
scenes from movie scripts. The extracted summaries allow for the content of the original
script to stay largely intact and provide the user with its important parts, while
greatly reducing the script-reading time. (iii) We extend our summarisation model
to capture additional modalities beyond the screenplay text. The model is rendered
multi-modal by introducing visual information obtained from the actual movie and by
extracting scenes from the movie, allowing users to generate visual summaries of motion
pictures. (iv) We devise a novel end-to-end neural network model for generating
natural language screenplay overviews. This model enables the user to generate short
descriptive and informative texts that capture certain aspects of a movie script, such as
its genres, approximate content, or style, allowing them to gain a fast, high-level understanding
of the screenplay. Multiple automatic and human evaluations were carried
out to assess the performance of our models, demonstrating that they are well-suited
for the tasks set out in this thesis, outperforming strong baselines. Furthermore, the
ScriptBase data set has started to gain traction, and is currently used by a number of
other researchers in the field to tackle various tasks relating to screenplays and their
analysis
Unsupervised Opinion Summarization with Noising and Denoising
The supervised training of high-capacity models on large datasets containing
hundreds of thousands of document-summary pairs is critical to the recent
success of deep learning techniques for abstractive summarization.
Unfortunately, in most domains (other than news) such training data is not
available and cannot be easily sourced. In this paper we enable the use of
supervised learning for the setting where there are only documents available
(e.g.,~product or business reviews) without ground truth summaries. We create a
synthetic dataset from a corpus of user reviews by sampling a review,
pretending it is a summary, and generating noisy versions thereof which we
treat as pseudo-review input. We introduce several linguistically motivated
noise generation functions and a summarization model which learns to denoise
the input and generate the original review. At test time, the model accepts
genuine reviews and generates a summary containing salient opinions, treating
those that do not reach consensus as noise. Extensive automatic and human
evaluation shows that our model brings substantial improvements over both
abstractive and extractive baselines.Comment: ACL 202
Deep learning approaches to predict sea surface height above geoid in Pekalongan
Rising sea surface height is one of the world's vital issues in marine ecosystems because it greatly affects the ecosystems as well as the socio-economic life of the surrounding environment. Pekalongan is one area in Indonesia facing the effects of this phenomenon. This problem deserves to be explored further with complex approaches. One of them is a neural network to perform forecasting more accurately. In neural networks, the time series approach can be used with Long Short-Term Memory (LSTM) and Gated Recurrent Unit (GRU). By adding the bidirectional method to each of these two approaches, we will find the best method to use to perform the analysis. The best results were obtained by forecasting for 960 days using Vanilla BiGRU. The results can be interpreted from multiple perspectives. The forecasting results showed a fluctuating pattern as in previous periods, so it can be said that the pattern is still quite normal, which indicates that the terminal can continue to operate normally. However, the forecasting results from this study are expected to be a reference for information for the government to prevent future dangers
Text summarization of online hotel reviews with sentiment analysis
The aim of this thesis is the creation of a system that summarizes positive and negative property reviews. To achieve this, an extractive summarization system that produces two summaries is proposed: one for the positive reviews and another for the negative ones. This is achieved with a classification system that will feed positive and nega- tive reviews to the summarization system. To pursue our objective, a study on the different NLP methods, along with their pros and cons, was performed, leading to the conclu- sion that the use of transformers and more specifically, the combination of BERT and GPT-2 architectures, would be the best approach. To obtain the data from TripAdvisor that is in StayForLong website, a crawling process was performed from the StayForLong and TripAdvi- sor. These consisted on a total of over 80000 reviews, and over 175 properties that we pre-processed, cleaned and tokenized, in order to work with BERT for the sentiment analysis and GPT-2 for the sum- marization. Then we proceeded, with an extensive analysis in regards to the impact of the variables. Finally, we fine-tuned each of the mod- els so that it performed at its possible best. To evaluate our two systems, we evaluated the the binary sen- timent classification system, with multi-modal BERT with a 96% of precision and for the GPT-2 summarization system, we opted to apply the ROUGE-F1 metric, were we obtained an average of 57.5%
- …