1,884 research outputs found
A survey on technique for solving web page classification problem
Nowadays, the number of web pages on the World Wide Web has been increasing due to the popularity of the Internet usage. The web page classification is needed in order to organize the increasing number of web pages. There are many web page classification techniques that have been proposed by the other researchers. However, there is no comprehensive survey on the performance of the techniques for the web page classification. In this paper, surveys of the different web page classification techniques with the result of the techniques achieved are presented. The existing works of web page classification are reviewed. Based on the survey, we found that the neural network technique namely Convolutional Neural Network (CNN) produce high F-measure value and meet the real-time requirement for classification compared to the other machine learning technique
DEBACER: a method for slicing moderated debates
Subjects change frequently in moderated debates with several participants, such as in parliamentary sessions, electoral debates, and trials. Partitioning a debate into blocks with the same subject is essential for understanding. Often a moderator is responsible for defining when a new block begins so that the task of automatically partitioning a moderated debate can focus solely on the moderator's behavior. In this paper, we (i) propose a new algorithm, DEBACER, which partitions moderated debates; (ii) carry out a comparative study between conventional and BERTimbau pipelines; and (iii) validate DEBACER applying it to the minutes of the Assembly of the Republic of Portugal. Our results show the effectiveness of DEBACER.info:eu-repo/semantics/publishedVersio
Factors Influencing the Surprising Instability of Word Embeddings
Despite the recent popularity of word embedding methods, there is only a
small body of work exploring the limitations of these representations. In this
paper, we consider one aspect of embedding spaces, namely their stability. We
show that even relatively high frequency words (100-200 occurrences) are often
unstable. We provide empirical evidence for how various factors contribute to
the stability of word embeddings, and we analyze the effects of stability on
downstream tasks.Comment: NAACL HLT 201
Blending Sentence Optimization Weights of Unsupervised Approaches for Extractive Speech Summarization
AbstractThis paper evaluates the performance of two unsupervised approaches, Maximum Marginal Relevance (MMR) and concept-based global optimization framework for speech summarization. Automatic summarization is very useful techniques that can help the users browse a large amount of data. This study focuses on automatic extractive summarization on multi-dialogue speech corpus. We propose improved methods by blending each unsupervised approach at sentence level. Sentence level information is leveraged to improve the linguistic quality of selected summaries. First, these scores are used to filter sentences for concept extraction and concept weight computation. Second, we pre-select a subset of candidate summary sentences according to their sentence weights. Last, we extend the optimization function to a joint optimization of concept and sentence weights to cover both important concepts and sentences. Our experimental results show that these methods can improve the system performance comparing to the concept-based optimization baseline for both human transcripts and ASR output. The best scores are achieved by combining all three approaches, which are significantly better than the baseline system
Sinkhorn-Flow: Predicting Probability Mass Flow in Dynamical Systems Using Optimal Transport
Predicting how distributions over discrete variables vary over time is a
common task in time series forecasting. But whereas most approaches focus on
merely predicting the distribution at subsequent time steps, a crucial piece of
information in many settings is to determine how this probability mass flows
between the different elements over time. We propose a new approach to
predicting such mass flow over time using optimal transport. Specifically, we
propose a generic approach to predicting transport matrices in end-to-end deep
learning systems, replacing the standard softmax operation with Sinkhorn
iterations. We apply our approach to the task of predicting how communities
will evolve over time in social network settings, and show that the approach
improves substantially over alternative prediction methods. We specifically
highlight results on the task of predicting faction evolution in Ukrainian
parliamentary voting.Comment: A prior version of the work appeared in the Optimal Transport
Workshop at NeurIPS 201
PRADA: Practical Black-Box Adversarial Attacks against Neural Ranking Models
Neural ranking models (NRMs) have shown remarkable success in recent years,
especially with pre-trained language models. However, deep neural models are
notorious for their vulnerability to adversarial examples. Adversarial attacks
may become a new type of web spamming technique given our increased reliance on
neural information retrieval models. Therefore, it is important to study
potential adversarial attacks to identify vulnerabilities of NRMs before they
are deployed.
In this paper, we introduce the Adversarial Document Ranking Attack (ADRA)
task against NRMs, which aims to promote a target document in rankings by
adding adversarial perturbations to its text. We focus on the decision-based
black-box attack setting, where the attackers have no access to the model
parameters and gradients, but can only acquire the rank positions of the
partial retrieved list by querying the target model. This attack setting is
realistic in real-world search engines. We propose a novel Pseudo
Relevance-based ADversarial ranking Attack method (PRADA) that learns a
surrogate model based on Pseudo Relevance Feedback (PRF) to generate gradients
for finding the adversarial perturbations.
Experiments on two web search benchmark datasets show that PRADA can
outperform existing attack strategies and successfully fool the NRM with small
indiscernible perturbations of text
- …