63,149 research outputs found
Recent Trends and Techniques in Text Detection and Text Localization in a Natural Scene: A Survey
Text information extraction from natural scene images is a rising area of research. Since text in natural scene images generally carries valuable details, detecting and recognizing scene text has been deemed essential for a variety of advanced computer vision applications. There has been a lot of effort put into extracting text regions from scene text images in an effective and reliable manner. As most text recognition applications have high demand of robust algorithms for detecting and localizing texts from a given scene text image, so the researchers mainly focus on the two important stages text detection and text localization. This paper provides a review of various techniques of text detection and text localization
Aggregated Text Transformer for Scene Text Detection
This paper explores the multi-scale aggregation strategy for scene text
detection in natural images. We present the Aggregated Text TRansformer(ATTR),
which is designed to represent texts in scene images with a multi-scale
self-attention mechanism. Starting from the image pyramid with multiple
resolutions, the features are first extracted at different scales with shared
weight and then fed into an encoder-decoder architecture of Transformer. The
multi-scale image representations are robust and contain rich information on
text contents of various sizes. The text Transformer aggregates these features
to learn the interaction across different scales and improve text
representation. The proposed method detects scene texts by representing each
text instance as an individual binary mask, which is tolerant of curve texts
and regions with dense instances. Extensive experiments on public scene text
detection datasets demonstrate the effectiveness of the proposed framework
Multimodal Hate Speech Detection from Bengali Memes and Texts
Numerous works have been proposed to employ machine learning (ML) and deep
learning (DL) techniques to utilize textual data from social media for
anti-social behavior analysis such as cyberbullying, fake news propagation, and
hate speech mainly for highly resourced languages like English. However,
despite having a lot of diversity and millions of native speakers, some
languages such as Bengali are under-resourced, which is due to a lack of
computational resources for natural language processing (NLP). Like English,
Bengali social media content also includes images along with texts (e.g.,
multimodal contents are posted by embedding short texts into images on
Facebook), only the textual data is not enough to judge them (e.g., to
determine they are hate speech). In those cases, images might give extra
context to properly judge. This paper is about hate speech detection from
multimodal Bengali memes and texts. We prepared the only multimodal hate speech
detection dataset1 for a kind of problem for Bengali. We train several neural
architectures (i.e., neural networks like Bi-LSTM/Conv-LSTM with word
embeddings, EfficientNet + transformer architectures such as monolingual Bangla
BERT, multilingual BERT-cased/uncased, and XLM-RoBERTa) jointly analyze textual
and visual information for hate speech detection. The Conv-LSTM and XLM-RoBERTa
models performed best for texts, yielding F1 scores of 0.78 and 0.82,
respectively. As of memes, ResNet152 and DenseNet201 models yield F1 scores of
0.78 and 0.7, respectively. The multimodal fusion of mBERT-uncased +
EfficientNet-B1 performed the best, yielding an F1 score of 0.80. Our study
suggests that memes are moderately useful for hate speech detection in Bengali,
but none of the multimodal models outperform unimodal models analyzing only
textual data
- …