73 research outputs found
Automatic Document Image Binarization using Bayesian Optimization
Document image binarization is often a challenging task due to various forms
of degradation. Although there exist several binarization techniques in
literature, the binarized image is typically sensitive to control parameter
settings of the employed technique. This paper presents an automatic document
image binarization algorithm to segment the text from heavily degraded document
images. The proposed technique uses a two band-pass filtering approach for
background noise removal, and Bayesian optimization for automatic
hyperparameter selection for optimal results. The effectiveness of the proposed
binarization technique is empirically demonstrated on the Document Image
Binarization Competition (DIBCO) and the Handwritten Document Image
Binarization Competition (H-DIBCO) datasets
Massively-Parallel Feature Selection for Big Data
We present the Parallel, Forward-Backward with Pruning (PFBP) algorithm for
feature selection (FS) in Big Data settings (high dimensionality and/or sample
size). To tackle the challenges of Big Data FS PFBP partitions the data matrix
both in terms of rows (samples, training examples) as well as columns
(features). By employing the concepts of -values of conditional independence
tests and meta-analysis techniques PFBP manages to rely only on computations
local to a partition while minimizing communication costs. Then, it employs
powerful and safe (asymptotically sound) heuristics to make early, approximate
decisions, such as Early Dropping of features from consideration in subsequent
iterations, Early Stopping of consideration of features within the same
iteration, or Early Return of the winner in each iteration. PFBP provides
asymptotic guarantees of optimality for data distributions faithfully
representable by a causal network (Bayesian network or maximal ancestral
graph). Our empirical analysis confirms a super-linear speedup of the algorithm
with increasing sample size, linear scalability with respect to the number of
features and processing cores, while dominating other competitive algorithms in
its class
Russo-Ukrainian War: Prediction and explanation of Twitter suspension
On 24 February 2022, Russia invaded Ukraine, starting what is now known as
the Russo-Ukrainian War, initiating an online discourse on social media.
Twitter as one of the most popular SNs, with an open and democratic character,
enables a transparent discussion among its large user base. Unfortunately, this
often leads to Twitter's policy violations, propaganda, abusive actions, civil
integrity violation, and consequently to user accounts' suspension and
deletion. This study focuses on the Twitter suspension mechanism and the
analysis of shared content and features of the user accounts that may lead to
this. Toward this goal, we have obtained a dataset containing 107.7M tweets,
originating from 9.8 million users, using Twitter API. We extract the
categories of shared content of the suspended accounts and explain their
characteristics, through the extraction of text embeddings in junction with
cosine similarity clustering. Our results reveal scam campaigns taking
advantage of trending topics regarding the Russia-Ukrainian conflict for
Bitcoin and Ethereum fraud, spam, and advertisement campaigns. Additionally, we
apply a machine learning methodology including a SHapley Additive
explainability model to understand and explain how user accounts get suspended
BotArtist: Twitter bot detection Machine Learning model based on Twitter suspension
Twitter as one of the most popular social networks, offers a means for
communication and online discourse, which unfortunately has been the target of
bots and fake accounts, leading to the manipulation and spreading of false
information. Towards this end, we gather a challenging, multilingual dataset of
social discourse on Twitter, originating from 9M users regarding the recent
Russo-Ukrainian war, in order to detect the bot accounts and the conversation
involving them. We collect the ground truth for our dataset through the Twitter
API suspended accounts collection, containing approximately 343K of bot
accounts and 8M of normal users. Additionally, we use a dataset provided by
Botometer-V3 with 1,777 Varol, 483 German accounts, and 1,321 US accounts.
Besides the publicly available datasets, we also manage to collect 2
independent datasets around popular discussion topics of the 2022 energy crisis
and the 2022 conspiracy discussions. Both of the datasets were labeled
according to the Twitter suspension mechanism. We build a novel ML model for
bot detection using the state-of-the-art XGBoost model. We combine the model
with a high volume of labeled tweets according to the Twitter suspension
mechanism ground truth. This requires a limited set of profile features
allowing labeling of the dataset in different time periods from the collection,
as it is independent of the Twitter API. In comparison with Botometer our
methodology achieves an average 11% higher ROC-AUC score over two real-case
scenario datasets
Digitisation Processing and Recognition of Old Greek Manuscipts (the D-SCRIBE Project)
After many years of scholar study, manuscript collections continue to be an important source of novel
information for scholars, concerning both the history of earlier times as well as the development of cultural
documentation over the centuries. D-SCRIBE project aims to support and facilitate current and future efforts in
manuscript digitization and processing. It strives toward the creation of a comprehensive software product, which
can assist the content holders in turning an archive of manuscripts into a digital collection using automated
methods. In this paper, we focus on the problem of recognizing early Christian Greek manuscripts. We propose a
novel digital image binarization scheme for low quality historical documents allowing further content exploitation in
an efficient way. Based on the existence of closed cavity regions in the majority of characters and character
ligatures in these scripts, we propose a novel, segmentation-free, fast and efficient technique that assists the
recognition procedure by tracing and recognizing the most frequently appearing characters or character ligatures
ICFHR2016 Handwritten Keyword Spotting Competition (H-KWS 2016)
© 2016 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works.[EN] The H-KWS 2016, organized in the context of the ICFHR 2016 conference aims at setting up an evaluation framework for benchmarking handwritten keyword spotting (KWS) examining both the Query by Example (QbE) and the Query by String (QbS) approaches. Both KWS approaches were hosted into two different tracks, which in turn were split into two distinct challenges, namely, a segmentation-based and a segmentation-free to accommodate different perspectives adopted by researchers in the KWS field. In addition, the competition aims to evaluate the submitted training-based methods under different amounts of training data. Four participants submitted at least one solution to one of the challenges, according to the capabilities and/or restrictions of their systems. The data used in the competition consisted of historical German and English documents with their own characteristics and complexities. This paper presents the details of the competition, including the data, evaluation metrics and results of the best run of each participating methods.This work was partially supported by the Spanish MEC under FPU grant FPU13/06281, by the Generalitat Valenciana under the Prometeo/2009/014 project grant ALMA-MATER, and through the EU projects: HIMANIS (JPICH programme, Spanish grant Ref. PCIN-2015-068) and READ (Horizon-2020 programme, grant Ref. 674943).Pratikakis, I.; Zagoris, K.; Gatos, B.; Puigcerver, J.; Toselli, AH.; Vidal, E. (2016). ICFHR2016 Handwritten Keyword Spotting Competition (H-KWS 2016). IEEE. https://doi.org/10.1109/ICFHR.2016.0117
Transforming scholarship in the archives through handwritten text recognition:Transkribus as a case study
Purpose: An overview of the current use of handwritten text recognition (HTR) on archival manuscript material, as provided by the EU H2020 funded Transkribus platform. It explains HTR, demonstrates Transkribus, gives examples of use cases, highlights the affect HTR may have on scholarship, and evidences this turning point of the advanced use of digitised heritage content. The paper aims to discuss these issues. - Design/methodology/approach: This paper adopts a case study approach, using the development and delivery of the one openly available HTR platform for manuscript material. - Findings: Transkribus has demonstrated that HTR is now a useable technology that can be employed in conjunction with mass digitisation to generate accurate transcripts of archival material. Use cases are demonstrated, and a cooperative model is suggested as a way to ensure sustainability and scaling of the platform. However, funding and resourcing issues are identified. - Research limitations/implications: The paper presents results from projects: further user studies could be undertaken involving interviews, surveys, etc. - Practical implications: Only HTR provided via Transkribus is covered: however, this is the only publicly available platform for HTR on individual collections of historical documents at time of writing and it represents the current state-of-the-art in this field. - Social implications: The increased access to information contained within historical texts has the potential to be transformational for both institutions and individuals. - Originality/value: This is the first published overview of how HTR is used by a wide archival studies community, reporting and showcasing current application of handwriting technology in the cultural heritage sector
ICFHR 2012 Competition on Handwritten Document Image Binarization (H-DIBCO 2012)
H-DIBCO 2012 is the International Document Image Binarization
Competition which is dedicated to handwritten document images organized
in conjunction with ICFHR 2012 conference. The objective of the contest
is to identify current advances in handwritten document image
binarization using meaningful evaluation performance measures. This
paper reports on the contest details including the evaluation measures
used as well as the performance of the 24 submitted methods along with a
short description of each method
- …