17 research outputs found
Matrix Factorization in Tropical and Mixed Tropical-Linear Algebras
Matrix Factorization (MF) has found numerous applications in Machine Learning
and Data Mining, including collaborative filtering recommendation systems,
dimensionality reduction, data visualization, and community detection.
Motivated by the recent successes of tropical algebra and geometry in machine
learning, we investigate two problems involving matrix factorization over the
tropical algebra. For the first problem, Tropical Matrix Factorization (TMF),
which has been studied already in the literature, we propose an improved
algorithm that avoids many of the local optima. The second formulation
considers the approximate decomposition of a given matrix into the product of
three matrices where a usual matrix product is followed by a tropical product.
This formulation has a very interesting interpretation in terms of the learning
of the utility functions of multiple users. We also present numerical results
illustrating the effectiveness of the proposed algorithms, as well as an
application to recommendation systems with promising results
Visual Speech-Aware Perceptual 3D Facial Expression Reconstruction from Videos
The recent state of the art on monocular 3D face reconstruction from image
data has made some impressive advancements, thanks to the advent of Deep
Learning. However, it has mostly focused on input coming from a single RGB
image, overlooking the following important factors: a) Nowadays, the vast
majority of facial image data of interest do not originate from single images
but rather from videos, which contain rich dynamic information. b) Furthermore,
these videos typically capture individuals in some form of verbal communication
(public talks, teleconferences, audiovisual human-computer interactions,
interviews, monologues/dialogues in movies, etc). When existing 3D face
reconstruction methods are applied in such videos, the artifacts in the
reconstruction of the shape and motion of the mouth area are often severe,
since they do not match well with the speech audio.
To overcome the aforementioned limitations, we present the first method for
visual speech-aware perceptual reconstruction of 3D mouth expressions. We do
this by proposing a "lipread" loss, which guides the fitting process so that
the elicited perception from the 3D reconstructed talking head resembles that
of the original video footage. We demonstrate that, interestingly, the lipread
loss is better suited for 3D reconstruction of mouth movements compared to
traditional landmark losses, and even direct 3D supervision. Furthermore, the
devised method does not rely on any text transcriptions or corresponding audio,
rendering it ideal for training in unlabeled datasets. We verify the efficiency
of our method through exhaustive objective evaluations on three large-scale
datasets, as well as subjective evaluation with two web-based user studies
3D Facial Expressions through Analysis-by-Neural-Synthesis
While existing methods for 3D face reconstruction from in-the-wild images
excel at recovering the overall face shape, they commonly miss subtle, extreme,
asymmetric, or rarely observed expressions. We improve upon these methods with
SMIRK (Spatial Modeling for Image-based Reconstruction of Kinesics), which
faithfully reconstructs expressive 3D faces from images. We identify two key
limitations in existing methods: shortcomings in their self-supervised training
formulation, and a lack of expression diversity in the training images. For
training, most methods employ differentiable rendering to compare a predicted
face mesh with the input image, along with a plethora of additional loss
functions. This differentiable rendering loss not only has to provide
supervision to optimize for 3D face geometry, camera, albedo, and lighting,
which is an ill-posed optimization problem, but the domain gap between
rendering and input image further hinders the learning process. Instead, SMIRK
replaces the differentiable rendering with a neural rendering module that,
given the rendered predicted mesh geometry, and sparsely sampled pixels of the
input image, generates a face image. As the neural rendering gets color
information from sampled image pixels, supervising with neural rendering-based
reconstruction loss can focus solely on the geometry. Further, it enables us to
generate images of the input identity with varying expressions while training.
These are then utilized as input to the reconstruction model and used as
supervision with ground truth geometry. This effectively augments the training
data and enhances the generalization for diverse expressions. Our qualitative,
quantitative and particularly our perceptual evaluations demonstrate that SMIRK
achieves the new state-of-the art performance on accurate expression
reconstruction. Project webpage: https://georgeretsi.github.io/smirk/
WordStylist: Styled Verbatim Handwritten Text Generation with Latent Diffusion Models
Text-to-Image synthesis is the task of generating an image according to a
specific text description. Generative Adversarial Networks have been considered
the standard method for image synthesis virtually since their introduction;
today, Denoising Diffusion Probabilistic Models are recently setting a new
baseline, with remarkable results in Text-to-Image synthesis, among other
fields. Aside its usefulness per se, it can also be particularly relevant as a
tool for data augmentation to aid training models for other document image
processing tasks. In this work, we present a latent diffusion-based method for
styled text-to-text-content-image generation on word-level. Our proposed method
manages to generate realistic word image samples from different writer styles,
by using class index styles and text content prompts without the need of
adversarial training, writer recognition, or text recognition. We gauge system
performance with Frechet Inception Distance, writer recognition accuracy, and
writer retrieval. We show that the proposed model produces samples that are
aesthetically pleasing, help boosting text recognition performance, and gets
similar writer retrieval score as real data
Transforming scholarship in the archives through handwritten text recognition:Transkribus as a case study
Purpose: An overview of the current use of handwritten text recognition (HTR) on archival manuscript material, as provided by the EU H2020 funded Transkribus platform. It explains HTR, demonstrates Transkribus, gives examples of use cases, highlights the affect HTR may have on scholarship, and evidences this turning point of the advanced use of digitised heritage content. The paper aims to discuss these issues. - Design/methodology/approach: This paper adopts a case study approach, using the development and delivery of the one openly available HTR platform for manuscript material. - Findings: Transkribus has demonstrated that HTR is now a useable technology that can be employed in conjunction with mass digitisation to generate accurate transcripts of archival material. Use cases are demonstrated, and a cooperative model is suggested as a way to ensure sustainability and scaling of the platform. However, funding and resourcing issues are identified. - Research limitations/implications: The paper presents results from projects: further user studies could be undertaken involving interviews, surveys, etc. - Practical implications: Only HTR provided via Transkribus is covered: however, this is the only publicly available platform for HTR on individual collections of historical documents at time of writing and it represents the current state-of-the-art in this field. - Social implications: The increased access to information contained within historical texts has the potential to be transformational for both institutions and individuals. - Originality/value: This is the first published overview of how HTR is used by a wide archival studies community, reporting and showcasing current application of handwriting technology in the cultural heritage sector
Transferable Deep Features for Keyword Spotting
Deep features, defined as the activations of hidden layers of a neural network, have given promising results applied to various vision tasks. In this paper, we explore the usefulness and transferability of deep features, applied in the context of the problem of keyword spotting (KWS). We use a state-of-the-art deep convolutional network to extract deep features. The optimal parameters concerning their application are subsequently studied: the impact of the choice of hidden layer, the impact of applying dimensionality reduction with a manifold learning technique, as well as the choice of dissimilarity measure used to retrieve relevant word images. Extensive numerical results show that deep features lead to state-of-the-art KWS performance, even when the test and training set come from different document collections