220,554 research outputs found

    Visual Text Correction

    Full text link
    Videos, images, and sentences are mediums that can express the same semantics. One can imagine a picture by reading a sentence or can describe a scene with some words. However, even small changes in a sentence can cause a significant semantic inconsistency with the corresponding video/image. For example, by changing the verb of a sentence, the meaning may drastically change. There have been many efforts to encode a video/sentence and decode it as a sentence/video. In this research, we study a new scenario in which both the sentence and the video are given, but the sentence is inaccurate. A semantic inconsistency between the sentence and the video or between the words of a sentence can result in an inaccurate description. This paper introduces a new problem, called Visual Text Correction (VTC), i.e., finding and replacing an inaccurate word in the textual description of a video. We propose a deep network that can simultaneously detect an inaccuracy in a sentence, and fix it by replacing the inaccurate word(s). Our method leverages the semantic interdependence of videos and words, as well as the short-term and long-term relations of the words in a sentence. In our formulation, part of a visual feature vector for every single word is dynamically selected through a gating process. Furthermore, to train and evaluate our model, we propose an approach to automatically construct a large dataset for VTC problem. Our experiments and performance analysis demonstrates that the proposed method provides very good results and also highlights the general challenges in solving the VTC problem. To the best of our knowledge, this work is the first of its kind for the Visual Text Correction task

    STEFANN: Scene Text Editor using Font Adaptive Neural Network

    Full text link
    Textual information in a captured scene plays an important role in scene interpretation and decision making. Though there exist methods that can successfully detect and interpret complex text regions present in a scene, to the best of our knowledge, there is no significant prior work that aims to modify the textual information in an image. The ability to edit text directly on images has several advantages including error correction, text restoration and image reusability. In this paper, we propose a method to modify text in an image at character-level. We approach the problem in two stages. At first, the unobserved character (target) is generated from an observed character (source) being modified. We propose two different neural network architectures - (a) FANnet to achieve structural consistency with source font and (b) Colornet to preserve source color. Next, we replace the source character with the generated character maintaining both geometric and visual consistency with neighboring characters. Our method works as a unified platform for modifying text in images. We present the effectiveness of our method on COCO-Text and ICDAR datasets both qualitatively and quantitatively.Comment: Accepted in The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 202

    Near visual function measured with a novel tablet application in patients with astigmatism

    Get PDF
    Clinical relevance: While the clinical focus of performance metrics is traditionally based on visual acuity, research from the field of visual impairment has demonstrated that metrics such as reading speed and critical print size correlate much more strongly with subjective patient reported outcomes and assessed ability in real-world tasks. Background: More recently, digital device use has increasingly replaced many paper-based tasks. Therefore, this study aimed to assess the correlation between standard acuity/contrast metrics and functional reading ability compared to real-world performance on an iPad-based reading task with astigmatic patients corrected wearing toric and mean spherical equivalent contact lenses. Methods: Thirty-four adult participants, with −0.75 to −1.50 D of refractive astigmatism, were enrolled in a double-masked cross-over study and fitted with toric and spherical equivalent contact lenses, in random order. A digital application was developed to assess zoom, contrast modifications, the distance at which the tablet was held, blink rate, and time to complete the reading task. High and low contrast near logMAR visual acuity were measured along with reading performance (critical print size and optimal reading speed). Results: The amount participants chose to increase tablet font size (zoom) was correlated with their high-contrast visual acuity with toric correction (r = 0.434, p = 0.010). With best sphere correction, zoom was associated with reading speed (r = −0.450, p = 0.008) and working distance (r = 0.522, p = 0.002). Text zoom was also associated with horizontal (toric: r = 0.898, p < 0.001; sphere: r = 0.880, p < 0.001) and vertical scrolling (toric: r = 0.857, p < 0.001; sphere: r = 0.846, p < 0.001). There was a significant negative association between the selection of text contrast and zoom (toric: r = −0.417, p = 0.0141; sphere: r = −0.385, p = 0.025). Conclusion: Real-world task performance allows more robust assessment of visual function than standard visual metrics alone. Digital technology offers the opportunity to better understand the impact of different vision correction options on real-world task performance

    Collaborative Development and Evaluation of Text-processing Workflows in a UIMA-supported Web-based Workbench

    Get PDF
    Challenges in creating comprehensive text-processing worklows include a lack of the interoperability of individual components coming from different providers and/or a requirement imposed on the end users to know programming techniques to compose such workflows. In this paper we demonstrate Argo, a web-based system that addresses these issues in several ways. It supports the widely adopted Unstructured Information Management Architecture (UIMA), which handles the problem of interoperability; it provides a web browser-based interface for developing workflows by drawing diagrams composed of a selection of available processing components; and it provides novel user-interactive analytics such as the annotation editor which constitutes a bridge between automatic processing and manual correction. These features extend the target audience of Argo to users with a limited or no technical background. Here, we focus specifically on the construction of advanced workflows, involving multiple branching and merging points, to facilitate various comparative evalutions. Together with the use of user-collaboration capabilities supported in Argo, we demonstrate several use cases including visual inspections, comparisions of multiple processing segments or complete solutions against a reference standard, inter-annotator agreement, and shared task mass evaluations. Ultimetely, Argo emerges as a one-stop workbench for defining, processing, editing and evaluating text processing tasks

    Pattern of reading eye movements during monovision contact lens wear in presbyopes

    Get PDF
    Monovision can be used as a method to correct presbyopia with contact lenses (CL) but its effect on reading behavior is still poorly understood. In this study eye movements (EM) were recorded in fifteen presbyopic participants, naïve to monovision, whilst they read arrays of words, non-words, and text passages to assess whether monovision affected their reading. Three conditions were compared, using daily disposable CLs: baseline (near correction in both eyes), conventional monovision (distance correction in the dominant eye, near correction in the non-dominant eye), and crossed monovision (the reversal of conventional monovision). Behavioral measures (reading speed and accuracy) and EM parameters (single fixation duration, number of fixations, dwell time per item, percentage of regressions, and percentage of skipped items) were analyzed. When reading passages, no differences in behavioral and EM measures were seen in any comparison of the three conditions. The number of fixations and dwell time significantly increased for both monovision and crossed monovision with respect to baseline only with word and non-word arrays. It appears that monovision did not appreciably alter visual processing when reading meaningful texts but some limited stress of the EM pattern was observed only with arrays of unrelated or meaningless items under monovision, which require the reader to have more in-depth controlled visual processing

    Woodpecker: Hallucination Correction for Multimodal Large Language Models

    Full text link
    Hallucination is a big shadow hanging over the rapidly evolving Multimodal Large Language Models (MLLMs), referring to the phenomenon that the generated text is inconsistent with the image content. In order to mitigate hallucinations, existing studies mainly resort to an instruction-tuning manner that requires retraining the models with specific data. In this paper, we pave a different way, introducing a training-free method named Woodpecker. Like a woodpecker heals trees, it picks out and corrects hallucinations from the generated text. Concretely, Woodpecker consists of five stages: key concept extraction, question formulation, visual knowledge validation, visual claim generation, and hallucination correction. Implemented in a post-remedy manner, Woodpecker can easily serve different MLLMs, while being interpretable by accessing intermediate outputs of the five stages. We evaluate Woodpecker both quantitatively and qualitatively and show the huge potential of this new paradigm. On the POPE benchmark, our method obtains a 30.66%/24.33% improvement in accuracy over the baseline MiniGPT-4/mPLUG-Owl. The source code is released at https://github.com/BradyFU/Woodpecker.Comment: 16 pages, 7 figures. Code Website: https://github.com/BradyFU/Woodpecke

    The BURCHAK corpus: a Challenge Data Set for Interactive Learning of Visually Grounded Word Meanings

    Full text link
    We motivate and describe a new freely available human-human dialogue dataset for interactive learning of visually grounded word meanings through ostensive definition by a tutor to a learner. The data has been collected using a novel, character-by-character variant of the DiET chat tool (Healey et al., 2003; Mills and Healey, submitted) with a novel task, where a Learner needs to learn invented visual attribute words (such as " burchak " for square) from a tutor. As such, the text-based interactions closely resemble face-to-face conversation and thus contain many of the linguistic phenomena encountered in natural, spontaneous dialogue. These include self-and other-correction, mid-sentence continuations, interruptions, overlaps, fillers, and hedges. We also present a generic n-gram framework for building user (i.e. tutor) simulations from this type of incremental data, which is freely available to researchers. We show that the simulations produce outputs that are similar to the original data (e.g. 78% turn match similarity). Finally, we train and evaluate a Reinforcement Learning dialogue control agent for learning visually grounded word meanings, trained from the BURCHAK corpus. The learned policy shows comparable performance to a rule-based system built previously.Comment: 10 pages, THE 6TH WORKSHOP ON VISION AND LANGUAGE (VL'17

    Vision and Reading Difficulties Part 5: Clinical protocol and the role of the eye-care practitioner

    Get PDF
    This series of articles has described various aspects of visual characteristics of reading difficulties and the background behind techniques such as the use of coloured filters in helping to reduce the difficulties that are experienced. The present article, which is the last in series, aims to describe a clinical protocol that can be used by the busy eye care practitioner for the investigation and management of such patients. It also describes the testing techniques that can be used for the various assessments. Warning: DO NOT LOOK AT FIGURE 7 IF YOU HAVE MIGRAINE OR EPILEPSY
    • …
    corecore