2,349 research outputs found

    Document image processing using irregular pyramid structure

    Get PDF
    Ph.DDOCTOR OF PHILOSOPH

    Similarity Reasoning and Filtration for Image-Text Matching

    Full text link
    Image-text matching plays a critical role in bridging the vision and language, and great progress has been made by exploiting the global alignment between image and sentence, or local alignments between regions and words. However, how to make the most of these alignments to infer more accurate matching scores is still underexplored. In this paper, we propose a novel Similarity Graph Reasoning and Attention Filtration (SGRAF) network for image-text matching. Specifically, the vector-based similarity representations are firstly learned to characterize the local and global alignments in a more comprehensive manner, and then the Similarity Graph Reasoning (SGR) module relying on one graph convolutional neural network is introduced to infer relation-aware similarities with both the local and global alignments. The Similarity Attention Filtration (SAF) module is further developed to integrate these alignments effectively by selectively attending on the significant and representative alignments and meanwhile casting aside the interferences of non-meaningful alignments. We demonstrate the superiority of the proposed method with achieving state-of-the-art performances on the Flickr30K and MSCOCO datasets, and the good interpretability of SGR and SAF modules with extensive qualitative experiments and analyses.Comment: 14 pages, 8 figures, Accepted by AAAI202

    Vision Grid Transformer for Document Layout Analysis

    Full text link
    Document pre-trained models and grid-based models have proven to be very effective on various tasks in Document AI. However, for the document layout analysis (DLA) task, existing document pre-trained models, even those pre-trained in a multi-modal fashion, usually rely on either textual features or visual features. Grid-based models for DLA are multi-modality but largely neglect the effect of pre-training. To fully leverage multi-modal information and exploit pre-training techniques to learn better representation for DLA, in this paper, we present VGT, a two-stream Vision Grid Transformer, in which Grid Transformer (GiT) is proposed and pre-trained for 2D token-level and segment-level semantic understanding. Furthermore, a new dataset named D4^4LA, which is so far the most diverse and detailed manually-annotated benchmark for document layout analysis, is curated and released. Experiment results have illustrated that the proposed VGT model achieves new state-of-the-art results on DLA tasks, e.g. PubLayNet (95.7%95.7\%\rightarrow96.2%96.2\%), DocBank (79.6%79.6\%\rightarrow84.1%84.1\%), and D4^4LA (67.7%67.7\%\rightarrow68.8%68.8\%). The code and models as well as the D4^4LA dataset will be made publicly available ~\url{https://github.com/AlibabaResearch/AdvancedLiterateMachinery}.Comment: Accepted by ICCV202

    DavarOCR: A Toolbox for OCR and Multi-Modal Document Understanding

    Full text link
    This paper presents DavarOCR, an open-source toolbox for OCR and document understanding tasks. DavarOCR currently implements 19 advanced algorithms, covering 9 different task forms. DavarOCR provides detailed usage instructions and the trained models for each algorithm. Compared with the previous opensource OCR toolbox, DavarOCR has relatively more complete support for the sub-tasks of the cutting-edge technology of document understanding. In order to promote the development and application of OCR technology in academia and industry, we pay more attention to the use of modules that different sub-domains of technology can share. DavarOCR is publicly released at https://github.com/hikopensource/Davar-Lab-OCR.Comment: Short paper, Accept by ACM MM202

    SYNTHESIZING DYSARTHRIC SPEECH USING MULTI-SPEAKER TTS FOR DSYARTHRIC SPEECH RECOGNITION

    Get PDF
    Dysarthria is a motor speech disorder often characterized by reduced speech intelligibility through slow, uncoordinated control of speech production muscles. Automatic Speech recognition (ASR) systems may help dysarthric talkers communicate more effectively. However, robust dysarthria-specific ASR requires a significant amount of training speech is required, which is not readily available for dysarthric talkers. In this dissertation, we investigate dysarthric speech augmentation and synthesis methods. To better understand differences in prosodic and acoustic characteristics of dysarthric spontaneous speech at varying severity levels, a comparative study between typical and dysarthric speech was conducted. These characteristics are important components for dysarthric speech modeling, synthesis, and augmentation. For augmentation, prosodic transformation and time-feature masking have been proposed. For dysarthric speech synthesis, this dissertation has introduced a modified neural multi-talker TTS by adding a dysarthria severity level coefficient and a pause insertion model to synthesize dysarthric speech for varying severity levels. In addition, we have extended this work by using a label propagation technique to create more meaningful control variables such as a continuous Respiration, Laryngeal and Tongue (RLT) parameter, even for datasets that only provide discrete dysarthria severity level information. This approach increases the controllability of the system, so we are able to generate more dysarthric speech with a broader range. To evaluate their effectiveness for synthesis of training data, dysarthria-specific speech recognition was used. Results show that a DNN-HMM model trained on additional synthetic dysarthric speech achieves WER improvement of 12.2% compared to the baseline, and that the addition of the severity level and pause insertion controls decrease WER by 6.5%, showing the effectiveness of adding these parameters. Overall results on the TORGO database demonstrate that using dysarthric synthetic speech to increase the amount of dysarthric-patterned speech for training has a significant impact on the dysarthric ASR systems

    Scene Based Text Recognition From Natural Images and Classification Based on Hybrid CNN Models with Performance Evaluation

    Get PDF
    Similar to the recognition of captions, pictures, or overlapped text that typically appears horizontally, multi-oriented text recognition in video frames is challenging since it has high contrast related to its background. Multi-oriented form of text normally denotes scene text which makes text recognition further stimulating and remarkable owing to the disparaging features of scene text. Hence, predictable text detection approaches might not give virtuous outcomes for multi-oriented scene text detection. Text detection from any such natural image has been challenging since earlier times, and significant enhancement has been made recently to execute this task. While coming to blurred, low-resolution, and small-sized images, most of the previous research conducted doesn’t work well; hence, there is a research gap in that area. Scene-based text detection is a key area due to its adverse applications. One such primary reason for the failure of earlier methods is that the existing methods could not generate precise alignments across feature areas and targets for those images. This research focuses on scene-based text detection with the aid of YOLO based object detector and a CNN-based classification approach. The experiments were conducted in MATLAB 2019A, and the packages used were RESNET50, INCEPTIONRESNETV2, and DENSENET201. The efficiency of the proposed methodology - Hybrid resnet -YOLO procured maximum accuracy of 91%, Hybrid inceptionresnetv2 -YOLO of 81.2%, and Hybrid densenet201 -YOLO of 83.1% and was verified by comparing it with the existing research works Resnet50 of 76.9%, ResNet-101 of 79.5%, and ResNet-152 of 82%

    Natural Language Interpreter and Arithmetic Word Problem Solver

    Get PDF
    The field of Natural Language Processing (NLP) belongs nowadays to most studied and developing fields of Artificial Intelligence. Of countless applications of tasks of the NLP it could be particularly remarked that the intelligence test of a machine - Turing Test - involves detection of a human-like intelligence precisely through the language-based chat aimed to demonstrate sufficient mental capacities. In this sense, the computational analysis of language comprehension and production can thus be deemed of a prominent importance. This work has as its ultimate objective to combine for its outcomes results of the language parsing with notable strengths of the computers - manipulation of numbers. Therefore, two principal tasks of this project can be outlined. The parser of the natural language selected for this project - Catalan - is destined to find a syntactical representation of the given sentence, and the arithmetic word problem solver links up the established interpretation with resolution of an arithmetic word problem given in the natural language. Finally, the work concludes by discussion focused on analysis of results, opportune enhancements for the future work and possible ways how to address encountered issues and deficiencies.Avui en dia, el domini de Processament del Llenguatge Natural pertany als camps més tractats de la Intel·ligència Artificial. En el context de la varietat immensa de les seves aplicacions es pot destacar, que la prova d'intel·ligència de màquines - el test de Turing - comporta la detecció de la intel·ligència justament mitjançant el xat fent servir el llenguatge per demostrar les capacitats mentals. En aquest sentit, doncs, l'anàlisi computacional de la comprensió i producció del llenguatge pot considerar-se d'importància especial. Aquest treball té com a objectiu entrellaçar per les seves sortides els resultats de l'anàlisi de llenguatge natural amb el punt característicament fort dels ordinadors - la manipulació amb números. Dit això, es poden delimitar dues tasques principals en que consisteix el present projecte. Per una banda s'hi té l'analitzador del Català encarregat d'esbrinar la representació sintàctica de la frase donada i per l'altra, el sistema per resoldre problemes aritmètics senzills que permet passar de la interpretació de les frases formant l'enunciat en el llenguatge natural a la solució del problema. Per acabar, s'inclou la discussió pel que fa als resultats obtinguts, possibilitats de millores en el futur i causes de deficiències detectades.Hoy en día, el dominio de Procesamiento del Lenguaje Natural pertenece a los campos más tratados de la Inteligencia Artificial. Entre las aplicaciones de las tareas asociadas a éste se puede apreciar particularmente, que la prueba de inteligencia de máquinas - el test de Turing - comprende la detección de la inteligencia precisamente mediante el chat empleando el lenguaje para demostrar las habilidades mentales. En este sentido, pues, el análisis computacional de la comprensión y producción del lenguaje puede considerarse de importancia especial. El presente trabajo tiene por meta interrelacionar para sus salidas los resultados del análisis de lenguaje natural con el punto fuerte típico de los ordenadores - la manipulación con números. Entonces, se pueden diferenciar dos tareas principales que definen el ámbito de este proyecto; por un lado se tiene el analizador del idioma escogido para el mismo - Catalán - que se encarga de obtener la representación sintáctica de dada frase y por el otro, el sistema para resolver los problemas simples que permite llegar a la solución del problema a partir de la interpretación de las frases del enunciado en el lenguaje natural. Al final, se acaba razonando sobre los resultados logrados, las oportunidades de mejoras en el futuro e imperfecciones halladas

    Learning Semantic Information from Multimodal Data using Deep Neural Networks

    Get PDF
    During the last decades, most collective information has been digitized to form an immense database distributed across the Internet. This can also be referred to as Big data, a collection of data that is vast in volume and still growing with time. Nowadays, we can say that Big data is everywhere. We might not even realize how much it affects our daily life as it is applied in many ways, ranging from online shopping, music streaming, TV streaming, travel and transportation, energy, fighting crime, to health care. Many organizations and companies have been collecting and analyzing large volumes of data to solve domain-specific problems or making business decisions. One of the powerful tools that can be used to extract value from Big data is Deep learning, a type of machine learning algorithm inspired by the structure and function of the human brain called artificial neural networks that learn from large amounts of data. Deep learning has been widely used and applied in many research fields such as natural language processing, IoT applications, and computer vision. In this thesis, we introduce three Deep Neural Networks that used to learn semantic information from different types of data and a design guideline to accelerate Neural Network Layer on a general propose computing platform. First, we focus on the text type data. We proposed a new feature extraction technique to preprocess the dataset and optimize the original Restricted Boltzmann Machine (RBM) model to generate the more meaningful topic that better represents the given document. Our proposed method can improve the generated topic accuracy by up to 12.99% on Open Movie, Reuters, and 20NewsGroup datasets. Moving from text to image type data and with additional click locations, we proposed a human in a loop automatic image labeling framework focusing on aerial images with fewer features for detection. The proposed model consists of two main parts, a prediction model and an adjustment model. The user first provides click locations to the prediction model to generate a bounding box of a specific object. The bounding box is then fine-tuned by the adjustment model for more accurate size and location. A feedback and retrain mechanism is implemented that allows the users to manually adjust the generated bounding box and provide feedback to incrementally train the adjustment network during runtime. This unique online learning feature enables the user to generalize the existing model to target classes not initially presented in the training set, and gradually improves the specificity of the model to those new targets during online learning. Combining text and image type data, we proposed a Multi-region Attention-assisted Grounding network (MAGNet) framework that utilizes spatial attention networks for image-level visual-textual fusion preserving local (word) and global (phrase) information to refine region proposals with an in-network Region Proposal Network (RPN) and detect single or multiple regions for a phrase query. Our framework is independent of external proposal generation systems and without additional information, it can develop an understanding of the query phrase in relation to the image to achieve respectable results in Flickr30k entities and 12% improvement over the state-of-the-art in ReferIt game. Additionally, our model is capable of grounding multiple regions for a query phrase, which is more suitable for real-life applications. Although Deep neural networks (DNNs) have become a powerful tool, it is highly expensive in both computational time and storage cost. To optimize and improve the performance of the network while maintaining the accuracy, the block-circulant matrix-based (BCM) algorithm has been introduced. It has been proven to be highly effective when implemented using customized hardware, such as FPGAs. However, its performance suffers on general purpose computing platforms. In certain cases, using the BCM does not improve the total computation time of the networks at all. With this problem, we proposed a parallel implementation of the BCM layer, and guidelines that generally lead to better implementation practice is provided. The guidelines run across popular implementation language and packages including Python, numpy, intel-numpy, tensorflow, and nGraph
    corecore