24 research outputs found

    Symbolic and Visual Retrieval of Mathematical Notation using Formula Graph Symbol Pair Matching and Structural Alignment

    Get PDF
    Large data collections containing millions of math formulae in different formats are available on-line. Retrieving math expressions from these collections is challenging. We propose a framework for retrieval of mathematical notation using symbol pairs extracted from visual and semantic representations of mathematical expressions on the symbolic domain for retrieval of text documents. We further adapt our model for retrieval of mathematical notation on images and lecture videos. Graph-based representations are used on each modality to describe math formulas. For symbolic formula retrieval, where the structure is known, we use symbol layout trees and operator trees. For image-based formula retrieval, since the structure is unknown we use a more general Line of Sight graph representation. Paths of these graphs define symbol pairs tuples that are used as the entries for our inverted index of mathematical notation. Our retrieval framework uses a three-stage approach with a fast selection of candidates as the first layer, a more detailed matching algorithm with similarity metric computation in the second stage, and finally when relevance assessments are available, we use an optional third layer with linear regression for estimation of relevance using multiple similarity scores for final re-ranking. Our model has been evaluated using large collections of documents, and preliminary results are presented for videos and cross-modal search. The proposed framework can be adapted for other domains like chemistry or technical diagrams where two visually similar elements from a collection are usually related to each other

    3D Robotic Sensing of People: Human Perception, Representation and Activity Recognition

    Get PDF
    The robots are coming. Their presence will eventually bridge the digital-physical divide and dramatically impact human life by taking over tasks where our current society has shortcomings (e.g., search and rescue, elderly care, and child education). Human-centered robotics (HCR) is a vision to address how robots can coexist with humans and help people live safer, simpler and more independent lives. As humans, we have a remarkable ability to perceive the world around us, perceive people, and interpret their behaviors. Endowing robots with these critical capabilities in highly dynamic human social environments is a significant but very challenging problem in practical human-centered robotics applications. This research focuses on robotic sensing of people, that is, how robots can perceive and represent humans and understand their behaviors, primarily through 3D robotic vision. In this dissertation, I begin with a broad perspective on human-centered robotics by discussing its real-world applications and significant challenges. Then, I will introduce a real-time perception system, based on the concept of Depth of Interest, to detect and track multiple individuals using a color-depth camera that is installed on moving robotic platforms. In addition, I will discuss human representation approaches, based on local spatio-temporal features, including new “CoDe4D” features that incorporate both color and depth information, a new “SOD” descriptor to efficiently quantize 3D visual features, and the novel AdHuC features, which are capable of representing the activities of multiple individuals. Several new algorithms to recognize human activities are also discussed, including the RG-PLSA model, which allows us to discover activity patterns without supervision, the MC-HCRF model, which can explicitly investigate certainty in latent temporal patterns, and the FuzzySR model, which is used to segment continuous data into events and probabilistically recognize human activities. Cognition models based on recognition results are also implemented for decision making that allow robotic systems to react to human activities. Finally, I will conclude with a discussion of future directions that will accelerate the upcoming technological revolution of human-centered robotics

    Video Summarization Using Unsupervised Deep Learning

    Get PDF
    In this thesis, we address the task of video summarization using unsupervised deep-learning architectures. Video summarization aims to generate a short summary by selecting the most informative and important frames (key-frames) or fragments (key-fragments) of the full-length video, and presenting them in temporally-ordered fashion. Our objective is to overcome observed weaknesses of existing video summarization approaches that utilize RNNs for modeling the temporal dependence of frames, related to: i) the small influence of the estimated frame-level importance scores in the created video summary, ii) the insufficiency of RNNs to model long-range frames' dependence, and iii) the small amount of parallelizable operations during the training of RNNs. To address the first weakness, we propose a new unsupervised network architecture, called AC-SUM-GAN, which formulates the selection of important video fragments as a sequence generation task and learns this task by embedding an Actor-Critic model in a Generative Adversarial Network. The feedback of a trainable Discriminator is used as a reward by the Actor-Critic model in order to explore a space of actions and learn a value function (Critic) and a policy (Actor) for video fragment selection. To tackle the remaining weaknesses, we investigate the use of attention mechanisms for video summarization and propose a new supervised network architecture, called PGL-SUM, that combines global and local multi-head attention mechanisms which take into account the temporal position of the video frames, in order to discover different modelings of the frames' dependencies at different levels of granularity. Based on the acquired experience, we then propose a new unsupervised network architecture, called CA-SUM, which estimates the frames' importance using a novel concentrated attention mechanism that focuses on non-overlapping blocks in the main diagonal of the attention matrix and takes into account the attentive uniqueness and diversity of the associated frames of the video. All the proposed architectures have been extensively evaluated on the most commonly-used benchmark datasets, demonstrating their competitiveness against other approaches and documenting the contribution of our proposals on advancing the current state-of-the-art on video summarization. Finally, we make a first attempt on producing explanations for the video summarization results. Inspired by relevant works in the Natural Language Processing domain, we propose an attention-based method for explainable video summarization and we evaluate the performance of various explanation signals using our CA-SUM architecture and two benchmark datasets for video summarization. The experimental results indicate the advanced performance of explanation signals formed using the inherent attention weights, and demonstrate the ability of the proposed method to explain the video summarization results using clues about the focus of the attention mechanism

    Irish Machine Vision and Image Processing Conference Proceedings 2017

    Get PDF

    Temporal Segmentation of Video Lectures: a speech-based optimization framework

    Get PDF
    Video lectures are very popular nowadays. Following the new teaching trends, students are increasingly seeking educational videos on the web for the most different purposes: learn something new, review content for exams or just out of curiosity. Unfortunately, finding specific content in this type of video is not an easy task. Many video lectures are extensive and cover several topics, and not all of these topics are relevant to the user who has found the video. The result is that the user spends so much time trying to find a topic of interest in the middle of content irrelevant to him. The temporal segmentation of video lectures in topics can solve this problem allowing users to navigate of a non-linear way through all topics of a video lecture. However, temporal video lecture segmentation is a time-consuming task and must be automatized. For this reason, in this work we propose an optimization framework for the temporal video lecture segmentation problem. Our proposal only uses information from the teacher’s speech, therefore it does not depend on any additional resources such as slides, textbooks or manually generated subtitles. This makes our proposal versatile, as we can apply it to a wide range of different video lectures, as it only requires the teacher’s speech on the video. To do this, we formulate this problem as a linear programming model where we combine prosodic and semantic features from speech that may indicate topic transitions. To optimize this model, we use a elitist genetic algorithm with local search. Through the experiments, we were able to evaluate different aspects of our approach such as sensibility to parameter variation and convergence behavior. Also, we show that our method was capable of overcoming state-of-the-art methods, both in Recall and in F1-Score, in two different datasets of video lectures. Finally, we provide the implementation of our framework so that other researchers can contribute and reproduce our results.As videoaulas são muito populares hoje em dia. Seguindo as novas tendências de ensino, estudantes procuram cada vez mais por vídeos educacionais na Web com os mais diferentes propósitos: aprender algo novo, revisar conteúdo para exames ou apenas por curiosidade. Infelizmente, encontrar conteúdo específico nesse tipo de vídeo não é uma tarefa fácil. Muitas videoaulas são extensas e abrangem vários tópicos, sendo que nem todos são relevantes para o usuário que encontrou o vídeo. O resultado disso é que o usuário acaba gastando muito tempo ao tentar encontrar um tópico de interesse em meio a conteúdo que é irrelevante para ele. A segmentação temporal de videoaulas em tópicos pode resolver esse problema ao permitir que os usuários naveguem de maneira não-linear entre os tópicos existentes em uma videoaula. No entanto, se trata de uma tarefa dispendiosa que precisa ser automatizada. Por esse motivo, neste trabalho, propomos um framework de otimização para o problema de segmentação temporal de videoaulas. Nossa proposta utiliza apenas informações da fala do professor, portanto, não depende de recursos adicionais, como slides, livros didáticos ou legendas geradas manualmente. Isso a torna versátil, pois podemos aplicá-la a uma ampla variedade de videoaulas, uma vez que requer apenas que o discurso do professor esteja presente. Para fazer isso, formulamos o problema como um modelo de programação linear, onde combinamos recursos prosódicos e semânticos da fala que podem indicar transições de tópicos. Para otimizar esse modelo, usamos um algoritmo genético elitista com busca local. Através dos experimentos, fomos capazes de avaliar diferentes aspectos de nossa abordagem, como sua sensibilidade à variação de parâmetros e comportamento de convergência. Além disso, mostramos que nosso método foi capaz de superar métodos do estado da arte, tanto em Recall quanto em F1-Score, em dois conjuntos diferentes de videoaulas. Por fim, disponibilizamos a implementação de nosso framework para que outros pesquisadores possam contribuir e reproduzir nossos resultados.CAPES - Coordenação de Aperfeiçoamento de Pessoal de Nível Superio

    Better Text Detection through Improved K-means-based Feature Learning

    Get PDF
    In this thesis, we propose a different technique to initialize a Convolutional K-means. We propose Visual Similarity Sampling (VSS) to collect 8×88\times8 sample patches from images for convolutional feature learning. The algorithm uses within-class and between-class cosine similarity/dissimilarity measure to collect samples from both foreground and background. Thus. VSS uses local frequency of shapes within a character patch and uses it as probability distribution to select them. Also, we show how that initializing Convolutional K-means from samples with high between-class and within-class similarity produce discriminative codebook. We test the codebook to detect text in the natural scene. We show that using representative property within and between class for each sample as the probability for selecting it as initial cluster center, helps achieve discriminative cluster centers, which we use as feature maps. One of the advantages of our work is; since it is not problem dependent, it can be applied for sample collection in other pattern recognition problems. The proposed algorithm helped improve detection rate and simplify the learning process in both convolutional feature learning and text detection training

    Cognitive Foundations for Visual Analytics

    Full text link

    Preface

    Get PDF

    Front-Line Physicians' Satisfaction with Information Systems in Hospitals

    Get PDF
    Day-to-day operations management in hospital units is difficult due to continuously varying situations, several actors involved and a vast number of information systems in use. The aim of this study was to describe front-line physicians' satisfaction with existing information systems needed to support the day-to-day operations management in hospitals. A cross-sectional survey was used and data chosen with stratified random sampling were collected in nine hospitals. Data were analyzed with descriptive and inferential statistical methods. The response rate was 65 % (n = 111). The physicians reported that information systems support their decision making to some extent, but they do not improve access to information nor are they tailored for physicians. The respondents also reported that they need to use several information systems to support decision making and that they would prefer one information system to access important information. Improved information access would better support physicians' decision making and has the potential to improve the quality of decisions and speed up the decision making process.Peer reviewe
    corecore