13 research outputs found

    Structured video coding

    Get PDF
    Thesis (M.S.)--Massachusetts Institute of Technology, Dept. of Architecture, 1991.Includes bibliographical references (leaves 67-71).by Patrick Campbell McLean.M.S

    Multimedia Applications of the Wavelet Transform

    Get PDF
    This dissertation investigates novel applications of the wavelet transform in the analysis and compression of audio, still images, and video. Most recently, some surveys have been published on the restoration of noisy audio signals. Based on these, we have developed a wavelet-based denoising program for audio signals that allows flexible parameter settings. The multiscale property of the wavelet transform can successfully be exploited for the detection of semantic structures in images: A comparison of the coefficients allows the extraction of a predominant structure. This idea forms the basis of our semiautomatic edge detection algorithm. Empirical evaluations and the resulting recommendations follow. In the context of the teleteaching project Virtual University of the Upper Rhine Valley (VIROR), many lectures were transmitted between remote locations. We thus encountered the problem of scalability of a video stream for different access bandwidths in the Internet. A substantial contribution of this dissertation is the introduction of the wavelet transform into hierarchical video coding and the recommendation of parameter settings based on empirical surveys. Furthermore, a prototype implementation proves the principal feasibility of a wavelet-based, nearly arbitrarily scalable application. Mathematical transformations constitute a commonly underestimated problem for students in their first semesters of study. Motivated by the VIROR project, we spent a considerable amount of time and effort on the exploration of approaches to enhance mathematical topics with multimedia; both the technical design and the didactic integration into the curriculum are discussed. In a large field trial on "traditional teaching versus multimedia-enhanced teaching", the objective knowledge gained by the students was measured. This allows us to objectively rate positive the efficiency of our teaching modules

    Irish Machine Vision and Image Processing Conference Proceedings 2017

    Get PDF

    Apprentissage structuré à partir de vidéos et langage

    Get PDF
    The goal of this thesis is to develop models, representations and structured learning algorithms for the automatic understanding of complex human activities from instructional videos narrated with natural language. We first introduce a model that, given a set of narrated instructional videos describing a task, is able to generate a list of action steps needed to complete the task and locate them in the visual and textual streams. To that end, we formulate two assumptions. First, people perform actions when they mention them. Second, we assume that complex tasks are composed of an ordered sequence of action steps. Equipped with these two hypotheses, our model first clusters the textual inputs and then uses this output to refine the location of the action steps in the video. We evaluate our model on a newly collected dataset of instructional videos depicting 5 different complex goal oriented tasks. We then present an approach to link action and the manipulated objects. More precisely, we focus on actions that aim at modifying the state of a specific object, such as pouring a coffee cup or opening a door. Such actions are an inherent part of instructional videos. Our method is based on the optimization of a joint cost between actions and object states under constraints. The constraints are reflecting our assumption that there is a consistent temporal order for the changes in object states and manipulation actions. We demonstrate experimentally that object states help localizing actions and conversely that action localization improves object state recognition. All our models are based on discriminative clustering, a technique which allows to leverage the readily available weak supervision contained in instructional videos. In order to deal with the resulting optimization problems, we take advantage of a highly adapted optimization technique: the Frank-Wolfe algorithm. Motivated by the fact that scaling our approaches to thousands of videos is essential in the context of narrated instructional videos, we also present several improvements to make the Frank- Wolfe algorithm faster and more computationally efficient. In particular, we propose three main modifications to the Block-Coordinate Frank-Wolfe algorithm: gap-based sampling, away and pairwise Block Frank-Wolfe steps and a solution to cache the oracle calls. We show the effectiveness of our improvements on four challenging structured prediction tasks.Le but de cette thèse est de développer des modèles, des représentations adaptées et des algorithmes de prédiction structurée afin de pouvoir analyser de manière automatique des activités humaines complexes commentées par du langage naturel. Dans un premier temps, nous présentons un modèle capable de découvrir quelle est la liste d’actions nécessaires à l’accomplissement de la tâche ainsi que de localiser ces actions dans le flux vidéo et dans la narration textuelle à partir de plusieurs vidéos tutorielles. La première hypothèse est que les gens réalisent les actions au moment où ils les décrivent. La seconde hypothèse est que ces tâches complexes sont réalisées en suivant un ordre précis d’actions.. Notre modèle est évalué sur un nouveau jeu de données de vidéos tutorielles qui décrit 5 tâches complexes. Nous proposons ensuite de relier les actions avec les objets manipulés. Plus précisément, on se concentre sur un type d’action particulière qui vise à modifier l’état d’un objet. Par exemple, cela arrive lorsqu’on sert une tasse de café ou bien lorsqu’on ouvre une porte. Ce type d’action est particulièrement important dans le contexte des vidéos tutorielles. Notre méthode consiste à minimiser un objectif commun entre les actions et les objets. Nous démontrons via des expériences numériques que localiser les actions aident à mieux reconnaitre l’état des objets et inversement que modéliser le changement d’état des objets permet de mieux déterminer le moment où les actions se déroulent. Tous nos modèles sont basés sur du partionnement discriminatif, une méthode qui permet d’exploiter la faible supervision contenue dans ce type de vidéos. Cela se résume à formuler un problème d’optimisation qui peut se résoudre aisément en utilisant l’algorithme de Frank- Wolfe qui est particulièrement adapté aux contraintes envisagées. Motivé par le fait qu’il est très important d’être en mesure d’exploiter les quelques milliers de vidéos qui sont disponibles en ligne, nous portons enfin notre effort à rendre l’algorithme de Frank-Wolfe plus rapide et plus efficace lorsque confronté à beaucoup de données. En particulier, nous proposons trois modifications à l’algorithme Block-Coordinate Frank-Wolfe : un échantillonage adaptatif des exemples d’entrainement, une version bloc des ‘away steps’ et des ‘pairwise steps’ initialement prévu dans l’algorithme original et enfin une manière de mettre en cache les appels à l’oracle linéaire

    Neural Encoding and Decoding with Deep Learning for Natural Vision

    Get PDF
    The overarching objective of this work is to bridge neuroscience and artificial intelligence to ultimately build machines that learn, act, and think like humans. In the context of vision, the brain enables humans to readily make sense of the visual world, e.g. recognizing visual objects. Developing human-like machines requires understanding the working principles underlying the human vision. In this dissertation, I ask how the brain encodes and represents dynamic visual information from the outside world, whether brain activity can be directly decoded to reconstruct and categorize what a person is seeing, and whether neuroscience theory can be applied to artificial models to advance computer vision. To address these questions, I used deep neural networks (DNN) to establish encoding and decoding models for describing the relationships between the brain and the visual stimuli. Using the DNN, the encoding models were able to predict the functional magnetic resonance imaging (fMRI) responses throughout the visual cortex given video stimuli; the decoding models were able to reconstruct and categorize the visual stimuli based on fMRI activity. To further advance the DNN model, I have implemented a new bidirectional and recurrent neural network based on the predictive coding theory. As a theory in neuroscience, predictive coding explains the interaction among feedforward, feedback, and recurrent connections. The results showed that this brain-inspired model significantly outperforms feedforward-only DNNs in object recognition. These studies have positive impact on understanding the neural computations under human vision and improving computer vision with the knowledge from neuroscience

    How to improve learning from video, using an eye tracker

    Get PDF
    The initial trigger of this research about learning from video was the availability of log files from users of video material. Video modality is seen as attractive as it is associated with the relaxed mood of watching TV. The experiments in this research have the goal to gain more insight in viewing patterns of students when viewing video. Students received an awareness instruction about the use of possible alternative viewing behaviors to see whether this would enhance their learning effects. We found that: - the learning effects of students with a narrow viewing repertoire were less than the learning effects of students with a broad viewing repertoire or strategic viewers. - students with some basic knowledge of the topics covered in the videos benefited most from the use of possible alternative viewing behaviors and students with low prior knowledge benefited the least. - the knowledge gain of students with low prior knowledge disappeared after a few weeks; knowledge construction seems worse when doing two things at the same time. - media players could offer more options to help students with their search for the content they want to view again. - there was no correlation between pervasive personality traits and viewing behavior of students. The right use of video in higher education will lead to students and teachers that are more aware of their learning and teaching behavior, to better videos, to enhanced media players, and, finally, to higher learning effects that let users improve their learning from video

    Image Registration Workshop Proceedings

    Get PDF
    Automatic image registration has often been considered as a preliminary step for higher-level processing, such as object recognition or data fusion. But with the unprecedented amounts of data which are being and will continue to be generated by newly developed sensors, the very topic of automatic image registration has become and important research topic. This workshop presents a collection of very high quality work which has been grouped in four main areas: (1) theoretical aspects of image registration; (2) applications to satellite imagery; (3) applications to medical imagery; and (4) image registration for computer vision research

    Ultracold atomic gases in artificial magnetic fields

    Get PDF
    [no abstract

    Eye quietness and quiet eye in expert and novice golf performance: an electrooculographic analysis

    Get PDF
    Quiet eye (QE) is the final ocular fixation on the target of an action (e.g., the ball in golf putting). Camerabased eye-tracking studies have consistently found longer QE durations in experts than novices; however, mechanisms underlying QE are not known. To offer a new perspective we examined the feasibility of measuring the QE using electrooculography (EOG) and developed an index to assess ocular activity across time: eye quietness (EQ). Ten expert and ten novice golfers putted 60 balls to a 2.4 m distant hole. Horizontal EOG (2ms resolution) was recorded from two electrodes placed on the outer sides of the eyes. QE duration was measured using a EOG voltage threshold and comprised the sum of the pre-movement and post-movement initiation components. EQ was computed as the standard deviation of the EOG in 0.5 s bins from –4 to +2 s, relative to backswing initiation: lower values indicate less movement of the eyes, hence greater quietness. Finally, we measured club-ball address and swing durations. T-tests showed that total QE did not differ between groups (p = .31); however, experts had marginally shorter pre-movement QE (p = .08) and longer post-movement QE (p < .001) than novices. A group × time ANOVA revealed that experts had less EQ before backswing initiation and greater EQ after backswing initiation (p = .002). QE durations were inversely correlated with EQ from –1.5 to 1 s (rs = –.48 - –.90, ps = .03 - .001). Experts had longer swing durations than novices (p = .01) and, importantly, swing durations correlated positively with post-movement QE (r = .52, p = .02) and negatively with EQ from 0.5 to 1s (r = –.63, p = .003). This study demonstrates the feasibility of measuring ocular activity using EOG and validates EQ as an index of ocular activity. Its findings challenge the dominant perspective on QE and provide new evidence that expert-novice differences in ocular activity may reflect differences in the kinematics of how experts and novices execute skills

    Apprentissage supervisés sous contraintes

    Full text link
    As supervised learning occupies a larger and larger place in our everyday life, it is met with more and more constrained settings. Dealing with those constraints is a key to fostering new progress in the field, expanding ever further the limit of machine learning---a likely necessary step to reach artificial general intelligence. Supervised learning is an inductive paradigm in which time and data are refined into knowledge, in the form of predictive models. Models which can sometimes be, it must be conceded, opaque, memory demanding and energy consuming. Given this setting, a constraint can mean any number of things. Essentially, a constraint is anything that stand in the way of supervised learning, be it the lack of time, of memory, of data, or of understanding. Additionally, the scope of applicability of supervised learning is so vast it can appear daunting. Usefulness can be found in areas including medical analysis and autonomous driving---areas for which strong guarantees are required. All those constraints (time, memory, data, interpretability, reliability) might somewhat conflict with the traditional goal of supervised learning. In such a case, finding a balance between the constraints and the standard objective is problem-dependent, thus requiring generic solutions. Alternatively, concerns might arise after learning, in which case solutions must be developed under sub-optimal conditions, resulting in constraints adding up. An example of such situations is trying to enforce reliability once the data is no longer available. After detailing the background (what is supervised learning and why is it difficult, what algorithms will be used, where does it land in the broader scope of knowledge) in which this thesis integrates itself, we will discuss four different scenarios. The first one is about trying to learn a good decision forest model of a limited size, without learning first a large model and then compressing it. For that, we have developed the Globally Induced Forest (GIF) algorithm, which mixes local and global optimizations to produce accurate predictions under memory constraints in reasonable time. More specifically, the global part allows to sidestep the redundancy inherent in traditional decision forests. It is shown that the proposed method is more than competitive with standard tree-based ensembles under corresponding constraints, and can sometimes even surpass much larger models. The second scenario corresponds to the example given above: trying to enforce reliability without data. More specifically, the focus in on out-of-distribution (OOD) detection: recognizing samples which do not come from the original distribution the model was learned from. Tackling this problem with utter lack of data is challenging. Our investigation focuses on image classification with convolutional neural networks. Indicators which can be computed alongside the prediction with little additional cost are proposed. These indicators prove useful, stable and complementary for OOD detection. We also introduce a surprisingly simple, yet effective summary indicator, shown to perform well across several networks and datasets. It can easily be tuned further as soon as samples become available. Overall, interesting results can be reached in all but the most severe settings, for which it was a priori doubtful to come up with a data-free solution. The third scenario relates to transferring the knowledge of a large model in a smaller one in the absence of data. To do so, we propose to leverage a collection of unlabeled data which are easy to come up with in domains such as image classification. Two schemes are proposed (and then analyzed) to provide optimal transfer. Firstly, we proposed a biasing mechanism in the choice of unlabeled data to use so that the focus is on the more relevant samples. Secondly, we designed a teaching mechanism, applicable for almost all pairs of large and small networks, which allows for a much better knowledge transfer between the networks. Overall, good results are obtainable in decent time provided the collection of data actually contains relevant samples. The fourth scenario tackles the problem of interpretability: what knowledge can be gleaned more or less indirectly from data. We discuss two subproblems. The first one is to showcase that GIFs (cf. supra) can be used to derive intrinsically interpretable models. The second consists in a comparative study between methods and types of models (namely decision forests and neural networks) for the specific purpose of quantifying how much each variable is important in a given problem. After a preliminary study on benchmark datasets, the analysis turns to a concrete biological problem: inferring gene regulatory network from data. An ambivalent conclusion is reached: neural networks can be made to perform better than decision forests at predicting in almost all instances but struggle to identify the relevant variables in some situations. It would seem that better (motivated) methods need to be proposed for neural networks, especially in the face of highly non-linear problems
    corecore