21 research outputs found

    Learning visually grounded meaning representations

    Get PDF
    Humans possess a rich semantic knowledge of words and concepts which captures the perceivable physical properties of their real-world referents and their relations. Encoding this knowledge or some of its aspects is the goal of computational models of semantic representation and has been the subject of considerable research in cognitive science, natural language processing, and related areas. Existing models have placed emphasis on different aspects of meaning, depending ultimately on the task at hand. Typically, such models have been used in tasks addressing the simulation of behavioural phenomena, e.g., lexical priming or categorisation, as well as in natural language applications, such as information retrieval, document classification, or semantic role labelling. A major strand of research popular across disciplines focuses on models which induce semantic representations from text corpora. These models are based on the hypothesis that the meaning of words is established by their distributional relation to other words (Harris, 1954). Despite their widespread use, distributional models of word meaning have been criticised as ‘disembodied’ in that they are not grounded in perception and action (Perfetti, 1998; Barsalou, 1999; Glenberg and Kaschak, 2002). This lack of grounding contrasts with many experimental studies suggesting that meaning is acquired not only from exposure to the linguistic environment but also from our interaction with the physical world (Landau et al., 1998; Bornstein et al., 2004). This criticism has led to the emergence of new models aiming at inducing perceptually grounded semantic representations. Essentially, existing approaches learn meaning representations from multiple views corresponding to different modalities, i.e. linguistic and perceptual input. To approximate the perceptual modality, previous work has relied largely on semantic attributes collected from humans (e.g., is round, is sour), or on automatically extracted image features. Semantic attributes have a long-standing tradition in cognitive science and are thought to represent salient psychological aspects of word meaning including multisensory information. However, their elicitation from human subjects limits the scope of computational models to a small number of concepts for which attributes are available. In this thesis, we present an approach which draws inspiration from the successful application of attribute classifiers in image classification, and represent images and the concepts depicted by them by automatically predicted visual attributes. To this end, we create a dataset comprising nearly 700K images and a taxonomy of 636 visual attributes and use it to train attribute classifiers. We show that their predictions can act as a substitute for human-produced attributes without any critical information loss. In line with the attribute-based approximation of the visual modality, we represent the linguistic modality by textual attributes which we obtain with an off-the-shelf distributional model. Having first established this core contribution of a novel modelling framework for grounded meaning representations based on semantic attributes, we show that these can be integrated into existing approaches to perceptually grounded representations. We then introduce a model which is formulated as a stacked autoencoder (a variant of multilayer neural networks), which learns higher-level meaning representations by mapping words and images, represented by attributes, into a common embedding space. In contrast to most previous approaches to multimodal learning using different variants of deep networks and data sources, our model is defined at a finer level of granularity—it computes representations for individual words and is unique in its use of attributes as a means of representing the textual and visual modalities. We evaluate the effectiveness of the representations learnt by our model by assessing its ability to account for human behaviour on three semantic tasks, namely word similarity, concept categorisation, and typicality of category members. With respect to the word similarity task, we focus on the model’s ability to capture similarity in both the meaning and appearance of the words’ referents. Since existing benchmark datasets on word similarity do not distinguish between these two dimensions and often contain abstract words, we create a new dataset in a large-scale experiment where participants are asked to give two ratings per word pair expressing their semantic and visual similarity, respectively. Experimental results show that our model learns meaningful representations which are more accurate than models based on individual modalities or different modality integration mechanisms. The presented model is furthermore able to predict textual attributes for new concepts given their visual attribute predictions only, which we demonstrate by comparing model output with human generated attributes. Finally, we show the model’s effectiveness in an image-based task on visual category learning, in which images are used as a stand-in for real-world objects

    Self-supervised Face Representation Learning

    Get PDF
    This thesis investigates fine-tuning deep face features in a self-supervised manner for discriminative face representation learning, wherein we develop methods to automatically generate pseudo-labels for training a neural network. Most importantly solving this problem helps us to advance the state-of-the-art in representation learning and can be beneficial to a variety of practical downstream tasks. Fortunately, there is a vast amount of videos on the internet that can be used by machines to learn an effective representation. We present methods that can learn a strong face representation from large-scale data be the form of images or video. However, while learning a good representation using a deep learning algorithm requires a large-scale dataset with manually curated labels, we propose self-supervised approaches to generate pseudo-labels utilizing the temporal structure of the video data and similarity constraints to get supervision from the data itself. We aim to learn a representation that exhibits small distances between samples from the same person, and large inter-person distances in feature space. Using metric learning one could achieve that as it is comprised of a pull-term, pulling data points from the same class closer, and a push-term, pushing data points from a different class further away. Metric learning for improving feature quality is useful but requires some form of external supervision to provide labels for the same or different pairs. In the case of face clustering in TV series, we may obtain this supervision from tracks and other cues. The tracking acts as a form of high precision clustering (grouping detections within a shot) and is used to automatically generate positive and negative pairs of face images. Inspired from that we propose two variants of discriminative approaches: Track-supervised Siamese network (TSiam) and Self-supervised Siamese network (SSiam). In TSiam, we utilize the tracking supervision to obtain the pair, additional we include negative training pairs for singleton tracks -- tracks that are not temporally co-occurring. As supervision from tracking may not always be available, to enable the use of metric learning without any supervision we propose an effective approach SSiam that can generate the required pairs automatically during training. In SSiam, we leverage dynamic generation of positive and negative pairs based on sorting distances (i.e. ranking) on a subset of frames and do not have to only rely on video/track based supervision. Next, we present a method namely Clustering-based Contrastive Learning (CCL), a new clustering-based representation learning approach that utilizes automatically discovered partitions obtained from a clustering algorithm (FINCH) as weak supervision along with inherent video constraints to learn discriminative face features. As annotating datasets is costly and difficult, using label-free and weak supervision obtained from a clustering algorithm as a proxy learning task is promising. Through our analysis, we show that creating positive and negative training pairs using clustering predictions help to improve the performance for video face clustering. We then propose a method face grouping on graphs (FGG), a method for unsupervised fine-tuning of deep face feature representations. We utilize a graph structure with positive and negative edges over a set of face-tracks based on their temporal structure of the video data and similarity-based constraints. Using graph neural networks, the features communicate over the edges allowing each track\u27s feature to exchange information with its neighbors, and thus push each representation in a direction in feature space that groups all representations of the same person together and separates representations of a different person. Having developed these methods to generate weak-labels for face representation learning, next we propose to learn compact yet effective representation for describing face tracks in videos into compact descriptors, that can complement previous methods towards learning a more powerful face representation. Specifically, we propose Temporal Compact Bilinear Pooling (TCBP) to encode the temporal segments in videos into a compact descriptor. TCBP possesses the ability to capture interactions between each element of the feature representation with one-another over a long-range temporal context. We integrated our previous methods TSiam, SSiam and CCL with TCBP and demonstrated that TCBP has excellent capabilities in learning a strong face representation. We further show TCBP has exceptional transfer abilities to applications such as multimodal video clip representation that jointly encodes images, audio, video and text, and video classification. All of these contributions are demonstrated on benchmark video clustering datasets: The Big Bang Theory, Buffy the Vampire Slayer and Harry Potter 1. We provide extensive evaluations on these datasets achieving a significant boost in performance over the base features, and in comparison to the state-of-the-art results

    Visual and Camera Sensors

    Get PDF
    This book includes 13 papers published in Special Issue ("Visual and Camera Sensors") of the journal Sensors. The goal of this Special Issue was to invite high-quality, state-of-the-art research papers dealing with challenging issues in visual and camera sensors

    Vehicle Logo Recognition with Reduced-Dimension SIFT Vectors Using Autoencoders

    No full text
    Vehicle logo recognition has become an important part of object recognition in recent years because of its usage in surveillance applications. In order to achieve a higher recognition rates, several methods are proposed, such as Scale Invariant Feature Transform (SIFT), convolutional neural networks, bag-of-words and their variations. A fast logo recognition method based on reduced-dimension SIFT vectors using autoencoders is proposed in this paper. Computational load is decreased by applying dimensionality reduction to SIFT feature vectors. Feature vectors of size 128 are reduced to 64 and 32 by employing two layer neural nets called vanilla autoencoders. Publicly available vehicle logo images are used for testing purposes. Results suggest that the proposed method needs half of the original SIFT based method’s memory requirement with decreased processing time per image in return of a decrease in the accuracy less than 20%

    Large-scale document labeling using supervised sequence embedding

    Get PDF
    A critical component in computational treatment of an automated document labeling is the choice of an appropriate representation. Proper representation captures specific phenomena of interest in data while transforming it to a format appropriate for a classifier. For a text document, a popular choice is the bag-of-words (BoW) representation that encodes presence of unique words with non-zero weights such as TF-IDF. Extending this model to long, overlapping phrases (n-grams) results in exponential explosion in the dimensionality of the representation. In this work, we develop a model that encodes long phrases in a low-dimensional latent space with a cumulative function of individual words in each phrase. In contrast to BoW, the parameter space of the proposed model grows linearly with the length of the phrase. The proposed model requires only vector additions and multiplications with scalars to compute the latent representation of phrases, which makes it applicable to large-scale text labeling problems. Several sentiment classification and binary topic categorization problems will be used to empirically evaluate the proposed representation. The same model can also encode relative spatial distribution of elements in higher-dimensional sequences. In order to verify this claim, the proposed model will be evaluated on a large-scale image classification dataset, where images are transformed into two-dimensional sequences of quantized image descriptors.Ph.D., Computer Science -- Drexel University, 201

    Advances in Artificial Intelligence: Models, Optimization, and Machine Learning

    Get PDF
    The present book contains all the articles accepted and published in the Special Issue “Advances in Artificial Intelligence: Models, Optimization, and Machine Learning” of the MDPI Mathematics journal, which covers a wide range of topics connected to the theory and applications of artificial intelligence and its subfields. These topics include, among others, deep learning and classic machine learning algorithms, neural modelling, architectures and learning algorithms, biologically inspired optimization algorithms, algorithms for autonomous driving, probabilistic models and Bayesian reasoning, intelligent agents and multiagent systems. We hope that the scientific results presented in this book will serve as valuable sources of documentation and inspiration for anyone willing to pursue research in artificial intelligence, machine learning and their widespread applications

    Modeling Visual Rhetoric and Semantics in Multimedia

    Get PDF
    Recent advances in machine learning have enabled computer vision algorithms to model complicated visual phenomena with accuracies unthinkable a mere decade ago. Their high-performance on a plethora of vision-related tasks has enabled computer vision researchers to begin to move beyond traditional visual recognition problems to tasks requiring higher-level image understanding. However, most computer vision research still focuses on describing what images, text, or other media literally portrays. In contrast, in this dissertation we focus on learning how and why such content is portrayed. Rather than viewing media for its content, we recast the problem as understanding visual communication and visual rhetoric. For example, the same content may be portrayed in different ways in order to present the story the author wishes to convey. We thus seek to model not only the content of the media, but its authorial intent and latent messaging. Understanding how and why visual content is portrayed a certain way requires understanding higher level abstract semantic concepts which are themselves latent within visual media. By latent, we mean the concept is not readily visually accessible within a single image (e.g. right vs left political bias), in contrast to explicit visual semantic concepts such as objects. Specifically, we study the problems of modeling photographic style (how professional photographers portray their subjects), understanding visual persuasion in image advertisements, modeling political bias in multimedia (image and text) news articles, and learning cross-modal semantic representations. While most past research in vision and natural language processing studies the case where visual content and paired text are highly aligned (as in the case of image captions), we target the case where each modality conveys complementary information to tell a larger story. We particularly focus on the problem of learning cross-modal representations from multimedia exhibiting weak alignment between the image and text modalities. A variety of techniques are presented which improve modeling of multimedia rhetoric in real-world data and enable more robust artificially intelligent systems

    State of the art of audio- and video based solutions for AAL

    Get PDF
    Working Group 3. Audio- and Video-based AAL ApplicationsIt is a matter of fact that Europe is facing more and more crucial challenges regarding health and social care due to the demographic change and the current economic context. The recent COVID-19 pandemic has stressed this situation even further, thus highlighting the need for taking action. Active and Assisted Living (AAL) technologies come as a viable approach to help facing these challenges, thanks to the high potential they have in enabling remote care and support. Broadly speaking, AAL can be referred to as the use of innovative and advanced Information and Communication Technologies to create supportive, inclusive and empowering applications and environments that enable older, impaired or frail people to live independently and stay active longer in society. AAL capitalizes on the growing pervasiveness and effectiveness of sensing and computing facilities to supply the persons in need with smart assistance, by responding to their necessities of autonomy, independence, comfort, security and safety. The application scenarios addressed by AAL are complex, due to the inherent heterogeneity of the end-user population, their living arrangements, and their physical conditions or impairment. Despite aiming at diverse goals, AAL systems should share some common characteristics. They are designed to provide support in daily life in an invisible, unobtrusive and user-friendly manner. Moreover, they are conceived to be intelligent, to be able to learn and adapt to the requirements and requests of the assisted people, and to synchronise with their specific needs. Nevertheless, to ensure the uptake of AAL in society, potential users must be willing to use AAL applications and to integrate them in their daily environments and lives. In this respect, video- and audio-based AAL applications have several advantages, in terms of unobtrusiveness and information richness. Indeed, cameras and microphones are far less obtrusive with respect to the hindrance other wearable sensors may cause to one’s activities. In addition, a single camera placed in a room can record most of the activities performed in the room, thus replacing many other non-visual sensors. Currently, video-based applications are effective in recognising and monitoring the activities, the movements, and the overall conditions of the assisted individuals as well as to assess their vital parameters (e.g., heart rate, respiratory rate). Similarly, audio sensors have the potential to become one of the most important modalities for interaction with AAL systems, as they can have a large range of sensing, do not require physical presence at a particular location and are physically intangible. Moreover, relevant information about individuals’ activities and health status can derive from processing audio signals (e.g., speech recordings). Nevertheless, as the other side of the coin, cameras and microphones are often perceived as the most intrusive technologies from the viewpoint of the privacy of the monitored individuals. This is due to the richness of the information these technologies convey and the intimate setting where they may be deployed. Solutions able to ensure privacy preservation by context and by design, as well as to ensure high legal and ethical standards are in high demand. After the review of the current state of play and the discussion in GoodBrother, we may claim that the first solutions in this direction are starting to appear in the literature. A multidisciplinary 4 debate among experts and stakeholders is paving the way towards AAL ensuring ergonomics, usability, acceptance and privacy preservation. The DIANA, PAAL, and VisuAAL projects are examples of this fresh approach. This report provides the reader with a review of the most recent advances in audio- and video-based monitoring technologies for AAL. It has been drafted as a collective effort of WG3 to supply an introduction to AAL, its evolution over time and its main functional and technological underpinnings. In this respect, the report contributes to the field with the outline of a new generation of ethical-aware AAL technologies and a proposal for a novel comprehensive taxonomy of AAL systems and applications. Moreover, the report allows non-technical readers to gather an overview of the main components of an AAL system and how these function and interact with the end-users. The report illustrates the state of the art of the most successful AAL applications and functions based on audio and video data, namely (i) lifelogging and self-monitoring, (ii) remote monitoring of vital signs, (iii) emotional state recognition, (iv) food intake monitoring, activity and behaviour recognition, (v) activity and personal assistance, (vi) gesture recognition, (vii) fall detection and prevention, (viii) mobility assessment and frailty recognition, and (ix) cognitive and motor rehabilitation. For these application scenarios, the report illustrates the state of play in terms of scientific advances, available products and research project. The open challenges are also highlighted. The report ends with an overview of the challenges, the hindrances and the opportunities posed by the uptake in real world settings of AAL technologies. In this respect, the report illustrates the current procedural and technological approaches to cope with acceptability, usability and trust in the AAL technology, by surveying strategies and approaches to co-design, to privacy preservation in video and audio data, to transparency and explainability in data processing, and to data transmission and communication. User acceptance and ethical considerations are also debated. Finally, the potentials coming from the silver economy are overviewed.publishedVersio

    Adaptive Robot Systems in Highly Dynamic Environments: A Table Tennis Robot

    Get PDF
    Hintergrund: Tischtennis bietet ideale Bedingungen, um Kamera-basierte Roboterarme am Limit zu testen. Die besondere Herausforderung liegt in der hohen Geschwindigkeit des Spiels und in der großen Varianz von Spin und Tempo jedes einzelnen Schlages. Die bisherige Forschung mit Tischtennisrobotern beschrĂ€nkt sich jedoch auf einfache Szenarien, d.h. auf langsame BĂ€lle mit einer geringen Rotation. Forschungsziel: Es soll ein lernfĂ€higer Tischtennisroboter entwickelt werden, der mit dem Spin menschlicher Gegner umgehen kann. Methoden: Das vorgestellte Robotersystem besteht aus sechs Komponenten: Ballpositionserkennung, Ballspinerkennung, Balltrajektorienvorhersage, Schlagparameterbestimmung, Robotertrajektorienplanung und Robotersteuerung. Zuerst wird der Ball mit traditioneller Bildverarbeitung in den Kamerabildern lokalisiert. Mit iterativer Triangulation wird dann seine 3D-Position berechnet. Aus der Kurve der Ballpositionen wird die aktuelle Position und Geschwindigkeit des Balles ermittelt. FĂŒr die Spinerkennung werden drei Methoden prĂ€sentiert: Die ersten beiden verfolgen die Bewegung des aufgedruckten Ball-Logos auf hochauflösenden Bildern durch Computer Vision bzw. Convolutional Neural Networks. Im dritten Ansatz wird die Flugbahn des Balls unter BerĂŒcksichtigung der Magnus-Kraft analysiert. Anhand der Position, der Geschwindigkeit und des Spins des Balls wird die zukĂŒnftige Flugbahn berechnet. DafĂŒr wird die physikalische Diffenzialgleichung mit Gravitationskraft, Luftwiderstandskraft und Magnus-Kraft schrittweise gelöst. Mit dem berechneten Zustand des Balls am Schlagpunkt haben wir einen Reinforcement-Learning-Algorithmus trainiert, der bestimmt, mit welchen Schlagparametern der Ball zu treffen ist. Eine passende Robotertrajektorie wird von der Reflexxes-Bibliothek generiert. %Der Roboter wird dann mit einer Frequenz von 250 Hz angesteuert. Ergebnisse: In der quantitativen Auswertung erzielen die einzelnen Komponenten mindestens so gute Ergebnisse wie vergleichbare Tischtennisroboter. Im Hinblick auf das Forschungsziel konnte der Roboter - ein Konterspiel mit einem Menschen fĂŒhren, mit bis zu 60 RĂŒckschlĂ€gen, - unterschiedlichen Spin (Über- und Unterschnitt) retournieren - und mehrere TischtennisĂŒbungen innerhalb von 200 SchlĂ€gen erlernen. Schlußfolgerung: Bedeutende algorithmische Neuerungen fĂŒhren wir in der Spinerkennung und beim Reinforcement Learning von Schlagparametern ein. Dadurch meistert der Roboter anspruchsvollere Spin- und Übungsszenarien als in vergleichbaren Arbeiten.Background: Robotic table tennis systems offer an ideal platform for pushing camera-based robotic manipulation systems to the limit. The unique challenge arises from the fast-paced play and the wide variation in spin and speed between strokes. The range of scenarios under which existing table tennis robots are able to operate is, however, limited, requiring slow play with low rotational velocity of the ball (spin). Research Goal: We aim to develop a table tennis robot system with learning capabilities able to handle spin against a human opponent. Methods: The robot system presented in this thesis consists of six components: ball position detection, ball spin detection, ball trajectory prediction, stroke parameter suggestion, robot trajectory generation, and robot control. For ball detection, the camera images pass through a conventional image processing pipeline. The ball’s 3D positions are determined using iterative triangulation and these are then used to estimate the current ball state (position and velocity). We propose three methods for estimating the spin. The first two methods estimate spin by analyzing the movement of the logo printed on the ball on high-resolution images using either conventional computer vision or convolutional neural networks. The final approach involves analyzing the trajectory of the ball using Magnus force fitting. Once the ball’s position, velocity, and spin are known, the future trajectory is predicted by forward-solving a physical ball model involving gravitational, drag, and Magnus forces. With the predicted ball state at hitting time as state input, we train a reinforcement learning algorithm to suggest the racket state at hitting time (stroke parameter). We use the Reflexxes library to generate a robot trajectory to achieve the suggested racket state. Results: Quantitative evaluation showed that all system components achieve results as good as or better than comparable robots. Regarding the research goal of this thesis, the robot was able to - maintain stable counter-hitting rallies of up to 60 balls with a human player, - return balls with different spin types (topspin and backspin) in the same rally, - learn multiple table tennis drills in just 200 strokes or fewer. Conclusion: Our spin detection system and reinforcement learning-based stroke parameter suggestion introduce significant algorithmic novelties. In contrast to previous work, our robot succeeds in more difficult spin scenarios and drills