7 research outputs found

    A weakly-supervised approach for discovering common objects in airport video surveillance footage

    Get PDF
    Object detection in video is a relevant task in computer vision. Standard and current detectors are typically trained in a strongly supervised way, what requires a huge amount of labelled data. In contrast, in this paper we focus on object discovery in video sequences by using sets of unlabelled data. Thus, we present an approach based on the use of two region proposal algorithms (a pretrained Region Proposal Network and an Optical Flow Proposal) to produce regions of interest that will be grouped using a clustering algorithm. Therefore, our system does not require the collaboration of a human except for assigning human understandable labels to the discovered clusters. We evaluate our approach in a set of videos recorded at the outdoor area of an airport where the aeroplanes park to load passengers and luggage (apron area). Our experimental results suggest that the use of an unsupervised approach is valid for automatic object discovery in video sequences, obtaining a CorLoc of 86.8 and a mAP of 0.374 compared to a CorLoc of 70.4 and mAP of 0.683 achieved by a supervised Faster R-CNN trained and tested on the same dataset.Universidad de Málaga. Campus de Excelencia Internacional Andalucía Tech

    Language-Driven Video Understanding

    Full text link
    Video understanding has advanced quite a long way in the past decade, accomplishing tasks including low-level segmentation and tracking that study objects as pixel-level segments or bounding boxes to more high-level activity recognition or classification tasks that classify a video scene to a categorical action label. Despite the progress that has been made, much of this work remains a proxy for an eventual task or application that requires a holistic view of the video, such as objects, actions, attributes, and other semantic components. In this dissertation, we argue that language could deliver the required holistic representation. It plays a significant role in video understanding by allowing machines to communicate with humans and to understand our requests, as shown in tasks such as text-to-video search engine, voice-guided robot manipulation, to name a few. Our language-driven video understanding focuses on two specific problems: video description and visual grounding. What marks our viewpoint different from prior literature is twofold. First, we propose a bottom-up structured learning scheme by decomposing a long video into individual procedure steps and representing each step with a description. Second, we propose to have both explicit (i.e., supervised) and implicit (i.e., weakly-supervised and self-supervised) grounding between words and visual concepts which enables interpretable modeling of the two spaces. We start by drawing attention to the shortage of large benchmarks on long video-language and propose the largest-of-its-kind YouCook2 dataset and ActivityNet-Entities dataset in Chap. II and III. The rest of the chapters circle around two main problems: video description and visual grounding. For video description, we first address the problem of decomposing a long video into compact and self-contained event segments in Chap. IV. Given an event segment or short video clip in general, we propose a non-recurrent approach (i.e., Transformer) for video description generation in Chap. V as opposed to prior RNN-based methods and demonstrate superior performance. Moving forward, we notice one potential issue in end-to-end video description generation, i.e., lack of visual grounding ability and model interpretability that would allow humans to directly interact with machine vision models. To address this issue, we transition our focus from end-to-end, video-to-text systems to systems that could explicitly capture the grounding between the two modalities, with a novel grounded video description framework in Chap. VI. So far, all the methods are fully-supervised, i.e., the model training signal comes directly from heavy & expensive human annotations. In the following chapter, we answer the question "Can we perform visual grounding without explicit supervision?" with a weakly-supervised framework where models learn grounding from (weak) description signal. Finally, in Chap. VIII, we conclude the technical work by exploring a self-supervised grounding approach—vision-language pre-training—that implicitly learns visual grounding from web multi-modal data. This mimics how humans obtain their commonsense from the environment through multi-modal interactions.PHDRoboticsUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttps://deepblue.lib.umich.edu/bitstream/2027.42/155174/1/luozhou_1.pd

    Context-sensitive interpretation of natural language location descriptions : a thesis submitted in partial fulfilment of the requirements for the award of Doctor of Philosophy in Information Technology at Massey University, Auckland, New Zealand

    Get PDF
    People frequently describe the locations of objects using natural language. Location descriptions may be either structured, such as 26 Victoria Street, Auckland, or unstructured. Relative location descriptions (e.g., building near Sky Tower) are a common form of unstructured location description, and use qualitative terms to describe the location of one object relative to another (e.g., near, close to, in, next to). Understanding the meaning of these terms is easy for humans, but much more difficult for machines since the terms are inherently vague and context sensitive. In this thesis, we study the semantics (or meaning) of qualitative, geospatial relation terms, specifically geospatial prepositions. Prepositions are one of the most common forms of geospatial relation term, and they are commonly used to describe the location of objects in the geographic (geospatial) environment, such as rivers, mountains, buildings, and towns. A thorough understanding of the semantics of geospatial relation terms is important because it enables more accurate automated georeferencing of text location descriptions than use of place names only. Location descriptions that use geospatial prepositions are found in social media, web sites, blogs, and academic reports, and georeferencing can allow mapping of health, disaster and biological data that is currently inaccessible to the public. Such descriptions have unstructured format, so, their analysis is not straightforward. The specific research questions that we address are: RQ1. Which geospatial prepositions (or groups of prepositions) and senses are semantically similar? RQ2. Is the role of context important in the interpretation of location descriptions? RQ3. Is the object distance associated with geospatial prepositions across a range of geospatial scenes and scales accurately predictable using machine learning methods? RQ4. Is human annotation a reliable form of annotation for the analysis of location descriptions? To address RQ1, we determine the nature and degree of similarity among geospatial prepositions by analysing data collected with a human subjects experiment, using clustering, extensional mapping and t-stochastic neighbour embedding (t-SNE) plots to form a semantic similarity matrix. In addition to calculating similarity scores among prepositions, we identify the senses of three groups of geospatial prepositions using Venn diagrams, t-sne plots and density-based clustering, and define the relationships between the senses. Furthermore, we use two text mining approaches to identify the degree of similarity among geospatial prepositions: bag of words and GloVe embeddings. By using these methods and further analysis, we identify semantically similar groups of geospatial prepositions including: 1- beside, close to, near, next to, outside and adjacent to; 2- across, over and through and 3- beyond, past, by and off. The prepositions within these groups also share senses. Through is recognised as a specialisation of both across and over. Proximity and adjacency prepositions also have similar senses that express orientation and overlapping relations. Past, off and by share a proximal sense but beyond has a different sense from these, representing on the other side. Another finding is the more frequent use of the preposition close to for pairs of linear objects than near, which is used more frequently for non-linear ones. Also, next to is used to describe proximity more than touching (in contrast to other prepositions like adjacent to). Our application of text mining to identify semantically similar prepositions confirms that a geospatial corpus (NCGL) provides a better representation of the semantics of geospatial prepositions than a general corpus. Also, we found that GloVe embeddings provide adequate semantic similarity measures for more specialised geospatial prepositions, but less so for those that have more generalised applications and multiple senses. We explore the role of context (RQ2) by studying three sites that vary in size, nature, and context in London: Trafalgar Square, Buckingham Palace, and Hyde Park. We use the Google search engine to extract location descriptions that contain these three sites with 9 different geospatial prepositions (in, on, at, next to, close to, adjacent to, near, beside, outside) and calculate their acceptance profiles (the profile of the use of a preposition at different distances from the reference object) and acceptance thresholds (maximum distance from a reference object at which a preposition can acceptably be used). We use these to compare prepositions, and to explore the influence of different contexts. Our results show that near, in and outside are used for larger distances, while beside, adjacent to and at are used for smaller distances. Also, the acceptance threshold for close to is higher than for other proximity/adjacency prepositions such as next to, adjacent to and beside. The acceptance threshold of next to is larger than adjacent to, which confirms the findings in ‎Chapter 2 which identifies next to describing a proximity rather than touching spatial relation. We also found that relatum characteristics such as image schema affect the use of prepositions such as in, on and at. We address RQ3 by developing a machine learning regression model (using the SMOReg algorithm) to predict the distance associated with use of geospatial prepositions in specific expressions. We incorporate a wide range of input variables including the similarity matrix of geospatial prepositions (RQ1); preposition senses; semantic information in the form of embeddings; characteristics of the located and reference objects in the expression including their liquidity/solidity, scale and geometry type and contextual factors such as the density of features of different types in the surrounding area. We evaluate the model on two different datasets with 25% improvement against the best baseline respectively. Finally, we consider the importance of annotation of geospatial location descriptions (RQ4). As annotated data is essential for the successful study of automated interpretation of natural language descriptions, we study the impact and accuracy of human annotation on different geospatial elements. Agreement scores show that human annotators can annotate geospatial relation terms (e.g., geospatial prepositions) with higher agreement than other geospatial elements. This thesis advances understanding of the semantics of geospatial prepositions, particularly considering their semantic similarity and the impact of context on their interpretation. We quantify the semantic similarity of a set of 24 geospatial prepositions; identify senses and the relationships among them for 13 geospatial prepositions; compare the acceptance thresholds of 9 geospatial prepositions and describe the influence of context on them; and demonstrate that richer semantic and contextual information can be incorporated in predictive models to interpret relative geospatial location descriptions more accurately

    Biological Relatives

    Get PDF
    Thirty-five years after its initial success as a form of technologically assisted human reproduction, and five million miracle babies later, in vitro fertilization (IVF) has become a routine procedure worldwide. In Biological Relatives, Sarah Franklin explores how the normalization of IVF has changed how both technology and biology are understood. Drawing on anthropology, feminist theory, and science studies, Franklin charts the evolution of IVF from an experimental research technique into a global technological platform used for a wide variety of applications, including genetic diagnosis, livestock breeding, cloning, and stem cell research. She contends that despite its ubiquity, IVF remains a highly paradoxical technology that confirms the relative and contingent nature of biology while creating new biological relatives. Using IVF as a lens, Franklin presents a bold and lucid thesis linking technologies of gender and sex to reproductive biomedicine, contemporary bioinnovation, and the future of kinship. This title was made Open Access by libraries from around the world through Knowledge Unlatched
    corecore