46 research outputs found

    Human-Tool-Interaction-Based Action Recognition Framework for Automatic Construction Operation Monitoring

    Get PDF
    Monitoring activities on a construction jobsite is one of the most important tasks that a construction management team performs every day. Construction management teams monitor activities to ensure that a construction project progresses as scheduled and that the construction crew works properly in a safe working environment. However, site monitoring is often time-consuming. Various automated or semi-automated tracking approaches such as radio frequency identification, Global Positioning System, ultrawide band, barcode, and laser scanning have been introduced to better monitor activities on the construction site. However, deploying and maintaining such techniques require a high level of involvement by very specific well-trained professionals and could be costly. As an alternative way to monitor sites, object recognition and tracking have the advantage of requiring low human involvement and intervention. However, it is still a challenge to recognize construction crew activities with existing methods, which have a high false recognition rate. This research proposes a new approach for recognizing construction personnel activity from still images or video frames. The new approach mimics the human thinking process with the assumption that a construction worker performs a certain activity with a specific body pose using a specific tool. The new approach consists of two recognition tasks, construction worker pose recognition and tool recognition. The two recognition tasks are connected in sequence with an interactive spatial relationship. The proposed method was developed into a computer application using Matlab. It was compared against a benchmark method that only uses construction worker body pose for activity recognition. The benchmark method was also developed into a computer application with Matlab. The proposed method and the benchmark method were tested with the same sample set containing 500 images of over 10 different construction activities. The experimental results show that the proposed framework achieved a higher reliability (precision value), a lower sensitivity (recall value), and an overall better performance (F₁ score) than the benchmark method

    Crowdsourcing in Computer Vision

    Full text link
    Computer vision systems require large amounts of manually annotated data to properly learn challenging visual concepts. Crowdsourcing platforms offer an inexpensive method to capture human knowledge and understanding, for a vast number of visual perception tasks. In this survey, we describe the types of annotations computer vision researchers have collected using crowdsourcing, and how they have ensured that this data is of high quality while annotation effort is minimized. We begin by discussing data collection on both classic (e.g., object recognition) and recent (e.g., visual story-telling) vision tasks. We then summarize key design decisions for creating effective data collection interfaces and workflows, and present strategies for intelligently selecting the most important data instances to annotate. Finally, we conclude with some thoughts on the future of crowdsourcing in computer vision.Comment: A 69-page meta review of the field, Foundations and Trends in Computer Graphics and Vision, 201

    Understanding perceived quality through visual representations

    Get PDF
    The formatting of images can be considered as an optimization problem, whose cost function is a quality assessment algorithm. There is a trade-off between bit budget per pixel and quality. To maximize the quality and minimize the bit budget, we need to measure the perceived quality. In this thesis, we focus on understanding perceived quality through visual representations that are based on visual system characteristics and color perception mechanisms. Specifically, we use the contrast sensitivity mechanisms in retinal ganglion cells and the suppression mechanisms in cortical neurons. We utilize color difference equations and color name distances to mimic pixel-wise color perception and a bio-inspired model to formulate center surround effects. Based on these formulations, we introduce two novel image quality estimators PerSIM and CSV, and a new image quality-assistance method BLeSS. We combine our findings from visual system and color perception with data-driven methods to generate visual representations and measure their quality. The majority of existing data-driven methods require subjective scores or degraded images. In contrast, we follow an unsupervised approach that only utilizes generic images. We introduce a novel unsupervised image quality estimator UNIQUE, and extend it with multiple models and layers to obtain MS-UNIQUE and DMS-UNIQUE. In addition to introducing quality estimators, we analyze the role of spatial pooling and boosting in image quality assessment.Ph.D

    Hashing for Multimedia Similarity Modeling and Large-Scale Retrieval

    Get PDF
    In recent years, the amount of multimedia data such as images, texts, and videos have been growing rapidly on the Internet. Motivated by such trends, this thesis is dedicated to exploiting hashing-based solutions to reveal multimedia data correlations and support intra-media and inter-media similarity search among huge volumes of multimedia data. We start by investigating a hashing-based solution for audio-visual similarity modeling and apply it to the audio-visual sound source localization problem. We show that synchronized signals in audio and visual modalities demonstrate similar temporal changing patterns in certain feature spaces. We propose to use a permutation-based random hashing technique to capture the temporal order dynamics of audio and visual features by hashing them along the temporal axis into a common Hamming space. In this way, the audio-visual correlation problem is transformed into a similarity search problem in the Hamming space. Our hashing-based audio-visual similarity modeling has shown superior performances in the localization and segmentation of sounding objects in videos. The success of the permutation-based hashing method motivates us to generalize and formally define the supervised ranking-based hashing problem, and study its application to large-scale image retrieval. Specifically, we propose an effective supervised learning procedure to learn optimized ranking-based hash functions that can be used for large-scale similarity search. Compared with the randomized version, the optimized ranking-based hash codes are much more compact and discriminative. Moreover, it can be easily extended to kernel space to discover more complex ranking structures that cannot be revealed in linear subspaces. Experiments on large image datasets demonstrate the effectiveness of the proposed method for image retrieval. We further studied the ranking-based hashing method for the cross-media similarity search problem. Specifically, we propose two optimization methods to jointly learn two groups of linear subspaces, one for each media type, so that features\u27 ranking orders in different linear subspaces maximally preserve the cross-media similarities. Additionally, we develop this ranking-based hashing method in the cross-media context into a flexible hashing framework with a more general solution. We have demonstrated through extensive experiments on several real-world datasets that the proposed cross-media hashing method can achieve superior cross-media retrieval performances against several state-of-the-art algorithms. Lastly, to make better use of the supervisory label information, as well as to further improve the efficiency and accuracy of supervised hashing, we propose a novel multimedia discrete hashing framework that optimizes an instance-wise loss objective, as compared to the pairwise losses, using an efficient discrete optimization method. In addition, the proposed method decouples the binary codes learning and hash function learning into two separate stages, thus making the proposed method equally applicable for both single-media and cross-media search. Extensive experiments on both single-media and cross-media retrieval tasks demonstrate the effectiveness of the proposed method

    Thin-Slicing for Pose: Learning to Understand Pose without Explicit Pose Estimation

    Get PDF
    International audienceWe address the problem of learning a pose-aware, compact embedding that projects images with similar human poses to be placed close-by in the embedding space. The embedding function is built on a deep convolutional network, and trained with triplet-based rank constraints on real image data. This architecture allows us to learn a robust representation that captures differences in human poses by effectively factoring out variations in clothing, background, and imaging conditions in the wild. For a variety of pose-related tasks, the proposed pose embedding provides a cost-efficient and natural alternative to explicit pose estimation, circumventing challenges of localizing body joints. We demonstrate the efficacy of the embedding on pose-based image retrieval and action recognition problems
    corecore