38,214 research outputs found

    CML-MOTS: Collaborative Multi-task Learning for Multi-Object Tracking and Segmentation

    Full text link
    The advancement of computer vision has pushed visual analysis tasks from still images to the video domain. In recent years, video instance segmentation, which aims to track and segment multiple objects in video frames, has drawn much attention for its potential applications in various emerging areas such as autonomous driving, intelligent transportation, and smart retail. In this paper, we propose an effective framework for instance-level visual analysis on video frames, which can simultaneously conduct object detection, instance segmentation, and multi-object tracking. The core idea of our method is collaborative multi-task learning which is achieved by a novel structure, named associative connections among detection, segmentation, and tracking task heads in an end-to-end learnable CNN. These additional connections allow information propagation across multiple related tasks, so as to benefit these tasks simultaneously. We evaluate the proposed method extensively on KITTI MOTS and MOTS Challenge datasets and obtain quite encouraging results

    Instance Flow Based Online Multiple Object Tracking

    Full text link
    We present a method to perform online Multiple Object Tracking (MOT) of known object categories in monocular video data. Current Tracking-by-Detection MOT approaches build on top of 2D bounding box detections. In contrast, we exploit state-of-the-art instance aware semantic segmentation techniques to compute 2D shape representations of target objects in each frame. We predict position and shape of segmented instances in subsequent frames by exploiting optical flow cues. We define an affinity matrix between instances of subsequent frames which reflects locality and visual similarity. The instance association is solved by applying the Hungarian method. We evaluate different configurations of our algorithm using the MOT 2D 2015 train dataset. The evaluation shows that our tracking approach is able to track objects with high relative motions. In addition, we provide results of our approach on the MOT 2D 2015 test set for comparison with previous works. We achieve a MOTA score of 32.1

    BURST: A Benchmark for Unifying Object Recognition, Segmentation and Tracking in Video

    Full text link
    Multiple existing benchmarks involve tracking and segmenting objects in video e.g., Video Object Segmentation (VOS) and Multi-Object Tracking and Segmentation (MOTS), but there is little interaction between them due to the use of disparate benchmark datasets and metrics (e.g. J&F, mAP, sMOTSA). As a result, published works usually target a particular benchmark, and are not easily comparable to each another. We believe that the development of generalized methods that can tackle multiple tasks requires greater cohesion among these research sub-communities. In this paper, we aim to facilitate this by proposing BURST, a dataset which contains thousands of diverse videos with high-quality object masks, and an associated benchmark with six tasks involving object tracking and segmentation in video. All tasks are evaluated using the same data and comparable metrics, which enables researchers to consider them in unison, and hence, more effectively pool knowledge from different methods across different tasks. Additionally, we demonstrate several baselines for all tasks and show that approaches for one task can be applied to another with a quantifiable and explainable performance difference. Dataset annotations and evaluation code is available at: https://github.com/Ali2500/BURST-benchmark

    ZJU ReLER Submission for EPIC-KITCHEN Challenge 2023: TREK-150 Single Object Tracking

    Full text link
    The Associating Objects with Transformers (AOT) framework has exhibited exceptional performance in a wide range of complex scenarios for video object tracking and segmentation. In this study, we convert the bounding boxes to masks in reference frames with the help of the Segment Anything Model (SAM) and Alpha-Refine, and then propagate the masks to the current frame, transforming the task from Video Object Tracking (VOT) to video object segmentation (VOS). Furthermore, we introduce MSDeAOT, a variant of the AOT series that incorporates transformers at multiple feature scales. MSDeAOT efficiently propagates object masks from previous frames to the current frame using two feature scales of 16 and 8. As a testament to the effectiveness of our design, we achieved the 1st place in the EPIC-KITCHENS TREK-150 Object Tracking Challenge.Comment: Top 1 solution for EPIC-KITCHEN Challenge 2023: TREK-150 Single Object Tracking. arXiv admin note: text overlap with arXiv:2307.0201

    ZJU ReLER Submission for EPIC-KITCHEN Challenge 2023: Semi-Supervised Video Object Segmentation

    Full text link
    The Associating Objects with Transformers (AOT) framework has exhibited exceptional performance in a wide range of complex scenarios for video object segmentation. In this study, we introduce MSDeAOT, a variant of the AOT series that incorporates transformers at multiple feature scales. Leveraging the hierarchical Gated Propagation Module (GPM), MSDeAOT efficiently propagates object masks from previous frames to the current frame using a feature scale with a stride of 16. Additionally, we employ GPM in a more refined feature scale with a stride of 8, leading to improved accuracy in detecting and tracking small objects. Through the implementation of test-time augmentations and model ensemble techniques, we achieve the top-ranking position in the EPIC-KITCHEN VISOR Semi-supervised Video Object Segmentation Challenge.Comment: Top 1 solution for EPIC-KITCHEN Challenge 2023: Semi-Supervised Video Object Segmentatio

    Practical Uses of A Semi-automatic Video Object Extraction System

    Get PDF
    Object-based technology is important for computer vision applications including gesture understanding, image recognition, augmented reality, etc. However, extracting the shape information of semantic objects from video sequences is a very difficult task, since this information is not explicitly provided within the video data. Therefore, an application for exttracting the semantic video object is indispensable and important for many advanced applications. An algorithm for semi-automatic video object extraction system has been developed. The performance measures of video object extraction system; including evaluation using ground truth and error metric is shown, followed by some practical uses of our video object extraction system. The principle at the basis of semi-automatic object extraction technique is the interaction of the user during some stages of the segmentation process, whereby the semantic information is provided directly by the user. After the user provides the initial segmentation of the semantic video objects, a tracking mechanism follows its temporal transformation in the subsequent frames, thus propagating the semantic information. Since the tracking tends to introduce boundary errors, the semantic information can be refreshed by the user at certain key frame locations in the video sequence. The tracking mechanism can also operate in forward or backward direction of the video sequence. The performance analysis of the results is described using single and multiple key frames; Mean Error and “Last_Error”, and also forward and backward extraction. To achieve best performance, results from forward and backward extraction can be merged

    Improving Multiple Object Tracking with Optical Flow and Edge Preprocessing

    Full text link
    In this paper, we present a new method for detecting road users in an urban environment which leads to an improvement in multiple object tracking. Our method takes as an input a foreground image and improves the object detection and segmentation. This new image can be used as an input to trackers that use foreground blobs from background subtraction. The first step is to create foreground images for all the frames in an urban video. Then, starting from the original blobs of the foreground image, we merge the blobs that are close to one another and that have similar optical flow. The next step is extracting the edges of the different objects to detect multiple objects that might be very close (and be merged in the same blob) and to adjust the size of the original blobs. At the same time, we use the optical flow to detect occlusion of objects that are moving in opposite directions. Finally, we make a decision on which information we keep in order to construct a new foreground image with blobs that can be used for tracking. The system is validated on four videos of an urban traffic dataset. Our method improves the recall and precision metrics for the object detection task compared to the vanilla background subtraction method and improves the CLEAR MOT metrics in the tracking tasks for most videos

    Video Object Tracking Using Motion Estimation

    Get PDF
    Real time object tracking is considered as a critical application. Object tracking is one of the most necessary steps for surveillance, augmented reality, smart rooms and perceptual user interfaces, video compression based on object and driver assistance. While traditional methods of Segmentation using Thresholding, Background subtraction and Background estimation provide satisfactory results to detect single objects, noise is produced in case of multiple objects and in poor lighting conditions. Using the segmentation technique we can locate a target in the current frame. By minimizing the distance or maximizing the similarity coefficient we can find out the exact location of the target in the current frame. Target localization in current frame was computationally much complex in the conventional algorithms. Searching an object in the current frame using these algorithms starts from its location of the previous frame in the basis of attraction probably the square of the target area, calculating weighted average for all iteration then comparing similarity coefficients for each new location. To overcome these difficulties, a new method is proposed for detecting and tracking multiple moving objects on night-time lighting conditions. The method is performed by integrating both the wavelet-based contrast change detector and locally adaptive thresholding scheme. In the initial stage, to detect the potential moving objects contrast in local change over time is used. To suppress false alarms motion prediction and spatial nearest neighbour data association are used. A latest change detector mechanism is implemented to detect the changes in a video sequence and divide the sequence into scenes to be encoded independently. Using the change detector algorithm (CD), it was efficient enough to detect abrupt cuts and help divide the video file into sequences. With this we get a sufficiently good output with less noise. But in some cases noise becomes prominent. Hence, a method called correlation is used which gives the relation between two consecutive frames which have sufficient difference to be used as current and previous frame. This gives a way better result in poor light condition and multiple moving objects

    Design of networked visual monitoring systems

    Get PDF
    [[abstract]]We design and implement a networked visual monitoring system for surveillance. Instead of the usual periodical monitoring, the proposed system has an auto-tracking feature which captures the important characteristics of intruders. We integrate two schemes, namely, image segmentation and histogram comparison, to accomplish auto-tracking. The developed image segmentation scheme is able to separate moving objects from the background in real time. Next, the corresponding object centroid and boundary are computed. This information is used to guide the motion of tracking camera to track the intruders and then to take a series of shots, by following a predetermined pattern. We have also developed a multiple objects tracking scheme, based on object color histogram comparison, to overcome object occlusion and disocclusion issues. The designed system can track multiple intruders or follow any particular intruder automatically. To achieve efficient transmission and storage, the captured video is compressed in the H.263 format. Query based on time as well as events are provided. Users can access the system from web browsers to view the monitoring site or manipulate the tracking camera on the Internet. These features are of importance and value to surveillance.[[notice]]補正完畢[[incitationindex]]E
    corecore