6 research outputs found
TVPR: Text-to-Video Person Retrieval and a New Benchmark
Most existing methods for text-based person retrieval focus on text-to-image
person retrieval. Nevertheless, due to the lack of dynamic information provided
by isolated frames, the performance is hampered when the person is obscured in
isolated frames or variable motion details are given in the textual
description. In this paper, we propose a new task called Text-to-Video Person
Retrieval(TVPR) which aims to effectively overcome the limitations of isolated
frames. Since there is no dataset or benchmark that describes person videos
with natural language, we construct a large-scale cross-modal person video
dataset containing detailed natural language annotations, such as person's
appearance, actions and interactions with environment, etc., termed as
Text-to-Video Person Re-identification (TVPReid) dataset, which will be
publicly available. To this end, a Text-to-Video Person Retrieval Network
(TVPRN) is proposed. Specifically, TVPRN acquires video representations by
fusing visual and motion representations of person videos, which can deal with
temporal occlusion and the absence of variable motion details in isolated
frames. Meanwhile, we employ the pre-trained BERT to obtain caption
representations and the relationship between caption and video representations
to reveal the most relevant person videos. To evaluate the effectiveness of the
proposed TVPRN, extensive experiments have been conducted on TVPReid dataset.
To the best of our knowledge, TVPRN is the first successful attempt to use
video for text-based person retrieval task and has achieved state-of-the-art
performance on TVPReid dataset. The TVPReid dataset will be publicly available
to benefit future research
Learning Person Re-identification Models from Videos with Weak Supervision
Most person re-identification methods, being supervised techniques, suffer
from the burden of massive annotation requirement. Unsupervised methods
overcome this need for labeled data, but perform poorly compared to the
supervised alternatives. In order to cope with this issue, we introduce the
problem of learning person re-identification models from videos with weak
supervision. The weak nature of the supervision arises from the requirement of
video-level labels, i.e. person identities who appear in the video, in contrast
to the more precise framelevel annotations. Towards this goal, we propose a
multiple instance attention learning framework for person re-identification
using such video-level labels. Specifically, we first cast the video person
re-identification task into a multiple instance learning setting, in which
person images in a video are collected into a bag. The relations between videos
with similar labels can be utilized to identify persons, on top of that, we
introduce a co-person attention mechanism which mines the similarity
correlations between videos with person identities in common. The attention
weights are obtained based on all person images instead of person tracklets in
a video, making our learned model less affected by noisy annotations. Extensive
experiments demonstrate the superiority of the proposed method over the related
methods on two weakly labeled person re-identification datasets
κ΄μ λ€μ€ 보νμ μΆμ μ μν κ³μΈ΅μ κΆ€μ λ§€μΉ κΈ°λ²
νμλ
Όλ¬Έ (λ°μ¬) -- μμΈλνκ΅ λνμ : 곡과λν μ κΈ°Β·μ»΄ν¨ν°κ³΅νλΆ, 2020. 8. μ΅μ§μ.The purpose of wide-area tracking problem is to track pedestrians that appear on cameras that overlap or do not overlap, regardless of the time interval or person density.
In a single camera tracking, data association using overlapping of the detection boxes is used to solve the tracking problem, but still has appearance ambiguity issues.
However, wide-area tracking requires a tracking scheme that focuses on the appearance similarity of humans, without the use of overlapping of detection boxes.
In this dissertation, we propose the tracking scheme for the Wide-area Multi-Pedestrian Tracking (WaMuPeT).
To achieve the WaMuPeT, we propose the trajectory matching in overlapping camera settings (Ch. 3), non-overlapping camera settings (Ch. 4) and robust trajectory matching in dense scene settings (Ch. 5).
In trajectory matching in overlapping camera settings (Ch. 3), we propose a novel deep-learning architecture for accurate 3-D localization and tracking of a pedestrian using multiple cameras.
The deep-learning network is composed of two networks: detection network and localization network.
The detection network yields the pedestrian detections and the localization network estimates the ground position of a pedestrian within its detection box.
In addition, an attentional pass filter is introduced to effectively connect the two networks.
Using the detection proposals and their 2-D grounding positions obtained from the two networks, multi-camera multi-target 3-D localization and tracking algorithm is developed through min-cost network flow approach.
In the experiments, it is shown that the proposed method improves the performance of 3-D localization and tracking.
In trajectory matching in non-overlapping camera settings (Ch. 4), we propose a novel re-ranking method using a ranking-reflected metric to measure the similarity between two ordered sets of -nearest neighbors (OKNN).
The proposed metric for ranking-reflected similarity (RSS) reflects the ranking of the shared elements between the two OKNNs.
Using RSS, a re-ranking procedure is proposed that prioritizes galleries having neighbors similar to a probe's neighbor in the perspective of ranking order.
In the experiment, we show that the proposed method improves the Re-ID accuracy by add-on to the state-of-the-art methods.
In robust trajectory matching in dense scene settings (Ch. 5), we propose a novel framework for multi-pedestrian tracking to generate robust trajectories in dense scene.
In the proposed tracking method, we propose the tracking method based on the trajectory matching by the strategy of divide and conquer method.
In this strategy, short-term, mid-term and long-term trajectories are generated by each trajectory merging stages, respectively.
Also we propose a novel deep-feature matching method called stable boundary selection (SBS).
In SBS matching, the detections are clustered by the group similarity of deep features, so that robust trajectories can be generated.
With the smoothing algorithms and the detection restoration algorithm, the proposed tracking method shows the state-of-the-art tracking accuracy in three public tracking dataset.κ΄μ μΆμ λ¬Έμ μ λͺ©μ μ μκ° κ°κ²©μ΄λ μ¬λ λ°λμ κ΄κ³μμ΄ κ²ΉμΉκ±°λ κ²ΉμΉμ§ μλ μΉ΄λ©λΌμ λνλλ 보νμλ₯Ό μΆμ νλ κ²μ΄λ€.
λ¨μΌ μΉ΄λ©λΌ μΆμ μμ κ°μ§ μμμ κ²ΉμΉ¨μ μ¬μ©νλ λ°μ΄ν° μ°κ²°μ μΆμ λ¬Έμ λ₯Ό ν΄κ²°νλ λ° μ¬μ©λμ§λ§ μ¬μ ν λͺ¨μ λͺ¨νΈμ± λ¬Έμ κ° μλ€.
κ·Έλ¬λ κ΄μ μΆμ μλ κ°μ§ μμμ κ²ΉμΉ¨μ μ¬μ©νμ§ μκ³ μ¬λμ μΈν μ μ¬μ±μ μ€μ μ λ μΆμ 체κ³κ° νμνλ€.
μ΄ λ
Όλ¬Έμμλ κ΄μ λ€μ€ 보νμ μΆμ (WaMuPeT)μ λν μΆμ 체κ³λ₯Ό μ μνλ€.
WaMuPeTλ₯Ό λ¬μ±νκΈ° μν΄ κ²ΉμΉλ μΉ΄λ©λΌ μ€μ (3 μ₯), κ²ΉμΉμ§ μλ μΉ΄λ©λΌ μ€μ (4 μ₯) μμμ κΆ€μ μΌμΉ κ·Έλ¦¬κ³ λΉ½λΉ½ν μ₯λ©΄ μ€μ (5 μ₯)μμ κ°μΈν κΆ€μ μΌμΉμ λν΄μ μ μνλ€.
κ²ΉμΉλ μΉ΄λ©λΌ μ€μ μμμ κΆ€μ λ§€μΉ (3 μ₯)μμλ μ¬λ¬ μΉ΄λ©λΌλ₯Ό μ¬μ©νμ¬ λ³΄νμλ₯Ό μ ννκ² 3D μ§μννκ³ μΆμ νκΈ°μν μλ‘μ΄ λ₯ λ¬λ μν€ν
μ²λ₯Ό μ μνλ€.
λ₯ λ¬λ λ€νΈμν¬λ κ°μ§ λ€νΈμν¬μ λ‘컬λΌμ΄μ μ΄μ
λ€νΈμν¬μ λ κ°μ§ λ€νΈμν¬λ‘ ꡬμ±λλ€.
νμ§ λ€νΈμν¬λ 보νμ νμ§λ₯Ό μ 곡νκ³ νμ§ν λ€νΈμν¬λ νμ§ μμ λ΄μμ 보νμμ μ§μ μμΉλ₯Ό μΆμ νλ€.
λν λ κ°μ λ€νΈμν¬λ₯Ό ν¨κ³Όμ μΌλ‘ μ°κ²°νκΈ° μν΄μ£Όμ ν¨μ€ νν°κ° λμ
λμλ€.
λ λ€νΈμν¬μμ μ»μ νμ§ μ μ λ° 2D μ μ§ μμΉλ₯Ό μ¬μ©νμ¬ μ΅μ λΉμ©μ λ€νΈμν¬ νλ¦ μ κ·Ό λ°©μμ ν΅ν΄ λ€μ€ μΉ΄λ©λΌ λ€μ€ λμ 3D μ§μν λ° μΆμ μκ³ λ¦¬μ¦μ΄ κ°λ°λλ€.
μ€νμμ μ μ λ λ°©λ²μ΄ 3D μ§μν λ° μΆμ μ±λ₯μ ν₯μμν€λ κ²μΌλ‘ λνλ¬λ€.
κ²ΉμΉμ§ μλ μΉ΄λ©λΌ μ€μ μμμ κΆ€μ μΌμΉ (4 μ₯)μμ, μ°λ¦¬λ μμκ° λ°μλ λ©νΈλ¦μ μ¬μ©νμ¬ λκ°μ μμκ° μ§μ λ -μ΅κ·Ό μ μ΄μ (OKNN) μΈνΈ μ¬μ΄μ μ μ¬μ±μ μΈ‘μ νλ€.
μμ λ°μ μ μ¬μ± (RSS)μ λν΄ μ μ λ λ©νΈλ¦μ λ OKNN μ¬μ΄μ 곡μ μμμ μμλ₯Ό λ°μν©λλ€.
RSSλ₯Ό μ¬μ©νμ¬, μμ μμμ κ΄μ μμ νλ‘λΈμ μ΄μκ³Ό μ μ¬ν μ΄μμ κ°λ κ°€λ¬λ¦¬λ₯Ό μ°μ μμ ννλ μ¬μμ μ μ°¨κ° μ μλλ€.
μ€νμμ μ μ λ λ°©λ²μ΄ μ΅μ λ°©λ²μ μΆκ°λμ΄ Re-ID μ νλκ° ν₯μλ¨μ 보μ¬μ€λ€.
κ³ λ°λ μ₯λ©΄ μ€μ μμ κ°λ ₯ν κΆ€μ μΌμΉ (5 μ₯)μμ, μ°λ¦¬λ κ³ λ°λ μ₯λ©΄μμ κ°λ ₯ν κΆ€μ μ μμ±νκΈ° μν΄ λ€μ€ 보νμ μΆμ μ μν μλ‘μ΄ νλ μ μν¬λ₯Ό μ μνλ€.
μ μλ μΆμ λ°©λ²μμλ λΆν λ° μ 볡 λ°©λ² μ λ΅μ λ°λ₯Έ κΆ€μ 맀μΉμ κΈ°λ°μΌλ‘ μΆμ λ°©λ²μ μ μνλ€.
μ΄ μ λ΅μμ, λ¨κΈ°, μ€κΈ° λ° μ₯κΈ° κΆ€μ μ κ°κ°μ κΆ€μ λ³ν© λ¨κ³μ μν΄ μμ±λλ€.
λν SBS (Stable Boundary Selection)λΌλ μλ‘μ΄ κΈ°λ₯ λ§€μΉ κΈ°λ²μ μ μνλ€.
SBS 맀μΉμμ, νμ§λ κΉμ νΉμ§μ κ·Έλ£Ή μ μ¬μ±μ μν΄ κ΅°μ§νλμ΄, κ°λ ₯ν κΆ€μ μ΄ μμ± λ μ μλ€.
μ μ λ μΆμ λ°©λ²μ νν μκ³ λ¦¬μ¦κ³Ό νμ§ λ³΅μ μκ³ λ¦¬μ¦μ ν΅ν΄ 3 κ°μ κ³΅κ° μΆμ λ°μ΄ν° μΈνΈμμ μ΅μ²¨λ¨ μΆμ μ νλλ₯Ό 보μ¬μ€λ€.Chapter 1 Introduction 1
1.1 Background 1
1.2 Related Works 4
1.2.1 Localization of Pedestrian Detection 4
1.2.2 Pedestrian Feature from Person Re-identification 5
1.2.3 Multi-Pedestrian Tracking 8
1.3 Contributions 8
1.4 Thesis Organization 10
Chapter 2 Problem Statements 11
2.1 Trajectory Matching in Overlapping Camera Settings 11
2.1.1 Challenges 11
2.1.2 Approach for the challenges 13
2.2 Trajectory Matching in Non-Overlapping Camera Settings 13
2.2.1 Challenges 13
2.2.2 Approach for the challenges 14
2.3 Robust Trajectory Matching in Dense Scene Settings 16
2.3.1 Challenges 16
2.3.2 Approach for the challenges 18
Chapter 3 Trajectory Matching in Overlapping Camera Settings 19
3.1 Overall Scheme 19
3.2 Network Design 20
3.3 MCMTT with Proposed Network 22
Chapter 4 Trajectory Matching in Non-overlapping Camera Settings 25
4.1 Overall Scheme 25
4.2 Proposed Method 30
4.2.1 Proposed Similarity Metric 30
4.2.2 Selection of A 31
4.2.3 Re-ranking Procedure 32
Chapter 5 Robust Trajectory Matching in Dense Scene Settings 35
5.1 Overall Scheme 35
5.2 Similarity Matrix Generation 39
5.3 Stable Boundary Selection 40
5.4 Trajectory Smoothing 42
5.5 Detection Restoration 46
5.6 Trajectory Merging Process 48
Chapter 6 Experiments 51
6.1 Dataset and Evaluation Metric 51
6.1.1 Trajectory Matching in Overlapping Camera Settings 51
6.1.2 Trajectory Matching in Non-overlapping Camera Settings 52
6.1.3 Robust Trajectory Matching in Dense Scene Settings 53
6.2 Results and Discussion 56
6.2.1 Trajectory Matching in Overlapping Camera Settings 56
6.2.2 Trajectory Matching in Non-overlapping Camera Settings 56
6.2.3 Robust Trajectory Matching in Dense Scene Settings 62
Chapter 7 Conclusions and Future Works 81
7.1 Concluding Remarks 81
7.2 Future Works 83
Abstract 97Docto