[Paper note] Video-based Person Re-identification with Accumulative Motion Context
阿新 • • 發佈:2019-01-01
Highlight
- Two stream: spatial + temporal (optical flow).
- Use a motion network pre-trained on optical flow to predict OF and also learn end-to-end in training phase.
- Fusion of motion and spatial features
- Multiloss: siamese reid and classification loss.
Model
- Structure of the whole model:
- Structure of motion network (pre-trained on LK or Epic optical flow):
- Structure of spatial network:
- Different spatial fusion method: concatenate, sum, max
- Different spatial fusion position: @ any layer in spatial network
- Motion context accumulation: via RNN (not LSTM in this paper)
- Multiloss: siamese (distance) loss + classification (softmax)
- Pre-train motion network on optical flow: smoothed L-1 loss (
l=1,2,3
representing optical flow estimation with different resolutions)
L(l)(motion)(e(l),g(l))=∑i,j,ksmoothL1(e(l)i,j,k−g(l)i,j,k) smoothL1(θ)={0.5θ2|θ|−0.5if |θ|<1otherwise
Experiment
- Datasets:
- iLIDS-VID: 300 IDs, 2 camera views, sequence length 23~192
- PRID-2011: 749 IDs, 2 camera views, sequence length 5~675
- Settings
- Input of spatial net: 64 x 32; Input of motion net: 128 x 64
- Data augmentation as both training and test phase
- 10 times experiment on different training/test split
- Sub-sequence 16 frames
- Sequence length 128 for testing
- Ablation study
- Motion information: compare LK & Epic OF, use OF as direct input or pre-train and train end-to-end. End-to-end training with Epic OF performs best.
- Spatial fusion method and location: concatenate and fuse @Max-pooling2 performs better.
- Compare with state-of-the-art: new state-of-the-art on PRID-2011.