[paper review] BEV-SUSHI: Multi-Target Multi-Camera 3D Detection and Tracking in Bird’s-Eye View 논문리뷰

728x90

포스팅할 논문의 이름은 BEV-SUSHI입니다.

BEV-SUSHI는 multi-camera 3D detection and Tracker논문으로 기존의 late association방법과는 다르게 GNN-based tracker를 사용하는데 base가 되는 GNN base의 tracker는 SUSHI로 아래 포스팅 참고하세요.

[paper review] Unifying Short and Long-Term Tracking with Graph Hierarchies 논문 리뷰

안녕하세요. 이번 포스팅은 Unifying Short and Long-Term Tracking with Graph Hierarchies, 줄여서 SUSHI라는 tracking 모듈을 리뷰하겠습니다.보통 tracking에서 long-term association, short-term association으로 나뉘는데 본 논

jaehoon-daddy.tistory.com

BEV-SUSHI는 제목에서 알수있듯이 multi-view를 BEV를 이용해 fusion을 진행하고 SUSHI를 이용해서 long-term tracking을 하는 것으로 예상할 수 있습니다.

그림은 Late aggregation방법과 Early aggregation방법을 나타낸 도식입니다. Late aggregation같은 경우 각 camera의 image마다 detection을 수행하고 나온결과를 tracking 알고리즘에 넣고 (보통은 filter기반) 나온 결과물들을 ReID 매칭을 통해 전체 tracking을 진행합니다. 이에 반해 논문에서 제시한 Early aggregation방법은 Multi-view Featrue aggregation을 진행하고 (BEV fusion을 이야기함) 나온 3D detection결과물과 각각의 ReID feature들을 Tracking 알고리즘에 넣어(SUSHI- learnable base)전체 tracking을 수행합니다.

Coordinate Systems and Projection

multi-view이기때문에 world coordinate을 지정하고 해당 3D 위치에서 2D로 projection했을때의 좌표를 아래와 같이 계산합니다.

K는 카메라의 Intrinsic파라미터, R|T는 extrinsic parmameter의 transformation matrixdlqslek.

import numpy as np

# 1. Define a 3D point in world coordinates
x_world = np.array([2.0, 1.0, 5.0, 1.0])  # homogeneous coordinates: [x, y, z, 1]

# 2. Define intrinsic matrix Kᶦ (camera-specific)
K = np.array([
    [1000, 0,    640],  # fx,  0, cx
    [0,    1000, 360],  # 0,  fy, cy
    [0,    0,    1]
])

# 3. Define extrinsic parameters: Rᶦ (rotation), tᶦ (translation)
R = np.eye(3)  # Identity (no rotation)
t = np.array([[0], [0], [0]])  # Zero translation

# 4. Combine [R | t] into a 3×4 extrinsic matrix
RT = np.hstack((R, t))  # shape: (3, 4)

# 5. Compute projection matrix Pᶦ = K × [R | t]
P = K @ RT  # shape: (3, 4)

# 6. Project the 3D point to 2D
u_homogeneous = P @ x_world  # shape: (3,)
u = u_homogeneous[:2] / u_homogeneous[2]  # normalize by depth (s)

# Print result
print("Projected 2D point (u, v):", u)

Multi-View 3D Ojbect Detection

BEVFormer라는 자율주행 도메인에서 많이 사용되는 BEV로 fusion하는 method를 백본으로 사용합니다. 이를 통해 여러 카메라로 부터 얻은 이미지를 BEV로 transformation합니다. 이 후에 BEV spacedptj N개의 reference point를 sampling해서 뽑은 후에 이를 위에서 설명한 projection방법을 이용해서 각 camera plane에 projection해서 image feature를 extraction합니다.

이 후에 SCA - spatial cross-attention이라고 하여 위의 수식과 같습니다. $Q_{p}$는 BEV space에서의 query(learnable), $V_{hit}$은 reference point가 projection된 camera들, $P_{p,i,j}$는 BEV상에서 p지점, i번째 카메라 ,j번째 point를 i카메라에 projection한 위치입니다. Attn은 deformable attention을 의미합니다.

(Query는 BEV에서 학습가능한 쿼리벡터를 의미하고 Reference point는 BEV의 실제 좌표들로 이 좌표들을 통해 각 카메라로 projection해서 feature sampling위치를 정합니다.)

TSA는 temporal self-attention의 줄임말로 현재시점 t에서의 BEV의 query와 과거 t-1 에서의 BEV를 연결해서 시간정보를 통합합니다. 논문에서는 object의 motion 정보를 잘 반영하기 위한 모듈이라고 합니다.

이 후에는 DETR head를 decorder로해서 3D bbox를 예측합니다. loss로는 focal loss, L1 loss를 사용하였습니다.

Multi-View ReID Feature Extraction

SUSHI에서는 2d Detection 정보만을 이용한 ReID model를 통해 embedding vector정보를 추출하였습니다. 하지만 본 논문에서는 3D detection task에 그대로 적용할 수 없습니다. traditional method로는 BEVFormer로부터 나온 3D bbox를 2D 에 projection해서 해당 prediction정보로 ReID model을 통해 embedding vector를 추출하는 방법이 있지만 3D bbox를 2D에 projection한결과는 실제보다 크기때문에 ReID품질이 저하되는 문제가 있다.

제안한 방법은 2D-3D detection Association으로 2D detection(DINO+FAN : small) 를 통해 2d bbox를 prediction하고 3D bbox를 2D에 projection한 결과와 Hungarian algorithm을 사용해서 matching을 시도합니다.

ReID모델은 SOLIDER라는 모델을 사용하였고 아래표는 해당 모델을 test한 성능지표입니다.

다시 정리하면 이 부분은 learning파트는 아닙니다. heuristic으로 2d,3d를 matching하는 것을 말하고 있습니다.

아래는 위의 설명한 내용을 대략적인 pytorch예제코드로 구현한 코드입니다.

import numpy as np
from scipy.optimize import linear_sum_assignment

def iou(boxA, boxB):
    # box: [x1, y1, x2, y2]
    xA = max(boxA[0], boxB[0])
    yA = max(boxA[1], boxB[1])
    xB = min(boxA[2], boxB[2])
    yB = min(boxA[3], boxB[3])
    
    interArea = max(0, xB - xA + 1) * max(0, yB - yA + 1)
    areaA = (boxA[2] - boxA[0] + 1) * (boxA[3] - boxA[1] + 1)
    areaB = (boxB[2] - boxB[0] + 1) * (boxB[3] - boxB[1] + 1)
    return interArea / float(areaA + areaB - interArea + 1e-6)

def bottom_center(box):
    # box: [x1, y1, x2, y2]
    x_bc = (box[0] + box[2]) / 2
    y_bc = box[3]  # bottom y
    return np.array([x_bc, y_bc])

def compute_cost_matrix(boxes_3d_proj, boxes_2d, alpha=2.0):
    num_3d = len(boxes_3d_proj)
    num_2d = len(boxes_2d)
    cost_matrix = np.full((num_3d, num_2d), np.inf)

    for i in range(num_3d):
        for j in range(num_2d):
            iou_val = iou(boxes_3d_proj[i], boxes_2d[j])
            if iou_val >= 0.1:
                bc_3d = bottom_center(boxes_3d_proj[i])
                bc_2d = bottom_center(boxes_2d[j])
                depth_penalty = 1.0 if bc_3d[1] >= bc_2d[1] else alpha
                cost_matrix[i, j] = depth_penalty * np.linalg.norm(bc_3d - bc_2d)

    return cost_matrix

# 예시
projected_3d_boxes = [[50, 100, 150, 250], [300, 150, 380, 280]]  # 3D boxes projected to 2D
detected_2d_boxes = [[55, 110, 145, 240], [310, 160, 370, 270]]

cost_mat = compute_cost_matrix(projected_3d_boxes, detected_2d_boxes)
row_idx, col_idx = linear_sum_assignment(cost_mat)

for i, j in zip(row_idx, col_idx):
    if cost_mat[i, j] < 1e6:
        print(f"3D box {i} assigned to 2D box {j} with cost {cost_mat[i, j]:.2f}")

3D MTMC with GNNs

GNN기반의 tracker는 처음에 언급하였던 SUSHI라는 모델을 baseline으로 하였습니다.

graph는 node(object), edge(linking)으로 이뤄져 있고 edge에는 ReID embedding, 3D object의 pose, velocity(option)의 정보가 담겨있고 edge의 weight는 cosine similarity나 distance로 게산됩니다. GNN에서는 이것을 message passing을 통해 edge간의 관계를 업데이트 합니다.

아래는 GNN에 대한 예제코드입니다.

import torch
import torch.nn as nn
import torch.nn.functional as F

class GNNLayer(nn.Module):
    def __init__(self, node_dim, edge_dim):
        super().__init__()
        # edge update MLP: h_ij <- h_i, h_j, h_ij
        self.edge_mlp = nn.Sequential(
            nn.Linear(node_dim * 2 + edge_dim, edge_dim),
            nn.ReLU(),
            nn.Linear(edge_dim, edge_dim)
        )
        # node update MLP: h_i <- h_i, h_ij
        self.node_mlp = nn.Sequential(
            nn.Linear(node_dim + edge_dim, node_dim),
            nn.ReLU(),
            nn.Linear(node_dim, node_dim)
        )

    def forward(self, node_feats, edge_feats, edge_index):
        # node_feats: [N, node_dim]
        # edge_feats: [E, edge_dim]
        # edge_index: [2, E] (src, dst)

        src, dst = edge_index  # edges from src → dst
        h_src = node_feats[src]
        h_dst = node_feats[dst]
        h_edge = edge_feats

        # (1) Update edge features
        h_edge_input = torch.cat([h_src, h_dst, h_edge], dim=-1)
        h_edge_updated = self.edge_mlp(h_edge_input)

        # (2) Aggregate edge features to each node
        # sum over incoming edges
        agg_msg = torch.zeros_like(node_feats)
        agg_msg.index_add_(0, dst, self.node_mlp(torch.cat([h_dst, h_edge_updated], dim=-1)))

        # (3) Update node features
        node_feats = node_feats + agg_msg  # residual connection

        return node_feats, h_edge_updated


N = 5   # num nodes (3D detections)
E = 8   # num edges
node_dim = 128
edge_dim = 64

node_feats = torch.randn(N, node_dim)
# edge_feat = concat([ReID_cos_sim, dx, dy, dz]) + learnable part
edge_feats = torch.randn(E, edge_dim)
edge_index = torch.randint(0, N, (2, E))  # [2, E]


gnn_layers = nn.ModuleList([GNNLayer(128, 64) for _ in range(3)])

for layer in gnn_layers:
    node_feats, edge_feats = layer(node_feats, edge_feats, edge_index)

본 논문에서는 Long-term tracking을 위한 구조를 제안하였는데, 3000~4000 frame이상에서는 memory 문제가 발생될 위험이 있습니다. 이를 위해서 Graph Pruning, Hierarchical GNN 구조, Global Block(sliding window = 1920)의 구조를 사용하였습니다. (motion 정보는 사용하지않았다고 합니다.)

이를 통해 32G VRAM기준으로 3840 frame까지 처리하였다고 합니다.