[paper review] Far3D : Expanding the Horizon for Surround-view 3D Object Detection 논문 리뷰

728x90

안녕햐세요. 이번 포스팅은 multi-view를 이용한 3D detector인 Far3D를 리뷰하겠습니다.

Far3D는 AAAI 24에 publish되었고 현재 nuscenes camera 3D detection 리더보드에서 sota이고, megvii라는 회사에서 나왔는데 이 megvii라는 곳에서 최근에 camera 3D detector를 발표하고 있습니다.

최근 camera 3d detector는 크게 두가지 타입으로 나뉩니다. query-baes 방법과 BEV-base 방법입니다. 위 논문은 query-base방법을 사용하였고 long range에서도 뛰어난 성능을 보이기 위한 아키텍쳐입니다.

[paper review] PETR 논문리뷰 (3D detection w Cam)

안녕하세요 이번에는 PETR 이라는 camera기반의 3D detection 논문을 살펴보겠습니다.최근 camera를 기반으로하는 3D detection 논문들이 많이 나오고 있습니다. 현재 multi-cam 3D detection분야는 BEV 방법론과 p

jaehoon-daddy.tistory.com

우선 이전에 포스팅한 PETR먼저 간략히 살펴보겠습니다.

2D의 대표적인 detection 아키텍쳐인 DETR을 참고하였습니다. 2D bb으로 feature를 만들고(2DF) 2DF를 3D 공간에 inverse projection한 뒤 얻은 feature와(3DF) encoding과정을 거쳐 aggregation합니다.

이후 decoder에서 learnable query(최대 object 개수만큼)와 cross attention하여 class, bbox를 prediction하게됩니다.

Overview

전체적인 맥락은 2D image를 2D BB과 FPN을 통해 Feature를 추출합니다. 여기서 별도의 depthnet을 통해 나온 depth map을 추출하고 YOLOX의 anchorfree head를 통해 나온 prediction을 활용하여 3D공간에 inverse projection합니다. 이제 query를 만들건데, inverse projection해서 나온 3D coordinate을 positional embedding하고 나온 feature와 매칭되는 2D feature와 score를 input으로하여 semi Embedding 모듈을 태워 나온 feature를 summation하여 query를 만들어냅니다. 3D detector head를 통해서 나온 global query와 concat하여 총 3D query를 만듭니다. 이후 [self-attention을 통해 query를 update하고 perspective Aggregation단계에서는 deformable 하게 기존 2D와 cross-attention을 수행하고 나온 output에 MLP를 태웁니다.] 이렇게가 한번의 modulation이고 이를 N번 반복후에 각각의 query에서 bbox regression, classification을 수행합니다.

Adaptive Query Generation

우선 detection의 범위를 long range로 늘리면(약 150m) computation cost가 꽤 heavy고 모델의 convergence의 비효율성이 발생합니다. 이를 위해서 query를 adaptive하게 선택하는 방법을 이용합니다.

2D backbone과 FPN neck을 통해 나온 feature를 YOLOX를 참고한 head와 depth estimation 모듈(net)에 태웁니다. 그리고 2D box의 결과와 depth map의 결과를 얻어냅니다. depth map과 2d box결과를 가지고 2d box center들을 inverse projection을 수행합니다. (0.1 score이상만)

inverse projection된 3D center를 position embedding을 하고 mlp로 이뤄진 semEmbed라는 모듈을 score와 해당 center에서의 2D feature를 input으로하여 각각 feed시킵니다.

그로인해 나온결과를 summation하여 3D adaptive query를 생성하게됩니다.

perspective-aware Aggregation

기존의 query-based approach들은 single-level feature map을 활용하였습니다. 문제는 예를들면 small distant object의 경우에는 large-resolution features가 필요하고 반대의 경우 low-resolution features가 필요하게 됩니다. 본 논문에서는 이런 dynamic한 상황을 해결하기 위해 3D spatial deformable attention을 사용합니다.(deformable attention의 변형)

*deformable attention reference

[CV / Detection] DETR기반의 Image Detector들

안녕하세요. 2D Detection관련하여 이번에는 DETR 모델에 관련해서 포스팅 하려합니다. 포스팅 시점 현재 2D image detection에서 bench mark SOTA에 올라와 있는 모델이 DETR기반의 모델이기 때문에 해당 모델

jaehoon-daddy.tistory.com

2D image Feature(F)를 Squeesze-and-excitation 을 사용해서 feature를 좀 더 풍부하게 만듭니다. 이 후에 3D deformable attention을 수행합니다.( PETR 에서는 global attention을 수행함)

3D deformable attention은 3D reference point offset되어 sampling되는데 offset되는 parameter가 learnable parameter가 되겠습니다. 수식으로 나타내면 아래와 같습니다.

Range-modulated 3D Denoising

보통의 camera 3D detector는 long range부분에서 많은 문제가 발생합니다. 두 가지의 요인에서 기인하는데 첫번째는 query density입니다. 먼 거리의 경우 실제 object와 일치할 가능성이 낮습니다. 두번째는 error propagation인데 2D에서의 작은 error가 3D 공간으로 변환될때 그것이 증폭될 수 있다는 것이고 이건 거리가 멀 수록 더 심해진다고 합니다.

이런 특성들로 인해서 object근처의 query가 noise로 간주되거나 반대의 경우가 발생할 수 있습니다. 이를 위해서 range-modulated 3D Denoising을 수행합니다.

alpha {0,1} 값에 따라 nagative, positive queries이 생성됩니다. P는 gt의 3D center를 의미하고 S는 box scale을 의미합니다. 이를 통해 offset constraint를 가이드해줄 수 있습니다.

즉, 정리하면 GT object를 기반으로 noise를 줘서 query를 생성합니다. 일종의 GT에 noise를 주는 augmentation이라고 생각할 수 있습니다.

Experiments

Argo val dataset으로 평가한 결과입니다. mAP를 보면 몇몇 lidar detector보다 성능이 좋습니다.

nuscence 데이터셋의 결과입니다.

implement detail은 Ojbect365와 COCO로 pretrained 된 ViT Large를 백본으로 사용하였고 FPN은 4단계의 feature map을 사용하였습니다. FPN으로는 FCOS3D로 pretrained된 VoVNet99라는 모델을 사용하였습니다. AdamW를 optimizer로 사용하였고 weight decay를 0.01로 하였습니다. batch size는 8 lr는 2e-4로 설정하였습니다.

실험결과는 사실 temporal feature를 사용한 결과이고 논문에서는 자세한 방법은 생략하고 StreamPETR를 baseline으로 하였다고 합니다. 추후 StreamPETR을 살펴보겠습니다.

코드상에 config인데 이 config를 보면 보다 직관적입니다.

    img_backbone=dict(
        type='VoVNet', ###use checkpoint to save memory
        spec_name='V-99-eSE',
        norm_eval=True,
        frozen_stages=-1,
        input_ch=3,
        out_features=('stage2','stage3','stage4','stage5',)),
    img_neck=dict(
        type='FPN',  ###remove unused parameters 
        start_level=1,
        add_extra_convs='on_output',
        relu_before_extra_convs=True,
        in_channels=[256, 512, 768, 1024],
        out_channels=256,
        num_outs=4),
    img_roi_head=dict(
        type='YOLOXHeadCustom',
        num_classes=26,
        in_channels=256,
        strides=[8, 16, 32, 64],
        train_cfg=dict(assigner=dict(type='SimOTAAssigner', center_radius=2.5)),
        test_cfg=dict(score_thr=0.01, nms=dict(type='nms', iou_threshold=0.65)),
        pred_with_depth=True,
        depthnet_config=depthnet_config,
        reg_depth_level='p3',
        pred_depth_var=False,    # note 2d depth uncertainty
        loss_depth2d=dict(type='L1Loss', loss_weight=1.0),
        sample_with_score=True,  # note threshold
        threshold_score=0.1,
        topk_proposal=None,
        return_context_feat=True,
    ),

아래 코드는 위 config를 참고해서 backbone, neck, depth map feature map을 얻는 코드입니다.

    def extract_img_feat(self, img, return_depth=False):
        """Extract features of images."""
        B = img.size(0)

        if img is not None:
            if img.dim() == 6:
                img = img.flatten(1, 2)
            if img.dim() == 5 and img.size(0) == 1:
                img.squeeze_()
            elif img.dim() == 5 and img.size(0) > 1:
                B, N, C, H, W = img.size()
                img = img.reshape(B * N, C, H, W)
            if self.use_grid_mask:
                img = self.grid_mask(img)

            img_feats = self.img_backbone(img)
            if isinstance(img_feats, dict):
                img_feats = list(img_feats.values())
        else:
            return None
        if self.with_img_neck:
            img_feats = self.img_neck(img_feats)
        img_feats_reshaped = []
        for i in self.position_level:
            BN, C, H, W = img_feats[i].size()
            img_feat_reshaped = img_feats[i].view(B, int(BN/B), C, H, W)
            img_feats_reshaped.append(img_feat_reshaped)
        
        if return_depth and self.depth_branch is not None:
            depths = self.depth_branch(img_feats_reshaped)
        else:
            depths = None

        if return_depth:
            return img_feats_reshaped, depths
        return img_feats_reshaped

저작자표시 비영리 변경금지 (새창열림)

소소한 개발자의 끄적거림