부스트캠프 AI Tech 2기 CV - Advanced Object Detection

주녁:-) 2022. 4. 5. 11:59

2022. 4. 5. 11:59

Advanced Object Detection 1

1. Cascade RCNN

1.1 Contribution

1.2. motivation

IoU threshold에 따라 다르게 학습되었을 때 결과가 다름
Input IoU가 높을수록 높은 IoU threshold에서 학습된 model이 더 좋은 결과를 냄

IoU threshold에 따라 다르게 학습되었을 때 결과가 다름
전반적인 AP의 경우 IoU threshold 0.5로 학습된 model이 성능이 가장 좋음
그러나 AP의 IoU threshold가 높아질수록(ex. AP70, AP90) IoU threshold가 0.6, 0.7로 학습된 model의 성능이 좋음
학습되는 IoU에 따라 대응 가능한 IoU 박스가 다름
그래프와 같이 high quality detection을 수행하기 위해선 IoU threshold를 높여 학습할 필요가 있음
단, 성능이 하락하는 문제가 존재
이를 해결하기 위해 Cascade RCNN을 제안

1.3 Method

RPN으로부터 B0을 얻고 projection을 하여 head를 통해 B1을 얻는다.

그 이후 B1을 projection을 하여 B2를 얻는 방식이 반복된다.

1.4 Result

Box pooling을 반복 수행할 시 성능 향상되는 것을 증명(iterative)
IOU threshold가 다른 Classifier가 반복될 때 성능 향상 증명(Integral)
IOU threshold가 다른 RoI head를 cascade로 쌓을 시 성능 향상 증명(Cascade)

2. Deformable Convolutional Networks(DCN)

2.1 Contribution

CNN 문제점

일정한 패턴을 지닌 convolution neural networks는 geometric transformations에 한계를 지님

기존 해결 방법

Geometric augmentation
Geometric invariant feature engineering

제안하는 Module

Deformable convolution

Conv의 각각의 filter의 영역에 offset을 추가하여 계산하는 것.

2.2 Results

3. Transformer

3. Overview

Transformer

NLP에서 long range decpendency를 해결. 이를 vision에도 적용.
Vision Transformer(ViT)
End-to-End Object Detection with Transformers(DETR)
Swin Transformer

Self Attention

Overview

Flatten 3D to 2D (Patch 단위로 나누기)

Learnable한 embedding 처리

Add class embedding, position embedding

앞서 만들어진 embedding 값에 class embedding 추가([CLS]Token)
이미지의 위치 따라 학습하기 위해 position embedding 추가

Transformer

Embedding : Transformer 입력값

Predict

Class embedding vector 값을 MLP head에 입력시켜 최종 결과를 추출

ViT의 문제점

ViT의 실험부분을 보면 굉장히 많은 양의 Data를 학습하여야 성능이 나옴
Transformer 특성상 computational cost 큼
일반적으로 backbone으로 사용하기 어려움

3.2 End-to-End Object Detection with Transformer (DETR)

Contribution

Transformer를 처음으로 Object Detection에 적용
기존의 Object Detection의 hand-crafted post process 단계를 transformer를 이용해 없앰

Architecture

224 x 224 input image
7 x 7 feature map size
49개 의 feature vector를 encoder 입력값으로 사용

Train

이 때 groundtruth에서 부족한 object 개수만큼 no object를 padding 처리
따라서 groundtruth와 prediction이 N:N 맵핑
각 예측 값이 N개 unique하게 나타나 post-precess 과정이 필요 없음

3.3 Swin Transformer

ViT의 문제점

ViT의 실험부분을 보면 굉장히 많은 양의 Data를 학습하여야 성능이 나옴
Transformer 특성상 computational cost 큼
일반적으로 backbone으로 사용하기 어려움

해결법

CNN과 유사한 구조로 설계
Window 라는 개념을 활용하여 cost를 줄임

Arichtecture

Patch Partitioning

Linear Embedding

Swin Transformer Block

ViT에선 Multi-Head Attention을 활용하지만 Swin Transformer 에선 W-MSA(Window Multi-head Self Attention)과 SW-MSA(Shifted Window Multi-head Self Attention)을 활용한다.

Window Multi-Head Attention

Shifted Window Multi-Head Attention

'네이버 부스트캠프 AI Tech' 카테고리의 다른 글

부스트캠프 AI Tech 2기 CV - HRNet (0)	2022.04.05
부스트캠프 AI Tech 2기 CV - Semantic Segmentation (0)	2022.04.05
부스트캠프 AI Tech 2기 CV - EfficientDet (0)	2022.04.05
부스트캠프 AI Tech 2기 CV - Neck (0)	2022.04.05
부스트캠프 AI Tech 2기 CV - 1 stage detectors (0)	2022.04.05

인간지능 blog

부스트캠프 AI Tech 2기 CV - Advanced Object Detection

1. Cascade RCNN

1.1 Contribution

1.2. motivation

1.3 Method

1.4 Result

2. Deformable Convolutional Networks(DCN)

2.1 Contribution

CNN 문제점

기존 해결 방법

제안하는 Module

2.2 Results

3. Transformer

3. Overview

Transformer

Self Attention

Overview

Flatten 3D to 2D (Patch 단위로 나누기)

Learnable한 embedding 처리

Add class embedding, position embedding

Transformer

Predict

ViT의 문제점

3.2 End-to-End Object Detection with Transformer (DETR)

Contribution

Architecture

Train

3.3 Swin Transformer

ViT의 문제점

해결법

Arichtecture

Patch Partitioning

Linear Embedding

Swin Transformer Block

Window Multi-Head Attention

Shifted Window Multi-Head Attention

'네이버 부스트캠프 AI Tech' 카테고리의 다른 글

+ Recent posts

티스토리툴바