OLAF: A Plug-and-Play Framework for Enhanced Multi-object Multi-part Scene Parsing

Pranav Gupta¹

Rishubh Singh^2,3

Pradeep Shenoy³

Ravi Kiran Sarvadevabhatla¹

¹International Institute of Information Technology, Hyderabad

²Swiss Federal Institute of Technology (EPFL)

³Google Research

In ECCV 2024

[Paper]

[Code]

The recipe for OLAF, our plug-and-play framework for enhanced multi-object multi-part scene parsing: (1) Augment RGB input with object-based channels (fg/bg, boundary edges) obtained from frozen pre-trained models (MO , ME ) (2) Use Low-level Dense Feature guidance from segmentation encoder (LDF, shaded green) (3) Employ targeted weight adaptation for stable optimization with augmented input. We show that following this recipe leads to significant gains (up to 4.0 mIoU) across multiple architectures and across multiple challenging datasets.

Video

Abstract

Multi-object multi-part scene segmentation is a challenging task whose complexity scales exponentially with part granularity and number of scene objects. To address the task, we propose a plug-and-play approach termed OLAF. First, we augment the input (RGB) with channels containing object-based structural cues (fg/bg mask, boundary edge mask). We propose a weight adaptation technique which enables regular (RGB) pre-trained models to process the augmented (5-channel) input in a stable manner during optimization. In addition, we introduce an encoder module termed LDF to provide low-level dense feature guidance. This assists segmentation, particularly for smaller parts. OLAF enables significant mIoU gains of 3.3 (Pascal-Parts-58), 3.5 (Pascal-Parts-108) over the SOTA model. On the most challenging variant (Pascal-Parts-201), the gain is 4.0. Experimentally, we show that OLAF's broad applicability enables gains across multiple architectures (CNN, U-Net, Transformer) and datasets.

Method

Illustration of OLAF's architectural integration with FLOAT (Sec. Methodology of Olaf). FLOAT's components are tagged with ★. The object masks from output S_o of object segmentation network 𝜏_o are merged to obtain the foreground map fg. The output of edge generation network 𝜏_e is thresholded and filtered using fg to obtain edge map edge. The obtained maps are stacked with input image I to obtain the 5-channel input I' for the part segmentation network 𝜐. The interface for LDF with encoder E_part and its architecture (top right) are also shown. A similar integration of OLAF also exists for U-Net style and Transformer style architectures.

Paper and Supplementary Material

P. Gupta, R. Singh, P. Shenoy, R. Sarvadevabhatla
OLAF: A Plug-and-Play Framework for Enhanced Multi-object Multi-part Scene Parsing
In Conference, ECCV 2024.

This template was originally made by Phillip Isola and Richard Zhang for a colorful ECCV project; the code can be found here.