OLAF: A Plug-and-Play Framework for Enhanced Multi-object Multi-part Scene Parsing

Pranav Gupta1
Rishubh Singh2,3
Pradeep Shenoy3
Ravi Kiran Sarvadevabhatla1
1International Institute of Information Technology, Hyderabad
2Swiss Federal Institute of Technology (EPFL)
3Google Research

In ECCV 2024

[Paper]

The recipe for OLAF, our plug-and-play framework for enhanced multi-object multi-part scene parsing: (1) Augment RGB input with object-based channels (fg/bg, boundary edges) obtained from frozen pre-trained models (MO , ME ) (2) Use Low-level Dense Feature guidance from segmentation encoder (LDF, shaded green) (3) Employ targeted weight adaptation for stable optimization with augmented input. We show that following this recipe leads to significant gains (up to 4.0 mIoU) across multiple architectures and across multiple challenging datasets.

Video


Abstract

Multi-object multi-part scene segmentation is a challenging task whose complexity scales exponentially with part granularity and number of scene objects. To address the task, we propose a plug-and-play approach termed OLAF. First, we augment the input (RGB) with channels containing object-based structural cues (fg/bg mask, boundary edge mask). We propose a weight adaptation technique which enables regular (RGB) pre-trained models to process the augmented (5-channel) input in a stable manner during optimization. In addition, we introduce an encoder module termed LDF to provide low-level dense feature guidance. This assists segmentation, particularly for smaller parts. OLAF enables significant mIoU gains of 3.3 (Pascal-Parts-58), 3.5 (Pascal-Parts-108) over the SOTA model. On the most challenging variant (Pascal-Parts-201), the gain is 4.0. Experimentally, we show that OLAF's broad applicability enables gains across multiple architectures (CNN, U-Net, Transformer) and datasets.


Method

Illustration of OLAF's architectural integration with FLOAT (Sec. Methodology of Olaf). FLOAT's components are tagged with ★. The object masks from output So of object segmentation network 𝜏o are merged to obtain the foreground map fg. The output of edge generation network 𝜏e is thresholded and filtered using fg to obtain edge map edge. The obtained maps are stacked with input image I to obtain the 5-channel input I' for the part segmentation network 𝜐. The interface for LDF with encoder Epart and its architecture (top right) are also shown. A similar integration of OLAF also exists for U-Net style and Transformer style architectures.





Paper and Supplementary Material

P. Gupta, R. Singh, P. Shenoy, R. Sarvadevabhatla
OLAF: A Plug-and-Play Framework for Enhanced Multi-object Multi-part Scene Parsing
In Conference, ECCV 2024.




This template was originally made by Phillip Isola and Richard Zhang for a colorful ECCV project; the code can be found here.