Wednesday, June 25, 2025

Enhanced Fine-Grained Image Segmentation of Mural Figures Using a Mamba-Based Vision Transformer

Share

Mamba-based ViT Network for Mural Image Segmentation: A Deep Dive

Introduction

In recent years, the field of image segmentation has experienced significant technological advancements, with Vision Transformers (ViT) gaining traction as an innovative means of handling complex visual tasks. This article explores a novel Mamba-based ViT network designed specifically for mural image segmentation. Through its sophisticated architecture, combining a hybrid encoding layer with a multi-level pyramid spatial-convolution (SCconv) decoding layer, the proposed network optimally extracts and reconstructs image features, producing precise segmentation masks.

Overview of the Proposed Architecture

The overall structure of our mural image segmentation network comprises two main components: the M-ViT encoder and the Pyramid SCconv decoder.

Architecture Illustration

As depicted in Figure 1, the architecture begins with feature extraction through the M-ViT units, followed by reconstruction via the Pyramid SCconv modules.

Encoding Phase

In the encoding phase, incoming images are partitioned into patches, which are then transformed into a linear sequence. The process involves four distinct downsampling stages, each featuring N M-ViT units. This hierarchical structure enables deeper insight into the image by capturing varied semantic features at different scales.

Decoding Phase

Once feature extraction is complete, the Pyramid SCconv decoder performs convolutional reconstruction on outputs from four scale-intermediate layers. This multi-level approach effectively integrates spatial and channel dimensions, culminating in a unified scale suitable for pixel-wise classification and ultimately leading to an accurate mask output.

M-ViT Encoder

The core of the encoder is the M-ViT unit, which plays a pivotal role in feature mapping essential to semantic segmentation tasks. As seen in Figure 2, the M-ViT unit comprises three essential modules: Efficient Self-Attention, Mix-FFN in a dual-stream branch, and the Mamba module.

Efficient Self-Attention

At different encoding stages, the feature map passing through the Efficient Self-Attention module enhances long-range dependencies with reduced computational demands. By employing a linear computation approach, it achieves performance levels similar to those of traditional dot-product transformers but with significantly lower computational costs.

Mix-FFN and Mamba Module

The dual-branch structure includes the Mix-FFN module, which effectively incorporates positional information and enhances nonlinear mapping through GELU activations, while the Mamba module addresses serializing discrete data, facilitating efficient context representation and computational efficiency.

Pyramid SCconv Decoder

Transitioning to the decoding phase, the Pyramid SCconv Decoder aims to extract high-quality segmentation results from the encoded feature maps.

Structure and Function

Each SCconv module includes two critical components: a spatially separable gating module (SSGM) that manages spatial information and a channel-separable convolution module (CSCM) that processes channel-related redundancy. As shown in Figure 3, the architecture facilitates effective multi-scale feature fusion and upsampling decoding.

Feature Processing

When processing a given two-dimensional feature map, normalization is performed first to maintain stability and improve learning. The importance weights calculated through the SSGM play a vital role in suppressing non-essential information and reducing redundancy, ultimately enhancing the quality of the output segmentation.

Loss Function

A well-rounded optimization strategy is crucial for achieving accurate segmentation results. To this end, a composite loss function combines Dice loss and Cross Entropy Loss.

Dice Loss

The Dice coefficient functions as a measure of similarity between the predicted segmentation and the ground truth. This metric is particularly valuable in assessing class imbalances typical in segmentation tasks.

Cross Entropy Loss

Complementing the Dice loss, Cross Entropy Loss evaluates discrepancies between predicted classes and the actual distribution, ensuring a balanced approach to classification.

Combined Loss Function

The ultimate loss function combines these two metrics, weighted by an empirical λ (0.2 in our study), enabling the model to effectively balance overall accuracy with fine-grained segmentation capabilities.


This innovative Mamba-based ViT network exemplifies a sophisticated approach to mural image segmentation, marrying contemporary methods with practical, real-world applications in visual computing. Each component and module of this architecture has been painstakingly designed to enhance performance, offering promise in the ongoing challenge of image segmentation.

Read more

Related updates