Overall Architecture of the HFMN
The architecture of the High-Frequency Mamba Network (HFMN) is an innovative design aimed at enhancing image super-resolution (SR). At its core, the proposed model is structured into three essential modules: a shallow feature extraction module, a deep feature extraction module, and a reconstruction module. This systematic approach allows for a comprehensive and efficient transformation of low-resolution (LR) images into high-quality, super-resolved images.
Shallow Feature Extraction Module
In the initial phase, the model utilizes a standard 3×3 convolution layer to execute shallow feature extraction on the input image. This method is straightforward yet effective in capturing basic features and local patterns present within the image. Mathematically, the shallow feature extraction can be represented as:
[
{X}{SF} = {H}{SF}({I}_{LR})
]
Here, (I{LR}) denotes the input low-resolution image, while (H{SF}) indicates the shallow feature extraction operation. The output, (X_{SF}), encapsulates the shallow features which will later be pooled into deeper layers.
Deep Feature Extraction Module
Upon extracting shallow features, the model channels these into the deep feature extraction module. This segment leans on a series of cascaded CNN-Mamba Fusion Group (CMFG) blocks. Each of these blocks is made up of multiple CNN-Mamba Fusion Blocks (CMFB).
This approach allows for effective dual-branch fusion of local and global information, vital for capturing the intricate details and overall context of the image. The deep feature extraction can be formally articulated as:
[
{X}{DF} = {H}{DF}({X}{S}) + {X}{S}
]
Here, (H{DF}) denotes the deep feature extraction operation, enhancing the output (X{DF}).
Reconstruction Module
After the formation of deep features, the next step entails sending these results to the reconstruction module, which is responsible for generating the final super-resolved image. The reconstruction operation, generally performed via pixel shuffle layers, is mathematically denoted as follows:
[
{I}{SR} = {H}{RE}({X}_{DF})
]
In this representation, (I{SR}) is the output super-resolved image, while (H{RE}) characterizes the reconstruction process.
The overall architecture of HFMN efficiently integrates these modular components, producing a robust framework for achieving high-quality images through super-resolution techniques.
CNN-Mamba Fusion Block
In the domain of super-resolution, traditional methods have predominantly relied on Convolutional Neural Networks (CNNs) or Transformer-based architectures. While CNNs excel at capturing local high-frequency details, they tend to falter when addressing the broader, global relationships in images, particularly those demanding memory-intensive operations. Conversely, Transformers, despite their efficiency in long-range relationship modeling, struggle with substantial computational costs associated with high-resolution inputs.
To bridge this gap, the HFMN introduces the CNN-Mamba Fusion Block (CMFB). This dual-branch fusion module effectively integrates local and global representations, complemented by the innovative Mamba architecture—a state space model that offers linear computational complexity. Notably, the use of Mamba grants a significant increase in inference speed while maintaining performance standards comparable to traditional methods.
The CMFB particularly stands out due to its capability to perform local feature extraction through Local High-frequency Feature Blocks (LHFB) while executing global modeling via Mamba-based Attention Blocks (MAB). The outputs from both branches are subsequently fused through a Dual-information Interactive Attention Block (DIAB) to produce refined features suitable for reconstruction.
The operations of this innovative module can be represented as:
[
\begin{array}{c}
{X}{DIAB} = f({X}{LFHB}, {X}{MAB}) \
{X}{out} = {Conv}{3 \times 3}({X}{LFHB} + {X}{MAB}) + {X}{in}
\end{array}
]
In this function, (f(\cdot)) embodies the interactive attention mechanism, while the final layer allows deeper extraction of features for optimal performance.
Dual-information Interactive Attention Block
To bolster interactions between local and global information, the HFMN incorporates an interactive fusion mechanism inspired by traditional self-attention approaches. This mechanism computes correlations between the query and key representations, enabling the model to refine features effectively across different domains.
The interaction between the outputs from both branches of the CMFB can be illustrated as:
[
\begin{array}{c}
Attention{1} = Softmax\left(\frac{{\widehat{Q}}{C}{\widehat{K}}{M}^{T}}{\sqrt{d}}\right) \cdot {\widehat{V}}{M} = {X}{C} \
Attention{2} = Softmax\left(\frac{{\widehat{Q}}{M}{\widehat{K}}{C}^{T}}{\sqrt{d}}\right) \cdot {\widehat{V}}{C} = {X}{M}
\end{array}
]
Here, (Attention{1}) extracts high-frequency features processed through LHFB, while (Attention{2}) emphasizes global contextual information from the MAB. The concatenation of both outputs further enhances the feature representation, culminating in comprehensive image reconstruction.
Local High-frequency Feature Block
The architecture of the HFMN employs a specialized Local High-frequency Feature Block (LHFB) for capturing critical high-frequency components essential in super-resolution tasks. Recognizing that CNNs possess a propensity to exploit high-frequency information effectively, the LHFB is constructed to refine these essential details accurately.
The two branches consist of a residual branch employing pixel attention and a high-frequency enhancement branch based on max pooling. The operations can be articulated as:
[
\begin{array}{c}
{X}^{{\prime\:}} = {Conv}{3 \times 3}(PA({Conv}{3 \times 3}({Conv}{1 \times 1}({X}{in}))) \
{X}^{{\prime\:}{\prime\:}} = {Conv}{3 \times 3}(Maxpool({X}{in})) \
{X}{out} = {Conv}{1 \times 1}(Concat({X}^{{\prime\:}}, {X}^{{\prime\:}{\prime\:}})) + {X}_{in}
\end{array}
]
In this schematic, maximum pooling and pixel attention are utilized to extract and augment high-frequency details from the original input, ensuring each pixel retains critical contributions from all channels during computation.
Mamba-based Attention Block
In-line with contemporary advancements, the HFMN incorporates Mamba-based Attention Blocks (MABs) designed for efficient image processing. This module employs a depthwise separable convolution to optimize local feature extraction alongside structured state-space methodologies for long-range dependency modeling.
The computational flow within this module can be expressed as:
[
\begin{array}{c}
{X}{gate} = SiLU(Linear(LN({X}{in}))) \
{X}^{{\prime\:}} = LN(SS2D(SiLU(DWConv(Linear({X}{in})))) \
{X}^{{\prime\:}{\prime\:}} = Linear({X}{gate} \odot {X}^{{\prime\:}}) + {X}{in} \
{X}{out} = CA({Conv}_{3 \times 3}(LN({X}^{{\prime\:}{\prime\:}}))) + {X}^{{\prime\:}{\prime\:}}
\end{array}
]
Through these operations, MAB combines traditional convolutional efficiencies with modern self-attention methodologies, exhibiting a promising balance between performance and complexity in tasks like super-resolution.
The architecture proposed within HFMN showcases a sophisticated blend of strategies and models, setting a significant precedent in the realm of image super-resolution by efficiently synthesizing high-quality outputs from low-resolution inputs.