Thursday, October 23, 2025

Enhancing Visible Infrared Person Reidentification with High-Order Interaction and Wavelet Convolution Networks

Share

An In-Depth Look at HIW-Net: High-order Interaction and Wavelet Convolution Network for Visible-Infrared Person Re-identification

Introduction to HIW-Net

Visible-infrared person re-identification is a critical task in computer vision, especially in areas such as surveillance and security. The High-order Interaction and Wavelet Convolution Network (HIW-Net) represents an innovative approach to address this challenge. Utilizing advanced techniques like feature interaction and wavelet convolution, HIW-Net significantly enhances the ability to differentiate between individuals captured in varying modalities—visible and infrared images.

Architecture Overview

Figure 4 illustrates the intricate architecture of the HIW-Net. It leverages ResNet50 as its backbone and comprises two main processing passes. In the initial pass, both visible and infrared images are introduced into the network. The preprocessing stage utilizes ResNet50 to extract primitive features. These features are then aggregated through different layers, which helps in minimizing modality differences and feature loss during processing.

Pass One: Feature Extraction and Interaction

During the first pass, the model processes the input images through multiple stages. In the initial stage, basic feature interaction (FI) is applied to capture low-order features. In successive stages, third-order primitive feature interaction (TPFI) is implemented, enabling interactions across both channel and spatial dimensions. This sophisticated structure ensures that the network maintains a comprehensive representation of various feature levels.

The wavelet convolution module then steps in, breaking down the aggregated features into different frequency bands. Each band is processed with convolutions featuring various wavelet convolution kernel sizes, enabling the network to capture a wider range of feature information. The loss function at this point integrates multiple components, including Cross-Entropy Loss ((L{ce})), Triplet Loss ((L{tri})), Orthogonality Loss ((L{ort})), and Center-Guided Pair Mining Loss ((L{cpm})), culminating in a robust model loss calculation.

Pass Two: Shape Processing

The second pass in HIW-Net is specifically designed for processing shape images derived from both visible and infrared sources. Much like the first pass, the second pass employs a similar calculation methodology, ultimately yielding a distinct shape loss. This two-pass architecture allows the model to focus on shape-related features independently, enabling a more nuanced understanding of each subject.

Third-Order Primitive Feature Interaction

The third-order primitive feature interaction is a focal point in HIW-Net’s architecture. Here, features are categorized into low-order, high-order, and primitive features based on their position within the network’s layers. The channel interaction phase processes these features in tandem through several convolutions, which results in a similarity matrix that ultimately enhances the feature aggregation across different stages.

Channels and Features

The process begins with the interaction between the primitive and high-order features. Multiple convolutions are applied to each feature type, producing compact outputs. These outputs then undergo matrix multiplication to compute the channel similarity. The interactions culminate in the aggregated high-order feature which represents a comprehensive amalgamation of low-order, high-order, and primitive features.

Spatial Interaction

Similar to the channel interaction, spatial interaction is performed on the high-order and low-order features. The key distinction here lies in aligning the spatial dimensions, ensuring that the processed low-order and primitive features maintain consistency with the high-order features.

Wavelet Convolution for Diverse Feature Mining

The wavelet convolution module in HIW-Net presents a novel approach to feature extraction. Utilizing a decomposition strategy, this module is designed for parameter efficiency while amplifying the importance of multi-frequency features. The implementation shows that with reduced parameters, the effective receptive field (ERF) significantly expands.

Application of Wavelet Transform

The wavelet convolution module acts as a seamless alternative for depth-wise convolutions. By integrating low-frequency subbands, it enhances the model’s robustness against high-frequency noise, emphasizing structural information over textural details. HIW-Net ingeniously incorporates this feature into its architecture without necessitating any radical adjustments to the existing framework.

Loss Functions: Key Components for Training

Training the HIW-Net is underpinned by several loss functions that collectively strive to optimize models for effective person re-identification.

Identity Classification Loss

The Identity Classification Loss (L_{id}) utilizes cross-entropy techniques to differentiate between various identities among the pedestrian features. This ensures that the model effectively learns to classify individuals, thereby minimizing negative impacts stemming from modality differences.

Triplet Loss

Triplet loss (L_{tri}) serves to refine the model’s feature space. By narrowing the distance between anchor-positive pairs (same identity) while broadening the gap between anchor-negative pairs (different identities), this loss function enhances overall discriminative power.

Orthogonality Loss

To maintain feature diversity, HIW-Net adopts orthogonality loss (L_{ort}). This helps decrease redundancy among feature vectors obtained from different branches, ensuring that the model captures unique and varied traits from the dataset.

Center-Guided Pair Mining Loss

The Center-Guided Pair Mining Loss (L_{cpm}) expands on the notion of diverse feature generation. It promotes minimized distances among intra-class samples while simultaneously ensuring adequate separation of inter-class samples, thereby facilitating robust modeling for cross-modal scenarios.

Total Loss Function

Ultimately, the total loss function (L_{total}) combines the modal and shape losses through a weighted relationship. The parameter (\lambda) plays a crucial role in adjusting this balance, helping to dictate the proportion of shape-enhanced features within the network. This approach has proven effective, particularly when employing a value of (\lambda = 0.9).

HIW-Net exemplifies a sophisticated and innovative methodology for visible-infrared person re-identification. Leveraging advanced feature interactions, wavelet convolutions, and a carefully crafted loss function architecture, it positions itself at the forefront of developments in the field. Each component of the model contributes to a robust framework, setting new benchmarks for performance and accuracy in challenging modalities.

Read more

Related updates