CMMF-Net: A Generative network based on CLIP-guided multi-modal feature fusion for thermal infrared image colorization
Thermal infrared (TIR) images remain unaffected by variations in light and atmospheric conditions, which makes them extensively utilized in diverse nocturnal traffic scenarios. However, challenges pertaining to low contrast and absence of chromatic information persist. The technique of image coloriz...
| Main Authors: | , , , , , , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
OAE Publishing Inc.
2025
|
| Subjects: | |
| Online Access: | http://umpir.ump.edu.my/id/eprint/43832/ http://umpir.ump.edu.my/id/eprint/43832/1/CMMF-Net_A%20Generative%20network%20based%20on%20CLIP-guided.pdf |
| _version_ | 1848826969966247936 |
|---|---|
| author | Jiang, Qian Zhou, Tao He, Youwei Ma, Wenjun Hou, Jingyu Ahmad Shahrizan, Abdul Ghani Miao, Shengfa Jing, Xin |
| author_facet | Jiang, Qian Zhou, Tao He, Youwei Ma, Wenjun Hou, Jingyu Ahmad Shahrizan, Abdul Ghani Miao, Shengfa Jing, Xin |
| author_sort | Jiang, Qian |
| building | UMP Institutional Repository |
| collection | Online Access |
| description | Thermal infrared (TIR) images remain unaffected by variations in light and atmospheric conditions, which makes them extensively utilized in diverse nocturnal traffic scenarios. However, challenges pertaining to low contrast and absence of chromatic information persist. The technique of image colorization emerges as a pivotal solution aimed at ameliorating the fidelity of TIR images. This enhancement is conducive to facilitating human interpretation and downstream analytical tasks. Because of the blurred and intricate features of TIR images, extracting and processing their feature information accurately through image-based approaches alone becomes challenging for networks. Hence, we propose a multi-modal model that integrates text features from TIR images with image features to jointly perform TIR image colorization. A vision transformer (ViT) model will be employed to extract features from the original TIR images. Concurrently, we manually observe and summarize the textual descriptions of the images, and then input these descriptions into a pretrained contrastive language-image pretraining (CLIP) model to capture text-based features. These two sets of features will then be fed into a cross-modal interaction (CI) module to establish the relationship between text and image. Subsequently, the text-enhanced image features will be processed through a U-Net network to generate the final colorized images. Additionally, we utilize a comprehensive loss function to ensure the network’s ability to generate high-quality colorized images. The effectiveness of the methodology put forward in this study is evaluated using the KAIST datasets. The experimental results vividly showcase the superior performance of our CMMF-Net method in comparison to other methodologies for the task of TIR image colorization. |
| first_indexed | 2025-11-15T03:53:16Z |
| format | Article |
| id | ump-43832 |
| institution | Universiti Malaysia Pahang |
| institution_category | Local University |
| language | English |
| last_indexed | 2025-11-15T03:53:16Z |
| publishDate | 2025 |
| publisher | OAE Publishing Inc. |
| recordtype | eprints |
| repository_type | Digital Repository |
| spelling | ump-438322025-07-08T00:24:53Z http://umpir.ump.edu.my/id/eprint/43832/ CMMF-Net: A Generative network based on CLIP-guided multi-modal feature fusion for thermal infrared image colorization Jiang, Qian Zhou, Tao He, Youwei Ma, Wenjun Hou, Jingyu Ahmad Shahrizan, Abdul Ghani Miao, Shengfa Jing, Xin T Technology (General) TA Engineering (General). Civil engineering (General) TJ Mechanical engineering and machinery TK Electrical engineering. Electronics Nuclear engineering TS Manufactures Thermal infrared (TIR) images remain unaffected by variations in light and atmospheric conditions, which makes them extensively utilized in diverse nocturnal traffic scenarios. However, challenges pertaining to low contrast and absence of chromatic information persist. The technique of image colorization emerges as a pivotal solution aimed at ameliorating the fidelity of TIR images. This enhancement is conducive to facilitating human interpretation and downstream analytical tasks. Because of the blurred and intricate features of TIR images, extracting and processing their feature information accurately through image-based approaches alone becomes challenging for networks. Hence, we propose a multi-modal model that integrates text features from TIR images with image features to jointly perform TIR image colorization. A vision transformer (ViT) model will be employed to extract features from the original TIR images. Concurrently, we manually observe and summarize the textual descriptions of the images, and then input these descriptions into a pretrained contrastive language-image pretraining (CLIP) model to capture text-based features. These two sets of features will then be fed into a cross-modal interaction (CI) module to establish the relationship between text and image. Subsequently, the text-enhanced image features will be processed through a U-Net network to generate the final colorized images. Additionally, we utilize a comprehensive loss function to ensure the network’s ability to generate high-quality colorized images. The effectiveness of the methodology put forward in this study is evaluated using the KAIST datasets. The experimental results vividly showcase the superior performance of our CMMF-Net method in comparison to other methodologies for the task of TIR image colorization. OAE Publishing Inc. 2025 Article PeerReviewed pdf en cc_by_4 http://umpir.ump.edu.my/id/eprint/43832/1/CMMF-Net_A%20Generative%20network%20based%20on%20CLIP-guided.pdf Jiang, Qian and Zhou, Tao and He, Youwei and Ma, Wenjun and Hou, Jingyu and Ahmad Shahrizan, Abdul Ghani and Miao, Shengfa and Jing, Xin (2025) CMMF-Net: A Generative network based on CLIP-guided multi-modal feature fusion for thermal infrared image colorization. Intelligence and Robotics, 5 (1). pp. 34-49. ISSN 2770-3541. (Published) https://doi.org/10.20517/ir.2025.03 https://doi.org/10.20517/ir.2025.03 |
| spellingShingle | T Technology (General) TA Engineering (General). Civil engineering (General) TJ Mechanical engineering and machinery TK Electrical engineering. Electronics Nuclear engineering TS Manufactures Jiang, Qian Zhou, Tao He, Youwei Ma, Wenjun Hou, Jingyu Ahmad Shahrizan, Abdul Ghani Miao, Shengfa Jing, Xin CMMF-Net: A Generative network based on CLIP-guided multi-modal feature fusion for thermal infrared image colorization |
| title | CMMF-Net: A Generative network based on CLIP-guided multi-modal feature fusion for thermal infrared image colorization |
| title_full | CMMF-Net: A Generative network based on CLIP-guided multi-modal feature fusion for thermal infrared image colorization |
| title_fullStr | CMMF-Net: A Generative network based on CLIP-guided multi-modal feature fusion for thermal infrared image colorization |
| title_full_unstemmed | CMMF-Net: A Generative network based on CLIP-guided multi-modal feature fusion for thermal infrared image colorization |
| title_short | CMMF-Net: A Generative network based on CLIP-guided multi-modal feature fusion for thermal infrared image colorization |
| title_sort | cmmf-net: a generative network based on clip-guided multi-modal feature fusion for thermal infrared image colorization |
| topic | T Technology (General) TA Engineering (General). Civil engineering (General) TJ Mechanical engineering and machinery TK Electrical engineering. Electronics Nuclear engineering TS Manufactures |
| url | http://umpir.ump.edu.my/id/eprint/43832/ http://umpir.ump.edu.my/id/eprint/43832/ http://umpir.ump.edu.my/id/eprint/43832/ http://umpir.ump.edu.my/id/eprint/43832/1/CMMF-Net_A%20Generative%20network%20based%20on%20CLIP-guided.pdf |