CMMF-Net: A Generative network based on CLIP-guided multi-modal feature fusion for thermal infrared image colorization

Thermal infrared (TIR) images remain unaffected by variations in light and atmospheric conditions, which makes them extensively utilized in diverse nocturnal traffic scenarios. However, challenges pertaining to low contrast and absence of chromatic information persist. The technique of image coloriz...

Full description

Bibliographic Details
Main Authors: Jiang, Qian, Zhou, Tao, He, Youwei, Ma, Wenjun, Hou, Jingyu, Ahmad Shahrizan, Abdul Ghani, Miao, Shengfa, Jing, Xin
Format: Article
Language:English
Published: OAE Publishing Inc. 2025
Subjects:
Online Access:http://umpir.ump.edu.my/id/eprint/43832/
http://umpir.ump.edu.my/id/eprint/43832/1/CMMF-Net_A%20Generative%20network%20based%20on%20CLIP-guided.pdf
_version_ 1848826969966247936
author Jiang, Qian
Zhou, Tao
He, Youwei
Ma, Wenjun
Hou, Jingyu
Ahmad Shahrizan, Abdul Ghani
Miao, Shengfa
Jing, Xin
author_facet Jiang, Qian
Zhou, Tao
He, Youwei
Ma, Wenjun
Hou, Jingyu
Ahmad Shahrizan, Abdul Ghani
Miao, Shengfa
Jing, Xin
author_sort Jiang, Qian
building UMP Institutional Repository
collection Online Access
description Thermal infrared (TIR) images remain unaffected by variations in light and atmospheric conditions, which makes them extensively utilized in diverse nocturnal traffic scenarios. However, challenges pertaining to low contrast and absence of chromatic information persist. The technique of image colorization emerges as a pivotal solution aimed at ameliorating the fidelity of TIR images. This enhancement is conducive to facilitating human interpretation and downstream analytical tasks. Because of the blurred and intricate features of TIR images, extracting and processing their feature information accurately through image-based approaches alone becomes challenging for networks. Hence, we propose a multi-modal model that integrates text features from TIR images with image features to jointly perform TIR image colorization. A vision transformer (ViT) model will be employed to extract features from the original TIR images. Concurrently, we manually observe and summarize the textual descriptions of the images, and then input these descriptions into a pretrained contrastive language-image pretraining (CLIP) model to capture text-based features. These two sets of features will then be fed into a cross-modal interaction (CI) module to establish the relationship between text and image. Subsequently, the text-enhanced image features will be processed through a U-Net network to generate the final colorized images. Additionally, we utilize a comprehensive loss function to ensure the network’s ability to generate high-quality colorized images. The effectiveness of the methodology put forward in this study is evaluated using the KAIST datasets. The experimental results vividly showcase the superior performance of our CMMF-Net method in comparison to other methodologies for the task of TIR image colorization.
first_indexed 2025-11-15T03:53:16Z
format Article
id ump-43832
institution Universiti Malaysia Pahang
institution_category Local University
language English
last_indexed 2025-11-15T03:53:16Z
publishDate 2025
publisher OAE Publishing Inc.
recordtype eprints
repository_type Digital Repository
spelling ump-438322025-07-08T00:24:53Z http://umpir.ump.edu.my/id/eprint/43832/ CMMF-Net: A Generative network based on CLIP-guided multi-modal feature fusion for thermal infrared image colorization Jiang, Qian Zhou, Tao He, Youwei Ma, Wenjun Hou, Jingyu Ahmad Shahrizan, Abdul Ghani Miao, Shengfa Jing, Xin T Technology (General) TA Engineering (General). Civil engineering (General) TJ Mechanical engineering and machinery TK Electrical engineering. Electronics Nuclear engineering TS Manufactures Thermal infrared (TIR) images remain unaffected by variations in light and atmospheric conditions, which makes them extensively utilized in diverse nocturnal traffic scenarios. However, challenges pertaining to low contrast and absence of chromatic information persist. The technique of image colorization emerges as a pivotal solution aimed at ameliorating the fidelity of TIR images. This enhancement is conducive to facilitating human interpretation and downstream analytical tasks. Because of the blurred and intricate features of TIR images, extracting and processing their feature information accurately through image-based approaches alone becomes challenging for networks. Hence, we propose a multi-modal model that integrates text features from TIR images with image features to jointly perform TIR image colorization. A vision transformer (ViT) model will be employed to extract features from the original TIR images. Concurrently, we manually observe and summarize the textual descriptions of the images, and then input these descriptions into a pretrained contrastive language-image pretraining (CLIP) model to capture text-based features. These two sets of features will then be fed into a cross-modal interaction (CI) module to establish the relationship between text and image. Subsequently, the text-enhanced image features will be processed through a U-Net network to generate the final colorized images. Additionally, we utilize a comprehensive loss function to ensure the network’s ability to generate high-quality colorized images. The effectiveness of the methodology put forward in this study is evaluated using the KAIST datasets. The experimental results vividly showcase the superior performance of our CMMF-Net method in comparison to other methodologies for the task of TIR image colorization. OAE Publishing Inc. 2025 Article PeerReviewed pdf en cc_by_4 http://umpir.ump.edu.my/id/eprint/43832/1/CMMF-Net_A%20Generative%20network%20based%20on%20CLIP-guided.pdf Jiang, Qian and Zhou, Tao and He, Youwei and Ma, Wenjun and Hou, Jingyu and Ahmad Shahrizan, Abdul Ghani and Miao, Shengfa and Jing, Xin (2025) CMMF-Net: A Generative network based on CLIP-guided multi-modal feature fusion for thermal infrared image colorization. Intelligence and Robotics, 5 (1). pp. 34-49. ISSN 2770-3541. (Published) https://doi.org/10.20517/ir.2025.03 https://doi.org/10.20517/ir.2025.03
spellingShingle T Technology (General)
TA Engineering (General). Civil engineering (General)
TJ Mechanical engineering and machinery
TK Electrical engineering. Electronics Nuclear engineering
TS Manufactures
Jiang, Qian
Zhou, Tao
He, Youwei
Ma, Wenjun
Hou, Jingyu
Ahmad Shahrizan, Abdul Ghani
Miao, Shengfa
Jing, Xin
CMMF-Net: A Generative network based on CLIP-guided multi-modal feature fusion for thermal infrared image colorization
title CMMF-Net: A Generative network based on CLIP-guided multi-modal feature fusion for thermal infrared image colorization
title_full CMMF-Net: A Generative network based on CLIP-guided multi-modal feature fusion for thermal infrared image colorization
title_fullStr CMMF-Net: A Generative network based on CLIP-guided multi-modal feature fusion for thermal infrared image colorization
title_full_unstemmed CMMF-Net: A Generative network based on CLIP-guided multi-modal feature fusion for thermal infrared image colorization
title_short CMMF-Net: A Generative network based on CLIP-guided multi-modal feature fusion for thermal infrared image colorization
title_sort cmmf-net: a generative network based on clip-guided multi-modal feature fusion for thermal infrared image colorization
topic T Technology (General)
TA Engineering (General). Civil engineering (General)
TJ Mechanical engineering and machinery
TK Electrical engineering. Electronics Nuclear engineering
TS Manufactures
url http://umpir.ump.edu.my/id/eprint/43832/
http://umpir.ump.edu.my/id/eprint/43832/
http://umpir.ump.edu.my/id/eprint/43832/
http://umpir.ump.edu.my/id/eprint/43832/1/CMMF-Net_A%20Generative%20network%20based%20on%20CLIP-guided.pdf