Photorealistic Video Colorization Using Gated Color Guidance and Cross-Frame Consistency
Nozomi Okada
Faculty of Science, University of Auckland, Auckland, New Zealand
Chika Sakamoto
Faculty of Science, University of Auckland, Auckland, New Zealand
Keywords: Video Colorization, Temporal Consistency, Feature Propagation, Deep Learning
Abstract
Video colorization remains a profoundly challenging problem in the domain of computer vision, demanding not only accurate spatial colorization but also robust temporal consistency across sequential frames. Previous approaches frequently suffer from severe visual artifacts, notably color bleeding, temporal flickering, and semantic mismatch, which collectively degrade the photorealism of the resulting outputs. To mitigate these pervasive issues, this paper introduces a novel framework for photorealistic video colorization utilizing gated color guidance alongside an advanced cross-frame consistency mechanism. The gated color guidance module effectively selectively incorporates prior color information from exemplar frames, dynamically weighing the relevance of reference colors based on deep semantic features. Concurrently, the cross-frame consistency module employs recurrent feature propagation to ensure that temporal variations remain imperceptible to the human visual system, thereby effectively eliminating flickering artifacts. Through rigorous experimental evaluation on standard benchmark datasets, the proposed architecture demonstrates unprecedented performance improvements across various quantitative metrics and qualitative visual assessments. The ablation studies validate the critical contributions of both the gating mechanism and the temporal consistency regularization. This research establishes a robust foundation for future applications in film restoration, historical archive digitization, and automated video enhancement.
References
Li, Yuqi, et al. "Ammkd: Adaptive multimodal multi-teacher distillation for lightweight vision-language models." arXiv preprint arXiv:2509.00039 (2025).
Zhang, W., Zhang, C., Gu, C., Kou, J., Yuan, H., Fang, X., ... & Fang, Y. (2024, October). Hallucination in Large Language Models: From Mechanistic Understanding to Novel Control Frameworks. In 2024 7th International Conference on Universal Village (UV) (pp. 1-36). IEEE.
Zhang, P., Zhu, S., Wang, C., Zhao, Y., & Lam, E. Y. (2024). Neuromorphic imaging with super-resolution. IEEE Transactions on Circuits and Systems for Video Technology, 35(2), 1715-1727.
Liu, Y., & Kwon, H. (2025). Efficient Depth Estimation for Unstable Stereo Camera Systems on AR Glasses. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 6252-6261).
Zhang, S., Yang, S., Zhang, W., Xiong, Y., & Yao, S. (2026). Hybrid Beamforming for Subarray-Level Movable Antenna Enhanced MU-MIMO Communications. IEEE Wireless Communications Letters, 15, 2559-2563.
Peng, Q., Bai, C., Zhang, G., Xu, B., Liu, X., Zheng, X., ... & Lu, C. (2025, October). NavigScene: Bridging local perception and global navigation for beyond-visual-range autonomous driving. In Proceedings of the 33rd ACM International Conference on Multimedia (pp. 4193-4202).
Peng, Q., Xue, H., Wang, P., & Chen, C. (2026, March). Lifelong Domain Adaptive 3D Human Pose Estimation. In Proceedings of the AAAI Conference on Artificial Intelligence (Vol. 40, No. 10, pp. 8358-8366).
Zhu, Guoying, et al. "Enabling MoE on the Edge via Importance-Driven Expert Scheduling." arXiv preprint arXiv:2508.18983 (2025).
Lv, Qi, et al. "F1: A vision-language-action model bridging understanding and generation to actions." arXiv preprint arXiv:2509.06951 (2025).
Dong, J., Qu, X., Zhang, C., Rong, S. Q., Thai, N. D., Pan, W., ... & Ong, Y. S. (2026). Tug-of-war no more: Harmonizing accuracy and robustness in vision-language models via stability-aware task vector merging. In The Fourteenth International Conference on Learning Representations.
Zhao, H., Gu, J., Wang, S., Lu, T., Zhang, X., Wu, Z., ... & Jiang, Y. G. (2026). LSTD: Long Short-Term Temporal Diffusion for Video Generation. IEEE Transactions on Multimedia.
Yang, D., Gao, Y., Wang, X., Yue, Y., Yang, Y., & Fu, M. (2025, May). Opengs-slam: Open-set dense semantic slam with 3d gaussian splatting for object-level scene understanding. In 2025 IEEE International Conference on Robotics and Automation (ICRA) (pp. 8486-8492). IEEE.
Dong, J., Koniusz, P., Feng, L., Zhang, Y., Zhu, H., Liu, W., ... & Ong, Y. S. (2025). Robustifying zero-shot vision language models by subspaces alignment. In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 21037-21047).
Zhou, Yufan, et al. "Masked Temporal Interpolation Diffusion for Procedure Planning in Instructional Videos." arXiv preprint arXiv:2507.03393 (2025).
Simonyan, K., & Zisserman, A. (2014). Two-stream convolutional networks for action recognition in videos. In Advances in Neural Information Processing Systems.
Tang, Y., Zhang, G., Liu, J. K., & Qin, R. (2025). Weakly supervised land-cover classification of high-resolution images with low-resolution labels through optimized label refinement. International Journal of Remote Sensing, 46(5), 1913-1937.
Wang J, Fan L, Li B, et al. A Dynamic Factor Gating Architecture with Market Regime Awareness for Stock Return Forecasting[J]. 2026.
Song, S., Tang, Y., & Qin, R. (2025). Synthetic Data Matters: Re-training with Geo-typical Synthetic Labels for Building Detection. IEEE Transactions on Geoscience and Remote Sensing.
Mi, L., Wang, W., Tu, W., He, Q., Kong, R., Fang, X., ... & Liu, Y. (2025, March). Empower vision applications with LoRA LMM. In Proceedings of the Twentieth European Conference on Computer Systems (pp. 261-277).
Dai, S., Wu, Y., Chen, S., Huang, R., & Dannenberg, R. B. (2023, November). SingStyle111: A Multilingual Singing Dataset With Style Transfer. In ISMIR (pp. 765-773).
Anticipation, E. V. A. Self-Regulated Learning for Egocentric Video Activity Anticipation.
Zhang, P., Liu, H., Ge, Z., Wang, C., & Lam, E. Y. (2024). Neuromorphic imaging with joint image deblurring and event denoising. IEEE Transactions on Image Processing, 33, 2318-2333.
Dong, J., Koniusz, P., Zhang, Y., Zhu, H., Liu, W., Qu, X., & Ong, Y. S. (2025, October). Improving zero-shot adversarial robustness in vision-language models by closed-form alignment of adversarial path simplices. In Forty-second International Conference on Machine Learning.
Wang, W., Mi, L., Cen, S., Dai, H., Li, Y., Fu, X., & Liu, Y. (2025). Region-based Content Enhancement for {Efficient} Video Analytics at the Edge. In 22nd USENIX Symposium on Networked Systems Design and Implementation (NSDI 25) (pp. 613-633).
Guo, Y., Hutabarat, Y., Owaki, D., & Hayashibe, M. (2023). Speed-variable gait phase estimation during ambulation via temporal convolutional network. IEEE Sensors Journal, 24(4), 5224-5236.
Wang, C., Muller, R., Song, R., Monteuuis, J. P., Petit, J., Man, Y., ... & Li, M. (2025). From Threat to Trust: Exploiting Attention Mechanisms for Attacks and Defenses in Cooperative Perception. In 34th USENIX Security Symposium (USENIX Security 25) (pp. 7387-7406).
Qu, W., Wang, J., Gong, Y., Huang, X., & Xiao, L. (2025). An end-to-end robust point cloud semantic segmentation network with single-step conditional diffusion models. In Proceedings of the Computer Vision and Pattern Recognition Conference (pp. 27325-27335).
Huang, H., Zhang, J., Zhang, J., Xu, J., & Wu, Q. (2020). Low-rank pairwise alignment bilinear network for few-shot fine-grained image classification. IEEE Transactions on Multimedia, 23, 1666-1680.
Zhao, H., Lu, T., Gu, J., Zhang, X., Zheng, Q., Wu, Z., ... & Jiang, Y. G. (2024, September). Magdiff: Multi-alignment diffusion for high-fidelity video generation and editing. In European Conference on Computer Vision (pp. 205-221). Cham: Springer Nature Switzerland.
Zhang, W. (2026). A 5-6 GHz PVT Robust Current Mode Passive Mixer for Direct Down-Conversion Receiver.
Lv, Qi, et al. "Spatial-temporal graph diffusion policy with kinematic modeling for bimanual robotic manipulation." Proceedings of the Computer Vision and Pattern Recognition Conference. 2025.
Peng, Q., Zheng, C., & Chen, C. (2023). Source-free domain adaptive human pose estimation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 4826-4836).
Zhao, Haoyu, et al. "Dynamictrl: Rethinking the basic structure and the role of text for high-quality human image animation." arXiv preprint arXiv:2503.21246 (2025).
Redmon, J., & Farhadi, A. (2017). YOLO9000: Better, faster, stronger. In Proceedings of CVPR.
Qu, W., Shao, Y., Meng, L., Huang, X., & Xiao, L. (2024). A conditional denoising diffusion probabilistic model for point cloud upsampling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 20786-20795).
Zhang, Y., He, Y., Shao, Y., Yao, Z., Xu, H., Dong, J., ... & Dong, Z. (2026, May). Chromouvqa: Benchmarking vision-language models under chromatic camouflaged images. In ICASSP 2026-2026 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 12777-12781). IEEE.
Wang, S., Yu, Y., Feldt, R., & Parthasarathy, D. (2025). Automating a complete software test process using llms: An automotive case study. arXiv preprint arXiv:2502.04008.
Xie, C., Zhu, D., Wang, Z., Zhang, H., & Wei, Z. (2026). Compliance-Aware Discharge Agent for Auditable ICU Discharge Planning: A Pilot Feasibility Study Using Structured eICU Records. Available at SSRN 6429758.
Guo, Hanzhong, et al. "Leveraging verifier-based reinforcement learning in image editing." arXiv preprint arXiv:2604.27505 (2026).
Huang, J. (2025, September). SCIAI: RELIABLE LARGE LANGUAGE MODEL REASONING FOR SCIENTIFIC LITERATURE VERIFICATION AND HYPOTHESIS VALIDATION. In The 5th International scientific and practical conference “Trends in the development of science by young scientists and students”(September 30-October 03, 2025) Warsaw, Poland. International Science Group. 2025. 122 p. (p. 16).
Dong, J., Wang, Y., Lai, J., & Xie, X. (2023). Restricted black-box adversarial attack against deepfake face swapping. IEEE Transactions on Information Forensics and Security, 18, 2596-2608.
Tu, P., Huang, Y., Zheng, F., He, Z., Cao, L., & Shao, L. (2022, June). Guidedmix-net: Semi-supervised semantic segmentation by using labeled images as reference. In Proceedings of the AAAI conference on artificial intelligence (Vol. 36, No. 2, pp. 2379-2387).
Zhao, H., Wang, Q., Zhan, G., Min, W., Zou, Y., & Cui, S. (2022). Need only one more point (NOOMP): Perspective adaptation crowd counting in complex scenes. IEEE Transactions on Multimedia, 25, 1414-1426.