Efficient Edge Video Analytics with Region-Aware Enhancement and Temporal Consistency
Jisoo Kang
Faculty of Engineering
Pierre Faure
Built Environment and Information Technology
Youngjae Bang
University of Pretoria, Pretoria
Keywords: Edge Computing, Video Analytics, Region-Aware Processing, Temporal Consistency
Abstract
The exponential proliferation of connected vision sensors has fundamentally transformed the landscape of automated surveillance, intelligent transportation systems, and industrial monitoring. Conventional paradigms that rely on transmitting continuous, high-definition video streams to centralized cloud architectures are increasingly untenable due to severe bandwidth constraints, inherent transmission latency, and profound privacy concerns. Edge computing has emerged as a compelling alternative by migrating computational resources closer to the data source. However, edge devices frequently possess constrained computational capabilities and limited thermal budgets, rendering the execution of complex deep neural networks highly challenging. This research presents a comprehensive framework for efficient edge video analytics characterized by two novel components. First, we introduce a region-aware enhancement mechanism that selectively allocates computational resources to spatial areas of high analytical value, thereby discarding irrelevant background information and significantly reducing spatial redundancy. Second, we integrate a temporal consistency module designed to leverage the inherent continuity across sequential frames. By propagating high-level semantic features from previous frames to current frames using lightweight motion estimation, the system minimizes redundant computations while ensuring smooth and stable analytical outputs. Through extensive evaluation on standard video analytic datasets, our proposed methodology demonstrates substantial improvements in processing speed and bandwidth utilization without compromising analytical accuracy.
References
Qu, W., Shao, Y., Meng, L., Huang, X., & Xiao, L. (2024). A conditional denoising diffusion probabilistic model for point cloud upsampling. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 20786-20795).
Guo, Zixin, Kai Zhao, and Luyan Zhang. "InstanceRSR: Real-World Super-Resolution via Instance-Aware Representation Alignment." ICASSP 2026-2026 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2026.
Wang, S., Yu, Y., Feldt, R., & Parthasarathy, D. (2025). Automating a complete software test process using llms: An automotive case study. arXiv preprint arXiv:2502.04008.
Peng, Q., Zheng, C., & Chen, C. (2023). Source-free domain adaptive human pose estimation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 4826-4836).
Zhang, P., Liu, H., Ge, Z., Wang, C., & Lam, E. Y. (2024). Neuromorphic imaging with joint image deblurring and event denoising. IEEE Transactions on Image Processing, 33, 2318-2333.
Zhang, S., Yang, S., Zhang, W., Xiong, Y., & Yao, S. (2026). Hybrid Beamforming for Subarray-Level Movable Antenna Enhanced MU-MIMO Communications. IEEE Wireless Communications Letters, 15, 2559-2563.
Liu, Y., Liu, H., Wang, H., & Liu, M. (2022). Regularizing visual semantic embedding with contrastive learning for image-text matching. IEEE Signal Processing Letters, 29, 1332-1336.
Redmon, J., Divvala, S., Girshick, R., & Farhadi, A. (2016). You only look once: Unified, real-time object detection. In Proceedings of CVPR.
Dai, S., Wu, Y., Chen, S., Huang, R., & Dannenberg, R. B. (2023, November). SingStyle111: A Multilingual Singing Dataset With Style Transfer. In ISMIR (pp. 765-773).
Mi, L., Wang, W., Tu, W., He, Q., Kong, R., Fang, X., ... & Liu, Y. (2025, March). Empower vision applications with LoRA LMM. In Proceedings of the Twentieth European Conference on Computer Systems (pp. 261-277).
Tu, P., Huang, Y., Zheng, F., He, Z., Cao, L., & Shao, L. (2022, June). Guidedmix-net: Semi-supervised semantic segmentation by using labeled images as reference. In Proceedings of the AAAI conference on artificial intelligence (Vol. 36, No. 2, pp. 2379-2387).
Wang, C., Li, Z., Li, M. F., & Wen, W. (2025). JigsawComm: Joint Semantic Feature Encoding and Transmission for Communication-Efficient Cooperative Perception. arXiv preprint arXiv:2511.17843.
Dong, J., Liu, J., Qu, X., & Ong, Y. S. (2025). Confound from All Sides, Distill with Resilience: Multi-Objective Adversarial Paths to Zero-Shot Robustness. In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 624-634).
Zhang, Y., He, Y., Shao, Y., Yao, Z., Xu, H., Dong, J., ... & Dong, Z. (2026, May). Chromouvqa: Benchmarking vision-language models under chromatic camouflaged images. In ICASSP 2026-2026 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 12777-12781). IEEE.
Peng, Q., Zheng, C., & Chen, C. (2024). A dual-augmentor framework for domain generalization in 3d human pose estimation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 2240-2249).
Ning, X., Jiang, L., Zhang, X., Wang, Z., Zhang, L., Yan, Y., ... & Li, W. (2026). HSBNet: Fusing Semantics and Anisotropic Thermal Diffusion Fields for Boundary-Aware Point Cloud Segmentation. Information Fusion, 104246.
Song, S., Tang, Y., & Qin, R. (2025). Synthetic Data Matters: Re-training with Geo-typical Synthetic Labels for Building Detection. IEEE Transactions on Geoscience and Remote Sensing.
Zhang, W. (2026). A 5-6 GHz PVT Robust Current Mode Passive Mixer for Direct Down-Conversion Receiver.
Yang, D., Gao, Y., Wang, X., Yue, Y., Yang, Y., & Fu, M. (2025, May). Opengs-slam: Open-set dense semantic slam with 3d gaussian splatting for object-level scene understanding. In 2025 IEEE International Conference on Robotics and Automation (ICRA) (pp. 8486-8492). IEEE.
Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., et al. (2021). Learning transferable visual models from natural language supervision. In Proceedings of ICML.
Dong, J., Koniusz, P., Feng, L., Zhang, Y., Zhu, H., Liu, W., ... & Ong, Y. S. (2025). Robustifying zero-shot vision language models by subspaces alignment. In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 21037-21047).
Lv, Qi, et al. "F1: A vision-language-action model bridging understanding and generation to actions." arXiv preprint arXiv:2509.06951 (2025).
Kong, R., Li, Y., Feng, Q., Wang, W., Ye, X., Ouyang, Y., ... & Liu, Y. (2024, August). SwapMoE: Serving off-the-shelf MoE-based large language models with tunable memory budget. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) (pp. 6710-6720).
Zhang, J., Shi, Y., Ma, Y., Xu, L., Yu, J., & Wang, J. (2023, June). Ikol: Inverse kinematics optimization layer for 3d human pose and shape estimation via gauss-newton differentiation. In Proceedings of the AAAI conference on artificial intelligence (Vol. 37, No. 3, pp. 3454-3462).
Zhao, H., Gu, J., Wang, S., Lu, T., Zhang, X., Wu, Z., ... & Jiang, Y. G. (2026). LSTD: Long Short-Term Temporal Diffusion for Video Generation. IEEE Transactions on Multimedia.
Peng, Q., Planche, B., Gao, Z., Zheng, M., Choudhuri, A., Chen, T., ... & Wu, Z. (2024). 3d vision-language gaussian splatting. arXiv preprint arXiv:2410.07577.
Lv, Q., Deng, X., Chen, G., Wang, M. Y., & Nie, L. (2024). Decision mamba: A multi-grained state space model with self-evolution regularization for offline rl. Advances in neural information processing systems, 37, 22827-22849.
Tang, Y., Zhang, G., Liu, J. K., & Qin, R. (2025). Weakly supervised land-cover classification of high-resolution images with low-resolution labels through optimized label refinement. International Journal of Remote Sensing, 46(5), 1913-1937.
Lv, Qi, et al. "Spatial-temporal graph diffusion policy with kinematic modeling for bimanual robotic manipulation." Proceedings of the Computer Vision and Pattern Recognition Conference. 2025.
Zhao, Haoyu, et al. "Dynamictrl: Rethinking the basic structure and the role of text for high-quality human image animation." arXiv preprint arXiv:2503.21246 (2025).
Xie, C., Zhu, D., Wang, Z., Zhang, H., & Wei, Z. (2026). Compliance-Aware Discharge Agent for Auditable ICU Discharge Planning: A Pilot Feasibility Study Using Structured eICU Records. Available at SSRN 6429758.
Huang, H., Zhang, J., Zhang, J., Xu, J., & Wu, Q. (2020). Low-rank pairwise alignment bilinear network for few-shot fine-grained image classification. IEEE Transactions on Multimedia, 23, 1666-1680.
Li, Yanshu, et al. "Cama: Enhancing multimodal in-context learning with context-aware modulated attention." arXiv e-prints (2025): arXiv-2505.
Zhang, W., Zhang, C., Gu, C., Kou, J., Yuan, H., Fang, X., ... & Fang, Y. (2024, October). Hallucination in Large Language Models: From Mechanistic Understanding to Novel Control Frameworks. In 2024 7th International Conference on Universal Village (UV) (pp. 1-36). IEEE.
Qu, W., Wang, J., Gong, Y., Huang, X., & Xiao, L. (2025). An end-to-end robust point cloud semantic segmentation network with single-step conditional diffusion models. In Proceedings of the Computer Vision and Pattern Recognition Conference (pp. 27325-27335).
Tu, P., Xie, X., Ai, G., Li, Y., Huang, Y., & Zheng, Y. (2023). FemtoDet: An object detection baseline for energy versus performance tradeoffs. In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 13318-13327).
Qi, Z., Yuan, Y., Ruan, X., Wang, S., Zhang, W., & Huang, Q. (2024). Collaborative debias strategy for temporal sentence grounding in video. IEEE Transactions on Circuits and Systems for Video Technology, 34(11), 10972-10986.
Yang, Huan, et al. "Kvshare: An llm service system with efficient and effective multi-tenant kv cache reuse." arXiv preprint arXiv:2503.16525 (2025).
Liu, Y., & Kwon, H. (2025). Efficient Depth Estimation for Unstable Stereo Camera Systems on AR Glasses. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 6252-6261).
Qi, Z., Wang, S., Zhang, W., & Huang, Q. (2024). Uncertainty-boosted robust video activity anticipation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 46(12), 7775-7792.