Transformer-Based Spatial-Temporal Models for Comprehensive Scene Understanding Object Tracking and Autonomous Decision Support
Amelia Paredes,
Department of Mechanical Engineering, Massachusetts Institute of Technology, Cambridge, MA, USA
Eleanor Sterling
Department of Mechanical Engineering, Massachusetts Institute of Technology, Cambridge, MA, USA
Keywords: Scene Understanding,, Vision Transformers,, Object Tracking, Decision Support,, Autonomous Systems
Abstract
The integration of scene understanding, object tracking, and decision support into a singular computational framework remains a formidable challenge in autonomous systems. Traditional approaches have relied on disjointed pipelines where convolutional neural networks process spatial features, recursive algorithms manage temporal tracking, and isolated heuristic models handle downstream decision making. Such fragmentation inherently introduces cascading errors, latency, and suboptimal context sharing. In this paper, we propose a unified Transformer-based architecture designed to concurrently process spatial-temporal representations for holistic scene understanding, continuous target tracking, and proactive decision support. By leveraging self-attention mechanisms across both spatial dimensions and temporal frames, the proposed model efficiently constructs global contextual dependencies without the restricted receptive fields characteristic of conventional convolutions. Our methodology incorporates a multi-head prediction module that projects shared latent embeddings into semantic segmentation masks, object bounding boxes, and action policy probabilities. We conduct extensive empirical evaluations on standard large-scale driving datasets, demonstrating that our integrated spatiotemporal Transformer significantly reduces inference latency while achieving superior quantitative metrics across all three domains compared to state-of-the-art disjointed architectures. The findings underscore the efficacy of global representation learning in complex dynamic environments and provide a robust foundation for the next generation of autonomous robotic and vehicular control systems.
References
Yang, K., Tang, X., Peng, Z., Zhang, X., Wang, P., He, J., & Liu, H. (2025). FlowerDance: MeanFlow for Efficient and Refined 3D Dance Generation. arXiv preprint arXiv:2511.21029.
Yang, Y. (2023, November). Large capacity data hiding in binary image black and white mixed regions. In 2023 3rd International Conference on Electronic Information Engineering and Computer (EIECT) (pp. 516-521). IEEE.
Sha, Q., Tang, T., Du, X., Liu, J., Wang, Y., & Sheng, Y. (2025). Detecting credit card fraud via heterogeneous graph neural networks with graph attention. arXiv preprint arXiv:2504.08183.
Zhu, D., Xie, C., Wang, Z., & Zhang, H. (2025). RaX-Crash: A Resource Efficient and Explainable Small Model Pipeline with an Application to City Scale Injury Severity Prediction. arXiv preprint arXiv:2512.07848.
Lin, Y., Xue, B., Zhang, M., Schofield, S., & Green, R. (2024, December). Deep Learning-Based Depth Map Generation and YOLO-Integrated Distance Estimation for Radiata Pine Branch Detection Using Drone Stereo Vision. In 2024 39th International Conference on Image and Vision Computing New Zealand (IVCNZ) (pp. 1-6). IEEE.
Zhang, Y. (2025, March). Social network user profiling for anomaly detection based on graph neural networks. In 2025 5th International Conference on Artificial Intelligence and Industrial Technology Applications (AIITA) (pp. 1197-1201). IEEE.
Zeng, D., Yang, Y., Tang, Y., Zhao, L., Wang, X., Yun, D., ... & Lin, H. (2025). Shaping school for childhood myopia: the association between floor area ratio of school environment and myopia in China. British Journal of Ophthalmology, 109(1), 146-151.
Wang, R., Guo, T., Li, Y., Meng, D., & Liang, B. (2025). Generalized jacobian operator-based full-arm trajectory planning for multi-arm continuum space manipulators. Aerospace Science and Technology, 111559.
Xia, J., & Liu, L. (2025, December). Training-Free Instance-Aware 3D Scene Reconstruction and Diffusion-Based View Synthesis from Sparse Images. In Proceedings of the SIGGRAPH Asia 2025 Conference Papers (pp. 1-12).
Hu, Q., Peng, Y., Shao, Z., & Chen, J. (2026). Scene degradation-aware fusion network for robust infrared and visible image synthesis in extreme conditions. The Visual Computer, 42(1), 48.
Ning, X., Jiang, L., Zhang, X., Wang, Z., Zhang, L., Yan, Y., ... & Li, W. (2026). HSBNet: Fusing Semantics and Anisotropic Thermal Diffusion Fields for Boundary-Aware Point Cloud Segmentation. Information Fusion, 104246.
Li, B., Wang, C. Y., Xu, H., Zhang, X., Armand, E., Srivastava, D., ... & Tu, Z. (2025). OverLayBench: A Benchmark for Layout-to-Image Generation with Dense Overlaps. arXiv preprint arXiv:2509.19282.
Hu, Q., Peng, Y., Zhang, C., Lin, Y., U, K., & Chen, J. (2025). Building Instance Extraction via Multi-Scale Hybrid Dual-Attention Network. Buildings, 15(17), 3102.
Wu, J., Sun, Y., Xie, T., Chen, S., Bao, J., Xu, Y., ... & Wang, X. (2026). Cross-Modal Memory Compression for Efficient Multi-Agent Debate. arXiv preprint arXiv:2602.00454.
Zhu, Y., Duan, H., Wang, Z., Kim, E. H., Fu, Z., & Pedrycz, W. (2025). Robust Classification via Interval Type-2 Fuzzy C-Means and Gradient Boosting. IEEE Transactions on Fuzzy Systems, 33(9), 3103-3117.
Song, S., Tang, Y., & Qin, R. (2025). Synthetic Data Matters: Re-training with Geo-typical Synthetic Labels for Building Detection. IEEE Transactions on Geoscience and Remote Sensing.
Tu, P., Huang, Y., Zheng, F., He, Z., Cao, L., & Shao, L. (2022, June). Guidedmix-net: Semi-supervised semantic segmentation by using labeled images as reference. In Proceedings of the AAAI conference on artificial intelligence (Vol. 36, No. 2, pp. 2379-2387).
Yang, K., Tang, X., Peng, Z., Hu, Y., He, J., & Liu, H. (2025). Megadance: Mixture-of-experts architecture for genre-aware 3d dance generation. arXiv preprint arXiv:2505.17543.
Guo, Y., Hutabarat, Y., Owaki, D., & Hayashibe, M. (2023). Speed-variable gait phase estimation during ambulation via temporal convolutional network. IEEE Sensors Journal, 24(4), 5224-5236.
Zhu, Y., Duan, H., Wang, Z., Kim, E. H., Fu, Z., & Pedrycz, W. (2025). BPFNN: Bayesian Probabilistic Fuzzy Neural Networks for Uncertainty-Aware Clustering and Probabilistic Fuzzy Reasoning. IEEE Transactions on Cybernetics.
Yang, K., Zhou, X., Tang, X., Diao, R., Liu, H., He, J., & Fan, Z. (2024, May). Beatdance: A beat-based model-agnostic contrastive learning framework for music-dance retrieval. In Proceedings of the 2024 International Conference on Multimedia Retrieval (pp. 11-19).
Liang, L., Chen, J., Shi, J., Zhang, K., & Zheng, X. (2025). Noise-Robust image edge detection based on multi-scale automatic anisotropic morphological Gaussian Kernels. PLoS One, 20(5), e0319852.
Yang, D., Wang, X., Gao, Y., Liu, S., Ren, B., Yue, Y., & Yang, Y. (2025, October). Opengs-fusion: Open-vocabulary dense mapping with hybrid 3D Gaussian splatting for refined object-level understanding. In 2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS) (pp. 21135-21142). IEEE.
Guo, Z., Zhao, K., & Zhang, L. (2026). InstanceRSR: Real-world super-resolution via instance-aware degradation. arXiv preprint arXiv:2603.24240
Wang, Z., Kim, E. H., Oh, S. K., Pedrycz, W., Fu, Z., & Yoon, J. H. (2024). Reinforced fuzzy-rule-based neural networks realized through streamlined feature selection strategy and fuzzy clustering with distance variation. IEEE Transactions on Fuzzy Systems, 32(10), 5674-5686.
Ma, W., Li, Y., Liu, C., Zhang, H., Li, J., Chen, K., & Gao, W. (2026). GeoCraft: A Diffusion Model-Based 3D Reconstruction Method Driven by Image and Point Cloud Fusion. Information Fusion, 104149.
Zhao, H., Lu, T., Gu, J., Zhang, X., Zheng, Q., Wu, Z., ... & Jiang, Y. G. (2024, September). Magdiff: Multi-alignment diffusion for high-fidelity video generation and editing. In European Conference on Computer Vision (pp. 205-221). Cham: Springer Nature Switzerland.
Zeng, G., Zhang, X., Wang, Z., Xu, H., Chen, Z., Li, B., & Tu, Z. (2025). Yolo-count: Differentiable object counting for text-to-image generation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (pp. 16765-16775).
Wang, Y., Song, R., Li, L., Tang, Y., Zhang, R., & Liu, J. (2025). User profile constructed by multiple attributes for optimizing linguistic steganalysis in social networks. Expert Systems with Applications, 129311.
Ma, F., Chai, J., & Wang, H. (2019). Two-dimensional compact variational mode decomposition-based low-light image enhancement. IEEE Access, 7, 136299-136309.
Xia, J., & Liu, L. (2025). Close-up-gs: Enhancing close-up view synthesis in 3d gaussian splatting with progressive self-training. arXiv preprint arXiv:2503.09396.
Fan, D., Feng, Q., Zhang, A., Liu, M., Ren, Y., & Wang, Y. (2023). Optimization of scheduling and timetabling for multiple electric bus lines considering nonlinear energy consumption model. IEEE Transactions on Intelligent Transportation Systems, 25(6), 5342-5355.
Wang, Y., Xu, H., Zhang, X., Chen, Z., Sha, Z., Wang, Z., & Tu, Z. (2024). Omnicontrolnet: Dual-stage integration for conditional image generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (pp. 7436-7448).
Tang, Y., Zhang, G., Liu, J. K., & Qin, R. (2025). Weakly supervised land-cover classification of high-resolution images with low-resolution labels through optimized label refinement. International Journal of Remote Sensing, 46(5), 1913-1937.
Xu, Y., Li, F., Fujisawa, M., Cheng, X., Marzouk, Y., & Ishikawa, I. (2025). Generative Modeling through Koopman Spectral Analysis: An Operator-Theoretic Perspective. arXiv preprint arXiv:2512.18837.
Sun, L., Xia, J., & Liu, L. (2025). Towards High-Quality Novel View Synthesis From Nonuniformly Distributed Input Views. IEEE Transactions on Visualization and Computer Graphics.
Huang, Y., Zhang, K., Wang, Y., Du, D., Yuan, Y., & Zhao, Z. (2025, June). Enhancing Open-Vocabulary Panoptic Segmentation with Semantic-Guided Q-Tuning. In 2025 IEEE International Conference on Multimedia and Expo (ICME) (pp. 1-6). IEEE.
Peng, Q., Planche, B., Gao, Z., Zheng, M., Choudhuri, A., Chen, T., ... & Wu, Z. (2024). 3d vision-language gaussian splatting. arXiv preprint arXiv:2410.07577.
Sermanet, P., Eigen, D., Zhang, X., Mathieu, M., Fergus, R., & LeCun, Y. (2014). OverFeat: Integrated recognition, localization and detection using convolutional networks. In Proceedings of ICLR.
Huang, Y., Zhang, C., & Pan, C. (2022). Channel-aided transmission parameter signalling detection for DTMB-A. IEEE Transactions on Broadcasting, 69(1), 303-312.
Zhang, J., Shi, Y., Ma, Y., Xu, L., Yu, J., & Wang, J. (2023, June). Ikol: Inverse kinematics optimization layer for 3d human pose and shape estimation via gauss-newton differentiation. In Proceedings of the AAAI conference on artificial intelligence (Vol. 37, No. 3, pp. 3454-3462).
Guo, Y., Sekiguchi, Y., Zeng, W., Ebihara, S., Owaki, D., & Hayashibe, M. (2025). Physics-informed learning framework for lower limb kinematic prediction with sparse sensors and its application in chronic stroke. IEEE Transactions on Neural Systems and Rehabilitation Engineering.
Ahmad, N. R. (2025). Exploring the impact of inflation on Pakistani society: Challenges, causes, and long-term consequences for economic stability and social well-being. https://doi.org/10.63075/7vtnh777
Ahmad, N. R. (2025). Business ethics in the age of automation: How companies can balance profitability with responsibility. Punjab Model Bazaars Management Company.