Implementing Causal Observability for Practical Site Reliability Engineering in Cloud-Native Distributed Systems

PDF (1478KB), PP.16-25

Views: 0 Downloads: 0

Author(s)

Oreoluwa Omoike 1,*

1. Department of Computer Science, Olabisi Onabanjo University, Ogun State, Nigeria

* Corresponding author.

DOI: https://doi.org/10.5815/ijwmt.2026.02.02

Received: 25 Dec. 2025 / Revised: 26 Jan. 2026 / Accepted: 14 Mar. 2026 / Published: 8 Apr. 2026

Index Terms

Causal Observability, Cloud-Native Systems, Granger Causality, Bayesian Networks, Kubernetes, Site Reliability Engineering, Predictive Monitoring, DevOps, OpenTelemetry, Cybersecurity.

Abstract

This paper presents a Causal Observability Framework designed to enhance the reliability and performance of cloud-native distributed systems through structured integration with the DevOps pipeline. The framework unifies three interdependent components: real-time telemetry collection, dual-domain causal tracing, and probabilistic causal inference. The causal tracing layer combines a time-domain vector autoregressive Granger causality model with a discrete Fourier transform frequency-domain extension. The causal inference layer employs Bayesian network propagation, updated online via the Expectation-Maximisation algorithm, to compute posterior downstream failure probabilities from upstream anomaly observations. Validation was conducted through a controlled, three-replicate experimental study on a seven-service AI-powered recommendation application deployed across a dual-provider six-node Kubernetes cluster (AWS EKS and GCP GKE) under three traffic profiles ranging from 50 to 500 requests per second. Against a conventional threshold-based monitoring baseline, the proposed framework achieved: a 35% reduction in incident response time (70 minutes to 45 minutes), a 40% reduction in mean time to recovery (50 minutes to 30 minutes), a 1.5 percentage-point improvement in system availability (98.0% to 99.5%), a 61% reduction in false-positive alert rate (18% to 7%), and a 63% improvement in root-cause localisation accuracy (54% to 88%). All five improvements were statistically significant at p < 0.05 via paired t-test. A quantified nine-minute early-warning lead time over conventional detection was demonstrated in the fault-injection scenario. Seven formal equations underpin the methodology, spanning Granger vector autoregression, F-test inference, AIC-based lag selection, normalised causality scoring, frequency-domain spectral causality, Bayesian posterior propagation, and expected detection lead time.

Cite This Paper

Oreoluwa Omoike, "Implementing Causal Observability for Practical Site Reliability Engineering in Cloud-Native Distributed Systems", International Journal of Wireless and Microwave Technologies(IJWMT), Vol.16, No.2, pp. 16-25, 2026. DOI:10.5815/ijwmt.2026.02.02

Reference

[1]J. Kosinska, B. Balis, M. Konieczny, and M. Malawski, "Toward the observability of cloud-native applications: The overview of the state-of-the-art," IEEE Transactions on Cloud Computing, 2023, doi:10.1109/TCC.2023.3278417.
[2]P. Ganesan, "Observability in cloud-native environments: Challenges and solutions," International Journal of Core Engineering and Technology, 2022. [Online]. Available: https://www.researchgate.net/publication/384867297
[3]L. Pham, H. Ha, and H. Zhang, "Root Cause Analysis for Microservice System based on Causal Inference: How Far Are We?" in Proceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering (ASE), Sacramento, CA, USA, 2024, pp. 706–715 
[4]Y. Han, Q. Du, Y. Huang, P. Li, X. Shi, and J. Wu, "Holistic root cause analysis for failures in cloud-native systems using observability data," IEEE Transactions on Services Computing, 2024, doi:10.1109/TSC.2024.3462345.
[5]H. Allam, "Cloud-native reliability: Applying SRE to serverless and event-driven architectures," International Journal of Artificial Intelligence, 2024. [Online]. Available: https://ijaidsml.org/index.php/ijaidsml/article/view/185
[6]C. W. J. Granger, "Investigating causal relations by econometric models and cross-spectral methods," Econometrica, vol. 37, no. 3, pp. 424-438, 1969, doi:10.2307/1912791.
[7]J. Pearl, Causality: Models, Reasoning, and Inference, 2nd ed. Cambridge: Cambridge University Press, 2009.
[8]N. Marie-Magdelaine, "Observability and resource management in cloud-native environments," Doctoral thesis, HAL Open Science, 2021. [Online]. Available: https://theses.hal.science/tel-03486157/
[9]T. Schreiber, "Measuring information transfer," Physical Review Letters, vol. 85, no. 2, pp. 461-464, 2000, doi:10.1103/PhysRevLett.85.461.
[10]B. Beyer, C. Jones, J. Petoff, and N. R. Murphy, Site Reliability Engineering: How Google Runs Production Systems. Sebastopol, CA: O'Reilly Media, 2016. Available: https://sre.google/sre-book/table-of-contents/
[11]R. Xin, P. Chen, and Z. Zhao, "CausalRCA: Causal Inference Based Precise Fine-Grained Root Cause Localization for Microservice Applications," Journal of Systems and Software, vol. 203, p. 111724, 2023. https://doi.org/10.1016/j.jss.2023.111724
[12]L. Pham, H. Ha, and H. Zhang, "BARO: Robust Root Cause Analysis for Microservices via Multivariate Bayesian Online Change Point Detection," Proceedings of the ACM on Software Engineering, vol. 1, no. FSE, Article 98, pp. 2214–2237, 2024. https://doi.org/10.1145/3660805
[13]R. Rouf, M. Rasolroveicy, M. Litoiu, S. Nagar, P. Mohapatra, P. Gupta, and I. Watts, "InstantOps: A Joint Approach to System Failure Prediction and Root Cause Identification in Microservices Cloud-Native Applications," in Proc. 15th ACM/SPEC 
[14]International Conference on Performance Engineering (ICPE), New York, USA, 2024, pp. 119–129. https://doi.org/10.1145/3629526.3645047
[15]J. Park, B. Choi, C. Lee, and D. Han, "Graph Neural Network-Based SLO-Aware Proactive Resource Autoscaling Framework for Microservices," IEEE/ACM Transactions on Networking, 2024.  https://doi.org/10.1109/TNET.2024.3354651
[16]Y. Chen, C. Zhang, and M. Ma, "Predictive SLO Breach Prevention Using Temporal Graph Neural Networks in Microservice Systems," Computer Fraud and Security, 2023. Available: https://arxiv.org/abs/2408.00803
[17]J. Soldani and A. Brogi, "Anomaly Detection and Failure Root Cause Analysis in (Micro)Service-Based Cloud Applications: A Survey," ACM Computing Surveys, vol. 55, no. 3, Article 59, pp. 1–39, 2022. https://doi.org/10.1145/3501297