A Failure Detector for Crash Recovery Systems in Cloud

Full Text (PDF, 800KB), PP.9-16

Views: 0 Downloads: 0


Bharati Sinha 1,* Awadhesh Kumar Singh 1 Poonam Saini 2

1. National Institute of Technology, Kurukshetra, India, 136119

2. Punjab Engineering College, Chandigarh, India, 160012

* Corresponding author.

DOI: https://doi.org/10.5815/ijitcs.2019.07.02

Received: 2 Apr. 2019 / Revised: 10 May 2019 / Accepted: 19 May 2019 / Published: 8 Jul. 2019

Index Terms

Failure detectors, Cloud computing, Crash recovery systems, Machine Repairman model


Cloud computing has offered remarkable scalability and elasticity to distributed computing paradigm. It provides implicit fault tolerance through virtual machine (VM) migration. However, VM migration needs heavy replication and incurs storage overhead as well as loss of computation. In early cloud infrastructure, these factors were ignorable due to light load conditions; however, nowadays due to exploding task population, they trigger considerable performance degradation in cloud. Therefore, fault detection and recovery is gaining attention in cloud research community. The Failure Detectors (FDs) are modules employed at the nodes to perform fault detection. The paper proposes a failure detector to handle crash recoverable nodes and the system recovery is performed by a designated checkpoint in the event of failure. We use Machine Repairman model to estimate the recovery latency. The simulation experiments have been carried out using CloudSim plus.

Cite This Paper

Bharati Sinha, Awadhesh Kumar Singh, Poonam Saini, "A Failure Detector for Crash Recovery Systems in Cloud", International Journal of Information Technology and Computer Science(IJITCS), Vol.11, No.7, pp.9-16, 2019. DOI:10.5815/ijitcs.2019.07.02


[1]T.D.Chandra and S.  Toueg, “Unreliable failure detectors for reliable distributed systems,” Journal of the ACM (JACM), vol.43(2), pp. 225-267,1996.

[2]M. J. Fischer, N. A. Lynch and M. S. Paterson, “Impossibility of distributed consensus with one faulty process,” Journal of the ACM (JACM), vol. 32(2), pp. 374-382, 1985.

[3]M. Larrea, A. Fernández and S. Arévalo, “Eventually consistent failure detectors,” In Proceedings 10th Euromicro Workshop on Parallel, Distributed and Network-based Processing. IEEE, 2002.

[4]F. Baskett, K. Chandy, R. Muntz, and F. Palacios, “Open, Closed,and Mixed Networks of Queues with Different Classes of Customers,” Journal of the ACM, vol. 22(2), pp.248-260, 1975.

[5]A. O. Allen, “Probability, Statistics, and Queuing Theory with Computer Science Applications,” second edition, Academic Press, Inc., Boston, Massachusetts, 1990.

[6]T. D. Chandra, V. Hadzilacos and S. Toueg, “The weakest failure detector for solving consensus,” Journal of the ACM (JACM), vol.43(4), pp. 685-722, 1996.

[7]C. Dwork, N. Lynch and L. Stockmeyer, “Consensus in the presence of partial synchrony,” Journal of the ACM (JACM), vol.35(2), pp.288-323, 1988.

[8]I. Gupta, T. D. Chandra and G. S. Goldszmidt, “On scalable and efficient distributed failure detectors,” In Proceedings of the twentieth annual ACM symposium on Principles of distributed computing, pp. 170-179, 2001.

[9]W. Chen, S. Toueg and M. K. Aguilera, “On the quality of service of failure detectors,” IEEE Transactions on computers, 51(1), 13-32, 2002.

[10]M. K. Aguilera, S. Toueg and B. Deianov, “Revisiting the weakest failure detector for uniform reliable broadcast,” In Proceedings of the 13th International Symposium on Distributed Computing, pp.19-33,1999. 

[11]T. Ma, J. Hillston and S, Anderson, “On the quality of service of crash-recovery failure detectors”, IEEE Transactions on Dependable and Secure Computing, vol.7(3), pp.271-283, 2010.

[12]N. Xiong, A.V. Vasilakos, J. Wu,  Y.R. Yang, A. Rindos, Y. Zhou, W. Song and Y. Pan, “A self-tuning failure detection scheme for cloud computing service,” In Parallel & Distributed Processing Symposium (IPDPS), pp. 668-679, 2012.

[13]R. C. Turchetti, E. P. Duarte, L. Arantes, and P. Sens, “A QoS-configurable failure detection service for internet applications,” Journal of Internet Services and Applications, vol.7(1), 2016.

[14]H. S. Pannu, J. Liu, Q. Guan and S. Fu, “AFD: adaptive failure detection system for cloud computing infrastructures,” Performance Computing and Communications Conference (IPCCC), IEEE, pp. 71‐80, 2012. 

[15]F. Wang, H. Jin, D. Zou, and W. Qiang, “A Quick and Open Failure Detector for Cloud Computing System,” In Proceedings of the International Conference on Computer Science & Software Engineering, ACM, 2014.

[16]J. Liu, Z. Wu, J. Wu, J. Dong, Y. Zhao and D. Wen, “A Weibull distribution accrual failure detector for cloud computing,” PloS one, vol.12(3), 2017.

[17]Cao, Jiajun, et al. “Checkpointing as a service in heterogeneous cloud environments,”2015 15th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing. IEEE, 2015.

[18]S. K. Mondal, F. Machida, J. K. Muppala, “ Service Reliability Enhancement in Cloud by Checkpointing and Replication,” In Principles of Performance and Reliability Modeling and Evaluation, Springer, pp. 425–448, 2016.

[19]H. Amarasinghe, A. Jarray, and A. Karmouch, “Fault-tolerant IaaS management for networked cloud infrastructure with SDN,” IEEE International Conference on Communications (ICC). IEEE, 2017.

[20]C. Yang, X. Xu, K. Ramamohanrao and J. Chen, “A Scalable Multi-Data Sources based Recursive Approximation Approach for Fast Error Recovery in Big Sensing Data on Cloud,” IEEE Transactions on Knowledge and Data Engineering, 2019.

[21]L. Luo, S. Meng, X. Qiu and Y. Dai, “Improving Failure Tolerance in Large-Scale Cloud Computing Systems,” IEEE Transactions on Reliability, 2019.

[22]Al-Sayed, Mustafa M., Sherif Khattab, and Fatma A. Omara, “Prediction mechanisms for monitoring state of cloud resources using Markov chain model,” Journal of Parallel and Distributed Computing, vol.96, pp. 163-171, 2016.

[23]E. Lazowska, J. Zahorjan, G. Graham and K. Sevcik,  “Quantitative System Performance ~ Computer System Analysis UsingQueueing Network Models,” Prentice-Hall, 1984.

[24]B. Sinha, A. K. Singh and P. Saini, “Failure detectors for crash faults in cloud,” Journal of Ambient Intelligence and Humanized Computing, 2018.

[25]M. C. Silva Filho, R. L. Oliveira, C. C. Monteiro, Pedro. R. M. Inacio,  Manoel, “CloudSim plus: a cloud computing simulation framework pursuing software engineering principles for improved modularity, extensibility and correctness,” FIP/IEEE Symposium on Integrated Network and Service Management (IM), IEEE, 2017.