Enhanced PROBCONS for Multiple Sequence Alignment in Cloud Computing

Full Text (PDF, 672KB), PP.38-47

Views: 0 Downloads: 0


Eman M. Mohamed 1,* Hamdy M. Mousa 1 Arabi E. keshk 1

1. Faculty of Computers and Information, Menoufia University, Egypt

* Corresponding author.

DOI: https://doi.org/10.5815/ijitcs.2019.09.05

Received: 15 May 2019 / Revised: 20 May 2019 / Accepted: 23 May 2019 / Published: 8 Sep. 2019

Index Terms

Bioinformatics, Multiple sequence alignment, Protein features, PROBCONS


Multiple protein sequence alignment (MPSA) intend to realize the similarity between multiple protein sequences and increasing accuracy. MPSA turns into a critical bottleneck for large scale protein sequence data sets. It is vital for existing MPSA tools to be kept running in a parallelized design.  Joining MPSA tools with cloud computing will improve the speed and accuracy in case of large scale data sets.  PROBCONS is probabilistic consistency for progressive MPSA based on hidden Markov models.  PROBCONS is an MPSA tool that achieves the maximum expected accuracy, but it has a time-consuming problem. In this paper firstly, the proposed approach is to cluster the large multiple protein sequences into structurally similar protein sequences. This classification is done based on secondary structure, LCS, and amino acids features. Then PROBCONS MPSA tool will be performed in parallel to clusters. The last step is to merge the final PROBCONS of clusters. The proposed algorithm is in the Amazon Elastic Cloud (EC2). The proposed algorithm achieved the highest alignment accuracy. Feature classification understands protein sequence, structure and function, and all these features affect accuracy strongly and reduce the running time of searching to produce the final alignment result.

Cite This Paper

Eman M. Mohamed, Hamdy M. Mousa, Arabi E. keshk, "Enhanced PROBCONS for Multiple Sequence Alignment in Cloud Computing", International Journal of Information Technology and Computer Science(IJITCS), Vol.11, No.9, pp.38-47, 2019. DOI:10.5815/ijitcs.2019.09.05


[1]Do CB, Katoh K, " Protein multiple sequence alignment methods" Mol Biol Clifton NJ2008, Vol. 484, pp. 379–413, 2008. 

[2]M. a. Aniba, " Issues in bioinformatics benchmarking: the case study of multiple sequence alignment" Nucleic Acids Res, Vol. 38, pp. 7353–7363, 2010. 

[3]Wallace IM, Blackshields G, Higgins DG., "Multiple sequence alignments" Current Opinion in Structural Biology, Vol. 15, no. 3, pp. 261-266, 2005. 

[4]S. B. Needleman and C. D. Wunsch." A general method Applicable to the Search for Similarities in the Amino Acid Sequence of Two Proteins". Journal of Molecular Biology, Vol. 48(3), pp. 443-453, 1970.

[5]Peng Zhao, Tao Jiang." A heuristic algorithm for multiple sequence alignment based on blocks". Combinatorial Optimization, Vol. 5(1), pp. 95–115, Mar 2001.

[6]Feng DF, Doolittle RF., "Progressive sequence alignment as a prerequisite to correct phylogenetic trees", Journal of Molecular Evolution, Vol. 4(25), pp. 351-360, 1987.

[7]P.Zhao and Tao Jiang J, Hirosawa, M., Totoki, Y., Hoshida, M. and Ishikawa, M., “Comprehensive study on iterative algorithms of multiple sequence alignment”, CABIOS, Vol. 11, pp. 13–18, 1995.

[8]Chuong B. Do, Mahathi S.P. Mahabhashyam, Michael Brudno, and Serafim Batzoglou, "ProbCons: probabilistic consistency-based multiple sequence alignment", Genome Research, Vol. 2(15), pp. 330-340, 2005.

[9]Stoye J, "Multiple sequence alignment with the divide-and-conquer method", Gene 211, pp. GC45–GC56, 1998.

[10]Stoye J, Moulton V, Dress AW, " DCA: an efficient implementation of the divide-and-conquer approach to simultaneous multiple sequence alignment", Comput. Appl. Biosci. Vol.13 (6), pp. 625-626, 1997.

[11] Larkin MA, Blackshields G, Brown NP, Chenna R, McGettigan PA, McWilliam H, Valentin F, Wallace IM, Wilm A, Lopez R, Thompson JD, Gibson TJ, Higgins DG. "Clustal W and Clustal X version 2.0.", Bioinformatics, Vol. 23, pp. 2947-2948, 2007. 

[12]Sievers F, Higgins DG, "Clustal Omega for making accurate alignments of many protein sciences". Protein Sci. , Vol. 27, pp. 135-145, 2018.

[13]Lassmann T, Sonnhammer EL., "Kalign—an accurate and fast multiple sequence alignment algorithm", BMC Bioinformatics, Vol. 6, pp. 298, 2005.

[14]Lassmann T, Frings O, Sonnhammer EL." Kalign2: high-performance multiple alignments of protein and nucleotide sequences allowing external features". Nucleic Acids Res, Vol.37, pp. 858–865, 2009.

[15]Katoh K, Standley DM, "MAFFT multiple sequence alignment software version 7: improvements in performance and usability", Molecular Biology and Evolution, Vol. 4(30), pp. 772–780, 2013

[16]Edgar RC, "MUSCLE: a multiple sequence alignment methods with reduced time and space complexity", BMC Bioinformatics, Vol. 5, pp. 113-131, 2004.

[17]Thompson JD, Koehl P, Ripp R, Poch O., "BAliBASE 3.0: latest developments of the multiple sequence alignment benchmark", Proteins, Vol. 1(61), pp. 36-127, 2005.

[18]Jiang, Q., Jin, X., Lee, S.-J., & Yao, S. "Protein secondary structure prediction: A survey of the state of the art". Journal of Molecular Graphics and Modelling, Vol. 76, pp. 379–402, 2017.

[19]D.T. Jones, "Protein secondary structure prediction based on position-specific scoring matrices", J. Mol. Biol. Vol. 292, pp. 195–202, 1992.

[20]Z.-H., Zhou, M., Luo, X., & Li, S. "Highly Efficient Framework for Predicting Interactions between Proteins". IEEE Transactions on Cybernetics, Vol 47(3), pp. 731–743, 2017.

[21]H. Nakashima, K. Nishikawa, and T. Ooi, “The folding type of a protein are relevant to the amino acid composition,” J. Biochem., Vol. 99(1), pp. 153–162, 1986.

[22]Bergroth, L., Hakonen, H. and   Raita, T. "A Survey of Longest Common Subsequence Algorithms".  SPIRE (IEEE Computer Society), pp. 39–48, 2000.

[23]Daugelaite, J., O’ Driscoll, A., & Sleator, R. D. (2013)." An Overview of Multiple Sequence Alignments and Cloud Computing in Bioinformatics". ISRN Biomathematics, pp. 1–14, 2013.

[24]Zhu X, Li K, Salah A." A data parallel strategy for aligning multiple biological sequences on multi-core computers". Computers in Biology and Mediciene, Vol.  43(4), pp. 350-361, 2013.

[25]Charu Sharma and A.K.Vyas "Parallel Approaches in Multiple Sequence Alignments". International Journal of Advanced Research in Computer Science and Software Engineering, Vol. 4(2), 2014.

[26]Diana H.P.Low, BharadwajVeeravalli, David A.Bader, "On the Design of High-Performance Algorithms for Aligning Multiple Protein Sequences on Mesh-Based Multiprocessor Architectures'' Journal of Parallel and Distributed Computing, no. 67(9), pp. 1007-1017, 2007.

[27]Chaichoompu K, Kittitornkun S, and Tongsima S. "MT-ClustalW: multithreading multiple sequence alignment"; Proceedings 20th IEEE International Parallel & Distributed Processing Symposium. IEEE Computer Society Press; pp. 280, 2006.

[28]Kuo-Bin Li. "ClustalW-MPI: ClustalW analysis using distributed and parallel computing". Bioinformatics. ; Vol.19, pp.1585–1586, 2003.

[29]Quan Zou, Qinghua Hu, Maozu Guo, Guohua Wang. "HAlign: Fast Multiple Similar DNA/RNA Sequence Alignment Based on the Centre Star Strategy". Bioinformatics, Vol. 31(15), pp. 2475-2481, 2015.

[30]Shixiang Wan, Quan Zou. "HAlign-II: efficient ultra-large multiple sequence alignment and phylogenetic tree reconstruction with distributed and parallel computing". Algorithms for Molecular Biology, pp. 12-25, 2017.

[31]Blazewicz, J., Frohmberg, W., Kierzynka, M., Wojciechowski, P. "G-MSA - A GPU-based, fast and accurate algorithm for multiple sequence alignment ". Journal of Parallel and Distributed Computing, Vol. 73(1), pp. 32–41, 2013.

[32]Xi Chen, Chen Wang, Shanjiang Tang, Ce Yu, Quan Zou. "CMSA: A heterogeneous CPU/GPU computing system for multiple similar RNA/DNA sequence alignment". BMC Bioinformatics, Vol. 18, pp. 315, 2017.

[33] Kleinjung J, Douglas N, Heringa J.  "Parallelized multiple alignments". Bioinformatics, Vol.18, pp. 1270–127, 2002.

[34]Eman M. Mohamed, Hamdy M. Mousa, Arabi E. Keshk, “comparative analysis of multiple sequence alignment tools", MECS, Vol. 10(8), pp. 24-30, 2018.

[35]Xing, Z., Pei, J., & Keogh, E." A brief survey on sequence classification". ACM SIGKDD Explorations Newsletter, Vol. 12(1), pp. 40, 2010.

[36]Y. Altun, I. Tsochantaridis, and T. Hofmann. "Hidden Markov support vector machines".   ICML '03, the Twentieth International Conference on Machine Learning, pp. 3 -10, 2003.

[37]Paolo Di Tommaso, Miquel Orobitg, Fernando Guirado, Fernado Cores, Toni Espinosa, Cedric Notredame, " Cloud-Coffee: implementation of a parallel consistency-based multiple alignment algorithm in the T-Coffee package and its benchmarking on the Amazon Elastic-Cloud.," Bioinformatics, Vol. 15(26), pp. 1903-1904, 2010.

[38]S.P. Mielke, V.V. Krishnan, "Protein structural class identification directly from NMR spectra using averaged chemical shifts", Bioinformatics. Vol. 19, pp. 2054–2064, 2003.

[39]J. Kähärä and H. Lähdesmäki, “Evaluating a linear k-mer model for protein–DNA interactions using high-throughput SELEX data,” BMC Bioinformat., vol. 14(10), pp. S2, 2013.

[40]Sen TZ, Jernigan RL, Garnier J, Kloczkowski A. "GOR V server for protein secondary structure prediction". Bioinformatics. Vol. 21(11), pp. 2787–2788, 2005.

[41]Kabsch W, Sander C. "A dictionary of secondary structure." Biopolymers; Vol. 22, pp. 2577–2637, 1983.

[42]Pervez MT, Babar ME, Nadeem A, et al. "IVisTMSA: Interactive Visual Tools for Multiple Sequence Alignments". Evol Bioinform Online. Vol. 11, pp.35–42, 2015.