Building and Annotating a Codeswitched Hate Speech Corpora

Full Text (PDF, 1132KB), PP.33-52

Views: 0 Downloads: 0


Edward Ombui 1,* Lawrence Muchemi 2 Peter Wagacha 2

1. School of Science and Technology, Africa Nazarene University, Nairobi, Kenya

2. School of Computing and Informatics, University of Nairobi, Nairobi, Kenya

* Corresponding author.


Received: 23 Feb. 2020 / Revised: 26 Aug. 2020 / Accepted: 3 Apr. 2021 / Published: 8 Jun. 2021

Index Terms

Annotation scheme, Hate Speech, Dataset, distancing language, Code-switching


Presidential campaign periods are a major trigger event for hate speech on social media in almost every country. A systematic review of previous studies indicates inadequate publicly available annotated datasets and hardly any evidence of theoretical underpinning for the annotation schemes used for hate speech identification. This situation stifles the development of empirically useful data for research, especially in supervised machine learning. This paper describes the methodology that was used to develop a multidimensional hate speech framework based on the duplex theory of hate [1] components that include distance, passion, commitment to hate, and hate as a story. Subsequently, an annotation scheme based on the framework was used to annotate a random sample of ~51k tweets from ~400k tweets that were collected during the August and October 2017 presidential campaign period in Kenya. This resulted in a gold-standard codeswitched dataset that could be used for comparative and empirical studies in supervised machine learning. The resulting classifiers trained on this dataset could be used to provide real-time monitoring of hate speech spikes on social media and inform data-driven decision-making by relevant security agencies in government.

Cite This Paper

Edward Ombui, Lawrence Muchemi, Peter Wagacha, "Building and Annotating a Codeswitched Hate Speech Corpora", International Journal of Information Technology and Computer Science(IJITCS), Vol.13, No.3, pp.33-52, 2021. DOI:10.5815/ijitcs.2021.03.03


[1] R. Sternberg, K. Sternberg, The Duplex Theory of Hate I: The Triangular Theory of the Structure of Hate. In The Nature of Hate, Cambridge Univ. Press. (2008) 51–77.
[2] A. Des Forges, Leave None To Tell The Story: Genocide in Rwanda, New York Hum. Rights Watch. (1999).
[3] R.. King, G.M. Sutton, High Times for Hate Crime: Explaining the Temporal Clustering of Hate Motivated Offending, Criminology. 51 (2013) 71–94.
[4] E. Ombui, L. Muchemi, P. Wagacha, Hate Speech Detection in Code-switched Text Messages, in: 3rd Int. Symp. Multidiscip. Stud. Innov. Technol., IEEE, Ankara, 2019.
[5] D.M. Blei, A.Y. Ng, M.I. Jordan, Latent dirichlet allocation, J. Mach. Learn. Res. 3 (2003) 993–1022.
[6] E. Ombui, M. Karani, L. Muchemi, Annotation Framework for Hate Speech Identification in Tweets: Case Study of Tweets during Kenyan Elections, in: IST-2019, 2019.
[7] P. Burnap, M.L. Williams, Us and them: identifying cyber hate on twitter across multiple protected characteristics., EPJ Data Sci. (2016).
[8] Z. Waseem, D. Hovy, Hateful symbols or hateful people? predictive features for hate speech detection on twitter., in: Proc. NAACL-HLT, 2016: pp. 88–93.
[9] C. Nobata, J. Tetreault, A. Thomas, Y. Mehdad, Y. Chang, Abusive Language Detection in Online User Content, in: 25th Int. Conf. World Wide Web, 2016: pp. 145–153.
[10] P. Fortuna, L. da Silva, Jo˜ao Rocha Soler-Company, Juan Wanner, S. Nunes, A Hierarchically-Labeled Portuguese HateSpeech Dataset, in: Proc. Third Work. Abus. Lang. Online, ACL, 2019: pp. 94–104.
[11] V.P. de Pelle, Rogers Prates Moreira, Offensive Comments in the Brazilian Web: a dataset and baseline results, in: 6th Brazilian Work. Soc. Netw. Anal. Min., 2017.
[12] B. Ross, M. Rist, G. Carbonell, B. Cabrera, N. Kurowsky, M. Wojatzki, Measuring the Reliability of Hate Speech Annotations:
The Case of the European Refugee Crisis, Arxiv:1701.08118. 1 (2017).
[13] E. Fersini, P. Rosso, M. Anzovino, Misogyny, Overview of the task on automatic Identification, in: E Third Work. Eval. Hum. Lang. Technol. Iber. Lang., 2018.
[14] F. Poletto, M. Stranisci, M. Sanguinetti, V. Patti, C. Bosco, Hate speech annotation: Analysis of an italian Twitter corpus., in: CEUR WS, 2018: pp. 1–6.
[15] R. Kumar, A.K. Ojha, S. Malmasi, M. Zampieri, Benchmarking Aggression Identification in Social Media, in: Proc. First Work. Trolling, Aggress. Cyberbullying, ACL, 2018: pp. 1–11.
[16] Donia Gamal, Marco Alfonse, El-Sayed M. El-Horbaty, Abdel-Badeeh M.Salem, "Twitter Benchmark Dataset for Arabic Sentiment Analysis", International Journal of Modern Education and Computer Science, Vol.11, No.1, pp. 33-38, 2019.
[17] Afnan Atiah Alsolamy, Muazzam Ahmed Siddiqui, Imtiaz Hussain Khan, " A Corpus Based Approach to Build Arabic Sentiment Lexicon", International Journal of Information Engineering and Electronic Business, Vol.11, No.6, pp. 16-23, 2019.
[18] Alemu Kumilachew Tegegnie, Adane Nega Tarekegn, Tamir Anteneh Alemu,"A Comparative Study of Flat and Hierarchical Classification for Amharic News Text Using SVM", International Journal of Information Engineering and Electronic Business, Vol.9, No.3, pp.36-42, 2017.
[19] T. Davidson, D. Warmsley, M. Macy, I. Weber, Automated Hate Speech Detection and the Problem of Offensive Language, in: ICWSM, 2017.
[20] Priya Gupta, Aditi Kamra, Richa Thakral, Mayank Aggarwal, Sohail Bhatti, Vishal Jain, "A Proposed Framework to Analyze Abusive Tweets on the Social Networks", International Journal of Modern Education and Computer Science, Vol.10, No.1, pp. 46-56, 2018.
[21] Z. Waseem, Are You a Racist or Am I Seeing Things? Annotator Influence on Hate Speech Detection on Twitter, in: EMNLP Work. NLP CSS, 2016: pp. 138–142.
[22] W. Warner, J. Hirschberg, Detecting Hate Speech on the World Wide Web, in: Lang. Soc. Media (LSM 2012), 2012.
[23] I. Kwok, Y. Wang, Locate the hate: Detecting tweets against blacks, AAAI. (2013).
[24] Three Kenyan politicians arrested over “hate speech,” Telegr. (2010).
[25] Kenyan authorities arrest blogger after posts on alleged official corruption, CPJ. (2018).
[26] P.. Cavazos-Rehg, M.J. Krauss, S. Sowles, S. Connolly, C. Rosas, M. Bharadwaj, L. Bierut, A content analysis of depression-related tweets, Comput. Hum. Behav. 54 (2016) 351–357.
[27] M. Karani, E. Ombui, A. Gichamba, The Design and Development of a Custom Text Annotator, in: IEEE Africon, 2019.
[28] K. Krippendorff, Computing Krippendorff’s Alpha-Reliability, Univ. Pennsylvania Sch. (2011). m
[29] Y. Chen, Y. Zhou, S. Zhu, H. Xu, Detecting offensive language in social media to protect adolescent online safety, in: Fourth ASE/IEEE Int. Conf. Soc. Comput. (SocialCom 2012), Amstadam, 2012.
[30] W. Clyne, S. Pezaro, K. Deeny, R. Kneasfsey, Using Social Media to Generate and Collect Primary Data: The #ShowsWorkplaceCompassion Twitter Research Campaign, JMIR Public Heal. Surveill. 4 (2018) e41.
[31] W. Ahmed, P. Bath, G. Demartini, Using Twitter as a data source: An overview of ethical, legal and methodological challenges, in: Second (Ed.), Ethics Online Res. Adv. Res. Ethics Integr., Emerald, 2017: pp. 79–107.
[32] Twitter Privacy Policy, Twitter, Inc. (2018). (accessed October 26, 2019).
[33] Y.R. Tausczik, J.W. Pennebaker, The Psychological Meaning of Words: LIWC and Computerized Text Analysis Methods, J. Lang. Soc. Psychol. 1 (2010).
[34] E. Alpaydin, Introduction to Machine Learning, 2nd Editio, The MIT Press, London, 2010.
[35] M.L. Williams, P. Burnap, Cyberhate on Social Media in the aftermath of Woolwich: A Case Study in Computational Criminology and Big Data, Br. J. Criminol. 56 (2016) 211–238.
[36] Twitter, About Twitter’s APIs, (n.d.). (accessed November 25, 2020).
[37] P. Burnap, M.L. Williams, Cyber Hate Speech on Twitter: An Application of Machine Classification and Statistical Modeling for Policy and Decision Making, Policy & Internet. 2 (2015) 223–242.
[38] P. Badjatiya, S. Gupta, M. Gupta, V. Varma, Deep Learning for Hate Speech Detection in Tweets, in: 2017 Int. World Wide Web Conf. Comm., 2017.
[39] S. Joshi, D. Deshpande, Twitter Sentiment Analysis System, Int. J. Comput. Appl. 180 (2018).
[40] A. Schmidt, M. Wiegand, A Survey on Hate Speech Detection using Natural Language Processing, SocialNLP@EACL. (2017).
[41] K. Constitution, THE CONSTITUTION OF KENYA, 2010, LAWS OF KENYA, Kenya, 2010.
[42] KLR, NATIONAL COHESION AND INTEGRATION ACT NO.12 of 2008, National Council for Law, Kenya, 2012.
[43] M. Makinen, M.W. Kuira, Social Media and Post-Election Crisis in Kenya, Inf. Commun. Technol. - Africa. 13 (2008).
[44] NCIC, Functions of the Commission, (2019). (accessed September 16, 2019).
[45] R. Damary, NCIC deploys peace monitors to arrest triggers of election chaos, Star. (2017).
[46] Kenet, Kenya Education Network, (2018). (accessed September 16, 2016).
[47] J.-M. Xu, K.-S. Jun, X. Zhu, A. Bellmor, Learning from bullying traces in social media, in: Proc. 2012 Conf. North Am. Chapter Assoc. Comput. Linguist., Association for Computational Linguistics, 2012: pp. 656–666.