I. Introduction
In this era of digital communications, hatred data is not only in social media comments and posts or text messages but also in voice messages and video content as well [1]. Content like this causes cyberbullying, rioting, fraud, loss of respect, and even murder. (Hate Crime Statistics, 2019) shows 15588 law enforcement agencies reported crimes, suspects, offenders, and hate crime zones. These groups reported 7314 hate crimes with 8,559 offenses. The findings include 57.6%race/ethnicity/ancestry/bias, 20.1 % religion, and 16.7 % sexual orientation. [2] In another research, online hate crimes frequently start online and affect us offline. They were also mistreated online, according to the report’s victims. The research shows that hate data is linked to voice tone and facial emotions. Voice tones and facial expressions are crucial aspects to discern hatred. In some cases, only the text data, facial expressions or vocal information as audio data is not enough individually to detect the hateful conversations. Almost all current research on hate speech identification uses text data. This study suggests integrating audio, video, and text elements to detect hate speech. The following research is implemented by taking all the modes available to deliver hate speech into account and designed to detect hate speech more precisely. The final conclusion of hate speech is determined by merging the results of a hard voting ensemble or majority voting where we combine all model result image, audio, and text to determine the final output.