,
Gomal University , Dera Ismail Khan , Pakistan
,
Gomal University , Dera Ismail Khan , Pakistan
,
Gomal University , Dera Ismail Khan , Pakistan
,
Gomal University , Dera Ismail Khan , Pakistan
,
Gomal University , Dera Ismail Khan , Pakistan
,
Gomal University , Dera Ismail Khan , Pakistan
,
Gomal University , Dera Ismail Khan , Pakistan
Gomal University , Dera Ismail Khan , Pakistan
The low-resource social media text i.e., Urdu tweets containing hate speech are identified with the help of a machine learning-based ensemble approach. The dataset used for this study consisted of 8,800 tweets and half of them were labeled as Hateful and the other half as No-Hate. In preprocessing, we took into account the features of Urdu normalizing the characters, eliminating frequent words, and filtering the punctuation. TF-IDF was used to extract features based on unigrams and bigrams and the number of terms was restricted to 5,000. At first, Logistic Regression, Multinomial Naive Bayes, and Support Vector Classifier were chosen as the base learners and the Logistic Regression was used again as meta-learner in the last layer of the ensemble. The training data consisted of 80% and the rest, 20%, data was used to test the performance of models. Compared to other baseline ensemble approaches and classifiers including Random Forest, Gradient Boosting, AdaBoost, Bagging, Soft Voting, and Hard Voting, our proposedmachine learning based-stacking ensemble approach achieved a high accuracy of 86.53%, precision of 85.45%, and recall of 86.96% and F1-score of 86.20%. The research indicates that the machine learning-based stacking ensemble approach plays a vital role in the identification of hate speech in Urdu Tweets.
This is an open access article distributed under the Creative Commons Attribution Non-Commercial License (CC BY-NC) License which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
The statements, opinions and data contained in the journal are solely those of the individual authors and contributors and not of the publisher and the editor(s). We stay neutral with regard to jurisdictional claims in published maps and institutional affiliations.