Friday, October 24, 2025

Enhanced Error Analysis of NLP Datasets: A Data-Centric Framework Utilizing Explainable AI

Share

Exploring Emotion Classification in Arabic Text Data: Experimental Results and Insights

Understanding the Dataset

In the quest to analyze emotional responses in Arabic text, exploratory data analysis (EDA) proved invaluable. This phase characterized the dataset and informed the optimal pre-processing strategy. An intriguing finding was that the distribution of emotion labels—anger, fear, joy, love, none, sadness, surprise, and sympathy—was remarkably balanced, as illustrated in Fig. 6. This balance provided a solid foundation for subsequent analyses.

Using word clouds, as displayed in Fig. 7, we gained deeper insights into the textual content. The word “الاولمبياد” (Olympics) emerged as the most frequently occurring term, predominantly falling within the ‘none’ class. This suggests potential spurious correlations during model training, hinting that the word’s presence didn’t always reflect genuine emotional content.

Interestingly, the ‘none’ class, while defined as encompassing sentences carrying no discernible emotion, revealed a nuanced reality. Annotators often included instances that didn’t neatly fit other emotion categories—examples include sarcastic praise of a rival team or disdain for a sports uniform. Such instances underscore the complexity of emotional expression in text, calling for careful consideration during classification.

Emoji Usage and Its Implications

Upon delving into emoji usage, a clear pattern emerged: about 18% of tweets included emojis, with notable discrepancies across emotion classes (shown in Fig. 8). Tweets labeled as ‘anger’ had significantly lower emoji usage, whereas ‘love’ and ‘fear’ tweets featured higher emoji incidences, with love tweets boasting an impressive 34% emoji use. Surprisingly, even the ‘none’ class exhibited around 20% emoji usage, suggesting that emojis may play a crucial role in text interpretation, conveying nuances that words alone might miss.

This analysis led to the establishment of multiple pre-processing techniques. Emojis were categorized according to their emotional connotations, treated as standalone words to enhance classification accuracy. Text normalization followed, where non-Arabic characters and unnecessary punctuation were filtered out, while a mild ISRI stemmer from the NLTK library ensured a balance between readability and effective word reduction. Moreover, MARBERT was employed to generate word embeddings, providing a richer input for subsequent classification.

Performance Insights: MARBERT vs. Other Models

The performance evaluation of various models—GRU, LR, NB, and MARBERT—yielded compelling results, as summarized in Table 2. MARBERT consistently outperformed its counterparts, achieving superior accuracy. Strikingly, MARBERT on raw data performed 4% better than its pre-processed counterpart, although the trade-off became apparent when considering interpretability.

A detailed analysis of 200 misclassified samples using explainable AI (XAI) tools unveiled noteworthy insights. Raw data introduced susceptibility to noise, particularly through phenomena like exaggerated Arabic words that amplified meaning through character repetition. Such variations compromised model predictability, while normalization techniques successfully mitigated these issues.

With regard to explainability, the XAI tools—LIME and SHAP—exposed flaws when using raw data, particularly with punctuation and emoji misinterpretation. While raw data improved accuracy, it significantly hindered interpretability, leading to the decision to adopt pre-processed data across all models, prioritizing clarity in model predictions.

Comparing XAI Techniques

The experiments that aimed to contrast LIME and SHAP unveiled significant differences in their consistency and robustness. SHAP exhibited greater stability in its outputs compared to LIME, which showed variability due to its inherent randomness. Reruns of LIME often resulted in weight fluctuations, particularly for samples where multiple labels had closely competing probabilities.

In terms of robustness, the removal of minor-weighted words typically had little effect on LIME’s output. SHAP, contrarily, adapted dynamically, occasionally assigning previously non-weighted words new significance after minor removals, thereby reflecting a holistic view of input analysis.

The handling of repeated words proved particularly enlightening; LIME aggregated repeated words into single weights, while SHAP assigned distinct values to each. This could lead to increased complexity in SHAP’s output, cluttering visual interpretations without adding substantial meaning.

Time complexity further influenced the choice of XAI techniques—while both LIME and SHAP had quadratic scaling characteristics, the perturbation factor for LIME made it considerably slower. This highlighted a practical limitation when analyzing larger datasets or more computationally expensive models.

Identifying Distinctive Anomaly Patterns

In pursuing distinctive anomaly patterns within the dataset, several key trends emerged, warranting further exploration. For instance, the high frequency of the term “الاوليمبياد” led to spurious correlations with the ‘none’ label, as evidenced by its overrepresentation during Olympic events. Replacing this term in misclassified tweets saw significant improvement in label accuracy, underscoring the need for careful attention to dataset specifics.

Contextual misunderstandings surfaced as a significant source of misclassification. For example, tweets containing Islamic prayers often misled classifiers into associating them with ‘love’ rather than ‘sympathy,’ revealing underlying biases rooted in linguistic and cultural contexts. The potential for such biases further emphasizes the necessity of diverse training datasets that encapsulate varied emotional expressions and contexts.

Dialect discrepancies added another layer of complexity to classification accuracy. Tweets featuring dialects distinct from the training data exhibited higher misclassification rates, highlighting the necessity of dialect-aware models to capture regional nuances effectively.

Additionally, short statements, such as slight lexical modifications, showed marked sensitivity in predictions. For example, a minor change in word choice led to a shift from ‘sympathy’ to ‘fear’, illuminating concerns over model stability based on subtle linguistic cues.

Analyzing Classifier Performance

The comparative analysis of classifiers—Naive Bayes, Logistic Regression, GRU, and MARBERT—revealed distinct behavioral patterns. Naive Bayes operated on an independence assumption that made it vulnerable to misclassifications driven by repeated word occurrences. For instance, repeated instances of “حب” (love) could overshadow the true sentiment of ‘happiness’, highlighting a critical shortcoming of the model.

Logistic Regression, with its performance nestled between Naive Bayes and GRU, excelled in situations where clear emotional indicators were present yet faltered in identifying subtler emotional expressions.

GRU, while superior in contextual modeling compared to logistic approaches, still exhibited biases towards named entities, neglecting the essential emotional content of the tweets. Misclassifications often stemmed from this over-attention to specific identifiers rather than emotional nuance.

In contrast, MARBERT consistently demonstrated the highest accuracy, revealing a nuanced understanding of Arabic emotional expressions. It exhibited remarkable contextual awareness, although still grappling with biases stemming from the training data.

These analyses underscore the complexities and challenges facing emotion classification in Arabic text, particularly as they relate to model interpretability, contextual understanding, and linguistic diversity.

Read more

Related updates