Friday, October 24, 2025

Open Source Arabic Dataset for Natural Language Processing Research

Share

Clustering Results

The exploration of clustering results using traditional algorithms involves several steps, including preprocessing, matrix transformation, and application of various clustering algorithms. Initially, the traditional clustering algorithm was dissected through a specific implementation on the proposed dataset. The preprocessing stage focused on essential tasks, such as removing stop words and eliminating diacritics like tashkeel, harakat, tweel, and shadda. This approach, termed Preprocessing 1, ensured that the textual data was cleansed adequately. The resultant data was then transformed into a TF-IDF matrix, a crucial step for analyzing the frequency of terms in relation to document importance.

Following this, both K-means and DBSCAN clustering algorithms were employed. The results were clearly documented in easy-to-reference tables. For instance, the K-means algorithm’s results are articulated in Table 9, while Table 10 presents the outcomes from the DBSCAN algorithm. In addition, the mini-batch K-means algorithm results were showcased in Table 11. This structured reporting of results enhances the clarity and comprehension of the findings.

Measurement Metrics

To assess the efficacy of clustering, the Davies-Bouldin index was underscored. This metric evaluates the clustering outcome, where a lower value indicates more favorable clustering. The experiments revealed interesting insights: when configurating DBSCAN with an eps value of 0.7, a robust Davies-Bouldin index score of 1.545 was achieved. In contrast, the same algorithm yielded a less optimal score of 13.591 when eps was set to 1. Similarly, K-means showed a better performance at k=10 with a score of 1.866.

Beyond the Davies-Bouldin index, the Silhouette Coefficient also became pivotal in assessing clustering quality. This metric, which ranges from -1 to 1, signifies that higher scores indicate better-defined clusters. The experiment unveiled that the K-means algorithm with k=7 recorded a Silhouette Coefficient of 0.097, albeit a modest achievement that points to weaker clustering. Furthermore, DBSCAN scores were predominantly negative except when using eps=0.7, underscoring challenges encountered within the clustering process.

Another dimension of validation was external assessment via the Adjusted Rand Index (ARI), a robust method to gauge the similarity among clustering results. The K-means algorithm at k=9 exhibited a commendable ARI of 0.419, suggesting substantial agreement between clustering methodologies.

Enhanced Preprocessing Techniques

In the second experimental scenario, additional preprocessing, termed Preprocessing 2, aimed to refine the data further. This stage involved removing stop words, addressing diacritics, and executing Arabic normalization utilizing libraries such as PyArabic. The refined outputs were then subjected to the same traditional clustering algorithms.

The internal measures for the Davies-Bouldin index illuminated that the mini-batch K-means algorithm delivered superior results. Most k-values yielded scores above 2.5, signaling more defined and coherent clusters. Comparatively, DBSCAN produced contrasting outcomes, scoring 1.672 at eps=0.7 and rising to a less favorable 13.1 with eps=1. K-means held a middle ground, registering scores between 3.5 and 5.5.

The Silhouette Coefficient continued to serve as a valuable measure, with K-means achieving a peak score of approximately 0.1 under cosine distance. However, DBSCAN struggled, revealing mainly negative scores with variations in distance metrics.

Final Preprocessing and Experimentation

In the last phase of experimentation, an advanced set of preprocessing methods, termed Preprocessing 3, encapsulated the removal of extraneous elements such as stop words, diacritics, Arabic normalization, and stemming via Arlstem. This meticulous cleansing aimed to pave the way for higher-quality clustering. After preparing the dataset, it was again transformed into a TF-IDF matrix.

The performances of K-means, DBSCAN, and mini-batch K-means were reported across various tables. The inclusion of bio-inspired algorithms like Particle Swarm Optimization (PSO) and Grey Wolf Optimization (GWO) marked an evolutionary step, as their application revealed effectiveness alongside traditional clustering methods. Notably, PSO and GWO exhibited their suitability, with their scores outlined in respective tables.

Cluster Quality Assessment

The internal measures were notably revealing throughout these tests. The lowest Davies-Bouldin scores, signaling better clustering, were recorded at 1.6 for DBSCAN. In contrast, mini-batch K-means constantly achieved scores around 2.2, while K-means typically registered around 4. Consolidating these findings painted a vivid picture of clustering effectiveness—K-means with k=7 emerged as the most reliable algorithm for the dataset used.

The clustering results signaled ambiguous performance indicators, particularly when assessed via the Silhouette Coefficient, hinting at significant room for improvement in delineating distinct clusters. Despite the overarching success of K-means, the discovery of varying optimal k-values against the number of categories highlighted how the inherent features of the dataset shape clustering behavior.

These extensive analyses reaffirm the crucial role of preprocessing, selection of algorithms, and evaluation metrics in the success of clustering exercises. The intricacies uncovered in this study contribute meaningfully to the dialogue on optimal clustering approaches, equipping researchers and practitioners with insights for future experimental endeavors.

Read more

Related updates