Enhancing Lithuanian Education: The Impact of Gen-AI on Text Data Augmentation

The Impact of Text Augmentation on Machine Learning Classification Models

Introduction

In the rapidly evolving field of machine learning, data classification plays an integral role, especially in processing textual information. The recent literature reveals a growing interest in leveraging various machine learning models for data classification tasks. This article explores the effects of text augmentation on classification accuracy using traditional machine learning algorithms, rather than delving into more complex deep learning architectures.

Selecting Traditional Machine Learning Models

In our research, we focused on six traditional machine learning algorithms: Multi-Layer Perceptron (MLP), Random Forest (RF), Gradient-Boosted Trees (GBT), K-Nearest Neighbors (kNN), Decision Trees (DT), and Naive Bayes (NB). The primary motivation for opting for these models is their efficiency in determining whether text data augmentation enhances classification outcomes. Traditional models typically require less time and fewer data than deep learning models, making them more suitable for our investigation.

Methodological Framework

For our experiments, we utilized two well-established techniques for transforming text into vector representations: Bag of Words (BoW) and sBERT embeddings. Each model underwent hyperparameter optimization, ensuring a comprehensive approach to our analysis. For a clear overview of the range of hyperparameters optimized, see Table 3.

All algorithms were trained and tested using stratified k-fold cross-validation with (k=5), thereby ensuring consistent evaluation conditions. The models’ performance was assessed using key metrics: accuracy, recall, precision, and F1 score.

Training Details

Throughout our experiments, a total of 15,296 models were trained and evaluated, specifically targeting the highest accuracy for each algorithm. The datasets utilized are detailed in Table 2.

Preliminary Research: Dimensionality Reduction for BoW Models

A significant challenge when processing text data is the high dimensionality of the resulting vectors. Based on prior research by Stefanovic and Kurasova, we implemented various filters to preprocess our textual data. These included removing numbers and punctuation, converting tokens to lowercase, and eliminating stop words, with the minimum token length set to three characters.

Initially, using the BoW technique resulted in vectors approximately 10,000 dimensions in size. This high dimensionality can stall model training, especially during hyperparameter optimization. Consequently, we conducted primary research utilizing Latent Semantic Analysis (LSA) to reduce dimensionality—a common practice in high-dimensional data analysis.

Experimenting with Dimensionality

In this preliminary phase, we trained MLP, RF, and kNN models both with and without dimensionality reduction. After rigorous testing, we found that reducing dimensionality to 40 significantly improved accuracy for MLP and kNN, while the RF model remained stable. The paired t-test indicated significant statistical differences, particularly for MLP and kNN. Therefore, we adopted the dimensional reduction strategy for the main research phase.

Main Research: Gen-AI Augmented Dataset Impact on Classification Accuracy

Armed with reduced dimensionality datasets, we embarked on the principal experiments involving the six machine learning models with the BoW method. Results from the original datasets without augmentation showed accuracy rates ranging from 82.28% (kNN) to 87.60% (RF) for two class tasks, and slightly lower scores for three class tasks.

Subsequently, we employed Gen-AI tools for text augmentation. The findings indicated a noticeable increase in model accuracy for four out of six algorithms, notably MLP, RF, GBT, and kNN. In contrast, decisions made by DT and NB yielded decreased accuracy after augmentation.

Comparative Analysis of Results

The comparative results for both the original dataset and augmented subsets are striking. As shown in Figure 3, the highest accuracy corresponds with the joint utilization of tools like ChatGPT and Copilot. This combination yielded the most significant performance boosts across multiple model configurations. In contrast, augmentation with the Gemini tool showed the least impact.

Hyperparameter Insights

Participation of ChatGPT and Copilot tools often pushed accuracy thresholds above 90%. Notably, as indicated in Table 6, varying hyperparameters influenced performance—e.g., MLP showed improvements in the number of iterations and neurons, while RF achieved the best results with 100-850 models based on the dataset size.

Exploring sBERT Representations

In a parallel approach, we also examined the use of sBERT embeddings, resulting in vectors with 384 dimensions. Interestingly, the accuracy achieved with sBERT generally lagged behind that of the BoW technique, with notable losses particularly in the DT and NB algorithms.

Augmented Data Impact Observations

The augmentation process stood out in producing remarkable accuracy improvements, especially with kNN. For subsets using the ChatGPT and Copilot tools, kNN recorded gains exceeding 20%. Conversely, certain configurations with NB showed lesser performance, often declining.

Final Insights on Accuracy Metrics

Table 9 and Table 10 illustrate the ultimate accuracy results from our experiments, emphasizing that the highest metrics were predominantly achieved using kNN. MLP models also registered significant results, especially when trained with hyperparameters optimized for multiple iterations.

Conclusion

Our exploration into text augmentation reveals a vibrant interplay between traditional machine learning models and innovative augmentation techniques—particularly through the lens of Gen-AI tools. The findings suggest that these augmentations can enhance classification results while maintaining a focus on classical model performance. As the research landscape continues to evolve, understanding these dynamics will be crucial for future developments in machine learning methodologies.

The Symbolic Strategy Letter

Premium features

Enhancing Lithuanian Education: The Impact of Gen-AI on Text Data Augmentation

The Impact of Text Augmentation on Machine Learning Classification Models

Introduction

Selecting Traditional Machine Learning Models

Methodological Framework

Training Details

Preliminary Research: Dimensionality Reduction for BoW Models

Experimenting with Dimensionality

Main Research: Gen-AI Augmented Dataset Impact on Classification Accuracy

Comparative Analysis of Results

Hyperparameter Insights

Exploring sBERT Representations

Augmented Data Impact Observations

Final Insights on Accuracy Metrics

Conclusion

Table of contents [hide]

How to Use a Scope of Work Generator Effectively

Cincoze Launches Innovative Machine Vision Computer Series

Advancing Organoid Morphological Segmentation with a Knowledge-Driven Deep Learning Framework

Data Center Robotics Market Expected to Hit $37.4 Billion by 2032 Amid Rising Automation

Enhancing User Engagement with Conversational AI Across Digital Platforms

Related updates

Enhancing User Engagement with Conversational AI Across Digital Platforms

Unlocking Consumer Insights: 3 Ways Retail Banks Can Leverage Natural Language Processing

Fallon Gorman Named President and CFO of NLP Logix

Fallon Gorman Joins NLP Logix as President and CFO

How to Use a Scope of Work Generator Effectively

Cincoze Launches Innovative Machine Vision Computer Series

Advancing Organoid Morphological Segmentation with a Knowledge-Driven Deep Learning...

Meet the Viral AI Robot Effortlessly Revolutionizing Laundry Day!

Exploring AI Meme Trends in Programmer Communities: Insights on...

IBM and AMD Turn to Quantum Computing for a...