Saturday, July 19, 2025

Predicting Air Quality and Pollution with Machine Learning Insights

Share

Understanding the Journey of Air Quality Index Prediction Through Machine Learning

Introduction to Air Quality Index (AQI)

Air quality is an increasingly pressing concern globally, especially in urban areas where pollution levels can soar. To gauge these levels in a standardized manner, many countries have adopted the Air Quality Index (AQI). The AQI provides a simplified representation of air quality and can alert the public about levels of pollutants that might pose health risks. This article describes a comprehensive research study that delves into the AQI prediction process through a structured, three-stage research framework.

Stage 1: Data Preparation and Processing

The foundation of any predictive model is its data, and this study utilized a rich dataset sourced online from the Real-time Dataset of Air Pollution Monitoring Generated Using IoT—Mendeley Data. Collected hourly in Gazipur, Bangladesh, over an entire year (January 1, 2022, to December 31, 2022), this dataset incorporates six pollutants: PM2.5, PM10, Carbon Monoxide (CO), Nitrogen Dioxide (NO2), Sulfur Dioxide (SO2), and Ozone (O3).

Calculating the AQI was based on methodologies outlined by the U.S. Environmental Protection Agency (EPA), utilizing a linear interpolation formula along with national air quality breakpoints established by Bangladesh’s Department of Environment. The AQI for each pollutant was derived using the following formula:

[
{I}{p}= \frac{{I}{high}-{I}{low}}{{C}{high}-{C}{low}}\left({C}{p}-{C}{low}\right)+ {I}{low}
]

Here, ({I}{p}) denotes the AQI value specific to a pollutant, while ({C}{p}) represents its measured concentration.

Outlier Detection and Data Normalization

Ensuring data quality is integral. The study employed box plotting to identify and eliminate outliers in the pollutant concentration values. Each plot depicted distributions, showcasing the measurements of the pollutants in their actual units. After removing outliers, the dataset underwent normalization through min-max scaling to bring all variables onto a comparable scale, thus preparing it for machine learning algorithms.

The dataset was divided into 80% for training the models and 20% for testing them. To enhance generalizability, this division was repeated multiple times, accompanied by tenfold cross-validation, a method that systematically partitions data to minimize biases and enhance model reliability.

Feature Importance Evaluation

To pinpoint the most impactful variables for AQI prediction, a Random Forest algorithm was employed. This technique effectively captures nonlinear interactions between variables. The analysis uncovered PM2.5 as the most influential input with an importance score of 12.6654, followed by PM10 and CO. Although PM2.5 and PM10 showed a moderate correlation, both were retained due to their unique contributions.

Stage 2: AQI Calculation According to Standards

The AQI was calculated based on established breakpoints laid out in Table 1, governed by the Department of Environment in Bangladesh. Each pollutant’s contribution was tallied, and the overall AQI was deemed to be the maximum sub-index from the six pollutants calculated. This multi-faceted approach ensures that the most critical sources of air quality degradation are taken into account in assessing overall air pollution levels.

Stage 3: Developing and Evaluating Machine Learning Models

For the final stage, the study harnessed MATLAB’s Learner Regression App, simplifying the development and analysis of predictive models. Several regression techniques including Gaussian Process Regression (GPR), Ensemble Regression (ER), Support Vector Machines (SVM), Regression Trees (RT), and Kernel Regression (KAR) were employed, each chosen for their theoretical applications and prior successes.

Hyperparameter Tuning

The models underwent hyperparameter tuning and were trained using standardized input from the leading pollutants—PM2.5, CO, and PM10. To combat overfitting, tenfold cross-validation was conducted. This systematic partitioning ensured that the model’s predictions were both robust and reliable.

Performance Evaluation of Machine Learning Models

Evaluating model performance is essential for gauging the effectiveness of machine learning in AQI prediction. The Learner Regression Tool incorporates three key metrics:

  1. Mean Absolute Error (MAE): This assesses the average of absolute deviations between predicted and actual values, effectively revealing how close predictions run to real observations.
    [
    MAE=\frac{1}{n}\sum{i=1}^{n}\left|{x}{i}-{y}_{i}\right|
    ]

  2. Root Mean Square Error (RMSE): This is a measure of the squared differences between predicted and actual values, providing a view of the error’s magnitude.
    [
    RMSE=\sqrt{\frac{1}{n}\sum{i=1}^{n}{\left({x}{i}-{y}_{i}\right)}^{2}}
    ]

  3. Coefficient of Determination (R²): This metric quantifies the proportion of variance in actual data that is captured by the model’s predictions, ranging between 0 and 1, with higher values indicating better performance.
    [
    {R}^{2}=1-\frac{{\sum }{i=1}^{n}{\left({X}{i}-{Y}{i}\right)}^{2}}{\sum{i=1}^{n}{\left({X}_{i}-\overline{X}\right)}^{2}}
    ]

Together, these metrics provide a comprehensive insight into the model’s performance, guiding the assessment of how well different models can predict AQI.


The exploration of AQI prediction through machine learning not only showcases the power of technology in environmental research but also emphasizes the importance of robust data preparation and model evaluation processes in achieving accurate and reliable predictions. This structured approach sets a precedent for future studies in environmental modeling, reinforcing the importance of interdisciplinary methods in addressing global challenges like air quality.

Read more

Related updates