Proposed Methodology for Predicting Traffic Accident Severity
Overview of Methodology
In the realm of traffic safety analysis, the proposed methodology employs a two-step approach to gain insights into the severity of traffic accidents. This approach is illustrated in Fig. 1. The initial phase focuses on identifying and ranking the dominant parameters affecting traffic accidents. Following that, the second phase entails constructing five machine learning models—Boosted Regression Trees (BRT), Artificial Neural Networks (ANN), Support Vector Machines (SVM), Naïve Bayes (NVB), and Logistic Regression (LGR)—to predict traffic accident severity. Each step of this process builds on the previous one, providing a comprehensive framework for understanding and potentially mitigating traffic accidents.
Data Description
The dataset utilized spans from 2018 to 2022, covering data from 14 cities in the Eastern Province of Saudi Arabia. This dataset includes a staggering 9,548 traffic accidents involving 17,100 vehicles, leading to 2,527 fatalities and 8,020 injuries during the analyzed period. These figures were sourced from the General Directorate of Traffic, a division of the Ministry of Interior in Saudi Arabia, and relied on official accident reports from police departments across the region.
Types of Accidents
The accidents are categorized into 19 distinct types, enabling a clearer understanding of the issues at hand. The most frequently occurring accidents are moving vehicle accidents, accounting for 3,665 cases, followed by vehicle overturns and run-over incidents, which have resulted in significant fatalities and injuries.
| Accident Type | Cases | Fatalities | Injuries |
|---|---|---|---|
| Moving Vehicle Accidents | 3,665 | 871 | 3,272 |
| Vehicle Overturns | 640 | 1,809 | |
| Run-Over Accidents | 503 | 1,794 | |
| Waste Container Accidents | 0 | 0 | |
| Bridge Falls | 2 | ||
| Hit Traffic Light Accidents | 0 | 0 |
This categorization allows for the identification of targeted prevention measures.
Causative Factors
A second table summarizes 39 causative factors behind the accidents. Key culprits include:
- Swerve Driving: Causing 28.1% of accidents, resulting in 765 fatalities.
- Distracted Driving: Representing 18.8% of accidents with 453 fatalities.
- Pedestrian Violations: 10.2% of cases led to 255 fatalities.
By understanding these factors, stakeholders can work towards implementing more effective safety measures.
Feature Selection and Ranking of Parameters
Feature selection is vital in enhancing the efficiency and effectiveness of machine learning models. This study employs a multi-faceted approach by utilizing three different algorithms—Maximum Relevance Minimum Redundancy (MRMR), Chi-square, and Kruskal-Wallis— to rank and identify the critical parameters associated with traffic accidents.
The top-ranked parameters, according to MRMR, Chi-square, and Kruskal-Wallis, consist of:
- Number of Vehicles Involved
- Accident City
- Number of Injured People
Notably, the correlation matrix indicates no strong relationships between the parameters, reducing the risk of collinearity.
Machine Learning Models Employed
Support Vector Machine (SVM)
SVM models are recognized for their efficiency in handling complex data. By constructing an optimal hyperplane, they classify data points into different categories while maximizing the margin between them. This helps minimize classification errors and is especially valuable in predicting the severity of traffic accidents.
Artificial Neural Networks (ANN)
Drawing inspiration from the human brain, ANNs excel in learning complex patterns from data. The Feedforward Neural Network (FFNN) employed in this study consists of multiple layers, where backpropagation fine-tunes weight adjustments to improve accuracy. This adaptability makes ANNs particularly useful for varied datasets.
Boosted Regression Trees (BRT)
BRT employs a systematic approach to decision-making by minimizing the Gini index during dataset division. This model effectively creates "if-then" rules for clearer interpretations, enhancing the transparency of the classification process.
Naïve Bayes
Although Naïve Bayes works under the assumption of feature independence, it has proven to be effective in various classification tasks, particularly in scenarios where correlations may not be strong. Its foundation on Bayes’ theorem allows it to assess odds based on prior knowledge effectively.
Logistic Regression
Binary Logistic Regression is employed due to its ability to handle binary outcomes, such as whether an accident results in a fatality. It is particularly adept at managing categorical explanatory variables.
Data Processing and Performance Evaluation
Data Normalization
Normalizing data ensures that all parameters contribute equally without overshadowing one another due to range variations. Normalization, achieved through a specific formula, minimizes model complexity and aids in reaching global minima more efficiently.
Performance Metrics
To evaluate the efficacy of each model, five metrics—accuracy, sensitivity, specificity, precision, and geometric mean (G-mean)—are employed. Each metric sheds light on different aspects of model performance, especially in the context of the often imbalanced nature of traffic accident data.
- Accuracy: The overall rate of correct predictions.
- Sensitivity: The ability to identify true positives, crucial for recognizing fatal accidents.
- Specificity: The accuracy in identifying true negatives.
- Precision: The reliability of positive predictions.
- G-Mean: Provides a balanced evaluation, useful for datasets with major class imbalances.
Through this structured methodology, the study aims to systematically uncover insights into traffic accident severity, enabling informed strategies to enhance road safety effectively.

