Dataset Construction in the Age of Deep Learning
An Overview of Data Acquisition and Labeling
In the world of machine learning, the journey begins with a crucial step: dataset construction. For our project, we paid careful attention to data acquisition and labeling, ensuring that every facet adhered to the ethical guidelines outlined in the World Medical Association Declaration of Helsinki. With the approval from multiple Ethics Committees, including those from prominent hospitals in China, we set the foundation for a robust dataset aimed at advancing research in ophthalmology.
The PSMM Dataset
Our research led to the formation of the PSMM dataset, which comprises five key sub-sources: ShenzhenEye, SUSTech, LishuiR, Zhongshan, and LishuiZ. Together, these sources provide a diverse array of ultra-widefield (UWF) images, meticulously collected from various hospitals over distinct timelines, ranging from 2019 to 2023.
- ShenzhenEye: Contains 38,922 UWF images from 4,003 patients, collected from Shenzhen Eye Hospital between January 1, 2019, and December 31, 2023.
- SUSTech: Comprises 2,835 images from 226 patients, collected from Southern University of Science and Technology Hospital from January 1, 2023 until June 31, 2023.
- LishuiR: Has 938 images from 155 patients, gathered from Lishui People’s Hospital throughout 2021 to 2023.
- Zhongshan: Encompasses 456 images from 85 patients at Zhongshan Ophthalmic Center.
- LishuiZ: Includes 220 images from 91 patients from Lishui Central Hospital, covering a similar timeframe as LishuiR.
Upon integration, the PSMM dataset features a total of 43,371 UWF images.
Data Processing: The Blueprint for Model Training
To ensure the highest quality and reliability, we established a meticulous data processing pipeline. This included desensitizing data to preserve patient privacy and centralizing the objective area by removing unnecessary black boundaries in images. Additionally, resizing was performed to streamline the dataset for model training.
Following this, we structured the dataset in line with the PASCAL Visual Object Classes Challenge (PASCAL VOC) 2007 format. This adaptation not only aligns with industry standards but facilitates ease of use in various deep learning tasks.
Partitioning the Dataset
Data partitioning is a critical step in preparing datasets for machine learning. Given that our dataset consists of multiple images from individual patients, we employed a stratified partitioning approach. This ensures that every partition accurately represents the underlying data distribution, creating separate training, development, and testing sets with a distribution of 7:1.5:1.5.
For our multi-label learning task, we assigned a single-class label to each patient, effectively simplifying the complexity of data partitioning. This step allows models trained on this data to tackle real-world challenges without compromising on performance.
Ensuring Accurate Annotations: Rigor in Labeling
Annotation is the backbone of successful machine learning applications. We engaged two junior ophthalmologists to label our UWF images, with a senior specialist conducting thorough reviews. This rigorous quality assurance mechanism helped filter out distorted or damaged images. Each image was evaluated for the presence of posterior staphyloma and categorized based on distinct myopic maculopathy features.
Emotional and Practical Considerations
An essential aspect of this annotation process was the incorporation of complex findings that might typically confound diagnoses—such as laser scars and choroidal nevi. By retaining these atypical findings, we aimed to mirror real-world clinical scenarios more accurately, even though this approach brought a risk of label noise.
In an effort to evaluate inter-rater reliability, we utilized Cohen’s kappa coefficients, achieving values ranging from 0.78 to 0.86 across different categories, indicating substantial agreement among annotators.
The Importance of Ethical Compliance
The ethical implications of our work cannot be overstated. All data handling adhered strictly to patient privacy laws and ethical guidelines, notably the retrospective design that allowed us to bypass the requirement for informed consent through de-identification of images.
Concluding Steps for Robust Model Development
With our dataset now in place, carefully annotated, and partitioned, we were ready to embark on the development of our model, RealMNet. Harnessing lightweight frameworks like TinyViT, strategic enhancements such as cost-sensitive calibration, and classifier adaptations, we aimed to delve deep into the multifaceted world of pathologic myopia.
In the subsequent sections, we’ll dive into the intricacies of our model architecture, the experimental protocols employed, and the evaluation metrics that allow us to assess performance comprehensively. Stay tuned for an in-depth exploration of these elements that together drive our mission toward understanding and diagnosing ocular pathologies more effectively.