Monitoring Your Baby: A Machine Learning Journey
As a parent, the peace of mind that comes from knowing your baby is safe can be priceless. Imagine being able to monitor your baby’s well-being, even from another room. This isn’t just a dream; it’s becoming a reality thanks to advances in technology. I had the opportunity to contribute to a pioneering project that aimed to provide parents with a mobile app to remotely monitor their babies. Our focus was on developing a machine learning model that could accurately detect when a baby is crying. Let me take you through the fascinating challenges and insights I encountered as the machine learning engineer on this project.
Setting the Baseline
When embarking on this journey, the initial steps were daunting. I had a clear goal: detect whether a baby was crying or not. But I faced a big question: where do I start, and what data do I need? The first task in a machine learning project is to “get a number” quickly—this means establishing a basic system to evaluate its performance. Fortunately, someone had previously tackled this problem and published their work on GitHub. Their code and training data offered a valuable jumping-off point, allowing me to focus on the intricacies rather than reinventing the wheel.
Understanding the Machine Learning System
The system I stumbled upon centered around processing audio files. Specifically, it took 5-second long audio clips, typically in .wav format. The audio was decoded into samples and divided into frames—short chunks about 10 milliseconds in length. For each frame, we computed various features, such as spectral roll-off and 13 Mel-frequency cepstral coefficients (MFCCs) using the Python library, librosa. This complex step yielded a matrix of features for each audio clip.
The crux of the model was feature averaging. Instead of working with abstract audio waveforms, we simplified the complexity into a single vector of numbers representing these features. For instance, if we computed 18 features from each frame, each audio clip ultimately became a clean vector of 18 values. While this abstraction significantly reduced our data size, it also posed risks—namely, significant loss of information that could hinder the classifier’s ability to distinguish between different audio classes.
The final step in the process employed a Support Vector Classifier (SVC) to categorize the audio clips into one of four classes: crying baby, laughing baby, noise, and silence.
Initial Results and Challenges
The early results were encouraging, with a train accuracy of 100% and a test accuracy of 98%. However, this success was misleading, as several factors prevented us from deploying this model effectively. The primary concern was the incompatibility of the SVC, implemented in scikit-learn, with Android. Our goal was to ensure that the app would function seamlessly on both Android and iOS platforms.
To tackle this, I opted to replace the SVC with a simpler yet effective two-layer neural network using Keras. The transition proved fruitful, yielding a test accuracy of 96% without intricate hyperparameter tuning. Keras models can be deployed smoothly on Android through TensorFlow’s conversion tool, tflite_convert.
Overcoming Feature Engineering Obstacles
One major hurdle arose in the realm of feature extraction. While we had successfully moved the classification component to the mobile app, the audio feature extraction process still relied heavily on librosa, a Python library incompatible with Android’s Java and Kotlin environments. We attempted to rewrite parts of librosa in Java; however, the absence of strict equivalents for critical packages like NumPy and SciPy made this effort laborious and ultimately futile.
To resolve this problem, I turned to TensorFlow’s audio recognition capabilities. By splitting the input audio into frames and computing only the MFCC features, I created a spectrogram—a one-channel representation of audio data, which is essentially treated like an image and fed into a Convolutional Neural Network (CNN). This approach not only allowed us to harness the power of image recognition techniques but also simplified the entire process of feature extraction.
Addressing Dataset Limitations
Despite our progress, we soon encountered another challenge related to the dataset. Initially, the provided repo contained only 400 examples featuring overly simplistic, studio-quality recordings. While those results were impressive, they raised concerns about generalizability. Our aim was to develop a model that could perform reliably in the unpredictable noise of the real world.
To mitigate this, I turned to Google’s Audioset and explored YouTube for recordings of crying babies and random sounds. While I successfully amassed a larger, diverse dataset, I faced a considerable obstacle: annotating the audio accurately. Labeling parts of audio, especially from videos, required significant human effort to ensure that only relevant segments were included.
Fortunately, the Audioset provided robust annotations, enabling me to gather 1,000 samples each for the classes "crying baby" and "other." This meant I had two datasets: one small, clear dataset of studio recordings, and another larger, real-world collection from YouTube.
Training and Deployment on Android
When I eventually trained the CNN model using the small dataset, overfitting was apparent, with training accuracy at 100% but test accuracy plummeting to around 80%. However, this intricate model proved much better suited for the more realistic, noisy YouTube data, which reflected actual environmental conditions.
Upon testing on an Android device, I hoped to see reasonable predictions. But the model recognized everything as a crying baby, even cheerful tunes like “Happy Birthday.” This issue stemmed from discrepancies between the audio processed by the app and the model trained on the raw YouTube sounds.
To resolve this, I recorded the entire dataset on the Android app, ensuring the model trained on audio it would encounter in real scenarios. The outcome was encouraging: not only did the model produce reasonable predictions on Android, but the added background noise further regularized the system. The final model achieved a commendable 95% test set accuracy.
Future Considerations
This project has laid a robust foundation for future enhancements. Once the application MVP is released, I will gain access to user data that reflects actual conditions with real babies. This data will provide invaluable insights, helping us refine the model to handle situations we hadn’t previously analyzed. For instance, there may be recurring challenges not adequately represented in our current datasets.
The world of machine learning is filled with unexpected twists, and I look forward to the adventures that await as we continue to innovate for parents everywhere.
Photo by Dakota Corbin on Unsplash.