Monitoring Your Baby: A Machine Learning Journey

As a parent, the peace of mind that comes from knowing your baby is safe can be priceless. Imagine being able to monitor your baby’s well-being, even from another room. This isn’t just a dream; it’s becoming a reality thanks to advances in technology. I had the opportunity to contribute to a pioneering project that aimed to provide parents with a mobile app to remotely monitor their babies. Our focus was on developing a machine learning model that could accurately detect when a baby is crying. Let me take you through the fascinating challenges and insights I encountered as the machine learning engineer on this project.

Setting the Baseline

When embarking on this journey, the initial steps were daunting. I had a clear goal: detect whether a baby was crying or not. But I faced a big question: where do I start, and what data do I need? The first task in a machine learning project is to “get a number” quickly—this means establishing a basic system to evaluate its performance. Fortunately, someone had previously tackled this problem and published their work on GitHub. Their code and training data offered a valuable jumping-off point, allowing me to focus on the intricacies rather than reinventing the wheel.

Understanding the Machine Learning System

The system I stumbled upon centered around processing audio files. Specifically, it took 5-second long audio clips, typically in .wav format. The audio was decoded into samples and divided into frames—short chunks about 10 milliseconds in length. For each frame, we computed various features, such as spectral roll-off and 13 Mel-frequency cepstral coefficients (MFCCs) using the Python library, librosa. This complex step yielded a matrix of features for each audio clip.

The crux of the model was feature averaging. Instead of working with abstract audio waveforms, we simplified the complexity into a single vector of numbers representing these features. For instance, if we computed 18 features from each frame, each audio clip ultimately became a clean vector of 18 values. While this abstraction significantly reduced our data size, it also posed risks—namely, significant loss of information that could hinder the classifier’s ability to distinguish between different audio classes.

The final step in the process employed a Support Vector Classifier (SVC) to categorize the audio clips into one of four classes: crying baby, laughing baby, noise, and silence.

Initial Results and Challenges

The early results were encouraging, with a train accuracy of 100% and a test accuracy of 98%. However, this success was misleading, as several factors prevented us from deploying this model effectively. The primary concern was the incompatibility of the SVC, implemented in scikit-learn, with Android. Our goal was to ensure that the app would function seamlessly on both Android and iOS platforms.

To tackle this, I opted to replace the SVC with a simpler yet effective two-layer neural network using Keras. The transition proved fruitful, yielding a test accuracy of 96% without intricate hyperparameter tuning. Keras models can be deployed smoothly on Android through TensorFlow’s conversion tool, tflite_convert.

Overcoming Feature Engineering Obstacles

One major hurdle arose in the realm of feature extraction. While we had successfully moved the classification component to the mobile app, the audio feature extraction process still relied heavily on librosa, a Python library incompatible with Android’s Java and Kotlin environments. We attempted to rewrite parts of librosa in Java; however, the absence of strict equivalents for critical packages like NumPy and SciPy made this effort laborious and ultimately futile.

To resolve this problem, I turned to TensorFlow’s audio recognition capabilities. By splitting the input audio into frames and computing only the MFCC features, I created a spectrogram—a one-channel representation of audio data, which is essentially treated like an image and fed into a Convolutional Neural Network (CNN). This approach not only allowed us to harness the power of image recognition techniques but also simplified the entire process of feature extraction.

Addressing Dataset Limitations

Despite our progress, we soon encountered another challenge related to the dataset. Initially, the provided repo contained only 400 examples featuring overly simplistic, studio-quality recordings. While those results were impressive, they raised concerns about generalizability. Our aim was to develop a model that could perform reliably in the unpredictable noise of the real world.

To mitigate this, I turned to Google’s Audioset and explored YouTube for recordings of crying babies and random sounds. While I successfully amassed a larger, diverse dataset, I faced a considerable obstacle: annotating the audio accurately. Labeling parts of audio, especially from videos, required significant human effort to ensure that only relevant segments were included.

Fortunately, the Audioset provided robust annotations, enabling me to gather 1,000 samples each for the classes "crying baby" and "other." This meant I had two datasets: one small, clear dataset of studio recordings, and another larger, real-world collection from YouTube.

Training and Deployment on Android

When I eventually trained the CNN model using the small dataset, overfitting was apparent, with training accuracy at 100% but test accuracy plummeting to around 80%. However, this intricate model proved much better suited for the more realistic, noisy YouTube data, which reflected actual environmental conditions.

Upon testing on an Android device, I hoped to see reasonable predictions. But the model recognized everything as a crying baby, even cheerful tunes like “Happy Birthday.” This issue stemmed from discrepancies between the audio processed by the app and the model trained on the raw YouTube sounds.

To resolve this, I recorded the entire dataset on the Android app, ensuring the model trained on audio it would encounter in real scenarios. The outcome was encouraging: not only did the model produce reasonable predictions on Android, but the added background noise further regularized the system. The final model achieved a commendable 95% test set accuracy.

Future Considerations

This project has laid a robust foundation for future enhancements. Once the application MVP is released, I will gain access to user data that reflects actual conditions with real babies. This data will provide invaluable insights, helping us refine the model to handle situations we hadn’t previously analyzed. For instance, there may be recurring challenges not adequately represented in our current datasets.

The world of machine learning is filled with unexpected twists, and I look forward to the adventures that await as we continue to innovate for parents everywhere.

Photo by Dakota Corbin on Unsplash.

The Symbolic Strategy Letter

Premium features

Detecting Baby’s Cry: A Case Study on Using Machine Learning Models

Monitoring Your Baby: A Machine Learning Journey

Setting the Baseline

Understanding the Machine Learning System

Initial Results and Challenges

Overcoming Feature Engineering Obstacles

Addressing Dataset Limitations

Training and Deployment on Android

Future Considerations

Table of contents [hide]

Amazon Launches AI-Enhanced Augmented Reality Glasses for Delivery Drivers

GraphComm: Predicting Cell Communication through Graph-Based Deep Learning of Single-Cell RNA Sequencing Data

DHL Launches New Innovation Center in Europe to Enhance Robotics, AI, and Sustainable Logistics

Fallon Gorman Named President and CFO of NLP Logix

5 Warning Signs That Generative AI Is Losing Momentum

Related updates

Exploring SU(d)-Symmetric Random Unitaries: Quantum Scrambling, Error Correction, and Machine Learning

Predicting N2 Lymph Node Metastasis in Non-Small Cell Lung Cancer Using Machine Learning

Interpretable Machine Learning for Classifying Metal Passivity from Minimal EIS Data

Optimizing Lithofacies Prediction in the Lower Goru Formation Using Diverse Machine Learning Algorithms

Amazon Launches AI-Enhanced Augmented Reality Glasses for Delivery Drivers

GraphComm: Predicting Cell Communication through Graph-Based Deep Learning of...

DHL Launches New Innovation Center in Europe to Enhance...

MIT CSAIL Unveils Vision System for robots to Enhance...

Microsoft Report: AI Will Transform Work Styles Instead of...

Revolutionizing Semiconductor Manufacturing with Machine Learning