11 min readSep 23, 2023

This project was carried out as part of the TechLabs “Digital Shaper Program” in Aachen (Winter Term 2022/23)

#Introduction

Driving while drowsy is a serious problem that affects millions of people worldwide. In a 24/7 society, which emphasises work, longer commute and exponential advances in technology, many people do not get the sleep they need.

In 2017, The National Highway Traffic Safety Administration (NHTSA) estimated that 91,000 police-reported crashes in US involved drowsy drivers. These crashes led to an estimated 50,000 people injured and nearly 800 deaths[1]. With the rise of autonomous vehicles, there has been an increased focus on driver monitoring systems that can detect drowsiness and other dangerous driving behaviors.

By alerting drivers when they are becoming drowsy, these systems can help prevent accidents and save lives. Moreover, monitoring of driver alertness is an implicit requirement in the forthcoming Society of Automotive Engineers (SAE) levels of conditional automated driving (level 3). This is because handing over vehicle control to drowsy drivers is unsafe[2].

The increasing significance of real-time driver drowsiness detection has led to extensive research and development studies which seek to find efficient and accurate methods and algorithms in this field.

One promising solution is the use of convolutional neural networks (CNNs). CNNs are a type of deep learning algorithm that can learn to identify patterns in images and other visual data. By training a CNN on a large dataset of images and videos of drowsy drivers, it is possible to create a system that can accurately detect when a driver is becoming drowsy.

Dataset Overview
Data preprocessing
CNN Approach
Results and Metrics
Landmark Detection Approach
Conclusion and further improvements

2) Dataset Overview

For our project, we used the Driver Drowsiness Dataset (DDD) from Kaggle (link at the end of the blogpost). It contains extracted and cropped faces of drivers from the videos of the Real-Life Drowsiness Dataset (RLDD). The frames were extracted from videos and stored as images from which regions of interest were further extracted.

The dataset has the following properties : — RGB images — 2 classes (Drowsy & Non Drowsy) — Size of image : 227 x 227–41793 images in total — 28 participants — every participant has an own label (for example: C — drowsy and c — non drowsy)

visualization of sample images in the dataset

3) Data preprocessing

Data preprocessing is done to clean and transform the data to the correct form as required by the model for better analysis and accuracy. Depending onn the given data, several methods, can be used e.g. data cleaning, feature extraction, normalisation etc.

It should be noticed that data preprocessing is a very important step because it affects the correctness of all the following results. Even in case of a solid neural network architecture , poor preprocessing will likely lead to poor/unusable outputs

Used tools and libraries

We mainly used Pytorch for preprocessing ata and creating and training the CNN model. Furthermore, we utilized Google Colab to run the model using GPUs which has the advantage of an easy set up of the environment.

Details about the data

Because of the large size of the DDD dataset and the expected difficulty in training the model for a long time, we considered converting images to grayscale or to scale the pictures down. However, as valuable information for training the model could be lost by performing these transformations, we opted to retain the RGB channels of the images.

The number of images that exist from each person (person ID on x axis) varies.

From the graph it is clear that there is an uneven distribution of images among the different people. For e.g.: person ‘A’ has 1411 images in Drowsy and 1252 in Non Drowsy. Person ‘ZA’ has 621 images in Drowsy and 1054 in Non Drowsy.

Separating the data

Next, the data needed to be split into train and test data. For effective testing, we took 3 sets of people out (for example person C, N, K) from both the drowsy and non drowsy folders. We did this as we wanted to analyze how the neural networks behaves when unseen images are used as input for testing. The remaining data (with all the images except of the three persons that have been shifted to the validation/test dataset) are split into 80% training data and 20% test data. For the creation of the final test dataset the 3 people that have been taken out previously were then combined with the 20% test data with people that appear in the train and the test dataset. It is done this way to get an effective comparison of how the model predicts people that it has already seen in contrast to totally new data.

Standardisation

After this, standardisation was done to remove any bias in the data. It is done by subtracting a pixel value by its mean and dividing by its standard deviation. For RGB datasets, the mean and standard deviation are calculated from each channel.

The mean and standard deviation calculation is done using train dataset, and the same values for mean and standard deviation are applied on the test data for its standardisation as well. Standardisation is not done on the whole dataset to avoid that any information from the test data influences the train dataset.

Loading of data

Pytorch provides Dataset and Dataloader classes that are used to easily load the data for training and testing. The modules also help in performing data transformations, shuffling, batching, etc. We have used a batch size (the number of data is processed at one time) of 64.

4) CNN Approach

To achieve our goal, we used a powerful type of neural network called a Convolutional Neural Network (CNN). A CNN is specifically designed to work with visual data, making it an excellent choice for image classification tasks. The CNN architecture we used consisted of several layers that are able to automatically extract relevant features from the input images.

Layers and parameters of our neural network

Each of the four convolutional layers in the CNN applies a set of filters to the image to identify patterns and features. Generally the first few convolutional layers detect low level features like edges. Deeper convolutional layers identify higher level features like eyes, ears or noses of participants. The output of each layer is then passed through a non-linear activation function (we used ReLU) to allow the model to learn non-linear relationships.

To further improve the performance and stability of the network, we used batch normalization and pooling layers in between each convolution layer.

Batch normalization ensures that the inputs have similar statistical properties, thereby improving the stability of the network and reducing the number of training epochs required.

Pooling layers reduce the size of the output from the convolutional layers. This, in turn, reduces the computational cost of the network and can also help in preventing overfitting.

Finally, the output is passed through several fully connected layers with batch normalization and ReLU activation functions. The output of the final layer is a fully connected layer with two output nodes, representing the two possible classes of drowsy and not drowsy drivers.

Our CNN has a batch size of 64 (how much data is processed at a time) and a total number of parameters equal to 30,732 (number of weights and biases used to train the model).

5) Results and Metrics

Accuracy

We trained and tested our model with four different sets of three people removed ((C, N, K), (O, P, ZB), (E, G, D), (M, U, ZA)). The set of people which were removed were chosen on the basis of characteristics which made them substantially different from the others i.e. glasses, beards, bad lighting or reflections. When evaluating the success of the model we looked at test accuracy: the percentage of images correctly classified on the test set. Please refer to the graph below to look at the test accuracies at each of the four epochs for each of the four removal combinations used.

test accuracies in 4 epochs of different sets of three people removed

We can see that even though accuracy is quite high, never dropping below 80%, the general trend is downwards, suggesting that despite our best efforts some overfitting is taking place.

Misclassifications per person

To understand the results further, we looked into the percentage of incorrectly classified images of each person, particularly looking at the images the model hadn’t seen during training. In the graph below we can see a detailed breakdown of accuracy by person in every epoch and identify the people which are most likely to be misclassified.

persons that have been misclassified in multiple epochs, first row — person C, N, K are only in test dataset, second row — person O, P, ZB are only in test dataset

From the image, we can see that the accuracy of the model for people left out of the training data (red bars) initially improves between epoch one and two or three and then decreases suggesting that this is when overfitting starts to really set in.

This could be largely attributed to the limitations of dataset. Even though the dataset contains 41793 images, it only has 28 participants with each participant in a similar pose on all images. Since the model hasn’t seen a large enough number of people during training, it tends to misclassify a larger number of images of new people.

We also assume that the network unfortunately tends to build classifiers for the persons in the training dataset. It is also visible that certain persons (for example O bottom right) which are only in the test dataset are harder to classify correctly than others (for example ZB bottom right).

F1 score

The F1 score is a machine learning metric that can be used in classification models. It is the harmonic mean of precision and recall. Precision can be seen as a measure of quality, and recall as a measure of quantity. Higher precision means that the model returns more relevant results than irrelevant ones, and high recall means that the model returns most of the relevant results (whether or not irrelevant ones are also returned).

A model will obtain a high F1 score if both Precision and Recall are high
A model will obtain a low F1 score if both Precision and Recall are low
A model will obtain a medium F1 score if one of Precision and Recall is low and the other is high

F1-scores in 4 epochs of different sets of three people removed

The F1 score results show similar trends as accuracy across the epochs.

6) Landmark Detection Approach

Initially, for our driver drowsiness project, we considered using landmark facial detection to track facial expression, head position, and eye outline. Landmark facial detection involves detecting very specific predefined points (68 in our application), such as the tip of the nose, eyebrows, mouth corners, e.t.c. on the face. The gained characteristic points (x, y) can be further used by machine learning algorithms to determine what kind of relationships between those points correspond to drowsiness or lack thereof.

In our approach we used the library dlib which provides face detectors and shape predictors for the landmark detection. By using a pretrained CNN face detector we could detect a face on our image dataset more than 97% of the time. The CNN face detector outputs the rectangle where the face is located. The shape predictor uses the rectangle from the previous step to estimate 68 landmarks as shown below.

Predefined 68 landmarks that need to be identified by the shape predictor in the image

Detected landmarks on a sample image

The prediction of the drowsiness is based on the simple calculation of the eye aspect ratio that combines the 12 landmark points in both eyes to one value. The eye aspect ratio is a measure of “how open” the eyes are. If the eye aspect ratio goes below an experimentally found threshold the person is labeled as drowsy.

Using only the eye aspect ratio landmark for classification leads to a mediocre accuracy of around 64%. We assume that the limited amount of test persons in the dataset have very differently shaped eyes, leading to inaccuracies. For future works all 68 landmarks should be used for classification.

7) Conclusion and further improvements

We created two approaches to solve the problem to detect the drowsiness of drivers using their images. In contrast to the usage of popular pretrained models and transfer learning both approaches are built from the ground up. Our CNN approach achieves an accuracy and F1-score of around 83%, when being faced with a difficult test dataset. Further improvements are needed to optimize the outcome. Our Landmark detection approach achieves a mediocre accuracy but using more landmarks as a source of information will potentially lead to more accurate results. With really simple means we already reached 64% accuracy for a challenging dataset.

Possibilities to improve

It would be great to enhance the existing dataset by including more participants to avoid the negative effects that were mentioned.
The data preprocessing of the CNN-approach could be improved, by using functions like RandomHorizontalFlip or RandomRotation to reduce the uniformity of the images that are passed to the neural network.
The usage of slightly modified pretrained CNN-models like VGG, ALEXNET or RESNET could further improve the detection accuracy.
The current approach that uses landmark detection only considers the eye aspect ratio for the classification. Apart from the eyes the drowsiness also affects other regions of the face, so using all available landmarks could be the next step. Lines and angles between the landmarks can provide more information and increase the performance.

Special Thanks

We would like to thank our mentor Jöran Rixen for his great support throughout our project. He helped us a lot with his expertise in Deep Learning and enhanced our understanding of challenges and solutions of data driven technology. We would also like to thank TechLabs Aachen for giving us the opportunity and platform to work on this interesting project. The application of Data Science and Deep Learning in practice was a valuable experience for us, as well as meeting other tech-enthusiastic people.

Team Members

Christian Sinn Aishwarya Bose Maroof Mohammed Abdul Aziz Misha Abramovich

Github Repository of the project: https://github.com/mattac22/TechLabs_DriverDrowsinessDetection

Driver Drowsiness Dataset: https://www.kaggle.com/datasets/ismailnasri20/driver-drowsiness-dataset-ddd

*This project was carried out as part of the TechLabs “Digital Shaper Program” in Aachen (Winter Term 2022)

*TechLabs Aachen e.V. reserves the right not to be responsible for the topicality, correctness, completeness or quality of the information provided. All references are made to the best of the authors’ knowledge and belief. If, contrary to expectation, a violation of copyright law should occur, please contact journey.ac@techlabs.org so that the corresponding item can be removed.**

Contents