Analysing the German trains (DB)

11 min readSep 23, 2023

This project was carried out as part of the TechLabs “Digital Shaper Program” in Aachen (Winter Term 2022/23)

Blogpost Gruppe 1

Introduction

Hey there! Let us tell you about our project analyzing Deutsche Bahn (DB), the German railway company. As you may know, millions of people rely on Deutsche Bahn to get around every day, but like any transportation system, it experiences delays and cancellations that can cause some headaches. That’s where we come in. Our team wanted to identify the root causes of these incidents and take proactive measures to prevent them and to help travelers decide which is the best time to take a DB train. To do this, we used data analysis and machine learning to analyze past delays and cancellations and predict future delays.

Method + approach

Our general roadmap we agreed on for the approach was chosen as :

Deciding on the scale of the project
Finding a suitable dataset
Cleaning the found data
Performing data exploration
Preprocessing of the data for the prediction

Deciding on the scale of the project

The first decision we had to make was deciding on what scale we wanted to perform the mentioned ideas. We agreed to exclusively work on long-distance trains in Germany, mainly operated by the DB, but also other companies with a smaller share of the German long-distance railway network like Flixtrain or Thalys, for example . Also, due to the fact that long-distance trains often have international destinations, there are also some European cities outside of Germany that we partly included in our analysis.

Finding a suitable dataset

After we made sure we were all on the same page regarding the idea of the project we went to work on finding suitable data. First, we collected historical data on delays and cancellations from the Deutsche Bahn API (in Zugfinder.net). The dataset consisted of certain variables, including train type, route, time of day, and delay duration. While this dataset provided valuable insights into the factors contributing to delays and cancellations, it also had some limitations. For example, the dataset lacked critical variables, such as weather conditions and passenger feedback, that could have influenced the analysis’s accuracy. The data came with different unique, desirable features and drawbacks. We decided that for our project, recent data in a shorter timeline was more important than old data in a longer timeline

Cleaning the found data

The final dataset we chose to work on was divided in Arrival and Departure Information, so we merged them into one single dataframe. The specific delay for every train in the dataset was not already given, so we also had to get the delay information out of the arrival and departure time information. Now that we had this done, we could get rid of more features the dataset originally had, for example, the train number for every listed train, only to keep the features of the dataset we were interested in and wanted to use. We also excluded or adjusted columns that had redundant information or inconsistencies to make the dataset as precise as possible for our next steps.

Performing data exploration

Now that we had a “sharp” dataset, it was time for data exploration. We thought of specific features we might want to get some insights to gain a better overall impression of the data we were working with. In the data analytics part of the blog, a bit further down will be some chosen examples of the results. The full results and the tools we used are accessible on a streamlit page that is reachable over a provided link also further down in this blog.

Preprocessing the data for the Prediction

After the data exploration part was done, it was time to go to the final phase of our project. Because the accessible Prediction models can’t handle strings/texts, we had to classify our data first in terms of encoding/classifying every feature of our dataset as an integer value. Also, it was time to split our dataset into a training and testing part to measure how good the created model predicted in the end. We also decided on what features we wanted to perform the prediction and what specific tools, like linear regression, for example, might be the best choice for this particular task to make the results as accurate as possible. More details on that part will be further down in the blog in the machine learning part.

Data Analysis

During our data analysis, we found that certain factors, such as train types, routes, and times of the day, had a significant impact on the frequency of incidents. However, we did not include the analysis of weather conditions nor the amount of travelers or the impact of important impacts in the world and-or Germany in our current project. We hope to explore this correlation in future work, as it could provide valuable insights into how weather affects train operations, due to the fact that we think that there is a strong relationship between the delays, cancellations and weather conditions. Our main goal was to analyze data on Deutsche Bahn’s operations to give them useful information. We looked at things like which stations have the most delays, how much time trains are delayed for, and when delays are most likely to happen. We also found it interesting to analyze the cancellation percentage as well as to analyze the delays and cancellations by station, so as to find the worst and best stations. By understanding these patterns, Deutsche Bahn can figure out what’s causing the delays and take steps to improve their services. We used different libraries in order to do our data analysis; pandas, numpy, matplotlib and seaborn are some of those.

Visualisations

In order to analyse the data the best possible, we made different charts such as:

Histograms
Bar Charts
Pie Charts
Heatmaps
Box Plots

Next you can find some of the visualisations we made:

This is a bar chart showing the amount of departures by station.

This is a heatmap showing the amount of departures by easton each day of the week.

This is a histogram showing the delay of minutes everyday during the month of May.

This pie chart shows the percentage of delay of the whole dataset.

This heat map shows the amount of average delay per hour and station.

Streamlit

In order to put together everything we worked on, we used Streamlit. It is an open-source Python library also based on HTML that allows users to quickly create interactive web applications with just a few lines of Python code. It is designed to simplify the process of creating and sharing data-focused web applications. You can find the link to our streamlit web application here:Click here to visit our web app. The code repository can be accessed here:The github repository

Machine Learning

For our machine learning models, we first started by importing the necessary libraries such as pandas, numpy, scikit-learn, matplotlib and pickle. Pandas was for data manipulation, numpy was for numerical operations, scikit-learn for machine learning tasks, matplotlib was for data visualisation, and pickle was for saving the trained model. Next, we split the data into training and testing sets using the train_test_split function from scikit-learn. This is an important step to prevent overfitting of the model to the training data and to evaluate the model’s performance on unseen data.

Models

For our project we implemented different types of machine learning models described below with their scores:

Logistic Regression

We used the “mean_squared_error” function to calculate the mean squared error between the predicted values and actual values. The model is trained using the ‘fit()’ method, and then the ‘predict()’ method is used to make predictions on the testing data. The mean squared error and the R-squared error are then calculated to evaluate the performance of the model. The mean squared error (MSE) is a measure of the average squared difference between the predicted values and the actual values. The root mean squared error (RMSE) is the square root of MSE, which provides a measure of the average difference between the predicted values and the real values, in the same units as the target variable. The R-squared (R2) is a measure of how well the model fits the data. The logistic regression model had an MSE of 462.42, and an RMSE of 21.5 and an R2 of -14.07 and the metrics indicated that the model was not performing well. To improve the performance, we transformed the arrival delay variable into a categorical variable. We then split the dataset into training and testing sets, and the training set was used to train a logistic regression model. The algorithm predicts a target outcome variable. We then evaluated the performance of the logistic regression model using several metrics such as accuracy, precision and F1 score. The accuracy measures the proportion of correctly classified instances, while precision measures the proportion of true positive instances among all positive instances. Recall measures the proportion of true positive instances among all instances that are actually positive. The F1 score is the harmonic mean of precision and recall. The logistic regression had an accuracy of 0.49 meaning it was not very accurate.

Confusion matrix of the logistic regression algorithm ## KNN

Afterwards, we implemented the K-Nearest Neighbour algorithm. To begin with, we first standardised the features using the StandardScaler function from the Scikit-learn library. We did this to transform the data to give it a mean of 0 and a standard deviation of 1, which helps improve the accuracy of the KNN algorithm. Next, we created the KNN classifier object, setting the number of neighbours to 5 and n_jobs to -1 to use all available CPU cores. We then trained the KNN model on the standardised training data. After training the model, we used it to predict the variable for the testing set. We evaluated the model’s accuracy using the accuracy_score function, which gave us a value of 0.51. We also generated a confusion matrix using the confusion_matrix function, which showed that the model correctly predicted 39% of the data with minor delays, 42% with moderate delays, and 63% with severe delays. The pickle library was used to save the trained KNN model y to reuse it later without the need to retrain the model. Our model achieved an accuracy of 0.51 and an F1 score of 0.51 meaning there is still room for improvement.

Confusion matrix of the KNN algorithm

Random Forest Classifier

In the RandomForestClassifier, we first defined a parameter grid for the number of trees, and then performed a grid search over the parameter grid. We set the random_state to 42 and the n_estimators to 11, which is the number of trees in the forest. Again, we trained the model on the training set and saved it using the pickle library. After that, we used the trained model to predict the target variable for the testing set. The accuracy_function gave us an accuracy of 0.51, and the confusion_matrix helped us evaluate the performance of our model. The F1 score gave us a score of 0.51 and we also generated a classification report using the classification_report function.

Confusion matrix of the Random Forest algorithm

Gradient Boosting Classifier

To use this algorithm, we defined the classifier and then set the number of trees to 11. We then fit the classifier on our training data. The accuracy of this model was 0.495 and the F1 score was 0.4. The average precision of the model was 0.52 and a recall of 0.50.

Confusion matrix of the Gradient Boosting algorithm

Issues/Challenges

We also faced some challenges during the project. One of the significant challenges was the limited availability of data. We had access to historical data on delays and cancellations of Deutsche Bahn trains, but the dataset was incomplete and lacked some critical variables, such as weather conditions, train maintenance logs, and passenger feedback. These missing variables could have influenced the analysis’s accuracy and hindered the model’s performance in predicting future incidents. Another challenge was the lack of domain expertise. We had limited knowledge, not only of data science and machine learning. This limited our ability to interpret the data accurately.

Project Result + Conclusion

We examined four different models for target value classification: Logistic Regression, Random Forest, Gradient Boosting, and Support Vector Machines. With an accuracy rating of 77%, Logistic Regression, Gradient Boosting, and Support Vector Machines performed comparably to the other models. An accuracy score of 74% for Random Forest indicates a marginally worse performance. Random Forest, Gradient Boosting, and Support Vector Machines all got F1 scores of 69%, 67%, and 73%, respectively, but when looking at the F1 scores, Random Forest scored the best with a score of 73%. The same four models — Logistic Regression, Random Forest, Gradient Boosting, and Support Vector Machines — were applied to multiclass classification as well. The outcomes, however, fell short of those for single class categorization. With accuracy ratings ranging from 47% to 51%, all models performed equally. The F1 values of Logistic Regression, Gradient Boosting, and Support Vector Machines ranged from 31% to 42%, while Random Forest obtained the highest F1 score of 50%. The models perform better in the previous classification than in multiclass classification, which is the project’s most significant finding. This is perhaps because anticipating train delays in particular categories is more difficult than predicting delays generally. A further indication that Random Forest would be a strong option for predicting train delays is the fact that it fared well in both classifications. In terms of areas of improvement, we could explore additional feature engineering techniques and incorporate weather data to better understand the impact of weather on train delays. This would likely require obtaining additional datasets and performing more advanced data preprocessing to ensure compatibility with our existing data.

Finally, this project was a wonderful introduction to machine learning and gave us the chance to assess how well various algorithms predicted train delays. The findings demonstrated that although some models outperformed others, there is no “optimal” model for this work; instead, it may rely on the particular specifications and constraints of the project. Overall, our findings demonstrate the potential for machine learning algorithms to predict train delays with reasonable accuracy. The implementation of such models could be used to improve the overall efficiency and reliability of train transportation systems.

This project was an eye-opener for all of us in the world of endless possibilities of data science and machine learning. But getting this far would not have been possible without our mentor Hachem Sfar, who was always available to answer our questions, critique our work and also offer guidance throughout the entire project. Thank you for reading.

Team members

Anne Santiago
Fidel Gatimu
Julius Thiemeyer

Mentor

Hachem Sfar