Aachen Transportation

TechLabs Aachen
12 min readMay 20, 2021

This project was carried out as part of the TechLabs “Digital Shaper Program” in Aachen (Winter Term 2020).

Introduction

Discussions about alternative transportation means in Aachen in former and present times are ubiquitous. One example being the plans for the “Campusbahn” to interconnect Aachen’s city center with the university clinic and adjacent campus, underlining the ongoing discussion on- and motivation for improved public transport in Aachen. As a result, it is not surprising that startups such as upBUS seek to develop future public transportation strategies for more efficient and flexible routes and transportation vehicles to, as they state themselves, “make traffic suck less”. To contribute to this subject, we also wanted to analyze the Aachen transport network. This project aims at the data exploration and visualization of the Aachen public transport system based on the open data provided by the AVV. A central goal was to analyze and visualize key characteristics of the transportation system.

GTFS file format

The General Transit Feed Specification (GTFS), also known as GTFS static or static transit defines a common format for public transportation schedules and associated geographic information. GTFS “feeds” let public transit agencies publish their transit data, and developers write applications to consume that data in an interoperable way. A GTFS feed is composed of a series of text files collected in a ZIP folder. Each file models a particular aspect of transit information. The GTFS file comprises various elements like agency, stop, route, trip, stop time , calendar date ,shape file and frequency information (see scheme below). Shape files are files which typically contain geographic data based on coordinates.

The GTFS files are extracted from the AVV website through a GTFS library and imported into Python. This is with the power of Pandas further manipulated to extract and visualize key data. A typical GTFS file contains data of 1 year. In order to generate data for multiple years, GTFS files were stitched.

Methodology

Data extraction:

The raw data of our analyses were derived by AVV open data website. Additional data on the Aachen population was derived from open data service of the city as well as from other existing open source projects (refer to resources).

Computation of population data:

Computation of population data: We used a combinatorial approach of classic data clean-up and visualization methods. To deal with geospatial file formats a library for processing general transit feed specifications (GTFS) (“gtfs_kit” provided by MrCagney) was used. The library includes functions for analyzing GTFS and shape files based on the GeoPandas library. Data clean-up and preparations were achieved by Pandas and GeoPandas functions.

The computation of stop and route data was performed with the help of “gtfs_kit” library functions. Function takes feed and date as necessary parameters among others to return stop, trip and route data respectively. We will explain a couple of these functions in the following.

Generating an enriched data set and Key data computation:

A master enriched data set was generated using the “compute_trip” and “compute_route” functions. The data contains the overall number of trips, service distance, mean trip distance, mean trip duration per “route_id”/”route_name” daily. The number of routes per day could be computed by counting the number of rows daily. Additionally, the number of stops per route was computed separately using the route data and then finally mapped onto the “route_id” column of the enriched dataset.
The generated dataframes were combined for different time intervals to generate a master dataset. The master dataset is further cleaned for outliers.

Stop stats data:

The details of each stop can be generated using the “compute_stop” function available in the “gtfs_kit” library.
gtfs_kit.stops.compute_stop_stats(feed: Feed, dates: List[str], stop_ids: Optional[List[str]] = None, headway_start_time: str = ‘07:00:00’, headway_end_time: str = ‘19:00:00’, *, split_directions: bool = False) → pandas.core.frame.DataFramed
The function above computes stats like number of trips, number of routes and average time between two buses/trains passing at that stops (also known as headway) for the stops at the given dates (YYYYMMDD date strings) and between the desired time intervals. Data for all the stops were generated for a 4-year interval. Cleaning of data was required as stop ids of the same stop were represented with different “stop_id” numbers and slightly different stop names. The data before and after cleaning are represented in the figure below.

In the interactive dashboard you can play around to see the stops activities in and around Aachen.

Trip-stats function:

This function takes parameters of the feed, “route_ids” of interest and returns dataframe containing trip data for the whole dates present between the feed dates.gtfs_kit.trips.compute_trip_stats(feed: Feed, route_ids: Optional[List[str]] = None, *, compute_dist_from_shapes: bool = False) → pandas.core.frame.DataFrame

This is further utilized to calculate the route data.

Route-stats computation:

The below functions take a feed file, the above computed “trip_stats” data, date intervals and headway start & end-times to output data of different routes and its corresponding number of trips, mean trip distances & durations, service distances & durations.
gtfs_kit.routes.compute_route_stats(feed: Feed, trip_stats_subset: pandas.core.frame.DataFrame, dates: List[str], headway_start_time: str = ‘07:00:00’, headway_end_time: str = ‘19:00:00’, *, split_directions: bool = False) → pandas.core.frame.DataFrame
The generated dataframes were stitched together for different time intervals to generate a master dataset. The master dataset is further cleaned for outliers such as mean trip distances
The dashboard represents the comparison of routes and trips over the years.

Outlier Detection:

There are chances that the dataset might have outliers or inconsistencies, and this was true to our dataset as well. While there are multiple attributes present in our dataset, we only concentrated on the attributes that were of utmost importance to our further analysis and visualization. There are various approaches to find and remove outliers like the Z-Score method, DBSCAN, Isolation Forest, IQR etc. The approach we took to find and replace the outliers was the InterQuartile Range method (IQR), as it is one of the simplest methods.

IQR is a method in which the dataset is defined based on lower and upper bounds. In this method, the lower and upper quartiles are defined based on the list of values for a particular attribute assigning lower and higher percentiles on which these limits are defined. Then the range or interquartile range is defined, which is the difference between the upper and lower limit multiplied by the outlier constant (defined by the user based on the statistical values). Then a quartile set is specified according to the lower limit as lower quartile minus the IQR and upper limit as upper quartile plus the IQR. The data that lies outside these limits are considered outliers. These outliers can either be removed permanently or replaced by statistical values using mean, median or mode method.

Data Pre-processing:

In our case, the columns ”num_trips”,”num_routes”, “mean_trip_distance” and “shape_id” were of utmost importance in the data pre-processing step. This is because of the usage of columns in further data analytics and visualization.

  1. num_trips/num_routes:

The figure below shows the outlier detection and replacement before (left) and after data cleaning (right).

“Number of trips” is one important attribute for our analytics. We have used a similar approach as described in the previous section. As you can see, for the route_id of “17683_714”, for instance, the average number of trips lie around 40. There are, however, some values which were present with larger variation than the average value before the processing of data. Hence, these values were identified using the above method and were replaced by using the median value of the dataset. The data is filtered based on the “route_id” and “Dayofweek”, so that the median value could be found only to that particular dataset. The outlier value is then replaced by this median value.
Similarly the computation for number of routes is carried out. In this case however, the data for finding the median value is replaced by “agency_name” and “DayOfWeek” as the dataset had slightly different attributes.

  1. mean_trip_distance:

Mean trip distance is another attribute which is very useful for our analysis. This is calculated based on the latitudes and longitudes from the shape file in the set of data files. The details of the computation is from the function in the library “gtfs_kit”, which we have utilised in our project as one of the major libraries.
When we analysed the dataset, there were multiple values which were having varying distances than the standard value for a particular route_id.
In this case, a similar approach as stated above was utilized with slight changes. The basic logic of IQR remains the same, but the median was replaced with mode value. The reason being that the outliers were only present in minute quantities for a particular “route_id”. Hence, we could use the value that repeats the most.
The sample for a route_id:17750_106 of prior and post outlier replacement can be seen in the figures below (left: prior data cleaning, right: post data cleaning).

The mean_trip_distance is around 155 km. However, there are some values that are below 80 km. Therefore, this value was replaced with the most recurring value (155 km).
Similar logic was applied in replacing the shape_id as shape_id is the primary value in determining the mean_trip_distance for that particular route_id. If we consider the above example, the most recurring shape_id of the value which has mean_trip_distance of 155 km is chosen as the replacing value. Since we only have to remove outliers once, the final result was satisfying even though the computation time was high

Affirming an outlier:

The judgements of whether or not a value is an outlier entirely depends on the dataset we are analysing. If we consider the above example, i.e. “route_id” “17750_106”, it actually corresponds to train “RE 4” which is travelling between Aachen and Dortmund. The distance could be checked in the map as well. The estimated distance is around 160 km. Hence, any value which is either much higher or much lower than this value is considered to be an outlier. Similarly others are also estimated.
In case of “num_of_trips” or “num_of_routes”, unless there was any special event on that particular day or a public holiday, the value should remain the same as that of other corresponding days since the schedules and trips are pre-planned and static. The only reason for the values to differ is because of some of the GTFS files that were not properly updated and/or corrupt.

Plotting shape files:

The imported shape files processed by gtfs_kit assign the longitude and latitudes of a coordinate in a single line assigned to a single shape ID (column ‘shape_id’) and ordered by an incremental sequence of numbers (column: ‘shape_pt_sequence’). Plotting resulted in interconnected lines, which was not desired. Therefore, the latitudes and longitudes were first isolated and transformed into a GeoDataFrame and subsequently the coordinates were grouped by the shape ID, converted to linestrings, and assigned to a column ‘geometry’. In the resulting GeoDataFrame the geometry column contained all coordinates of a single shape ID as a linestring. The geometry column was merged into the master dataframe to be able to filter by other parameters such as agency IDs or the short name of the routes. The geometry column could now be simply plotted. For visualization, the background was chosen to be black, while the transparency was adjusted to highlight most used tracks by appearing brighter than others (figure below).
Further, the routes were plotted on top of shape files containing information about districts of Aachen and their population.

Problems and Challenges

Several challenges had to be tackled during the course of the project. The data sets were partly incomplete or contained outliers which had to be removed prior to further data processing. As there was no real-time data available for Aachen, correlation analysis regarding real-life events was not possible. In addition, the archive data was limited to four year, so that no long-term development could be visualized. We contacted AVV to understand whether changes in the scheduled data were adapted based on actual data in retrospect, which was not the case. Due to these aspects, we had to adapt our goals for the project. We could not discover what the impact of certain events was on the Aachen transport network. Additionally, we could not investigate the network’s delays, because the data are not updated after a service is completed.

Results

From the enriched dataset, the key statistics such as routes with the highest mean trip distances, highest service distance, service durations could be identified.

  • Did you know that, the longest bus service originating from Aachen is SB66 from Aachen to Monschau covering 33 km?
  • Did you know that the bus with the most number of stops operating around Aachen is route 21 from Linert Friedhof to Palenberg Bahnhof with 90 bus stops?
  • Did you know that the route 21 is also the bus which has the highest mean trip duration of 1.35 hours?
  • The bus with the most number of trips per day is route 51 with 169 trips per day.
  • The longest train route (excluding ICE) originating from Aachen is RE1 from Aachen Hbf to Hamm(Westf) with a distance of 204 km.

To see more key insights feel free to explore our dashboards via the links in this blogpost!

Route and trip information:

The route and trip information were classified based on the different agencies that operate in and around Aachen and by the mode of transportation also (Bus, Train, Tram etc). Data was sampled on a daily basis, and trends were analysed. The number of routes and trips operating on weekdays were considerably higher as compared to on the weekends. The final dataset was a result of multiple mappings of route ids, route names and agency details.

Interactive dashboard

Stop data based information:

The data cleaning and subsequent computation of the stop data enabled the visualization of bus stop frequency by stop color and size on a map of Aachen. Check out an interactive version on our dashboard here:

Additionally, we investigated the buses per hour on a given day. By visualizing these data, we can clearly see at which time of the day a given station is busy or quiet. Below you can see this graph for Bushof on 16th of February 2021. For Bushof the peak hours are between 7 am and 9 am and 3 pm and 19 pm. These peak hours correspond to the usual commuting hours to and from work.

Population based visualizations:

In addition to the open data provided by AVV, we aimed at analyzing relevant population data on Aachen to find possible correlations, and create a combined visualization. The population of districts shows the total number of inhabitants per district (figure below). The districts with postal codes 52066 and 52074 (general area is Laurensberg and Rothe Erde) have the most inhabitants, indicating the necessity to cover and connect these districts by public transportation means. Both the available population and transportation data sets only covered the last four years, showing only minor changes in the total number of inhabitants but no changes in the district distribution. Hence, no analysis with respect to long-term population development and the likely expansion of public transportation means as a result of population growth was possible.

In the next step, the population visualization was converted to number inhabitants per square kilometer of each district and merged with a second data set of all public transportation routes in and around Aachen. The resulting plot visualizes the spatial arrangement of transportation routes to the districts, notably including even remote locations (e.g. Hamm and Bonn via train connections). The opacity of the lines thereby indicates the routes frequently used by several transportation lines, whereas more transparent routes are only used by one.

Conclusions and outlook

Due to the nature of the data, correlation to real-time events was limited. Nevertheless, the complexity of GTFS feeds served as a suitable data set to handle large amounts of data, visualize geospatial data, perform data pre-processing and statistical analysis and visualise the outcomes. In future projects, the analysis of real-time data would be an important asset to reveal interesting correlations. For this purpose, the here presented code can be easily transferred to the GTFS feed of other agencies.

Team members: Shuaib Abdulmajeed, Belman Akshay Tantry, Philipp Demling, Bjorn van Wijk, Niels Rohmert, Laura Grabowski, Mentor: Roney Mathew

Resources

TechLabs Aachen e.V. reserves the right not to be responsible for the topicality, correctness, completeness or quality of the information provided. All references are made to the best of the authors’ knowledge and belief. If, contrary to expectation, a violation of copyright law should occur, please contact journey.ac@techlabs.org so that the corresponding item can be removed.

--

--

TechLabs Aachen

Learn Data Science, AI, Web Development by means of our pioneering Digital Shaper program that combines online learning, projects and community — Free for You!!