Grünenthal: Availability Prediction of Production Lines

7 min readNov 10, 2021

This project was carried out as part of the TechLabs “Digital Shaper Program” in Aachen (Summer Term 2021)

Introduction.

Our team set out this semester to develop an availability prediction model for production lines (PLs) at a pharmaceutical company. The team’s approach consisted of using the Python programming language and its open-source libraries in the context of Data Science (DS) and Machine Learning (ML) to complete the task at hand.

The main use of such a model is to give information in advance on whether a PL would be available during the next shift. As a result, this helps the engineers plan where the production load should be directed to and minimize losses for the company due to unexpected downtime.

The partnerships team of TechLabs Aachen was able to contact Grünenthal in order to collaborate on the DS and Artificial Intelligence (AI) projects of this semester. This way we were able to obtain eight real-world datasets containing information about the operations, states and anonymised employee usage of real PLs. Our project started by cleaning the datasets and continued through a usual ML flow chain until the model was trained and evaluated.

Method.

The team tried to follow steps mentioned in DS lifecycle diagram.[1] We began by establishing milestones and setting deadlines for the different stages that our project would go through. First, the provided data was cleaned. Second, an exploratory data analysis (EDA) was conducted on it in order to understand the features and relations between them.

During this phase, each member of the team worked individually on a different dataset and then had to to the other team members so that everyone got a firm grasp on the data in general. The following Python libraries were mostly utilized during this phase:

Pandas for dataframe manipulation and cleaning;
Seaborn for statistical data visualization;
NumPy for statistical analysis and data manipulation;
Matplotlib.pyplot for data visualization.

After achieving a good understanding of the data at hand, we got rid of the empty, redundant, illogical or insignificant features which would not enrich our ML algorithms. At this stage of the project, some members of the team dedicated their efforts to reading research papers and looking into methods used in similar projects to finish up clarifying and improving our own approach. Meanwhile, the rest of the team focused on engineering single PL’s features within a limited timeframe.

We focused on engineering a Pandas dataframe that contained as much pertinent information on the PLs as possible and could be fed to ML algorithms effectively. For this, we had to find an efficient way of merging the datasets without losing any important data or exponentiation of it. We agreed on focusing on a shift-wise discretization since the dataset containing the target value had exactly this discretization.

However, this dataset had no time feature but only a date feature, which required us to figure out the start and end time of the shifts. This was done using a histogram with 24 bins for all lines to determine at which time most of the employees start working (blue) or end their work (red), respectively. (Fig.2)

**Fig. 2: Histograms for the determination of the shift timeframes.**

The information about the shift times was then used to group the data of the other datasets into a shift-wise discretization while data which could not be assigned to any shift was dropped (<0,2% of the data was lost).

Since in a real-life scenario many features about the current shift are not known a priori (e.g. failure time, states, etc.), we had to shift the indices of these features in such a way, that i.e. for predicting the late shift, only the data of the early shift is used.

While sharing the gained insights and inspecting the engineered dataset in a meeting, the complete team decided that we were going to focus on regression models for our project. By starting small and gradually moving towards more complexity, we were able to build the models based on the key insights gained on the previous steps.

Once clean engineered datasets were available (approximately 470 data points per PL) with all the relevant information merged, the team decided to proceed with the previously agreed models. Based on the literature research, we came to the conclusion that there are at least five ML algorithms [2] suitable for our case: Linear Regression (LR), Support Vector Regression (SVR), Random Forest (RF), Decision Tree (DT) and Elastic Net (ENet).

We used the scikit-learn open source library to apply these algorithms to our data. The dataframe that we used for the model contained the following features:

Date, as the index;
Failure time, which is the amount of time (in hours) that the PL stopped for unexpected reasons;
Unpredicted down frequency, which is the number of times a PL failed during a shift;
Scheduled down time (in hours);
Scheduled down frequency during a shift;
Shift, represented in two columns, one for the early shift and one for the night shift and the option of checking both to represent a late shift;
Shift duration (in hours);
Mean Time Between Failures (MTBF), calculated by subtracting the total failure time from the total working time and dividing by the number of failures;[3]
Availability during the next shift (in percent), as the target value.

Problems.

The development of the model wasn’t straightforward in any phase. When we began cleaning and exploring the provided datasets, we were faced to some common data-cleaning issues, such as:

columns that didn’t contain enough data to make them relevant or functional for our algorithms;
internal codes for certain actions, states, and processes which we didn’t know the meaning of and thus couldn’t find a use for the provided values;
inputs, which when compared with other metrics, were counterfactual;
features which represented the same information repeatedly.

The team’s first approach for this kind of data was to try and infer its meaning, however, due to lack of substantiation we finally decided to drop most of the values in question. Also, we decided that we were going to use only three of the eight provided datasets since the discarded information could be derived from the remaining data.

Collecting all the data from the selected datasets proved to be harder than expected since all the selected datasets had different sizes due to different time discretizations. Thus, merging was either connected with a loss of data or a large increase of the datasets size because of the required multi indexing.

In the end, the central problem we faced was the absence of any strong correlations among the data and the target value (availability). The fact that the available datasets contained only information from January to June was crucial due to lack of data. In such a case, the time feature could not be used to find correlations since no seasonality could be discovered/used.

Project Results.

As mentioned previously, five ML algorithms were compared for this regression problem. Results showed that LR and ENet are able to perform better than other models. Namely, the mean root mean square error (RMSE) scores for these models are 0.153 and 0.152, respectively.

At the same time, LR and ENet also showed better replication of the observed outcomes. The mean coefficient of determination (R2) for these two models varies between 0.56 and 0.58. Figure 4 shows more detailed comparison of the used ML models based on R2.

We believe that the values of presented performance metrics are average. Therefore, models with such results cannot be implemented in a ML pipeline for the production environment.

Conclusion and Outlook.

“The goal is to turn data into information, and information into insight.” — Carly Fiorina (CEO:1999–2005, Hewlett Packard)

At the start of this project, we set out the goal to successfully predict the availability of individual PLs. Although we did meet the outcomes that we intended to achieve, the learnings helped us understand the challenges faced while handling real-time production data.

The weak correlations between individual features coupled with an equivocal understanding of the production process also strongly emphasized the importance of having a sufficient knowledge of business processes before implementation of DS models.

The challenges faced by the team during EDA, merging, feature engineering and identification of the most ideal input/output features can be utilized by DS enthusiasts to further improvize on. Moreover, availability of production data stretched over a longer time period would further assist future prediction models by:

Providing more data.
Increasing possibility of pattern recognition (seasonality, machine behavior, etc.)

To conclude, our project could be used to build on applications in the field of ‘Predictive Maintenance’, subject to sufficient data and a clear understanding of business activities. Our overall journey with Techlabs has been interesting, fun-filled, challenging and has helped us improve our skills as ‘techies’. Thank you!

References.

https://docs.microsoft.com/en-us/azure/architecture/data-science-process/overview

The team:

Daniel Strohmeier
Elvis Gonsalves, LinkedIn
Dmitry Oleynikov, LinkedIn
Moritz Kappes
Oliver von Keitz Bravo

Mentor:

Revan Kumar Dhanasekaran, LinkedIn

TechLabs Aachen e.V. reserves the right not to be responsible for the topicality, correctness, completeness or quality of the information provided. All references are made to the best of the authors’ knowledge and belief. If, contrary to expectation, a violation of copyright law should occur, please contact journey.ac@techlabs.org so that the corresponding item can be removed.