Quality Prediction in a Mining Process
This project was carried out as part of the TechLabs “Digital Shaper Program” in Aachen (Summer Term 2022)
Introduction
Mineral processing technologies aim to separate valuable minerals from unwanted minerals and contaminants. One of these technologies is froth flotation, which uses physicochemical surface characteristics to provide this separation, such as hydrophobic or hydrophilic. [PU 2020] The prediction of the ore quality in the end and its influences during the process expedites and facilitates use of advanced computer algorithms in the improvement of mining processes. To prove and check the viability of these tools, this short paper describes the steps and results of application of various algorithms during a mining process to predict the final quality of iron ore.
The dataset was provided on Kaggle and had some codes and discussions about it already. The main idea of the project is to predict the final quality of the iron concentrate. The input is an ore containing iron and silica, iron and silica feed, the output is silica and iron concentrate, the higher the value of iron concentrate, the better. In this file, the final concentrate is already calculated, so the main idea is to use Data Science tools to predict the final concentrate and at the same time compare it with the real one, already done on the dataset.
Dataset and Motivation
We obtained the dataset from the Kaggle challenge.
- Link: https://www.kaggle.com/datasets/edumagalhaes/quality-prediction-in-a-mining-process?resource=download
- Reading the Dataset:
- Rows and Columns in the dataset:
Knowledge of the underlying correlations and the ability to predict output quality without time-consuming laboratory examination would allow for easier optimisation of the process parameters and could bring great value to the field of mineral processing.
Data Pre-processing
As the first step, an exploratory study has been done, which means some plots, correlations, and cleaning needed to be done. After the first steps and the cleaning, some models of machine learning were developed and performed. During the first step, plotting the main features, some important issues appeared.
For using the date column in the dataframe, we changed the datatype of the ‘date’ column to “datetime64[ns]” using the code:
Issue: Missing/Constant values and Discontinuous time instances
One of the major issues that we faced with the current dataset was a discontinuity in observations and fewer unique values per parameter over different readings.
For example, we can see the highlighted time periods in the below plot:
Solution: Two Interpretations of the Dataset
We devised two approaches to solve the above issue: ### Method-1: Per-hour averaged dataset We took the mean of all the values in one hour (we had ~180 observations for the same hour) using the below code, which gave us 4097 unique hourly values:
Method-2: Interpolated dataset
We interpolated the existing dataset to get 180 values uniformly across all the hours present using Spyder. The code is as below:
This resulted in a more gradual change in values rather than the step-like behaviour of the original dataset, illustrated below for a randomly selected 12 hour period.
Exploration of Features
Number and Types of Features Available
- Number of Features: 24
- Type of Features:
- % Iron and Silica Feed are quality measures of the iron ore pulp before it is fed into the froth flotation plant.
- Starch flow, Amina flow, and Ore Pulp properties are the most important variables (control variables) as they impact ore quality at the end of the process.
- Airflow and column level from floatation columns 1–7 measure the process parameters.
- % Iron and Silica are the final iron ore pulp quality measures which are drawn from laboratory results.
Correlation of Different Features
To check the correlation between the features, the heat map was plotted and as it has shown, there are not so strong correlations between them, except for the iron and silica concentrate.
Model Engineering using Machine Learning Algorithms
Algorithm-1: Long Short-Term Memory (LSTM)
Long Short-Term Memory (LSTM) networks are a type of recurrent neural network (RNN) capable of learning order dependence in sequence prediction problems. This model is widely used for time-series data since the model can better preserve sequential data. Hence this proved to be quite useful in predicting silica concentration (impurity) in Iron ore.
For this project, an LSTM model with 2 hidden layers from the TensorFlow library was used. 22 variables defined the feature set for this model. A drop rate of 20% was used between every neural network layer to reduce the chance of overfitting. A ‘lecun_uniform’ kernel initialiser was employed for this model. An Adam optimiser with a learning rate of 0.001 was applied to compile the model. The test dataset was used to validate the model. The model can predict Silica concentrate with a Mean Square error (MSE) of 2% and with a variance score (R² score) of 42%.
Algorithm-2: XGBoost
Gaining popularity due to its versatility and good performance for regression as well as classification problems, XGBoost was chosen as the second Machine Learning algorithm.
In line with the other algorithms, the cleaned dataset containing average values for each hour was used. The train-test split was set to a ratio of 8:2, leaving 2098 data points for training and 524 data points to test the predictions. Fine-tuning was carried out using the GridsearchCV module of sklearn with the tuning parameters shown in the figure below.
The parameters leading to the lowest RMSE were identified as colsample_bytree = 0.6, learning_rate = 0.02, max_depth = 1 and n_estimators = 800. Combining these with the parameters alpha = 5 and early_stopping_rounds=15 to combat possible overfitting resulted in an RMSE of 0.161158 (0.117016 when using Iron Concentrate). For an illustration of the resulting prediction, predictions and test data are plotted below.
These results show that the scaling of the predictions is not ideal. One explanation for this could be the very limited size of data after discarding parts with low data quality and averaging the 180 values per hour into one singular value. On the other hand, the lack of Silica Concentrate values does not leave many options. Also, it should be noted that a scaling factor lower than 1 was applied, leading to an effectively lower RMSE than it would be at the original scale.
One attempt at increasing the prediction quality was done by interpolating between the constant Silica and Iron Concentrate values and assuming the recorded value is correct for the first time step of each hour. This increased the available data size from 2622 rows to 463673 rows. Another grid-search was carried out, resulting in choosing the parameters colsample_bytree = 0.3, learning_rate = 0.075, max_depth = 20, n_estimators = 700, alpha = 5 and early_stopping_rounds=15 for this dataset. Again, parts with sustained periods of constant Concentrate values despite variable Feed values (marked in the plot below) were excluded.
The model as described before was trained on randomly sampled 80 % of the dataset with 20 % allocated for testing. The resulting RMSE of 0.193189 is higher than before, however the fact that the dataset was not scaled down means that the prediction is better. This can also be seen from the plot below, showing the same comparison as for the averaged values.
Algorithm-3: Random Forests
Random forest is a supervised learning model that is based on ensemble learning, which is essentially combining different decision trees model to form a more powerful prediction model, by introducing feature bagging through the inclusion of random subsets of the features. To include the time dependency of the dataset, the Silica Lag column was introduced.
Firstly, the data are split between training and test data, with 80% and 20% respectively. We then used the Random Forest Regressor model from scikit learn library. As this algorithm is computationally heavy, the number of estimators was lowered from the default 100 to 50. After some hyper-parameter tuning, the model was created with the following hyper-parameters.
Here is a visualisation of the model. What this model shows, is that at each node of each forest tree, a random feature from a subset of our dataset is selected, and depending on its value, the node is split into more nodes. In our first model, the selected feature for the first node is the Silica Lag with a deciding value of 0.551. As each tree progresses further, more features are used. Since each forest tree has feature randomness and uses bagging, the average value between each forest tree is then taken.
The result of the first model is as follows. As shown, the mean absolute error and the root mean squared error increased significantly between the training and test sets. This could be due to the small data set that we used and overfitting could occur.
The preliminary training result was really promising, we decided to take a deeper look into the model. The feature importance of the model was then investigated.
Since the Silica lag (last column) has high feature importance, we would like to see if the model has a high correlation with this feature. Thus, we dropped the Silica Lag column. Here is the result of the second model. The training set shows good results, with a high r2 score of 90.8%. However, the r2 score of the test set performed worse than expected, with a negative r2 score. In this case, our second model fits really poorly to the dataset without the inclusion of the time-dependent Silica Lag data.
Algorithm-4: Linear Regression
It is a predictive analytics technique that uses historical and current data to estimate future results. This technique is quite suitable for many applications, such as in the field of medicine, business, and economics. Linear regression can be related to one or more variables with the idea to create a linear relationship between them.
For this process and dataset, linear regression with 22 variables was tried after the cleaning and preprocessing of the data. This model was done in an exploratory way, it required some modification and a deeper understanding of the model and process for this section. However, some initial and exploratory data could have been plotted.
In the plot it is not possible to check clear linearity, one of the possible reasons for this could be the small coefficient values of each feature, 22 variables. They had some small values and a strong relationship with the iron concentrate. Also, another point to be recognised in this plot, is the axes, X. Although the X axes are not in the right scale, it was plotted in the same data as the real one.
The model achieves the following results: MSE of 3% and R2 of 26%.
Algorithm-5: Support Vector Regression
Support Vector Regression (SVR) falls under the category of supervised learning algorithms and is used for predicting discreet values. In the case of regression, SVR uses the same principle as Support Vector Machines (SVMs) to find the best fit line, with a given margin of tolerance.
We used SVR on the hourly-averaged dataset after removing all the constant values (removal of readings that did not change for more than 25 hours). The code is as below:
- Libraries needed for SVR:
- Splitting the data into Features and Targets:
- Splitting the dataset into training and testing data:
- Scaling the data for better results:
- Using the SVR model:
- Calculating the Score and Error:
The RMSE Error was 0.10
Outcome
Importance of Features
During the fitting process, the importance of features was analyzed.
- When the feature of Iron Concentrate was not dropped, the fitting places high importance on this feature.
- When it is excluded, higher importance is placed on the Silica Lag feature.
- For the interpolated data-frame without
Future Prospects
The biggest challenge in fitting this model was the low data quality. For a more accurate prediction, more data without long periods of missing values would be necessary. Based on this, a better comparison of models could be made to determine which model is most suitable for this application. Also, further features like the Silica Lag could improve the overall accuracy. Transforming the existing features like pH values and Flow Rates was also discussed but not included.
Other prospects could be a more elaborate tuning of the used algorithms and expanding the used algorithms beyond the five described in this post. Moreover, over- and under-fitting were not considered in detail, and evaluating this might be beneficial for the final model selection and tuning.
Once a robust and accurate model is created, the possibility of a quick online-estimation would be a final step with immense impact especially when coupled with a fast feedback on changes to the process parameters.
Conclusion
For this project, real-time-series data from a flotation plant was used to predict the resulting concentration of silica in the concentrated output based on input concentrations and operational parameters. Predictions were carried out using five algorithms including LSTM, XGBoost, Linear Regression, Random Forests, and Support Vector Regression.
After comparing these five machine learning methods used for prediction based on the dataset available, we concluded that LSTM and XGBoost perform the best with per-hour averaged dataset and interpolated dataset respectively. These conclusions are based on the current state of data engineering that we could achieve in the timespan allotted. Other methods including Random Forests, Linear Regression, and Support Vector Regression can also be used on both the per-hour averaged and interpolated dataset, with more refined methods of data cleaning and improved hyper-parameters.
More varied datasets can also be included as a part of the improved prediction methods, including different weight adjustments for various features.
Last but not the least, we would like to thank Techlabs Aachen and our Mentor- H. Ritter for this amazing opportunity and the guidance throughout the project phase.
TechLabs Aachen e.V. reserves the right not to be responsible for the topicality, correctness, completeness or quality of the information provided. All references are made to the best of the authors’ knowledge and belief. If, contrary to expectation, a violation of copyright law should occur, please contact journey.ac@techlabs.org so that the corresponding item can be removed.