Predicting Volcanic Eruptions from Seismic Behavior

Soumya De
Analytics Vidhya
Published in
22 min readMay 12, 2021

--

Developing a web-based machine learning application to predict volcanic eruptions on the basis of seismic signals

Volcano Acaytenango, Antigua, Guatemala: Photo by Aaron Thomas on Unsplash

What if we could forecast volcanic eruptions in the same way we predict weather? A single reliable forecast would aid in the planning of adequate emergency response, potentially saving thousands of lives.

Several organizations, including NASA, USGS, INGV, and others, are advocating for this cause. The primary goal of these organizations is to lead to a better understanding of the Earth’s structural and geological behavior while minimizing the threats that come with it.

Seismic indicators, according to studies, are an excellent predictor of a volcano’s impending eruption. The problem with seismic signals, though, is that they are difficult to interpret. In systems (that make use of these signals), eruptions can be estimated only a few minutes in advance, but longer-term predictions are difficult.

In this blog we will go through the end-to-end process of creating a web app (from problem understanding to deployment) that can forecast the approximate time of eruption of a volcano based on seismic signals collected by sensors mounted near the volcano.

Road Map

This part includes the application’s agenda and is intended to provide a high-level overview of the main stages of this case study.

The topics that will be covered in this blog are stated below.

  • Pre-requisites
  • Data collection
  • Problem understanding
  • Exploratory data analysis
  • Previous works
  • Developing a benchmark model
  • Developing a custom stacked ensemble model
  • Web-app development using Streamlit
  • Web-app deployment on Heroku platform

Let’s begin by looking at some of the ideas, resources, and platforms needed to follow this blog smoothly.

1. Pre-requisites

Here is a list of pre-requisites that are needed (or rather, helpful) in building similar data-related projects.

  • Kaggle Account - Don’t have an account? Register yourself at kaggle.com
  • Using Git - There is this nice blog by HubSpot that gives brilliant introduction on how to use Git and GitHub.
  • Virtual environments, python, pip and basics of machine learning

Note : This blog mainly focuses on developing custom ML model and its development as a web based application rather than machine learning itself.

Don’t worry if the pre-requisites are not present in your toolbox, please go through the blog. One can always come back and learn what is required!

2. Data Collection

Now that we have gone through our agenda and mentioned the things that are pre-required, we are now finally able to get our hands on the data.

The data was released by the Istituto Nazionale di Geofisica e Vulcanologia (INGV), Italy’s National Institute of Geophysics and Volcanology, through a Kaggle competition. The data can be accessed through the link provided below.

Source : Kaggle INGV Volcanic Eruption Prediction Data

Fig 2: Snapshot of Kaggle INGV competition Data Explorer

The download will be in the form of a zip file and which upon extraction will yield a directory as shown below:

3. Understanding the problem

3.1 Overview

The case we’re working on is based on a competition hosted by Kaggle. The data that is supplied includes measurements taken by seismometers placed in various locations around an active volcano.

Each of these measurements is a collection of ten timeseries captured at regular interval from ten seismometers through a duration of ten minutes and each of them is associated with a value, called time to eruption, that represents the time elapsed between the end of the measurement and the start of an eruption.

But the problem is that we don’t know much else about the data, such as what the sensor’s signals mean or where they are in reference to others. In this study, we will make use of machine learning methods to derive useful features from raw data and attempt to model the relationship between the data and their corresponding time to eruption values.

3.2 Key constraints

However, the key constraints are that the model must have low latency criteria (in seconds or even a few minutes is acceptable), be extremely descriptive and reliable so conditions like this cannot be overlooked, and not raise false alarms.

3.3 Use of machine learning

The aim of machine learning is to understand the nature of data and capture its essence with mathematical modelling, which in turn can be used by user to comprehend or to predict similar type of data in future.

In this case study we will try to model the time to eruption value given its multi dimensional seismic predictors using supervised machine learning algorithms. And since this variable is continuous in nature, the type of modelling would be regression in our case and prescribed key performance metric that will be used to measure model performance at different stages of this case study is mean absolute error.

But machine learning is not only about modelling itself rather it is the whole process from collecting, understanding, cleaning and transforming the data to deploying the developed model.

Lets proceed step-by-step…

4. Exploratory data analysis

We’ve downloaded the data to our device and are now ready to dig in. Let us take a look at what we have here.

  • We have two csv files (train.csv and sample submission.csv) and two directories (train, test)
  • The train directory contains 4431 csv files, which serve as our training data.
  • There are two columns in train.csv: segment id and time to eruption. Each value of the time to eruption variable is associated with a certain segment id, and this segment id indicates which training data the particular time to eruption is associated with.
  • Similarly, the test directory includes 4520 test data and the sample submission.csv file includes 4520 segment ids (test data file names), but the time to eruption values are unknown this time.

Let’s take a look at what those particular csv files in the train or test folders hold.

Fig 5: First few samples of two random csv file from train and test data

As a result, we have a total of 60001 readings from 10 sensors for each segment id(one data point). The multivariate time series data for each segment was recorded at regular intervals for 10 minutes (600 seconds). So we can assume that the readings were taken every 100th of a second or every 10 millisecond or every centisecond. We also have a variable called time to eruption that is associated with each segment id. This variable represents the time elapsed between the end of each 10 minute segment and the start of an eruption.

Let’s take a look at the distribution of the variable time to eruption.

The minimum value of time_to_eruption : 6250
which is basically 0 hours 1 minutes

The maximum value of time_to_eruption : 49046087
which is basically 136 hours 14 minutes
The average bin width of the histogram is 2043326.5416666667
Which in hour-minute format stands 5 hrs 40 mins
  • The above time_to_eruption density distribution plot shows that the distribution is similar to a uniform distribution between the given variable’s minimum (0.0006e7) and maximum (4.9e7) values.
  • We used 24 bins in the histogram plotting method. As this time to eruption ranges from 1 minute to 136 hours (5+ days) and we have used 24 bins in the plotting function, the bin width can be considered 6 hours (5 hrs 40 mins to be precise). In other words, the first bin contains data whose time to eruption variable varies from (0–6hrs), the second bin from (6–12hrs), and so on.
  • The number of entries/data points (or density) in the last two bins is lower than in the other bins. As a result, there are fewer (relatively less number of) parts whose time to eruption falls within the last 12 hours.

Visualization of the multivariate time series

Let us begin by plotting the data for the segment with the smallest value for the time to eruption variable.

The plot above is a visualization of the sensor values of segment with smallest time_to_eruption (6250 or ~1 minute) value

  • in all the sensors, at 4th minute [~25000] (of the total 10 min span of recording) there consists some sudden spikes in most of the sensors.
  • there are also spikes seen at the 8th minute[~50000] in sensor 1 and sensor 7
  • sensor 4 is has captured some critical tremors.
  • The sensors 2, 3 and 8 has entirely missed to capture anything. This absence could be simply coincidental, the result of a technological error, or it could be the result of a hidden pattern.

The pdfs of the sensor signals for the same segment is given below

  • The above plot depicts the density distribution of each individual sensors for a segment with lowest value for time_to_eruption
  • The sensor readings (as viewed visually) are very similar to normal distributions.
  • We can also examine the sensor density distribution for other segments and see whether they are also characterized as normal.

We’ve already shown that there are some test segments that have completely missing sensor columns. We’ll now try to figure out which sensors completely skipped a particular segment.

We can see from the plot above:

  • Sensor 2 skipped 18.84 percent of segments (835), the highest number of segments compared to other sensors, followed by sensor 5 (13.43 percent), sensor 8 (9.93 percent), and sensor 3 (9.82 percent )
  • Sensors 1, 7, 9, and 10 have lost fewer data fragments.
  • whereas for sensors 4 and 6, we have data for all training segments

Let’s look at the range of values for each sensor now. We would do this by determining the maximum and minimum values for each sensor for all training segments.

We can deduce from the above output that all sensor values range from -32767 to 32767. And now we’ll sort the train segments by time to eruption and examine some of the segments with low and high time to eruption values.

Let’s look at some time series visualizations of segments with low time to eruption values.

The plot above depicts the segment whose time to eruption is approximately (4 minutes) in the future (wrt time when data was captured). Let’s take a look at some more data with a low time to eruption value.

The visualization for the second and third segments (2nd and 3rd wrt time to eruption) is shown in the above two plots, and we can see that:

  • During 30000 cs, the signals in the first plot exhibit significant tremors in all sensors.
  • the signals in the second plot also have peaks and falls in between 10000 to 20000 cs and 40000 to 50000 cs but in they are not clearly distinguishable as in the first plot.
  • This data’s highs and lows may be useful for forecasting eruptions.

Let’s take a look at several other segments that have higher time to eruption values. We will now plot the data with a much larger time to eruption value and see how it can be comparable to the plots above (with a much smaller time to eruption).

The two plots above show the sensor values from the last two segments (in terms of time to eruption):

  • Sensors 1 and 5 failed to catch the time series for the segments with the highest time to eruption value.
  • Peaks and falls between specific time intervals (within the given 10 minutes) can be used as time series features.
  • The absence of an entire sequence by a specific sensor or group of sensors may also indicate time to eruption.

We’ll now produce sensor value density plot from some random segments from the training data to examine their distribution patterns.

As a result of the above two plots, we can conclude:

  • Despite the fact that the majority of the sensor distribution seems to be normal, all are not. We may validate this by examining the distribution of sensors 6 & 10 (of the first plot) and sensors 5 & 6 (of the second plot).
  • However, the majority of them resemble the bell-curve form of the normal distribution. As a result, we consider adding features from sensor distributions such as mean, standard deviation, different quantile values, skewness, and kurtosis to each of the data segments

5. Related Works

Let’s take a brief look at some previous approaches to this challenge. This section is not part of the development process, but it offers a broader viewpoint by investigating past implementations of ideas taken in this direction; some of them are briefly discussed below:

5.1 Mel Spectrogram + Blended ResNets (link)

Fig 2: Diagram of the solution pipeline

This method employs a blended ResNet model that is applied to the visualizations derived from the signals. As a result, this technique is divided into two parts:

  • Sensor signals from each section are translated into Mel sized spectrograms, which serve as a visual representation of the signals.
  • These spectrograms are then fed to a weighted average of several ResNet and tree based models(Fig 2).

5.2 Two-way stacking using Gradient Boosted Trees (link)

The steps followed in this methodology are given below:

  • Feature engineering
  • Validation and feature selection
  • First level models: three models are used in the first level of the stacked classifier namely LightGBM, XGBoost and Neural Network.
  • Second level models: Two models based on 1st level LightGBM were created at second level of stacking
  • Average blending: Finally the predictions of the two models (in the 2nd level) were averaged to obtain a better LB score

5.3 Deep learning model using LSTM layer (link)

This approach employs the LSTM recurrent neural network architecture. The suggested method is divided into two stages:

  • Feature engineering: Four types of features were incorporated in the dataset, namely Temporal characteristics, Frequential features, Cepstral features and Spectral features
  • Deep learning model: The architecture of the model is as follows. A LSTM layer is followed by three Conv1D layers in the model. Then flattening is performed using flatten layer. The output of the flatten layer is then fed to a fully connected NN (consisting of three dense layers of 64, 32, and 1 unit(s) respectively).

6. Developing a benchmark model

We’ve looked deep enough into our data to begin some simple modelling. Each data point, as we know, is a set of a 10-dimensional time series, and each time series can be thought of as a sensor value distribution corresponding particular data point or segment.

6.1 Descriptive statistics of each segment

In this section, we will develop some fundamental descriptive features for the sensor value distribution. These feature set will include: mean, standard deviation, minimum, maximum, different quantile values (30th, 60th, 80th, 90th), skewness and kurtosis for each sensor distribution.

Correlation coefficient plot

We can now investigate the correlation stats of the features in comparison to the target variable time to eruption.

Note: The image is a small part of the actual one, check out the full image by clicking on this link.

The above plot depicts the correlation between the features and the target variable.

  • All of the basic features, each to a greater or lesser extent, are associated (positively or negatively).
  • The features representing the sensor’s minimum and 30th quantile are positively associated with the time to eruption variable.
  • All of the sensors’ features representing standard deviation, maximum, and quantile(60th, 80th, 90th) values are negatively correlated with the target.
  • The correlation magnitude is very poor for certain sensor distribution features such as mean, skew, and kurtosis.

Relative feature importance plot

The plot below depicts the relative feature importance of the drawn features. The feature importance are calculated by the Random Forest algorithm.

  • The standard deviations of sensor 2 and sensor 5 are much more important than the rest of the features.
  • The majority of the skewness and mean features are of comparatively low value.
  • Since most of these features are important in some way, we will use them to build our benchmark model.
  • However, we will disregard the mean and skewness features of all sensor distribution while modelling.

Since we are keeping track of the min and max of each sensor through each segment, thus we don’t need to record missed sensors as it will be a redundant information. We have imputed the missing values with zero, so for entirely missed segment by any sensor will a zero in min as well as in max features.

Machine learning models on basic descriptive features

The first set of features for given segments has drawn out. In this section we incorporate various ML models and check the usefulness of the drawn features.

In this section, we will go through tiny code fragments and their associated results, attempting to prevent complete code walkthroughs. The implementation code for replicating the case study will be given at the end of this blog in the form of a link to the corresponding GitHub repository.

  • Elastic Net

In this experiment, we will examine the error measure in relation to various values of alpha (a constant that determines the amount of regularization to be applied) and train an elastic net model with alpha that produces the least amount of error. But instead of feeding the model with obtained data, we will scale the data first and then train the model.

  • Random Forest Regressor

Now, we will investigate the error measure of a Random Forest model with respect to various numbers of base learners and train Random Forest with the optimum number of base learners.

  • XGBoost Regressor

We will repeat the experiment that we did with the Random Forest model and train a gradient boosted regressor with the optimal number of base learners.

Summary:

  • In terms of validation mae, both XGBoost and RandomForest outperform ElasticNet, but they both overfit the training results.
  • In that scenario, the RandomForest outperforms the XGBoost since the XGBoost regressor overfits the train set and has a higher validation mae than the RF model.
  • As a result, we can now continue with hyperparameter tuning of the Random Forest and XGBoost model using RandomSearchCV in order to avoid overfitting.

Hyperparameter Tuning

  • Random Forest Regressor
For hyperparameter tuned RandomForestRegressor with parameters : {'n_estimators': 1500, 'min_samples_split': 12, 'min_samples_leaf': 5, 'max_samples': 0.8, 'max_features': 0.6000000000000001, 'max_depth': 12}  Mean Absolute Error for train data :  4383116.024559133 
Mean Absolute Error for cv data : 5500394.4091687715
  • XGBoost Regressor
For hyperparameter tuned XGBoostRegressor with parameters :{'subsample': 0.4, 'n_estimators': 800, 'max_depth': 9, 'learning_rate': 0.05, 'colsample_bytree': 0.8}  Mean Absolute Error for train data :  19707.49871692089 
Mean Absolute Error for cv data : 4116339.266495798

Summary:

Note: HPT = hyperparameter Tuned

6.2 Descriptive statistics wrt time steps of each segment

In this section, we will build features that are similar to the features derived in previous section, but with a minor difference. We would first divide the entire section into six parts (each containing data from 10000 timesteps), and then measure simple statics (such as mean, maximum, and minimum) of the data in each of these six parts individually to obtain the new features.

Correlation coefficient plot

We will now look into correlation stat of the features with respect to the target variable: time to eruption

Note: The image is a small part of the actual one, check out the full image by clicking on this link.

The above plot depicts the correlation between the features and the target variable.

  • The majority of the characteristics (except those related to mean, kurtosis, and skewness) are relevant to the target variable.
  • Features such as maximum, standard deviation, quantiles values of 60th, 80th, 90th are negatively correlated with the target
  • Whereas minimum, 30th quantile features are positively correlated.

Relative feature importance plot

The relative feature importance of the new features is given by the plot below.

Note: The image is a small part of the actual one, check out the full image by clicking on this link.

  • The standard deviation of the sensor_2_60 has much higher importance than rest of the features.
  • Since many of these features are important in some way, we will use them to build our benchmark model.
  • However, we will disregard the mean, kurtosis, and skewness features of the given sensor distribution this time.

Machine learning models on basic descriptive features on different time window

The second set of features for each segment has been developed. In this section, we will use the same ML models as in the previous section to assess the effectiveness of the drawn features.

In this part we will carry out similar set of experiments as we done in the previous section with ElasticNet, RandomForest and XGBoost. The results of these experiments is summarized below.

Summary:

The descriptive features in terms of time steps did not contribute significantly to the featurization process, as shown by a comparison of the MAE scores with the previous feature set. As a result, we did not do any hyperparameter tuning job in this set experiments.

6.3 Machine learning models on above two feature sets

We can now take into account both the feature sets and feed them into a feature selection module to determine the optimum number of features that are truly useful in deciding the time to eruption.

The above plot shows that the optimal number of features is 60. We will now repeat the same series of experiments for the three previously used models and summarize the result below.

Summary:

The efficiency on the validation set has degraded after hyperparameter tuning, but the problem of overfitting has been greatly reduced. Before moving forward let us a look at the mean absolute errors value resulted so far.

Summary:

However, if we evaluate the model’s performance using the following criteria:

  • fitting the train set
  • model performance on validation set

We think both HPT RandomForest for Desc stats feature set (Model 1)& HPT RandomForest for Selected Desc stats feature set (Model 2), serve as a good benchmarks. We have used them both to obtain the results on the public test data and submitted the result files to the competition site.

Given below are the Kaggle scores obtained on the private and public test set:

  • By considering, the time taken to generate predictions for the test set and performance score generated by Kaggle, Model_1 has a better performance
  • Hence we will choose Model_1 (Basic Descriptive Features + RandomForestRegressor) as benchmark/baseline

7. Develop a custom ensemble model

This section provides an custom ensemble algorithm to model our data. The model uses stacking mechanism i.e., obtaining predictions from base models as features and using this features to train a meta-learner to model the relationship between the obtained features and target variable.

The process of modelling on the train set is stated as follows:

  • split the train set equally into two sets: D1 and D2
  • perform sampling with replacement on D1 and create k number of samples: d1, d2, d3, … , dk
  • k number base models are trained(decision tree in our case) using these k samples
  • now use D2 to obtain k-dimensional feature set by passing D2 through the k trained base learners
  • with the help this k-dimensional feature set obtained from D2 along with D2 targets train a meta-learner. Any regression algorithm can be used as meta learner.

We have defined this custom ensemble regressor under a python class and the definition is given below:

We tested a variety of meta-learners, including SVM, Decision Tree, and other tree-based models, and discovered that XGBoost with 1300 base learners outperformed the others. As a result, we can limit the number of base learners to 1300, set the meta learner to XGBoost, and conduct grid search only on the meta learner.

With number of base learners = 1300 and meta learner = xgboost the best MAE score we get : Mean Absolute Error for train data : 1992397.2213964183 
Mean Absolute Error for cv data : 5385850.012225198

Now that we have designed and obtained the scores for custom stacked ensemble let us a look at the mean absolute errors value resulted so far.

Summary:

This custom model will be saved in Python Pickle format so that it can be used later for deployment. For the time being, we will use the model to evaluate its performance on the public test set. The mae scores obtained on the public and private test set is given below:

8. Streamlit web application

In this segment, we will use the custom ensemble model trained on the data to construct a web-app interface for user interaction using Streamlit. It is an open-source python library meant for building and sharing web-apps.

Installation :

pip install streamlit

For the sake of modularity, we have separated our code into separate python(.py) files for this case study, as follows:

  • model.py — to define the custom model class
  • predict.py — to define the final predict function and a function to return the error measure
  • helper.py — to define any additional functions required by the app
  • app.py — contains the main streamlit code that will help to render the app on a web-browser

One simply put everything on the app.py file. In this section we will look into app.py only, the contents of the rest of the python files are explained in the model development section.

Let’s get started…

Step 1: Set a title, cover image for the web-app and create a sidebar for additional information about the app.

Fig #: title, cover & sidebar

Note: The text under About in the sidebar is written in a txt file: about.txt, which is located in the same directory as the app.py file.

Step 2: Create a file uploader object for handing csv file uploads for testing

Fig #: File uploader
Fig #: file uploader prediction output

Step 3: Creating option for the visualization of the uploaded multi dimensional time-series

Fig #: Visualization output

So far, so good; we’ve gone through the main components of our web application. Check app.py for the complete code.

Before deploying the web-app online, one can verify the application locally by running the following command in environment prompt.

streamlit run app.py

Learn more about Streamlit from the outstanding video tutorial by JCharis on Youtube

9. Heroku deployment

Now that we’ve completed the majority of our tasks, we have a app that can be run locally and only the developer can see or run it.

Here, Heroku comes into play. The Heroku platform executes the user’s applications in virtual containers on a reliable and consistent runtime environment. For the purpose of deploying the developed app onto Heroku servers we have go through three major steps:

  • Creating the files needed by Heroku for the app to run successfully.
  • Making the code accessible for the Heroku platform via. GitHub
  • Connecting the code repo with the container

Step 1: Creating the files required by Heroku

So far, our project’s main directory includes:

  • custEnsemblexgb.pkl serialized version of our final model
  • model.py custom model definition
  • predict.py functions required for predicting and scoring
  • helper.pyadditional helper functions
  • app.py — contains the script for streamlit frontend

Heroku expects more detail, such as which Python version to load, which dependencies are needed, or which file to be run on the deployed environment. All of the information expected by the platform is provided by the following files, which, like the others, should be present in the main directory:

  • setup.sh this file will contain the command to build a new directory streamlit, as well as different environment initialization parameters.
mkdir -p ~/.streamlit/ 
echo “\
[server]\n\
port = $PORT\n\
enableCORS = false\n\
headless = true\n\
\n\
” > ~/.streamlit/config.toml
  • Procfileneeded to launch the web app and run the setup script and streamlit app
web: sh setup.sh && streamlit run app.py
  • runtime.txt specifying the python version that is required by the app
python-3.7.10

Note: Check Heroku Python Support before creating runtine.txt file.

  • requirements.txtspecifying the version requirement for other dependencies
streamlit==0.80.0
pandas==1.1.0
numpy==1.18.5
scikit-learn==0.22.2.post1
matplotlib==3.3.0
xgboost==1.1.1

Double check your package versions by verifying the environment’s package list

Step 2: Making the code accessible for Heroku via. GitHub

  • Create an empty GitHub repo
  • Initialize the main directory with git init
  • Push the code into the created repo
  • Make sure to add README.md and LICENSE file

Note: If you’re new to git, please check out this fantastic blog (already mentioned in pre-requisites)

Step 3: Connecting the repo with the Heroku container

Once the code has been pushed and you are ready to build an app with in Heroku dashboard. For this we have to:

  • Then create a new app from Heroku dashboard. Assign a suitable name & choose a region in the follow up form and hit create app.
  • Once we are on the dashboard of the created app, select the deployment method as GitHub. Then deploy the particular repo branch where the application files are present, in our case it will be master.
  • Wait for it to setup the container and initialize the environment. After completion it will automatically generate the successful build message and the link to the deployed web-app.

To learn more about Heroku refer the official documentation

10. Discussions

  • In this case study, we had opportunity to get our hands on seismic data provided INGV through Kaggle
  • We had basically, two sets of data: train(4431) and test(4520), consisting of data recorded over 10 sensors
  • First, we have drawn features such as minimum, maximum and other distribution properties of each of the sensor distributions for each data point
  • Another set of similar features (but this time number of features were more because instead of drawing the features from whole sensor distribution, we were collecting them on the basis of smaller time window) were drawn in the next sections
  • After that, mainly three (Elastic Net, Random Forest, XGBoost) ML algorithms was applied on on the combination of two sets featurization and performance was compared.
  • We have also designed a custom stacked ensemble model and trained it with the obtained features.
  • Finally, we took the serialized version of our model and created a basic Streamlit web interface prior to actually deploying it to the Heroku network.

11. Future Work

  • We should consider rich feature engineering techniques, such as transforming time domain to frequency domain. And also consider opensource libraries like tsfresh to draw features that are relevant for time series data.
  • We should also try more complex modelling techniques (like ANNs, DL).
  • We could look out for some pre-trained networks, for feature extraction, that are already capable of accomplishing similar task related to time series data.
  • Add more features to the application and optimize its live performance.

--

--