Sensorial-based ML Applications in Modern Industrial Product-Lines

19 min readNov 24, 2021

When Machine-Learning (ML) meets modern sensorial-based product-lines

TL; DR

Modern industrial product-lines typically comprise dozens of sensors which continuously sample the each of the evolving products along their generation path. From a data-scientist point-of-view, that’s interpreted into a gold-mine, i.e. an enormous dataset, which keeps growing, and holds multiple time-series vectors. In some cases, the data might even be annotated with binning results, quality-tests labels, etc.

This article acts as a technical overview of common Machine-Learning (ML) applications in the modern industrial product-lines domain. It doesn’t get too much into coding details, but rather focuses on architectural aspects and highlights several qualitative discussions.

Agenda

Problem Formulation
Exploratory Data Analysis (EDA)
Features Engineering
Clusters Analysis
Anomaly Detection
Time Forecasting
Quality (QA) Prediction

1. Problem Formulation

Each industrial batch goes through different phases (aka states) along the product-line path. Each phase is monitored in a given sample-rate by multiple sensors. Data from all sensors is typically collected and managed by a designated BI ETL pipeline. BI, or business intelligence, refers to a set of processes and technologies used to transform raw data into meaningful, usable information and obtain actionable insight into business operations and opportunities. BI ETL refers to the processes by which data is extracted from multiple sources, transformed into the proper format or structure for querying, reporting, and analysis, and loaded into a data warehouse or other type of centralized data repository.

The BI ETL interacts with the AI, or Artificial-Intelligence, core that holds all of the data-science related analysis. The BI provides the input data, and the AI eventually submits back the corresponding machine-learning insights. The communication between AI and BI is commonly applied via a SQL protocol.

BI/AI interaction for modern sensorial-based product-lines — illustration

Common basic terminology:

Batch: a unique object which evolves along the product-line
Sample: A collection of all sensors outputs, in a specific time. Each batch comprises multiple samples, i.e. multiple sensors results.
State: a unique phase along the batches life-cycle. Product-lines may maintain multiple states
Sensor: physical device which monitors batches status along the production. Modern product-lines typically maintain multiple sensors, each monitors a different metric, with a different scale
Virtual Sensor: A split of the physical sensor over the different states. For example, if sensor0 samples data over state0 and state1, then the corresponding virtual sensors are sensor0_state0 and sensor0_state1
Global Time: time axis represents the whole life-cycle, per batch
State Time: time axis represents the per-state life-cycle, per batch

The next sections deep-dives onto the AI application side of the above figure.

2. Exploratory Data Analysis (“mini-EDA”)

In data-science, the EDA term typically refers to an initial analysis of engineered-features and their correlation with the dependent variable (and with themselves). Hereby, the main goal of the “mini-EDA” is to feel the data, gain some preliminary insights about its statistical spread and about the various variables (dependent and independent). Such insights may assist data-scientists with assessing the feasibility and the potential of an AI approach for the given problem. It shall be noted that the “mini-EDA” is performed on the raw sensorial data, before extracting any features out.

Listed below are several key-observations, followed by a corresponding figure

Samples per Batches (top-left)
Samples per Sensor (mid-left)
Samples per State (bottom-left)
Sensor vs. Global Time (top-right)
Sensor vs. State Time (mid-right)
Quality Assurance (QA) / Lab Results per Batch (bottom-right)

The mini-EDA provides a quick glimpse into the data spread, and addresses the following questions:

How many samples do the batches contain (statistically-wise)?
How many samples do the sensors contain (statistically-wise)?
How many samples does each state contain?
How do the sensors change over time (for some arbitrary batch)?
How do the QA / Lab results are spread over the various batches?

3. Features Engineering

The sensorial data includes multiple batches, each act as a multi-dimensional time-series (each single sensor represents a single dimension time-series). The Features-Engineering phase transfers the problem from time-domain to features-domain, while exploiting the temporal information through statistical metrics. That allows a further processing in common AI methods, outside the time-series domain.

Features-Engineering (figure captured from tsfresh python package page

The AI tool-box contains several powerful methods which may be applied for the current problem. Those methods may be split into 2 main categories, namely classic-methods and neural-networks methods. The first category is aka ‘shallow-learning’ methods, while the last category is aka ‘deep-learning’ methods. When it comes to Features-Engineering, the classic-methods require an explicit implementation, while the neural-networks methods implicitly learn/engineer the features so no explicit implementation is required. This section thereby refers to the classic-methods. A very popular python package for features-engineering is tsfresh. It automatically calculates a large number of time series characteristics, the so called features. Further the package contains methods to evaluate the explaining power and importance of such characteristics for regression or classification tasks.

The tsfresh package provides 4 levels of features engineering:

ComprehensiveFCParameters — extract all features
EfficientFCParameters — skip features with high computational costs
MinimalFCParameters — only a small subset of features (quick setup)
TimeBasedFCParameters — only the features that require DateTimeIndex

Once features are extracted, it’s interesting to explore their correlation with the dependent variable, hereby the QA / Lab results. Common correlation metrics for that end are the Pearson and Spearman coefficients, which range between -1 and +1 and evaluate linear-relationships and monotonic-relationships, respectively.

The below figure shows the Pearson (top) and the Spearman (bottom) coefficients value, per each feature (x-axis), and for a specific QA/Lab result.

Pearson and Spearman correlation metrics

The Pearson and Spearman coefficients are commonly treated with a Correlation-Matrix setup in a heat-map fashion. That allows a deeper analysis of not only the correlation between the features and the dependent variable, but also the correlation between the features and themselves.

Correlation Matrix analysis (figure captured from plotCorrelation python package page)

The correlation analysis suggests for whether a supervised-learning approach might succeed predicting the dependent variable, based on the available features. A significant correlation (many features with coefficients near ±1) might encourage that such an approach could end up working well.

4. Cluster Analysis

At this point, each batch is represented in a high-dimensional features domain which reflects key-statistical metrics of the raw sensorial data. The clustering goal is to try labelling each batch into a corresponding bucket, in a way that will make sense both mathematically (e.g. minimizing a given error metric) and intuitively (e.g. each bucket would ideally be able to retrieve some human interpretation).

Clusters Analysis is a machine-learning branch which resides in the Unsupervised-Learning framework. Unsupervised Learning is a type of machine learning that looks for previously undetected patterns in a data set with no pre-existing labels and with a minimum of human supervision. A common algorithm for Clusters Analysis the K-Means algorithm.

The K-Means algorithm clusters data by trying to separate samples in n groups of equal variances, minimizing a criterion known as the inertia or within-cluster sum-of-squares. This algorithm requires the number of clusters to be specified. It scales well to large number of samples and has been used across a large range of application areas in many different fields. The k-means algorithm divides a set of N samples X into K disjoint clusters C, each described by the mean μj of the samples in the cluster. The means are commonly called the cluster “centroids”.

K-Means Inertia criterion, aka within-cluster sum-of-squares

The average complexity is given by O(k n T), were n is the number of samples and T is the number of iteration. In practice, the k-means algorithm is very fast (one of the fastest clustering algorithms available), but it falls in local minima. That’s why it can be useful to restart it several times. For many practical applications, however, local minima is a good-enough solution.

Features Normalization, aka Standardization

Before applying K-Means clustering algorithm, it’s essential for normalizing the features, aka standardization (recall that each feature might live in a totally different scale). Two common normalization implementations are scipy.cluster.vq.whiten and sklearn.preprocessing.StandardScaler (python libraries). At the end of this phase, each batch is represented in a normalized high-dimensional features domain, and ready for clusters analysis.

Common python implementation for K-Means is sklearn.cluster.KMeans, which follows either Lloyd’s or Elkan’s algorithm.

The overall coding might end-up as following:

from sklearn.cluster import KMeans
from scipy.cluster.vq import whitenfeatures_df_whiten = features_df.copy()km = KMeans(n_clusters=clusters_num, 
            init=’random’, 
            n_init=10, 
            max_iter=300, 
            tol=1e-04, 
            random_state=0,
            algorithm=”auto”)y_km = km.fit_predict(features_df_whiten)cluster_centers_df = pd.DataFrame(km.cluster_centers_)

where features_df holds the features data-frame (N batches x M features) and cluster_num is a parameter which defines the number of cluster for K-Means. The resulted data-frame, termed cluster_centers_df, has the size of C x M, where C equals to cluster_num. In other words, the above code snippet ends up with C clusters, where each is represented by a vector of M features. An additional result is the y_km vector which maps each of the given batches onto its corresponding cluster index, from 0 to C-1.

For visualization purpose, it’s essential to reduce the high-dimensional (M) features-domain into less coordinates (2 or 3). A common strategy for achieving that goal is the Principle-Component-Analysis (PCA) algorithm.

PCA is an unsupervised learning technique that offers a number of benefits. For example, by reducing the dimensionality of the data, PCA enables a better generalized machine learning models, and thereby helps dealing with the “curse of dimensionality”. In addition, PCA can help us improve performance at a very low cost of model accuracy, since most algorithms performance depends on the dimension of the data. Other benefits of PCA include reduction of noise in the data, feature selection (to a certain extent), and the ability to produce independent, uncorrelated features of the data.

Here, PCA is applied for visualization purpose and allows an inspection of the K-Means clustering result. Common python implementation for PCA is sklearn.decomposition.PCA. It actually provides an additional important result, besides visualization, which is the clusters contribution. The following figure, for example, implies that 14 clusters might be a proper choice for K-Means cluster_num parameter, since 14 clusters appear to cover about 90% of the explained variance.

Finally, it’s possible to visualize the K-Means clustering results in the PCA-domain, by selecting the 2 (for a 2D plot) or the 3 (for a 3D plot) most significant coordinates.

K-Means Cluster-Analysis result in a 2D PCA domain

Hereby, each dot represents a single batch and each star represents a single cluster.

To sum-up, we’ve started with Features-Engineering, which transferred the problem from raw-sensorial domain (time-series) into a high-dimensional features domain. The features were got into a standardization phase and then passed through a K-Means algorithm for clustering. Finally, the resulted clusters were visualized in a 2D PCA domain.

As final note, another common strategy for in time series analysis is termed Dynamic Time Warping (DTW), which can be applied Hierarchical-Clustering Analysis (HCA). Common python implementation for DTW/HCA is dtaidistance.clustering. This strategy will not be further discussed here with details, and the curios reader may find a good starting point at this article.

Hierarchical-Clustering Analysis (HCA) using Dynamic Time Warping (DTW)

5. Anomaly Detection

Isolation Forest

“Isolation forest is an unsupervised learning algorithm for anomaly detection that works on the principle of isolating anomalies” (wikipedia). The core of the algorithm is to “isolate” anomalies by creating decision trees over random attributes. The random partitioning produces noticeable shorter paths for anomalies since fewer instances (of anomalies) result in smaller partitions and since distinguishable attribute values are more likely to be separated in early partitioning. Hence, when a forest of random trees collectively produces shorter path lengths for some particular points, then they are highly likely to be anomalies. The below figure demonstrated (in a 2D domain) that isolating anomalies typically require less amount of splits.

Figure 12 — Isolation Forest concept — anomalies require less splits

Common python implementation for Isolation-Forest is sklearn.ensemble.IsolationForest, and require only few lines of code:

from sklearn.ensemble import IsolationForestX = np.array(features_pca_df)model = IsolationForest(behaviour=’new’, # Decision_function
                        contamination=0.01, # outliers_fraction
                        random_state=42) # seed
                        model.fit(X)
 
y_pred = model.predict(X)

The input for the above code-snippet is termed X, and holds the features in the PCA domain. In that sense, it’s a direct continuation of the last Clustering section. That means the raw sensorial data (time-series) passes through features-engineering, standardization and PCA stages before entering the Isolation-Forest stage. The resulted output is a vector, termed y_pred, which predicts whether or not each example should be considered as an inlier (+1) or outlier (-1). Once again, a PCA approach may be applied for a 2D (or 3D) visualization.

Anomaly-Detection with Isolation-Forest, visualized in a 2D PCA domain

Each dot in the above figure represent a different batch, and the color implies for the anomaly decision. It may be observed that there is a single outlier batch, which is ‘isolated’ at the right-bottom area. Next, it’s typically interesting to explore for the anomaly reasoning, and a common technique is comparing the outlier batch with the mean values, per each feature.

Anomaly-Detection with Isolation-Forest — Exploring anomalies reasoning

The left figure applies an RMSE metric, while the right figure applies a MAPE metric for comparison. The red horizontal line (right figure) represents a parametric threshold. It can be observed that 5 features were responsible for anomaly decision.

LSTM AutoEncoder

This is the first encounter with deep-learning along this article, and also the usage of the “virtual sensors” terminology that was introduced at the Problem Formulation section.

AutoEncoders Neural Networks try to learn data representation of its input, usually, by learning an efficient encoding that uses fewer parameters / memory. In a sense, the model learns the most important features of the data using as few parameters as possible.

The rationale for using this architecture for anomaly detection is that the model trains on a “normal” data and determines the resulting reconstruction error. Then, when the model encounters data that is outside the norm and attempts to reconstruct, it will end up with an increase in the reconstruction error as the model was never trained to accurately recreate items from outside the norm.

One of the advantages of using LSTM cells is the ability to include multivariate features in the analysis. Here, it’s the multiple sensor readings per time step. However, in an online anomaly detection analysis, it could be features per time step.

AutoEncoder neural network model is commonly created using Long Short-Term Memory (LSTM) Recurrent Neural Network (RNN) cells, which is supported within the Keras/TensorFlow framework. This topic is well covered by Brent Larzalere article.

The following steps are applied for each State:

Concatenate all batches into a single timeline (“virtual sensors”)
Cross-Validation — Train/Test dataset split, e.g. 80:20
Normalization, e.g. min/max
Reshape the normalized data into a format suitable for input into an LSTM network. LSTM cells expect a 3 dimensional tensor of the form [data samples, time steps, features]. Here, each sample input into the LSTM network represents one step in time and contains 4 features (the sensor readings for the four bearings at that time step)
Create an AutoEncoder neural network model, as a Python function using the Keras library. Compile it using Adam as the neural network optimizer and mean absolute error for calculating the loss function.
Fit the model to the training data and train it, e.g. for 100 epoch. It’s typically interesting to view the training losses for model’s performance evaluation.
Explore the loss distribution, set an anomaly threshold accordingly.

Anomaly-Detection with LSTM AutoEncoder — Learning phase Anomaly prediction

The above figure refers to some arbitrary state, termed ‘0108’. The learning curve plot (left) implies that the learning went up well in, in terms of bias and variance, as both train and validation curves ended up low enough and relatively tight. It does seem a good idea though, to apply a lower echo (~80) for ending the learning earlier and avoid overfitting. The loss distribution plot (middle) implies that a descent anomaly threshold may be set as ~0.3, since higher losses are relatively rare. Finally, the Loss-vs-Time plot (right) provides a interesting view for the loss of the concatenated batches, split into Train and Validation sets. The green horizontal line marks the anomaly threshold, and it may be observed that anomaly has been detected with its excess.

Root-causing the exact anomaly reasoning may be obtained by exploring the Sensor-vs-Time plots, for the relevant state.

Anomaly-Detection with LSTM AutoEncoder — Exploring anomalies reasoning

The above figure refers to the example state (‘0108’) and comprises multiple per-sensor subplots. Each subplot compares the sampled values with the predicted values, and additionally highlights with a red-mark the anomaly timestamp. This view thereby implies for the specific anomaly cause.

6. Time Forecasting

A key-observation, with the context of the titled topic, is that most time series data can be described by three components:

Trend → a general systematic linear or (most often) nonlinear component that changes over time and does not repeat
Seasonality → a general systematic linear or (most often) nonlinear component that changes over time and does repeat
Noise → a non-systematic component that is nor Trend/Seasonality within the data

The procedure of splitting a given time-series into its above components is known as Seasonal Decomposition, and is typically applied in an additive fashion.

Time forecasting is a hot topic which has many possible applications, such as stock prices forecasting, weather forecasting, business planning, resources allocation and many others. There are many types of time-forecasting techniques, as well described by Davide Burba article, and hereby the focus is put on the ARIMA framework.

ARIMA, short for ‘Auto Regressive Integrated Moving Average’ is actually a class of models that ‘explains’ a given time series based on its own past values, that is, its own lags and the lagged forecast errors, so that equation can be used to forecast future values.

An ARIMA model is characterized by 3 terms: d, p, q

d is the order of differencing, i.e. the number of differencing required to make the time series stationary. The ‘d’ parameter may be extracted from the ACF plot, as the minimal differencing order for entering the significance cone
p is the order of the AR term, i.e. the number of lags of Y to be used as predictors. The ‘p’ parameter may be extracted from the PACF plot, as the number of initial spikes, before entering the significance cone.
q is the order of the MA term, i.e. the number of lagged forecast errors that should go into the Model. The ‘q’ parameter may be extracted from the ACF plot, as the number of initial spikes, before entering the significance cone

An ARIMA model is one where a time series get differenced d times to make it stationary, and then combine the AR and the MA terms.
So the equation becomes:

𝑌𝑡=𝛼+𝛽1𝑌𝑡−1+𝛽2𝑌𝑡−2+…+𝛽𝑝𝑌𝑡−𝑝+𝜖𝑡+𝜙1𝜖𝑡−1+𝜙2𝜖𝑡−2+…+𝜙𝑞𝜖𝑡−𝑞

The SARIMAX extension of ARIMA that explicitly models the seasonal element in univariate data. There are four seasonal elements, on top of the ARIMA d,p,q parameters:

P: Seasonal autoregressive order.
D: Seasonal difference order.
Q: Seasonal moving average order.
m: The number of time steps for a single seasonal period.

Linear regression models work best when the predictors are not correlated and are independent of each other. Hence, the first step to build an ARIMA model is to make the time series stationary. A stationary process has the property that the mean, variance and autocorrelation structure do not change over time. One of the most popular methods for checking stationary is applying the Augmented Dickey Fuller (ADF) test. The ADF test gets an input time-series and returns back a score termed p-value. If the p-value is greater than the significance level 0.05, then the input time-series is NOT stationary.

Common python implementation ARIMA (and SARIMAX) is pmdarima.arima.auto_arima, which automatically discovers the optimal order for an ARIMA model. The procedure is repeated per each sensor and per each state. All batches are getting concatenated, pass normalization and enter the auto_arima pipeline. The result is time-forecasting models, per each sensor and per each state, which may be applied for future sensor/state behavior. That model can be treated as an additional anomaly-detection engine, e.g. calling ‘Anomaly’ if actual samples significantly defer from the respective forecast.

The below figure refers for a specific sensor, termed ‘s22’, where each column corresponds with a different state. The top and the bottom rows provide a zoom-out and a zoom-in views, respectively, over the time-axis. The grey cone represents the forecasting certainty, where a narrow cone means high certainty.

Time Forecasting — Example of a specific sensor, per each state

As final note, another common tool for in time series forecasting is termed Facebook Prophet. It is an additive regression model with four main components:

A piecewise linear or logistic growth curve trend. Prophet automatically detects changes in trends by selecting change-points from the data.
A yearly seasonal component modelled using Fourier series.
A weekly seasonal component using dummy variables.
A user-provided list of important holidays.

However, this strategy will not be further discussed here with details, and the curios reader may find a good starting point at this link.

7. Quality (QA) Prediction

XGBoost

XGBoost stands for “Extreme Gradient Boosting”, where the term “Gradient Boosting” originates from the paper Greedy Function Approximation: A Gradient Boosting Machine, by Friedman. XGBoost is used for supervised learning problems, where training data (with multiple features) is used for predicting a target variable.

XGBoost is actually a decision tree ensembles model. The tree ensemble model consists of a set of Classification And Regression Trees (CART). The following figure acts as a simple example of a CART that classifies whether someone will like a hypothetical computer game X.

XGBoost CART example for whether someone will like a hypothetical computer game X

The family members are classified into different leaves, and assigned with a score on the corresponding leaf. A CART is a bit different from decision trees, in which the leaf only contains decision values. In CART, a real score is associated with each of the leaves, which gives us richer interpretations that go beyond classification. Usually, a single tree is not strong enough to be used in practice. What is actually used is the ensemble model, which sums the prediction of multiple trees together (see right subplot in the above figure).

Back to the Sensorial-based problem we’re focusing on, a real decision tree from the ensemble forest would end up as following:

XGBoost — a realistic example for a sensorial decision tree

Common python implementation XGBoost is the XGBoost Python Package, which is an open source library that provides a high-performance implementation of gradient boosted decision trees. An underlying C++ codebase combined with a Python interface sitting on top makes for an extremely powerful yet easy to implement package. XGBoost comes with multiple hyper-parameters, and it’s usually a good practice to combine GridSearch for finding the optimal setup.

import xgboost as xgb
from sklearn.model_selection import GridSearchCVmodel = xgb.XGBRegressor()parameters = {'nthread':[4],
              'objective':['reg:linear'],
              'learning_rate': [.03, 0.05, .07],
              'max_depth': [5, 6, 7],
              'min_child_weight': [4],
              'silent': [1],
              'subsample': [0.7],
              'colsample_bytree': [0.7],
              'n_estimators': [500]}xgb_grid = GridSearchCV(model,
                        parameters,
                        cv = 3,
                        n_jobs = 4, 
                        verbose = True)xgb_grid.fit(X, y)model_score = xgb_grid.best_score_
model_params = xgb_grid.best_params_

Each row in X and y tables corresponds with a different batch (index holds the batches ids). Each column in X corresponds with a different feature and y has a single-column which holds the QA/Lab results. Since there are typically multiple QA/Lab types, this process repeats multiple times, for each type.

In order to build more robust models, it is common to apply a K-Fold Cross Validation where all the entries in the original training dataset are used for both training as well as validation (each entry is used for validation just once). An additional important result is the optimal steps amount for the XGBoost training.

XGBoost applies K-Fold Cross Validation method

Next, the best-model is trained over the training set. The trained model is then applied for predicting the test-set QA/Lab results. Finally, the model predictions are compared against the real known results, for assessing the model performance. In that contest, it is usually interesting to compare the model performance against the intrinsic variance of each QA/Lab result, in order to assess the error scale.

XGBoost result example — Actual vs. Predicted, for a specific QA/Lab types

An additional important output of the XGBoost model is the Features Importance. It enables examination of the importance of each feature column in the original dataset within the model. This result is retrieved by counting the number of times each feature is split on across all boosting rounds (trees) in the model.

XGBoost result example — Features Importance, for a specific QA/Lab types

LSTM-RNN (many-to-many)

Recurrent Neural Networks (RNN) have been proven to efficiently solve sequence problems. Particularly, Long Short Term Memory Network (LSTM), which is a variation of RNN, is currently being used in a variety of domains to solve sequence problems.

Many-to-One sequence problems get a sequence of data as input and have to predict a single output. Many-to-Many sequence problems have a sequence of data as input and have to predict a dimensional output.

Many-to-One LSTM networks are commonly applied for sensor-based Activity Recognition (e.g. Chung, Seungeun, et al, 2019), for Sentiment Analysis (e.g. Wen, Shiping, et al, 2019), and for other tasks. Here, we are willing to train an LSTM network to predict the QA/Lab Results scoring (output), based multi-variant Sensors data (input).

As part of the Virtual Sensors formulation, each sensor (Sx) is split per state (Sx_State), and padded with zeros according to longest series. An additional table holds all QA/Lab results, per each batch. Next, the X and y arrays (data and labels) are extracted out of the intersection between these 2 tables.

A cross-validation is performed, e.g. 80/20, for splitting the data and labels into Train and Test sets. Each set then goes through normalization (aka standardization). Finally, each set gets reshaped to match the LSTM format. The data sets are reshaped into (Samples, TimeSteps, Features) dimensions, while the label sets are reshaped into (Samples, Features) dimensions.

Virtual Sensors before (left) and after (right) normalization

At this stage, the data and the corresponding labels are ready to be injected into the LSTM, so it’s time to compile the neural-network. The exact architecture requires some extent of exploration, although in practice even a relatively simple network as following might yield descent results. Hereby, longest series has 17626 samples, and there are 23 types of QA/Lab results.

RNN LSTM architecture example for QA/Lab results prediction

The LSTM model is trained for 1000 epochs in a mini-batch fashion, with a validation split of 5%:

history = model.fit(X,
                    y, 
                    epochs=1000, 
                    batch_size=512,
                    validation_split=0.05,
                    verbose=2).history

Finally, the model performance may be evaluated over the test-set, by comparing the actual labels with the predicted ones.

Mlearning.ai Submission Suggestions

How to become a writer on Mlearning.ai

medium.com