Big Data Regression — Complete Pipeline — Part 1/2

Shahar Gino
11 min readDec 12, 2021

--

TL; DR

This 2 parts article provides a comprehensive view into a real regression big-data problem, including both code and detailed analysis. Part1 (this article) mainly focuses on EDA and on a Shallow-Learning modeling, while Part2 mainly focuses on Deep-Learning modeling.

Agenda

  1. Problem and Dataset description
  2. Exploratory Data Analysis (EDA)
  3. Modeling with Shallow Techniques

1. Problem and Dataset description

The original problem was generalized in order to respect the client privacy. 500 different sensors sample a given scene at a 1Hz rate (every second), and refer to 2 numeric labels (imagine them as yield, quality, etc.). The requested outcome is to predict these 2 labels based on the sensorial data.

Dataset description:

  • var0 ➔ TimeStamp
  • var1var501 ➔ Sensorial samples (imputed with missing-values)
  • result 502, result 502 ➔ Labels (to be predicted)
  • ~3 million rows, from 2021–09–01 to 2021–10–04 (~1 month)

Data is stored in a huge (~7GB) CSV file, and Dask is applied for loading it up in an efficient manner:

import dask.dataframe as dddata_df = dd.read_csv(path.join(base_path, 'raw_data.csv'))
data_df = data_df.drop('Unnamed: 0', axis=1)
display(data_df.head())
Dataset overview — ~3 million rows, each comprises 1 timestamp, 500 samples and 2 labels

Note that reading and manipulating the data with Pandas is not possible on most machines, as the dataset won’t fit in memory.

Following are all of the preliminary imports which are required for the following code, provided in this part:

!pip install sweetviz dask[dataframe] dask-ml dask_xgboost lazypredict==0.2.9import pickle
import numpy as np
import pandas as pd
from os import path
import seaborn as sns
import sweetviz as sv
from glob import glob
import dask.dataframe as dd
import matplotlib.pyplot as plt
from sklearn.utils import shuffle
from dask.distributed import Client
from sklearn.metrics import r2_score
from sklearn.pipeline import Pipeline
from dask.diagnostics import ProgressBar
from dask_ml.xgboost import XGBRegressor
from dask_ml.decomposition import IncrementalPCA
from dask_ml.preprocessing import StandardScaler
from lazypredict.Supervised import LazyRegressor
from sklearn.model_selection import train_test_split
from dask_ml.model_selection import train_test_split as dd_train_test_split

2. Exploratory Data Analysis (EDA)

2.1. Data type

Explore each column data-type by applying dask’s describe method:

display(data_df.describe())
EDA — Data-type check

It may be observed that all columns are with type of ‘float64’.

2.2. Samples number

Explore samples amount, by computing Dask’s dataframe shape:

with ProgressBar():
print("Data contains %d samples" % data_df.shape[0].compute())

Running the above code returned with reporting on 2923189 samples, which indeed matches ~1 months of sampling in a 1Hz rate (~ 31x24x60x60).

2.3. Labels Inspection

Plot the 2 labels, hereby termed as ‘result 502’ and ‘result 503':

results_list = ['var0', "result 502", "result 503"]with ProgressBar():
results_df = data_df[results_list].compute().set_index('var0')
fig, axes = plt.subplots(2, 1, figsize=(20,12))
results2_df = results_df.iloc[:int(1e5),:]
results_df.plot(title='Labels (raw) - full', grid=True, ax=axes[0])
results2_df.plot(title='Labels (raw) - zoom', grid=True, ax=axes[1])
plt.show()

Running the above code over the data ends up with the following figures:

EDA — Labels inspection, zoom-out (top) and zoom-in (bottom, first 100K samples)

It may be observed that the 2 labels correlates, while ‘results 503' appears to act in an discontinuous manner, i.e. with periodic drops to zero. That observation implies that it might worth checking if the data (samples) can be divided into 2 groups, depending on ‘results 503’ status ➔ see section 2.6.

2.4. Samples Inspection

Check the amount of missing values (None, NaN, NaT):

missing_values = data_df.isnull().sum()
percent_missing = ((missing_values / data_df.index.size) * 100)
with ProgressBar():
percent_missing_s = percent_missing.compute()
print('Missing Values mean: %.2f%%' % percent_missing_s.mean())title_str = 'Missing Values [%]'
percent_missing_s.plot(title=title_str, grid=True, figsize=(20,5))

Running the above code over the data ended up with the following figure:

Missing Values mean: 15.71%

EDA — Missing values among the samples, in percentage [%]

Following is a “zoom-in” for few representative samples (code + figures), in order to better understand the missing-values behavior:

with ProgressBar():
few_features_list = ['var100', 'var200', 'var300', 'var400', 'var500']
few_features_df = data_df[['var0'] + few_features_list]
few_features_df = few_features_df.compute().set_index('var0')
fig, axes = plt.subplots(5, 1, figsize=(20,25))
for k, features in enumerate(few_features_df.iteritems()):
title_str = '%s (raw)' % features[0]
features[1].plot(title=title_str, grid=True, ax=axes[k])
plt.show()
EDA — Few representative samples, for exploring the missing-values behavior

It may be observed that the missing-values represent large drop segments in the corresponding signals, which implies that a fill-with-zero strategy shall be considered. The rational is the ‘missing-data’ can be treated as ‘off time’ and hence a zero fill appears like an appropriate choice.

Note that sometimes it makes sense to adopt a different strategies, such as fill in with previous, mean, median, etc. In some occasions, it might make sense to drop features with missing values. The following code may be useful when applying such a strategy (drop columns with more than 50% missing values):

columns_to_drop = percent_missing_s[percent_missing_s >= 50].indexdata_cleaned_df = data_df.drop(list(columns_to_drop), axis=1)print('Columns number reduced from %d to %d after clean-up' % \
(len(data_df.columns), len(data_cleaned_df.columns)))

2.5. Correlation Analysis

Pearson’s correlation (“r”) coefficient measures the statistical relationship (aka association) between two continuous variables, based on covariance. It ranges between -1 and +1 and quantifies the direction and strength of the linear association between the two variables.
Evans (1996) suggested the following, for the absolute value:

  • 0.0 <= r < 0.2 ➔ “very weak”
  • 0.2 <= r < 0.4 ➔ “weak”
  • 0.4 <= r < 0.6 ➔ “moderate”
  • 0.6 <= r < 0.8 ➔ “strong”
  • 0.8 <= r < 1.0 ➔ “very strong”

Following is the Correlation-Matrix (code+figure), generated by Dask’s corr method, and plotted with the matplotlib’s matshow method:

with ProgressBar():
corr_pearson_df = data_df.corr(method='pearson').compute()
matfig = plt.figure(figsize=(10,10))
plt.matshow(corr_pearson_df.corr(), fignum=matfig.number)
plt.show()
EDA — Correlation Matrix, a yellow shade implies on a relatively high correlation

It may be observed that that there are few yellow blobs in the correlation matrix. That implies on an inter-correlation between the samples, and on a possible potential in combining them into joined features and or considering dimensionality reduction techniques (e.g. PCA, t-SNE, etc.) for exploiting the mutual information among the samples ➔ see section 3.1.

Next, it is interesting to focus on the last 2 rows/columns, which represent the correlation between the samples and the labels:

CORRELATION_THR = 0.4correlation_dict = {}fig, axes = plt.subplots(2, 1, figsize=(20,10))for k, label_type in enumerate(('502', '503')):   corr_s = corr_pearson_df.loc['result %s' % label_type]
corr_s = correlation_s.drop(['result 502', 'result 503'])
print('\nSignificant Feature for results %s (>%.1f):' % \
(label_type, CORRELATION_THR))
print(corr_s[corr_s > CORRELATION_THR].index.values)
title_str = 'Result %s - Pearson Correlation' % label_type
corr_s.plot(ax=axes[k], grid=True, title=title_str)
correlation_dict[label_type] = corr_splt.show()
EDA — Correlation between the samples and the 2 labels

It may be observed that both labels have a “moderate” correlation and beyond (0.4+) among several samples, which is encouraging.

The following code calculates and stores high correlated features:

fig, axes = plt.subplots(2, 1, figsize=(20,15))high_correlated_dict = {}for k, label_type in enumerate(('502', '503')):   corr_s = correlation_dict[label_type]
columns_to_keep = ['var0', 'result 502', 'result 503'] + \
corr_s[corr_s > CORRELATION_THR].index.tolist()
with ProgressBar():
df = data_df[columns_to_keep].compute().set_index('var0')
print('\nHigh Correlated features %s:' % str(df.shape))
display(df.head())
title_str = 'High Correlated Features (result %s)' % label_type
plot_df = corr_pearson_df.loc[df.columns]['result %s'%label_type]
plot_df = plot_df.drop(['result 502', 'result 503'])
plot_df.plot(kind='bar', title=title_str, grid=True, ax=axes[k])
high_correlated_dict[label_type] = df
plt.show()
EDA — High Correlated Features

The 1st label (‘result 502’) ended-up with 8 high-correlated samples, while the 2nd label (‘result 503’) ended-up with 34 high-correlated samples.

That is an important take-away, as it might be beneficial to stick with these high-correlated samples for the modeling stage, mainly for the shallow learning phase, where features extraction is typically done in a linear fashion without inherent combining.

2.6. Active vs. Inactive

This section leverages from the previous sections. Inspecting the 2 labels in section 2.3 suggests that the data (samples) can perhaps be divided into 2 groups, depending on ‘results 503’ status, i.e. “Active” or “Inactive”. In addition, section 2.5 implies that only small portion of the samples do correlate with the labels, so it might be beneficial to focus on these features (computationally-wise and memory-wise, mainly for big-data analysis).

The following code applies SweetViz’s compare_intra function for achieving that analysis. This function takes a boolean series as one of the arguments, as well as an explicit “name” tuple for naming the (true, false) resulting datasets. Note that internally, this creates 2 separate dataframes to represent each resulting group, hereby “Active” and “Inactive”.

for k, label_type in enumerate(('502', '503')):

html_log = 'report_%s.html' % label_type
feature_config = sv.FeatureConfig(
force_num=high_correlated_df.columns.tolist()
)
report = sv.compare_intra(high_correlated_df,
high_correlated_df['result 503'] == 0,
['Stop', 'Active'],
'result %s' % label_type,
feature_config,
pairwise_analysis='auto')
report.show_html(html_log)

Running the above code ends-up with 2 HTML reports, one per each label (“502” and “503”). Each report provides a deep view onto each sample, as well as a powerful pairwise analysis view (acting as a correlation matrix).

EDA — SweetViz compare_intra analysis for “Active” vs “InActive” (aka “Stop”) groups, per each label

The above figure is captured from the two HTML reports. The left side refer to the “result 502” report, while the right side refers to the “result 503” report. The top row provides a high level statistical comparison between the 2 groups. The mid section focused on one the most high-correlated samples (“var109"), and the bottom section introduces pairwise correlation matrices, per each group. Note that SweetViz compare_intra doesn’t perform well with large amount of features, hence it’s important to apply it on a selected subset, e.g. most correlated ones.

2.7. Dimensionality Reduction

Hereby, principal components analysis (PCA) is applied for exploiting the maximal variance, keeping least amount of components. Dask Incremental PCA is applied for dealing with the large amount of data in a memory bound conditions.

lr = Pipeline(steps=[
('scale', StandardScaler()),
('pca', IncrementalPCA()),
])
dropped_list = ['var0', 'result 502', 'result 503']
data_filt_df = data_df.drop(dropped_list, axis=1).fillna(0)
X_arr = data_filt_df.to_dask_array(lengths=True).astype(float)lr_fitted = lr.fit(X_arr)pca = lr_fitted['pca']print('PCA explained variance (%d components):' % pca_components_num)print(pca.explained_variance_ratio_)
print(pca.singular_values_)
explained_variance = 0.9cumsum_var = pca.explained_variance_ratio_.cumsum() > explained_varianceidx = cumsum_var.argmax()print('Number of components needed for having at least %.2f is equal to %d' % (explained_variance, idx))pd.Series(pca.explained_variance_ratio_.cumsum()).plot(title='PCA explained variance (n=%d)' % pca_components_num, grid=True, figsize=(20,5))
EDA - PCA Explained Variance

It may be observed that 121 components are required for explaining 90% of the variance. Currently, we’re running PCA over all data, for EDA purpose. Later on, PCA with modeling can be applied as following:

  • Fitting PCA on the training set only ➔ pca.fit(X_train)
  • Apply the mapping (transform) to both the training set and the test set ➔ X_train = pca.transform(X_train), X_test = pca.transform(X_test)

3. Modeling with Shallow Techniques

3.1. Features Engineering

The EDA provided in section 2 was treating the samples (‘var1', …, ‘var501') as features. At this point, when dealing with Features Engineering, the following scheme may be considered (one or multiple can be applied):

  1. Define the Features to be the Samples, as is, aka 1:1 mapping
  2. Define the Features to be the Samples after a dimensionality reduction, e.g. by applying PCA, t-SNE, correlation thresholding, etc.
  3. Define the Features by manipulating the samples, either linearly on non-linearly, aka kernel-based mapping.
  4. Consider adding features which combine samples together, gaining the mutual information between the samples to the labels

It’s usually recommended to start with the lightest scheme, and the switch for an heavier tools in the arsenal, if necessary. Hence, we’ll be starting with scheme 1+2 from the above list, i.e. defining the features as the high-correlation samples.

3.2. Lazy Predict

Lazy Predict helps build a lot of basic models with few coding and helps understand which models works better without any parameter tuning. Unfortunately, Lazy Predict doesn’t perform well for big-data (memory consumption is too high), but assuming that the data is roughly distributed similarly over-time, it might be educational to apply it over small portion of the data. The purpose here is basically to provide a rough sense of how the features “interact” with the labels, which might hint of how to proceed with the further analysis. Hence, shuffling the data during structuring is allowed, which is typically less appropriate for a time-series analysis. That might perform well when the time-series characteristics are considered less dominant. In other cases, shuffling the data may be skipped.

for label_type in ('502', '503'):   # only small portion, due to memory limitations:
df = high_correlated_dict[label_type].iloc[:int(1e4),:]
# Fill missing values with zeros:
df = df.fillna(0)
# Structure the data (+shuffle):
X, y = shuffle(df.drop(['result 502', 'result 503'], axis=1),
df['result %s' % label_type],
random_state=42)
# Split data into train and test sets (80:20 ratio):
cv_tuple = train_test_split(X, y, test_size=.2, random_state=0)
X_train, X_test, y_train, y_test = cv_tuple
# Create an instance of the estimator, and fit it to the data:
reg = LazyRegressor(verbose=0,
ignore_warnings=False,
custom_metric=None,
predictions=True)
lazy_models, lazy_predictions = reg.fit(X_train,
X_test,
y_train,
y_test)
# Results analysis:
print('Lazy Models (label = result %s)' % label_type)
display(lazy_models)
if not lazy_models.empty: fig, axs = plt.subplots(1, 1, figsize=(20, 5), sharex=True)
sns.set_theme(style="whitegrid")
sns.barplot(x=lazy_models.index,
y="R-Squared",
data=lazy_models,
ax=axs)
axs.set(ylim=(0, 1))
axs.set(title='Lazy Models (label = result %s)' % label_type)
plt.xticks(rotation=90)
plt.show()

Running the above code ends up with the following results:

Modeling — Lazy Predict, over small portion of the big-data (+shuffling)

It may be observed that several models appear to behave well in terms of a high R-Squared accuracy, which is encouraging.

3.3. XGBoost Regressor

The previous section (3.2) implies that shallow learning schemes might perform well with the data. This section focused on the popular and powerful XGBoost Regressor model. Hereby, no shuffling is applied, and the whole data is treated, using Dask’s Client, train_test_split and XGBRegressor machinery. Two models are being generated, one per each label.

client = Client(processes=True, threads_per_worker=1)
#client = Client('scheduler-address:8786')
for label_type in ('502', '503'): # Structure the data (high-correlated features only):
df = high_correlated_dict[label_type]
high_correlated_dd = dd.from_pandas(df , npartitions=3)
X = high_correlated_dd.drop(['result 502', 'result 503'], axis=1)
y = high_correlated_dd['result %s' % label_type]
# Split data into train and test sets (80:20 ratio):
cv_tpl = dd_train_test_split(X, y, test_size=.2, random_state=0)
X_train, X_test, y_train, y_test = cv_tpl
# Fit the model
est = XGBRegressor()
est.fit(X_train, y_train)
# Results analysis:
test_df = pd.DataFrame(y_test.compute())
test_df['predicted'] = est.predict(X_test).compute()
accuracy_r2 = r2_score(test_df['result %s' % label_type],
test_df['predicted'])
print('accuracy=%.2f' % accuracy_r2) feature_import_df = pd.DataFrame(index=X_test.columns,
data=est.feature_importances_)
title_str = 'Result %s - Features Import. (XGboost)' % label_type
feature_import_df.plot(title=title_str,
kind='bar',
legend=None,
figsize=(20,5))
fig, axes = plt.subplots(2, 1, figsize=(20,12))
title_str = 'Result %s - Act. vs. Predicted - full' % label_type
test_df.plot(title=, grid=True, ax=axes[0])
title_str = 'Result %s - Act. vs. Predicted - first 10K samples'
test_df_zoom = test_df.iloc[:int(1e4),:]
test_df_zoom.plot(title=title_str, grid=True, ax=axes[1])
plt.show()

Running the above code ends up with the following results:

Modeling — XGBoost Regressor

It may be observed that the modeling ended up pretty well, as the ‘result 502’ and the ‘result 503’ models introduce R2-Squared accuracies of 0.89 and 0.99, respectively. The Feature-Importance bar-plots at the top also make sense as the important features matches the high-correlated samples, which were retrieved in section 2.5.

Summary

A real regression problem with big-data was introduced, accompanied with both code and qualitative analysis. The first section dealt with loading the data in an efficient manner. The following section focused on various EDA aspects. Finally, the third section covered modeling details, including features engineering considerations, and it ended up with a promising results in terms of R2-Squared accuracy.

What’s Next?

  1. Explore other dimensionality-reduction techniques, either instead or on-top of the current high-correlation thresholding scheme
  2. Explore other shallow-learning models, e.g. Light GBM, etc.
  3. Switch for Deep-Learning techniques (next Part article), e.g. LSTM/RNN, AutoEncoder, Transformers, etc.

Final Note

When dealing with big-data, Pickle is your friend… It’s mostly recommended to store intermediate (and final) objects in designated pickle files and only repeat the heavy calculations with their absence.

--

--