A complete Machine-Learning stack, before real data becomes available

Shahar Gino
7 min readSep 12, 2021

--

Quick ramp-up of a full ML stack, based on a synthetic dataset

TL;DR

This article will technically guide you, in a step-by-step fashion, for ramping-up a synthetic dataset (features+labels) followed by Exploratory Data Analysis (EDA) and Machine-Learning (ML) analysis. It could be applied as a common template for an early ramp-up, before real data becomes available.

Agenda

  1. Dataset (synthetic) high-level definition - features and labels
  2. Dataset generation
  3. Dataset EDA (+correlation check)
  4. ML Analysis - models exploration

Step 1: Dataset Definition

The various features are defined as random-variables, while the labels are defined in a rule-base manner. It’s highly convenient to maintain those high-level definitions over Google-Sheets, which is reachable by Python and allow an easy but yet powerful setup.

For example:

Synthetic dataset definition over Google-Sheets - an example

Hereby 4 features were defined as random-variables, in columns A-G:

  • Age ➔ Nominal, normally-distributed as N(35,10)
  • Gender ➔ Categorical, where Male/Female ratio is 1.5:1, respectively
  • Children_num ➔ Nominal , uniformly-distributed as U(0,5)
  • Grade ➔ Nominal, combination of 2 normal-distributions, N(50,5) and N(85,5), in a ratio of 2:1, respectively

The score label is defined in a rule-base fashion in columns H-M:

  • Age below 25 will contribute a 5-stars scoring, age between 25 to 35 will contribute a 4-stars scoring, etc. The Age itself has a weight of 4 on the overall score label
  • Gender of Male and Female contribute score of 3 and 2, respectively. The Gender itself has a weight of 1 on the overall score label
  • Children_num and Grade affects the score in a similar way, with weights of 1 and 5 on the overall score label

For example:

A sample of {age=35, gender=Male, children_num=2, grade=63} will end-up with a label of 32, due to the following scoring contributions:

age=35 ➔ +4 (x4) ➔ +16

gender=Male ➔ +3 (x1) ➔ +3

children_num=2 ➔ +3 (x1) ➔ +3

grade=63 ➔ +2 (x5) ➔ +10

Step 2: Dataset Generation

The following python code reads the dataset definitions (step 1), as a Pandas DataFrame. Need to replace SPREADSHEET_ID and GID with the specific google-sheet URL.

import io
import requests
import pandas as pd
def read_dataset_definitions(url): s = requests.get(url).content
dataset_def_df = pd.read_csv(io.StringIO(s.decode('utf-8')))
dataset_def_df['Group'] =
dataset_def_df['Group'].fillna(method='ffill')
dataset_def_df = dataset_def_df.fillna(0)
return dataset_def_df
SpreadsheetID = SPREADSHEET_ID
gid = GID
url = f"https://docs.google.com/spreadsheets/d/{SpreadsheetID}/export?format=csv&gid={gid}"dataset_def_df = read_dataset_definitions(url)display(dataset_def_df)

Running the above code, with the appropriate SPREADSHEET_ID and GID, ends-up with the following printout:

Synthetic dataset definition, read from Google-Sheets as a Pandas DataFrame - an example

Next, the following code generates a synthetic dataset base on the read definitions:

!pip install pandarallelimport re
import numpy as np
from pandarallel import pandarallel
def generate_data(attr_dist, attr_type, clip_neg, size): data = 0 generic_search = lambda x: re.search(x, attr_dist, re.IGNORECASE) re_uniform_search = generic_search('U\((\d+)\s*,\s*(\d+)\)')
re_choices_search = generic_search('C\((.*)\)')
re_lookup_search = generic_search('L\((.*)\)')
re_norm_search = generic_search('N\((\d+)\s*,\s*(\d+)\)')
re_norm2_search = generic_search('N2\(\((\d+)\s*,\s*(\d+)\):(\d+)\s*,\s*\((\d+)\s*,\s*(\d+)\):(\d+)\)')
# Numeric, normal:
if re_norm_search:
mu = float(re_norm_search.group(1))
sigma = float(re_norm_search.group(2))
data = np.random.randn(size) * sigma + mu
data = data.astype(attr_type)
# Numeric, normal dual:
elif re_norm2_search:
mu1 = float(re_norm2_search.group(1))
sigma1 = float(re_norm2_search.group(2))
weight1 = float(re_norm2_search.group(3))
mu2 = float(re_norm2_search.group(4))
sigma2 = float(re_norm2_search.group(5))
weight2 = float(re_norm2_search.group(6))
size1 = int((weight1/(weight1 + weight2)) * size)
size2 = size - size1
data1 = np.random.randn(size1) * sigma1 + mu1
data2 = np.random.randn(size2) * sigma2 + mu2
data = np.concatenate((data1, data2), axis=None).astype(attr_type)
# Numeric, uniform:
elif re_uniform_search:
low = float(re_uniform_search.group(1))
high = float(re_uniform_search.group(2))+1
data = np.random.uniform(low, high, size)
if attr_type == 'bool':
data = data.astype(int).astype(attr_type)
data = data.astype(attr_type)
# Categorical:
elif re_choices_search:
choices_str = re_choices_search.group(1).replace(' ','')
choices_list = choices_str.split(',')
choices_items_list = [x.split(':')[0] for x in choices_list]
choices_items_weights = [float(x.split(':')[1]) for x in choices_list]
choices_items_probs = np.array(choices_items_weights) / np.sum(choices_items_weights)
data = np.random.choice(choices_items_list, size=size, p=choices_items_probs)
# Clip negatives:
if clip_neg == 'Yes':
data = data.clip(min=0)
return data
def calculate_score(row, debug=False):
score = 0
score_opts = 5
for attr_name, attr_value in row.items():
scoring_row = scoring_df.loc[attr_name]
weight = scoring_row['Weight']
if weight != 0:
for k in range(1,score_opts+1):
score_cond = scoring_row['Score_%d' % k]
if score_cond == 0:
continue
re_range_search = re.search('(\d+)\s*-\s*(\d+)', score_cond, re.IGNORECASE)
score_acc_conds = []
score_acc_conds.append(re_range_search and (float(re_range_search.group(1)) <= attr_value <= float(re_range_search.group(2))))
score_acc_conds.append(score_cond.endswith('+') and attr_value >= float(score_cond[:-1]))
score_acc_conds.append(score_cond.endswith('-') and attr_value <= float(score_cond[:-1]))
score_acc_conds.append(attr_value in score_cond.replace('(','').replace(')','').replace(' ','').split(','))
if any(score_acc_conds):
score += k
if debug:
print('Score update: %s=%s (+%d) --> %d' % (attr_name, str(attr_value), k, score))
break
return score
POPULATION_SIZE = 10000
pandarallel.initialize()
dataset_df = pd.DataFrame()
for index, row in dataset_def_df.iterrows():
if row['Status'] == 'Ready':
attr_name = row['Attribute Name']
attr_dist = row['Distribution']
attr_type = row['Type']
clip_neg = ['Clip Negatives']
attr_data = generate_data(attr_dist, attr_type, clip_neg, POPULATION_SIZE)
dataset_df[attr_name] = attr_data

scoring_cond1 = dataset_def_df.columns.str.startswith('Attribute Name', na=False)
scoring_cond2 = dataset_def_df.columns.str.startswith('Weight', na=False)
scoring_cond3 = dataset_def_df.columns.str.startswith('Score_', na=False)
scoring_cond = scoring_cond1 | scoring_cond2 | scoring_cond3
scoring_df = dataset_def_df.loc[:, scoring_cond]scoring_df = scoring_df.set_index('Attribute Name')dataset_df['score'] = dataset_df.parallel_apply(calculate_score, axis=1)
print('\nSynthetic Dataset %s, head:' % str(dataset_df.shape))
display(dataset_df.head())

Running the above code ends-up with the following stdout result:

Synthetic dataset generation (example) - 10,000 samples with 4 features and 1 label

Notes:

  1. Label calculation is applied in parallel to speed up computation, using pandarallel package (installed through pip, see the 1st line in the above code snippet). Of course, a standard “apply” may be applied instead.
  2. The POPULATION_SIZE variable determines the dataset size (amount of sample), and it’s fully scalable for any desired value

Step 3: Dataset EDA

Once dataset is available, a Exploratory Data Analysis (EDA) phase is typically triggered, for inspecting the data and obtaining important insights.

The EDA may be applied directly with “pure” python or by calling designated packages, such as Sweetviz for example, and both options will be introduced.

The following code applies a “pure” python for EDA:

import matplotlib.pyplot as pltall_cols = dataset_df.columns.to_list()
all_cols.remove('score')
numeric_cols = dataset_df._get_numeric_data().columns.to_list()
numeric_cols.remove('score')
categorical_cols = np.setdiff1d(all_cols, numeric_cols).tolist()
print('Numerical features: %d' % len(numeric_cols))
print('Categorical features: %d' % len(categorical_cols))
print('\nDataset (head):')
display(dataset_df.head())
print('\nScoring (head):')
display(scoring_df.head())
print('\nScoring example, for first record:')
print('-'*34)
score_example = calculate_score(dataset_df.iloc[0,:-1], True)
# Keep numeric features:
ml_df = dataset_df.loc[:, numeric_cols]
# Add categorical features:
ml_df[categorical_cols] = dataset_df[categorical_cols].apply(lambda col: pd.Categorical(col).codes)
f_hist = ml_df.hist(figsize=(10,12))

Running the code ends-up with the following stdout:

EDA with “pure” python (Matplotlib) for the example synthetic dataset

The EDA may then expand a bit further for analyzing correlation between the various features:

import seaborn as sns
import seaborn as sns
sns.set()
corr_df = pd.DataFrame(ml_df.corrwith(ml_df['score']).sort_values(), columns=['Correlation'])
display(corr_df)
pearson_cor = ml_df.corr(method='pearson')
spearman_cor = ml_df.corr(method='spearman')
# Highlight relevant features
pearson_cor_targets = abs(pearson_cor['score'])
spearman_cor_targets = abs(spearman_cor['score'])
print('Relevant Features (Pearson):\n')
print(pearson_cor_targets[pearson_cor_targets > 0.2].sort_values(), '\n')
print('Relevant Features (Spearman):\n')
print(spearman_cor_targets[spearman_cor_targets > 0.2].sort_values())
# Plot
fig, (ax1, ax2) = plt.subplots(2, 1, figsize=(15,30))
sns.heatmap(pearson_cor, ax=ax1, annot=False, cmap=plt.cm.Reds)
sns.heatmap(spearman_cor, ax=ax2, annot=False, cmap=plt.cm.Reds)
ax1.set_title('Pearson (linearity)')
ax2.set_title('Spearman (monotonic)')
plt.show()
Correlation Analysis — part of “pure” EDA phase

In that specific example, we cat observe a relatively high correlation (both by Pearson and Spearman) between the grade feature and the score label.

The following code applies Sweetviz for EDA:

!pip install sweetvizimport sweetviz as sv
from google.colab import files # if running with iPython
my_report = sv.analyze(dataset_df)
my_report.show_html(filepath="example_report.html", open_browser=False)
files.download('example_report.html') # if running with iPython

Running the code ends-up with the following interactive HTML result:

EDA with Sweetviz for the example synthetic dataset

Step 4: ML Analysis

This steps starts with the ml_df DataFrame, calculated in previous step:

print('ML dataframe %s:' % str(ml_df.shape))
display(ml_df.head())
ML DataFrame (example)

Next, we’ll split the dataset into a Train and Test sets:

from sklearn.model_selection import train_test_splitX = ml_df.drop('score', axis=1)
y = ml_df['score']
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)

Finally, we’ll iterated over various ML models, based on the powerful LazyPredict package:

!pip install lazypredictimport lazypredict
from lazypredict.Supervised import LazyRegressor
clf = LazyRegressor(verbose=0, ignore_warnings=True)
models, predictions = clf.fit(X_train, X_test, y_train, y_test)
display(models)
ML models analysis, by applying the LazyPredict package

The above table describes the performance of the various ML models for the given dataset, and it acts as an excellent starting-point for further/deeper analysis. The best model in that example appears to be the XGBRegressor, which achieves R2 of 1.00 and an RMSE of 0.10.

Final note:

When a real dataset becomes available, the above pipeline still partially holds, just without the need of the first 2 steps (no need for synthetic data anymore).

--

--