Sales Prediction

Lecture 6.3

Problem Statement

Build a model which predicts sales based on the money spent on different platforms for marketing.

Dataset - https://www.kaggle.com/datasets/ashydv/advertising-dataset

Performing Simple Linear Regression

Equation of linear regression
\(y = c + m_1x_1 + m_2x_2 + ... + m_nx_n\)

  • \(y\) is the response
  • \(c\) is the intercept
  • \(m_1\) is the coefficient for the first feature
  • \(m_n\) is the coefficient for the nth feature

In our case:

\(y = c + m_1 \times TV\)

The \(m\) values are called the model coefficients or model parameters.

Mainly there are 7 assumptions taken while using Linear Regression:

  • Linear Model
  • No Multicolinearlity in the data
  • Homoscedasticity of Residuals or Equal Variances
  • No Autocorrelation in residuals
  • Number of observations Greater than the number of predictors
  • Each observation is unique
  • Predictors are distributed Normally

Please refer : https://www.geeksforgeeks.org/assumptions-of-linear-regression/


::: {.cell _uuid=‘d68008018678c65564ddda5994cb05129f3ca72b’ execution_count=1}

# Dataset Handling
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split

# Data Visualisation
import matplotlib.pyplot as plt 
import seaborn as sns


#Model Training
from sklearn.linear_model import LinearRegression

#Evaluation
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.feature_selection import f_regression

np.set_printoptions(suppress=True)

:::

1. Reading the Data

::: {.cell _uuid=‘1365d38deb407ea9c0f4e93830c5f9d4d65ebd9d’ execution_count=2}

advertising = pd.DataFrame(pd.read_csv("advertising.csv"))
advertising.head()
TV Radio Newspaper Sales
0 230.1 37.8 69.2 22.1
1 44.5 39.3 45.1 10.4
2 17.2 45.9 69.3 12.0
3 151.5 41.3 58.5 16.5
4 180.8 10.8 58.4 17.9

:::

::: {.cell _uuid=‘4f36948806d235d179b1a5c6b6c990a41afc6e4a’ scrolled=‘true’ execution_count=3}

advertising.shape
(200, 4)

:::

::: {.cell _uuid=‘9578033b7d507aa4d901b48de36931066cc00241’ execution_count=4}

advertising.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200 entries, 0 to 199
Data columns (total 4 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   TV         200 non-null    float64
 1   Radio      200 non-null    float64
 2   Newspaper  200 non-null    float64
 3   Sales      200 non-null    float64
dtypes: float64(4)
memory usage: 6.4 KB

:::

::: {.cell _uuid=‘b817b9601c376627448453b03d79bf8f9dd02eac’ execution_count=5}

advertising.describe()
TV Radio Newspaper Sales
count 200.000000 200.000000 200.000000 200.000000
mean 147.042500 23.264000 30.554000 15.130500
std 85.854236 14.846809 21.778621 5.283892
min 0.700000 0.000000 0.300000 1.600000
25% 74.375000 9.975000 12.750000 11.000000
50% 149.750000 22.900000 25.750000 16.000000
75% 218.825000 36.525000 45.100000 19.050000
max 296.400000 49.600000 114.000000 27.000000

:::

2. Data Cleaning

::: {.cell _uuid=‘cf9580e58b78c0558d96f54272701b6d2d32a018’ execution_count=9}

advertising.isnull().sum()
TV           0
Radio        0
Newspaper    0
Sales        0
dtype: int64

:::

There are no NULL values in the dataset, hence it is clean.

3. Exploratory Data Analysis

import matplotlib.pyplot as plt

fig, axes = plt.subplots(nrows=2, ncols=2,figsize=(10,10))

advertising['TV'].plot.box(ax=axes[0,0])
advertising['Radio'].plot.box(ax=axes[0,1])
advertising['Newspaper'].plot.box(ax=axes[1,0])
advertising['Sales'].plot.box(ax=axes[1,1])
plt.show()

::: {.cell _uuid=‘2d6f716ebe182a58f9941c059256a09cc7f03703’ execution_count=11}

g = pd.plotting.scatter_matrix(advertising, figsize=(10,10))
plt.show()

:::

advertising.corr()
TV Radio Newspaper Sales
TV 1.000000 0.054809 0.056648 0.901208
Radio 0.054809 1.000000 0.354104 0.349631
Newspaper 0.056648 0.354104 1.000000 0.157960
Sales 0.901208 0.349631 0.157960 1.000000
f = plt.figure(figsize=(5, 5))
plt.matshow(advertising.corr(), fignum=f.number , )
plt.xticks(range(advertising.select_dtypes(['number']).shape[1]), advertising.select_dtypes(['number']).columns, fontsize=14, rotation=45)
plt.yticks(range(advertising.select_dtypes(['number']).shape[1]), advertising.select_dtypes(['number']).columns, fontsize=14)
cb = plt.colorbar()
cb.ax.tick_params(labelsize=14)
plt.title('Correlation Matrix', fontsize=16);

f_stats, p_values = f_regression(advertising[['TV','Radio','Newspaper']].to_numpy(),advertising['Sales'].to_numpy())
f_stats, p_values
(array([856.17671282,  27.57467815,   5.0667947 ]),
 array([0.        , 0.00000039, 0.02548744]))

Model Building

Dataset Preparation

We first assign the feature variable, TV, in this case, to the variable X and the response variable, Sales, to the variable y.

::: {.cell _uuid=‘ae7285c79fd678fad0ee4fb18f8923daf024838b’ execution_count=15}

X = advertising['TV'].to_numpy()
y = advertising['Sales'].to_numpy()

:::

Train-Val-Test Split

You now need to split our variable into training and testing sets. You’ll perform this by importing train_test_split from the sklearn.model_selection library. It is usually a good practice to keep 70% of the data in your train dataset and the rest 30% in your test dataset

::: {.cell _uuid=‘997311202075aaa98631ef95c1a0d91cdbefa2af’ execution_count=16}

X_train, X_test_and_val, y_train, y_test_and_val = train_test_split(X, y, train_size = 0.7,random_state = 100)
X_val, X_test, y_val, y_test = train_test_split(X_test_and_val, y_test_and_val, test_size = 0.5, random_state = 100)

:::

print(len(X_train),len(X_val),len(X_test))
140 30 30

Building a Linear Model

model = LinearRegression()
X_train.reshape(-1, 1)
array([[213.4],
       [151.5],
       [205. ],
       [142.9],
       [134.3],
       [ 80.2],
       [239.8],
       [ 88.3],
       [ 19.4],
       [225.8],
       [136.2],
       [ 25.1],
       [ 38. ],
       [172.5],
       [109.8],
       [240.1],
       [232.1],
       [ 66.1],
       [218.4],
       [234.5],
       [ 23.8],
       [ 67.8],
       [296.4],
       [141.3],
       [175.1],
       [220.5],
       [ 76.4],
       [253.8],
       [191.1],
       [287.6],
       [100.4],
       [228. ],
       [125.7],
       [ 74.7],
       [ 57.5],
       [262.7],
       [262.9],
       [237.4],
       [227.2],
       [199.8],
       [228.3],
       [290.7],
       [276.9],
       [199.8],
       [239.3],
       [ 73.4],
       [284.3],
       [147.3],
       [224. ],
       [198.9],
       [276.7],
       [ 13.2],
       [ 11.7],
       [280.2],
       [ 39.5],
       [265.6],
       [ 27.5],
       [280.7],
       [ 78.2],
       [163.3],
       [213.5],
       [293.6],
       [ 18.7],
       [ 75.5],
       [166.8],
       [ 44.7],
       [109.8],
       [  8.7],
       [266.9],
       [206.9],
       [149.8],
       [ 19.6],
       [ 36.9],
       [199.1],
       [265.2],
       [165.6],
       [140.3],
       [230.1],
       [  5.4],
       [ 17.9],
       [237.4],
       [286. ],
       [ 93.9],
       [292.9],
       [ 25. ],
       [ 97.5],
       [ 26.8],
       [281.4],
       [ 69.2],
       [ 43.1],
       [255.4],
       [239.9],
       [209.6],
       [  7.3],
       [240.1],
       [102.7],
       [243.2],
       [137.9],
       [ 18.8],
       [ 17.2],
       [ 76.4],
       [139.5],
       [261.3],
       [ 66.9],
       [ 48.3],
       [177. ],
       [ 28.6],
       [180.8],
       [222.4],
       [193.7],
       [ 59.6],
       [131.7],
       [  8.4],
       [ 13.1],
       [  4.1],
       [  0.7],
       [ 76.3],
       [250.9],
       [273.7],
       [ 96.2],
       [210.8],
       [ 53.5],
       [ 90.4],
       [104.6],
       [283.6],
       [ 95.7],
       [204.1],
       [ 31.5],
       [182.6],
       [289.7],
       [156.6],
       [107.4],
       [ 43. ],
       [248.4],
       [116. ],
       [110.7],
       [187.9],
       [139.3],
       [ 62.3],
       [  8.6]])

::: {.cell _uuid=‘b80a766082e6c9c40c3f09499fec4cfc51f62763’ execution_count=25}

model.fit(X_train.reshape(-1, 1),y_train)
LinearRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

:::

::: {.cell _uuid=‘fd4287b550d2f05555ae3e18d6f497912424f8cf’ execution_count=27}

# Print the parameters, i.e. the intercept and the slope of the regression line fitted
b = model.intercept_ 
w1 = model.coef_[0]

print(str(w1)+"x"+"+"+str(b))
0.05454575291590793x+6.948683200001362

:::

prediction = model.predict(X_val.reshape(-1, 1))
prediction
array([13.52144643, 18.86693021, 13.1068987 , 17.1923756 , 19.94148154,
       10.71234015, 20.13239168, 11.05597839,  9.03233096, 17.60692332,
       13.66326538, 18.77420243, 15.11418241, 12.25053038, 18.44147334,
       17.72692398, 14.09963141, 11.70507285, 17.99419817, 14.32326899,
        9.67597085, 10.6796127 , 13.34144544, 10.79961336, 12.08689312,
       16.60328147, 17.48692266, 18.82329361, 17.03419291, 18.75238413])
mean_squared_error(y_val.reshape(-1, 1),prediction)
4.42099969589266

R2

R-Squared (R² or the coefficient of determination) is a statistical measure in a regression model that determines the proportion of variance in the dependent variable that can be explained by the independent variable. In other words, r-squared shows how well the data fit the regression model (the goodness of fit)

r2_score(y_val.reshape(-1, 1),prediction)
0.7587703509647371

::: {.cell _uuid=‘6e0dc97a88b9fc1d4e975c2fe511e59bd0cd2b8a’ execution_count=31}

plt.scatter(X_train, y_train)
plt.plot(X_train, (w1 * X_train )+ b, 'r')
plt.show()

:::

Model Evaluation

::: {.cell _uuid=‘0b64c5e3173c685b0715a93f0a77c759e90b2dff’ execution_count=32}

predictions = model.predict(X_test.reshape(-1,1))

:::

::: {.cell _uuid=‘58863bc73dfa751e6bade66b3b71f80be51d9ca6’ execution_count=33}

mean_squared_error(y_test, predictions)
3.734113047761241

:::

Checking the R-squared on the test set

::: {.cell _uuid=‘6ce19fc28741a4d2b558a377f2fd39c81abdb72e’ execution_count=34}

r_squared = r2_score(y_test, predictions)
r_squared
0.8149944458734971

:::

Visualizing the fit on the test set

::: {.cell _uuid=‘eb08ac34d4e148e3221adfe126072f108adbfa24’ execution_count=35}

plt.scatter(X_test, y_test)
plt.plot(X_test, (w1 * X_test )+ b, 'r')
plt.show()

:::