Sales Prediction

Lecture 6.3

Problem Statement

Build a model which predicts sales based on the money spent on different platforms for marketing.

Dataset - https://www.kaggle.com/datasets/ashydv/advertising-dataset

Performing Simple Linear Regression

Equation of linear regression
\(y = c + m_1x_1 + m_2x_2 + ... + m_nx_n\)

\(y\) is the response
\(c\) is the intercept
\(m_1\) is the coefficient for the first feature
\(m_n\) is the coefficient for the nth feature

In our case:

\(y = c + m_1 \times TV\)

The \(m\) values are called the model coefficients or model parameters.

Mainly there are 7 assumptions taken while using Linear Regression:

Linear Model
No Multicolinearlity in the data
Homoscedasticity of Residuals or Equal Variances
No Autocorrelation in residuals
Number of observations Greater than the number of predictors
Each observation is unique
Predictors are distributed Normally

Please refer : https://www.geeksforgeeks.org/assumptions-of-linear-regression/

::: {.cell _uuid=‘d68008018678c65564ddda5994cb05129f3ca72b’ execution_count=1}

# Dataset Handling
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split

# Data Visualisation
import matplotlib.pyplot as plt 
import seaborn as sns


#Model Training
from sklearn.linear_model import LinearRegression

#Evaluation
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.feature_selection import f_regression

np.set_printoptions(suppress=True)

:::

1. Reading the Data

::: {.cell _uuid=‘1365d38deb407ea9c0f4e93830c5f9d4d65ebd9d’ execution_count=2}

advertising = pd.DataFrame(pd.read_csv("advertising.csv"))
advertising.head()

	TV	Radio	Newspaper	Sales
0	230.1	37.8	69.2	22.1
1	44.5	39.3	45.1	10.4
2	17.2	45.9	69.3	12.0
3	151.5	41.3	58.5	16.5
4	180.8	10.8	58.4	17.9

:::

::: {.cell _uuid=‘4f36948806d235d179b1a5c6b6c990a41afc6e4a’ scrolled=‘true’ execution_count=3}

advertising.shape

(200, 4)

:::

::: {.cell _uuid=‘9578033b7d507aa4d901b48de36931066cc00241’ execution_count=4}

advertising.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200 entries, 0 to 199
Data columns (total 4 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   TV         200 non-null    float64
 1   Radio      200 non-null    float64
 2   Newspaper  200 non-null    float64
 3   Sales      200 non-null    float64
dtypes: float64(4)
memory usage: 6.4 KB

:::

::: {.cell _uuid=‘b817b9601c376627448453b03d79bf8f9dd02eac’ execution_count=5}

advertising.describe()

	TV	Radio	Newspaper	Sales
count	200.000000	200.000000	200.000000	200.000000
mean	147.042500	23.264000	30.554000	15.130500
std	85.854236	14.846809	21.778621	5.283892
min	0.700000	0.000000	0.300000	1.600000
25%	74.375000	9.975000	12.750000	11.000000
50%	149.750000	22.900000	25.750000	16.000000
75%	218.825000	36.525000	45.100000	19.050000
max	296.400000	49.600000	114.000000	27.000000

:::

2. Data Cleaning

::: {.cell _uuid=‘cf9580e58b78c0558d96f54272701b6d2d32a018’ execution_count=9}

advertising.isnull().sum()

TV           0
Radio        0
Newspaper    0
Sales        0
dtype: int64

:::

There are no NULL values in the dataset, hence it is clean.

3. Exploratory Data Analysis

import matplotlib.pyplot as plt

fig, axes = plt.subplots(nrows=2, ncols=2,figsize=(10,10))

advertising['TV'].plot.box(ax=axes[0,0])
advertising['Radio'].plot.box(ax=axes[0,1])
advertising['Newspaper'].plot.box(ax=axes[1,0])
advertising['Sales'].plot.box(ax=axes[1,1])
plt.show()

::: {.cell _uuid=‘2d6f716ebe182a58f9941c059256a09cc7f03703’ execution_count=11}

g = pd.plotting.scatter_matrix(advertising, figsize=(10,10))
plt.show()

:::

advertising.corr()

	TV	Radio	Newspaper	Sales
TV	1.000000	0.054809	0.056648	0.901208
Radio	0.054809	1.000000	0.354104	0.349631
Newspaper	0.056648	0.354104	1.000000	0.157960
Sales	0.901208	0.349631	0.157960	1.000000

f = plt.figure(figsize=(5, 5))
plt.matshow(advertising.corr(), fignum=f.number , )
plt.xticks(range(advertising.select_dtypes(['number']).shape[1]), advertising.select_dtypes(['number']).columns, fontsize=14, rotation=45)
plt.yticks(range(advertising.select_dtypes(['number']).shape[1]), advertising.select_dtypes(['number']).columns, fontsize=14)
cb = plt.colorbar()
cb.ax.tick_params(labelsize=14)
plt.title('Correlation Matrix', fontsize=16);

f_stats, p_values = f_regression(advertising[['TV','Radio','Newspaper']].to_numpy(),advertising['Sales'].to_numpy())
f_stats, p_values

(array([856.17671282,  27.57467815,   5.0667947 ]),
 array([0.        , 0.00000039, 0.02548744]))

Model Building

Dataset Preparation

We first assign the feature variable, TV, in this case, to the variable X and the response variable, Sales, to the variable y.

::: {.cell _uuid=‘ae7285c79fd678fad0ee4fb18f8923daf024838b’ execution_count=15}

X = advertising['TV'].to_numpy()
y = advertising['Sales'].to_numpy()

:::

Train-Val-Test Split

You now need to split our variable into training and testing sets. You’ll perform this by importing train_test_split from the sklearn.model_selection library. It is usually a good practice to keep 70% of the data in your train dataset and the rest 30% in your test dataset

::: {.cell _uuid=‘997311202075aaa98631ef95c1a0d91cdbefa2af’ execution_count=16}

X_train, X_test_and_val, y_train, y_test_and_val = train_test_split(X, y, train_size = 0.7,random_state = 100)
X_val, X_test, y_val, y_test = train_test_split(X_test_and_val, y_test_and_val, test_size = 0.5, random_state = 100)

:::

print(len(X_train),len(X_val),len(X_test))

140 30 30

Building a Linear Model

model = LinearRegression()

X_train.reshape(-1, 1)

array([[213.4],
       [151.5],
       [205. ],
       [142.9],
       [134.3],
       [ 80.2],
       [239.8],
       [ 88.3],
       [ 19.4],
       [225.8],
       [136.2],
       [ 25.1],
       [ 38. ],
       [172.5],
       [109.8],
       [240.1],
       [232.1],
       [ 66.1],
       [218.4],
       [234.5],
       [ 23.8],
       [ 67.8],
       [296.4],
       [141.3],
       [175.1],
       [220.5],
       [ 76.4],
       [253.8],
       [191.1],
       [287.6],
       [100.4],
       [228. ],
       [125.7],
       [ 74.7],
       [ 57.5],
       [262.7],
       [262.9],
       [237.4],
       [227.2],
       [199.8],
       [228.3],
       [290.7],
       [276.9],
       [199.8],
       [239.3],
       [ 73.4],
       [284.3],
       [147.3],
       [224. ],
       [198.9],
       [276.7],
       [ 13.2],
       [ 11.7],
       [280.2],
       [ 39.5],
       [265.6],
       [ 27.5],
       [280.7],
       [ 78.2],
       [163.3],
       [213.5],
       [293.6],
       [ 18.7],
       [ 75.5],
       [166.8],
       [ 44.7],
       [109.8],
       [  8.7],
       [266.9],
       [206.9],
       [149.8],
       [ 19.6],
       [ 36.9],
       [199.1],
       [265.2],
       [165.6],
       [140.3],
       [230.1],
       [  5.4],
       [ 17.9],
       [237.4],
       [286. ],
       [ 93.9],
       [292.9],
       [ 25. ],
       [ 97.5],
       [ 26.8],
       [281.4],
       [ 69.2],
       [ 43.1],
       [255.4],
       [239.9],
       [209.6],
       [  7.3],
       [240.1],
       [102.7],
       [243.2],
       [137.9],
       [ 18.8],
       [ 17.2],
       [ 76.4],
       [139.5],
       [261.3],
       [ 66.9],
       [ 48.3],
       [177. ],
       [ 28.6],
       [180.8],
       [222.4],
       [193.7],
       [ 59.6],
       [131.7],
       [  8.4],
       [ 13.1],
       [  4.1],
       [  0.7],
       [ 76.3],
       [250.9],
       [273.7],
       [ 96.2],
       [210.8],
       [ 53.5],
       [ 90.4],
       [104.6],
       [283.6],
       [ 95.7],
       [204.1],
       [ 31.5],
       [182.6],
       [289.7],
       [156.6],
       [107.4],
       [ 43. ],
       [248.4],
       [116. ],
       [110.7],
       [187.9],
       [139.3],
       [ 62.3],
       [  8.6]])

::: {.cell _uuid=‘b80a766082e6c9c40c3f09499fec4cfc51f62763’ execution_count=25}

model.fit(X_train.reshape(-1, 1),y_train)

LinearRegression()

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

:::

::: {.cell _uuid=‘fd4287b550d2f05555ae3e18d6f497912424f8cf’ execution_count=27}

# Print the parameters, i.e. the intercept and the slope of the regression line fitted
b = model.intercept_ 
w1 = model.coef_[0]

print(str(w1)+"x"+"+"+str(b))

0.05454575291590793x+6.948683200001362

:::

prediction = model.predict(X_val.reshape(-1, 1))
prediction

array([13.52144643, 18.86693021, 13.1068987 , 17.1923756 , 19.94148154,
       10.71234015, 20.13239168, 11.05597839,  9.03233096, 17.60692332,
       13.66326538, 18.77420243, 15.11418241, 12.25053038, 18.44147334,
       17.72692398, 14.09963141, 11.70507285, 17.99419817, 14.32326899,
        9.67597085, 10.6796127 , 13.34144544, 10.79961336, 12.08689312,
       16.60328147, 17.48692266, 18.82329361, 17.03419291, 18.75238413])

mean_squared_error(y_val.reshape(-1, 1),prediction)

4.42099969589266

R2

R-Squared (R² or the coefficient of determination) is a statistical measure in a regression model that determines the proportion of variance in the dependent variable that can be explained by the independent variable. In other words, r-squared shows how well the data fit the regression model (the goodness of fit)

r2_score(y_val.reshape(-1, 1),prediction)

0.7587703509647371

::: {.cell _uuid=‘6e0dc97a88b9fc1d4e975c2fe511e59bd0cd2b8a’ execution_count=31}

plt.scatter(X_train, y_train)
plt.plot(X_train, (w1 * X_train )+ b, 'r')
plt.show()

:::

Model Evaluation

::: {.cell _uuid=‘0b64c5e3173c685b0715a93f0a77c759e90b2dff’ execution_count=32}

predictions = model.predict(X_test.reshape(-1,1))

:::

::: {.cell _uuid=‘58863bc73dfa751e6bade66b3b71f80be51d9ca6’ execution_count=33}

mean_squared_error(y_test, predictions)

3.734113047761241

:::

Checking the R-squared on the test set

::: {.cell _uuid=‘6ce19fc28741a4d2b558a377f2fd39c81abdb72e’ execution_count=34}

r_squared = r2_score(y_test, predictions)
r_squared

0.8149944458734971

:::

Visualizing the fit on the test set

::: {.cell _uuid=‘eb08ac34d4e148e3221adfe126072f108adbfa24’ execution_count=35}

plt.scatter(X_test, y_test)
plt.plot(X_test, (w1 * X_test )+ b, 'r')
plt.show()

:::