Sales Prediction

Lecture 6.3

Problem Statement

Build a model which predicts sales based on the money spent on different platforms for marketing.

Dataset -

Performing Simple Linear Regression

Equation of linear regression
\(y = c + m_1x_1 + m_2x_2 + ... + m_nx_n\)

  • \(y\) is the response
  • \(c\) is the intercept
  • \(m_1\) is the coefficient for the first feature
  • \(m_n\) is the coefficient for the nth feature

In our case:

\(y = c + m_1 \times TV\)

The \(m\) values are called the model coefficients or model parameters.

Mainly there are 7 assumptions taken while using Linear Regression:

  • Linear Model
  • No Multicolinearlity in the data
  • Homoscedasticity of Residuals or Equal Variances
  • No Autocorrelation in residuals
  • Number of observations Greater than the number of predictors
  • Each observation is unique
  • Predictors are distributed Normally

Please refer :

::: {.cell _uuid=‘d68008018678c65564ddda5994cb05129f3ca72b’ execution_count=1}

# Dataset Handling
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split

# Data Visualisation
import matplotlib.pyplot as plt 
import seaborn as sns

#Model Training
from sklearn.linear_model import LinearRegression

from sklearn.metrics import mean_squared_error, r2_score
from sklearn.feature_selection import f_regression



1. Reading the Data

::: {.cell _uuid=‘1365d38deb407ea9c0f4e93830c5f9d4d65ebd9d’ execution_count=2}

advertising = pd.DataFrame(pd.read_csv("advertising.csv"))
TV Radio Newspaper Sales
0 230.1 37.8 69.2 22.1
1 44.5 39.3 45.1 10.4
2 17.2 45.9 69.3 12.0
3 151.5 41.3 58.5 16.5
4 180.8 10.8 58.4 17.9


::: {.cell _uuid=‘4f36948806d235d179b1a5c6b6c990a41afc6e4a’ scrolled=‘true’ execution_count=3}

(200, 4)


::: {.cell _uuid=‘9578033b7d507aa4d901b48de36931066cc00241’ execution_count=4}
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 200 entries, 0 to 199
Data columns (total 4 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   TV         200 non-null    float64
 1   Radio      200 non-null    float64
 2   Newspaper  200 non-null    float64
 3   Sales      200 non-null    float64
dtypes: float64(4)
memory usage: 6.4 KB


::: {.cell _uuid=‘b817b9601c376627448453b03d79bf8f9dd02eac’ execution_count=5}

TV Radio Newspaper Sales
count 200.000000 200.000000 200.000000 200.000000
mean 147.042500 23.264000 30.554000 15.130500
std 85.854236 14.846809 21.778621 5.283892
min 0.700000 0.000000 0.300000 1.600000
25% 74.375000 9.975000 12.750000 11.000000
50% 149.750000 22.900000 25.750000 16.000000
75% 218.825000 36.525000 45.100000 19.050000
max 296.400000 49.600000 114.000000 27.000000


2. Data Cleaning

::: {.cell _uuid=‘cf9580e58b78c0558d96f54272701b6d2d32a018’ execution_count=9}

TV           0
Radio        0
Newspaper    0
Sales        0
dtype: int64


There are no NULL values in the dataset, hence it is clean.

3. Exploratory Data Analysis

import matplotlib.pyplot as plt

fig, axes = plt.subplots(nrows=2, ncols=2,figsize=(10,10))


::: {.cell _uuid=‘2d6f716ebe182a58f9941c059256a09cc7f03703’ execution_count=11}

g = pd.plotting.scatter_matrix(advertising, figsize=(10,10))


TV Radio Newspaper Sales
TV 1.000000 0.054809 0.056648 0.901208
Radio 0.054809 1.000000 0.354104 0.349631
Newspaper 0.056648 0.354104 1.000000 0.157960
Sales 0.901208 0.349631 0.157960 1.000000
f = plt.figure(figsize=(5, 5))
plt.matshow(advertising.corr(), fignum=f.number , )
plt.xticks(range(advertising.select_dtypes(['number']).shape[1]), advertising.select_dtypes(['number']).columns, fontsize=14, rotation=45)
plt.yticks(range(advertising.select_dtypes(['number']).shape[1]), advertising.select_dtypes(['number']).columns, fontsize=14)
cb = plt.colorbar()
plt.title('Correlation Matrix', fontsize=16);

f_stats, p_values = f_regression(advertising[['TV','Radio','Newspaper']].to_numpy(),advertising['Sales'].to_numpy())
f_stats, p_values
(array([856.17671282,  27.57467815,   5.0667947 ]),
 array([0.        , 0.00000039, 0.02548744]))

Model Building

Dataset Preparation

We first assign the feature variable, TV, in this case, to the variable X and the response variable, Sales, to the variable y.

::: {.cell _uuid=‘ae7285c79fd678fad0ee4fb18f8923daf024838b’ execution_count=15}

X = advertising['TV'].to_numpy()
y = advertising['Sales'].to_numpy()


Train-Val-Test Split

You now need to split our variable into training and testing sets. You’ll perform this by importing train_test_split from the sklearn.model_selection library. It is usually a good practice to keep 70% of the data in your train dataset and the rest 30% in your test dataset

::: {.cell _uuid=‘997311202075aaa98631ef95c1a0d91cdbefa2af’ execution_count=16}

X_train, X_test_and_val, y_train, y_test_and_val = train_test_split(X, y, train_size = 0.7,random_state = 100)
X_val, X_test, y_val, y_test = train_test_split(X_test_and_val, y_test_and_val, test_size = 0.5, random_state = 100)


140 30 30

Building a Linear Model

model = LinearRegression()
X_train.reshape(-1, 1)
       [205. ],
       [ 80.2],
       [ 88.3],
       [ 19.4],
       [ 25.1],
       [ 38. ],
       [ 66.1],
       [ 23.8],
       [ 67.8],
       [ 76.4],
       [228. ],
       [ 74.7],
       [ 57.5],
       [ 73.4],
       [224. ],
       [ 13.2],
       [ 11.7],
       [ 39.5],
       [ 27.5],
       [ 78.2],
       [ 18.7],
       [ 75.5],
       [ 44.7],
       [  8.7],
       [ 19.6],
       [ 36.9],
       [  5.4],
       [ 17.9],
       [286. ],
       [ 93.9],
       [ 25. ],
       [ 97.5],
       [ 26.8],
       [ 69.2],
       [ 43.1],
       [  7.3],
       [ 18.8],
       [ 17.2],
       [ 76.4],
       [ 66.9],
       [ 48.3],
       [177. ],
       [ 28.6],
       [ 59.6],
       [  8.4],
       [ 13.1],
       [  4.1],
       [  0.7],
       [ 76.3],
       [ 96.2],
       [ 53.5],
       [ 90.4],
       [ 95.7],
       [ 31.5],
       [ 43. ],
       [116. ],
       [ 62.3],
       [  8.6]])

::: {.cell _uuid=‘b80a766082e6c9c40c3f09499fec4cfc51f62763’ execution_count=25}, 1),y_train)
::: {.cell _uuid=‘fd4287b550d2f05555ae3e18d6f497912424f8cf’ execution_count=27}

# Print the parameters, i.e. the intercept and the slope of the regression line fitted
b = model.intercept_ 
w1 = model.coef_[0]



prediction = model.predict(X_val.reshape(-1, 1))
array([13.52144643, 18.86693021, 13.1068987 , 17.1923756 , 19.94148154,
       10.71234015, 20.13239168, 11.05597839,  9.03233096, 17.60692332,
       13.66326538, 18.77420243, 15.11418241, 12.25053038, 18.44147334,
       17.72692398, 14.09963141, 11.70507285, 17.99419817, 14.32326899,
        9.67597085, 10.6796127 , 13.34144544, 10.79961336, 12.08689312,
       16.60328147, 17.48692266, 18.82329361, 17.03419291, 18.75238413])
mean_squared_error(y_val.reshape(-1, 1),prediction)


R-Squared (R² or the coefficient of determination) is a statistical measure in a regression model that determines the proportion of variance in the dependent variable that can be explained by the independent variable. In other words, r-squared shows how well the data fit the regression model (the goodness of fit)

r2_score(y_val.reshape(-1, 1),prediction)

::: {.cell _uuid=‘6e0dc97a88b9fc1d4e975c2fe511e59bd0cd2b8a’ execution_count=31}

plt.scatter(X_train, y_train)
plt.plot(X_train, (w1 * X_train )+ b, 'r')


Model Evaluation

::: {.cell _uuid=‘0b64c5e3173c685b0715a93f0a77c759e90b2dff’ execution_count=32}

predictions = model.predict(X_test.reshape(-1,1))


::: {.cell _uuid=‘58863bc73dfa751e6bade66b3b71f80be51d9ca6’ execution_count=33}

mean_squared_error(y_test, predictions)


Checking the R-squared on the test set

::: {.cell _uuid=‘6ce19fc28741a4d2b558a377f2fd39c81abdb72e’ execution_count=34}

r_squared = r2_score(y_test, predictions)


Visualizing the fit on the test set

::: {.cell _uuid=‘eb08ac34d4e148e3221adfe126072f108adbfa24’ execution_count=35}

plt.scatter(X_test, y_test)
plt.plot(X_test, (w1 * X_test )+ b, 'r')
