Difference makes the DIFFERENCE
In this project we will be working with a fake advertising data set, indicating whether or not a particular internet user clicked on an Advertisement. We will try to create a model that will predict whether or not they will click on an ad based off the features of that user.
This data set contains the following features:
Import a few libraries you think you'll need (Or just import them as you go along!)
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline
from sklearn import model_selection
from sklearn import linear_model
from sklearn import metrics
Read in the advertising.csv file and set it to a data frame called ad_data.
ad_data = pd.read_csv('/content/advertising.csv')
Check the head of ad_data
ad_data.head()
Use info and describe() on ad_data
ad_data.describe()
ad_data.info()
Let's use seaborn to explore the data!
Try recreating the plots shown below!
Create a histogram of the Age
sns.set_style = "whitegrid"
sns.histplot(x = 'Age', data = ad_data, bins =30)
ad_data['Age'].hist(bins = 30)
Create a jointplot showing Area Income versus Age.
sns.jointplot(x = 'Age', y= 'Area Income', data = ad_data, hue = "Clicked on Ad")
Create a jointplot showing the kde distributions of Daily Time spent on site vs. Age.
sns.jointplot(kind = 'kde', x= "Area Income", y = "Age", data = ad_data, hue = "Clicked on Ad", cmap = "Blues")
Create a jointplot of 'Daily Time Spent on Site' vs. 'Daily Internet Usage'
sns.jointplot(x = "Daily Time Spent on Site", y = "Daily Internet Usage", hue = "Clicked on Ad", data = ad_data, cmap = "Green")
Finally, create a pairplot with the hue defined by the 'Clicked on Ad' column feature.
sns.pairplot(hue = "Clicked on Ad", data = ad_data)
Now it's time to do a train test split, and train our model!
You'll have the freedom here to choose columns that you want to train on!
Split the data into training set and testing set using train_test_split
sns.heatmap(ad_data.isnull())
ad_data.info()
X = ad_data.drop(['Clicked on Ad'], axis = 1)
y = ad_data['Clicked on Ad']
X.head(1)
X.drop(['Ad Topic Line', 'City', 'Country'], inplace = True, axis = 1)
X.drop(['Timestamp'], inplace = True, axis = 1)
X.head(2)
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=101)
Train and fit a logistic regression model on the training set.
logReg = LogisticRegression()
logReg.fit(X_train, y_train)
Now predict values for the testing data.
predictions = logReg.predict(X_test)
Create a classification report for the model.
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
print(classification_report(y_test, predictions))