Introduction and Thank you

This is the first post of a series which has been inspired by the wonderful course given by Kirill Eremenko and Hadelin de Ponteves in udemy. I have finished their wonderful course and this is my take on their teachings. And I’d like to start by thanking those two great men! Also if you want to get a decent education on Data Science in Python and R, please do take their course. I am sure that you’ll have a solid understanding on what the ML is without bogged down with statistics and in-depth maths.

Simple Linear Regression

In this regression model we believe that there is a correlation between our two variables but don’t know what is it. The usual example given in this scenario is the correlation between Years of Experience and Salary. We expect that if a person has more experience, he’ll have a higher salary. So if the YoE increases, Salary also increases. This is our expectation! Yet the dataset might prove that there is no correlation between the two variables, say for example the firm we are doing this research don’t pay any attention to this experience criteria. If we can find a correlation, we can also guess the salary of a new personnel based on the years of experience.

The formula of Simple Linear Regression is:

y = b0+b1x1


Dependent Variable = Constant + Coefficient * Independent Variable



The training of a Linear Regressor in python is quite easy. All we need is some libraries to import the data, train the machine “learner” and get a model and plot this model and the real data. To do this we’ll need:

  • matplotlib to plot the graphics
  • pandas to import the data, which is a comma separated value in our case
  • sklearn to split the dataset and do the actual fitting

You can use the anaconda distribution to have a proper setup with all of the libraries (and more!) coming in or you can use

pip install pandas matplotlib sklearn scipy

to install the required libraries. But anaconda is preferred (and easiest) method! Caveat Emptor!

Dataset and Splitting

For this example I used Bhargav from Kaggle’s data. You’ll see that in the splitting the dataset section I used 1/3 rating. This is a rule of the thumb rating, but you can change it, play with it, depending on the data.

Explanation of Terms: Fit and Predict


In Machine Learning the term fit means, to try and find the correlation between two values. So when we execute the statement, y_train) this basically says, find a Linear Regression correlation between those two datasets


When we use y_pred = regressor.predict(X_test) this will “guess” the values based on the values of X. Then we’ll compare this guess estimates with the real values, and this need to comparison is the main reason why we split the data as training and test. We “train” the machine with the training data, and see how accurate it is with the test data.

# Import the Libraries 
import matplotlib.pyplot as plt
import pandas as pd
from sklearn.cross_validation import train_test_split
from sklearn.linear_model import LinearRegression

# Import the dataset
dataset = pd.read_csv('Kaggle_Data.csv')
X = dataset.iloc[:,:-1].values 
y = dataset.iloc[:,1].values

# Split the dataset
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 1/3, random_state = 0)

# Fit SLR to the Training set
regressor = LinearRegresion(), y_train)

# Predict the Test Set Results
y_pred = regressor.predict(X_test)

# Visualize the Training Set Results
plt.scatter(X_train, y_train, color="red")
plt.plot(X_train, regressor.predict(X_train), color="blue")
plt.title("Salary vs. Experience (Training Set)")
plt.xlabel("Years of Experience")

# Visualize the Test Set Results
plt.scatter(X_test, y_test, color="red")
plt.plot(X_test, regressor.predict(X_test), color="blue")
plt.title("Salary vs. Experience (Test Set)")
plt.xlabel("Years of Experience")