Introduction and Thank you
This is the first post of a series which has been inspired by the wonderful course given by Kirill Eremenko and Hadelin de Ponteves in udemy. I have finished their wonderful course and this is my take on their teachings. And I’d like to start by thanking those two great men! Also if you want to get a decent education on Data Science in Python and R, please do take their course. I am sure that you’ll have a solid understanding on what the ML is without bogged down with statistics and in-depth maths.
Simple Linear Regression
In this regression model we believe that there is a correlation between our two variables but don’t know what is it. The usual example given in this scenario is the correlation between Years of Experience and Salary. We expect that if a person has more experience, he’ll have a higher salary. So if the YoE increases, Salary also increases. This is our expectation! Yet the dataset might prove that there is no correlation between the two variables, say for example the firm we are doing this research don’t pay any attention to this experience criteria. If we can find a correlation, we can also guess the salary of a new personnel based on the years of experience.
The formula of Simple Linear Regression is:
y = b0+b1x1
Dependent Variable = Constant + Coefficient * Independent Variable
The training of a Linear Regressor in python is quite easy. All we need is some libraries to import the data, train the machine “learner” and get a model and plot this model and the real data. To do this we’ll need:
matplotlibto plot the graphics
pandasto import the data, which is a comma separated value in our case
sklearnto split the dataset and do the actual fitting
You can use the anaconda distribution to have a proper setup with all of the libraries (and more!) coming in or you can use
pip install pandas matplotlib sklearn scipy
to install the required libraries. But anaconda is preferred (and easiest) method! Caveat Emptor!
Dataset and Splitting
For this example I used Bhargav from Kaggle’s data. You’ll see that in the splitting the dataset section I used 1/3 rating. This is a rule of the thumb rating, but you can change it, play with it, depending on the data.
Explanation of Terms: Fit and Predict
In Machine Learning the term
fit means, to try and find the correlation between two values. So when we execute the statement
regressor.fit(X_train, y_train) this basically says, find a Linear Regression correlation between those two datasets
When we use
y_pred = regressor.predict(X_test) this will “guess” the values based on the values of X. Then we’ll compare this guess estimates with the real values, and this need to comparison is the main reason why we split the data as training and test. We “train” the machine with the training data, and see how accurate it is with the test data.
# Import the Libraries import matplotlib.pyplot as plt import pandas as pd from sklearn.cross_validation import train_test_split from sklearn.linear_model import LinearRegression # Import the dataset dataset = pd.read_csv('Kaggle_Data.csv') X = dataset.iloc[:,:-1].values y = dataset.iloc[:,1].values # Split the dataset X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 1/3, random_state = 0) # Fit SLR to the Training set regressor = LinearRegresion() regressor.fit(X_train, y_train) # Predict the Test Set Results y_pred = regressor.predict(X_test) # Visualize the Training Set Results plt.scatter(X_train, y_train, color="red") plt.plot(X_train, regressor.predict(X_train), color="blue") plt.title("Salary vs. Experience (Training Set)") plt.xlabel("Years of Experience") plt.ylabel("Salary") plt.show() # Visualize the Test Set Results plt.scatter(X_test, y_test, color="red") plt.plot(X_test, regressor.predict(X_test), color="blue") plt.title("Salary vs. Experience (Test Set)") plt.xlabel("Years of Experience") plt.ylabel("Salary") plt.show()