Simple Linear Regression using scikit-learn
Linear Regression is a statistical model used to predict the linear relationship between two or more variables.
Here we are going to demonstrate the linear Regression model using the Scikit-learn library in Python.
Scikit-learn also defined as sklearn is a python library with a lot of efficient tools for machine learning and statistical modeling including classification, regression, clustering, and dimensionality reduction. It features various algorithms like support vector machines, random forests, and k-neighbors.
The dataset used for this model contains the Experience and Salary of Employees. The Salary is based on the Years of Experience of the employee. We are going to derive a linear relationship between the years of experience and the salary.
You can download the dataset here- Dataset
Implementation of the linear regression model using sklearn :-
Step 1: Import all the required libraries for the model.
# importing the required libraries
import pandas as pdimport numpy as npimport matplotlib.pyplot as plt
Step 2: Import the dataset.
# read the dataset using pandasData = pd.read_csv(“Salary_Data.csv”)print(“Data imported successfully”)
The dataset is imported using the read_csv function in the pandas library. If the dataset file is imported successfully it will print “Data imported successfully”
Data.head() # displays the top 5 rows of the data
- head() function in NumPy is used to display the number of rows of data we want to display. By default, it will display 5 rows if we didn’t pass any arguments.

Data.info() # Provides information regarding the columns in the data

Step 3: Plotting the data in a graph
Data.plot(x=’YearsExperience’,y=’Salary’,style=’o’)plt.title(“Salary Prediction with experience”)plt.xlabel(“Experience in years”)plt.ylabel(“Salary”)
- Plotting the data to know how the values are scattered using the plot function.
# Assigning the data into rows and columnsX = Data.iloc[:, :-1].valuesY = Data.iloc[:, 1].values
Step 4: Splitting the data into training and testing data
# Split the data for train and testfrom sklearn.model_selection import train_test_splitx_train,x_test,y_train,y_test=train_test_split(X,Y,test_size=0.2,random_state=0)
- To split the data into training and testing sets, import a module train_test_split in the sklearn library.
Step 5: Train the data
#Training the datafrom sklearn.linear_model import LinearRegressionregressor = LinearRegression()regressor.fit(x_train,y_train)print(“Training sucessful”)
- Train the dataset using the fit() method in the LinearRegression function of sklearn. It will print “Training successful” if the dataset is trained by our model.
Step 6: Plot Regression line
# plotting the regression lineline=regressor.coef_*X+regressor.intercept_plt.scatter(X,Y)plt.plot(X,line)plt.show()

# Intecept and coeff of the lineprint(‘Intercept of the model:’,regressor.intercept_)print(‘Coefficient of the line:’,regressor.coef_)
- The linear regression model is represented by the equation, y = mx + c
- m = coefficient of the line
- c = intercept of the line
Therefore the equation for this model is represented as
- y = (26780.09)x + 9312.5
print(x_test) #printing the test data
Test data:

Step 7: Predicting the result
y_pred=regressor.predict(x_test)df=pd.DataFrame({‘actual’:y_test,’predicted’:y_pred})df
- The predict() method is used to predict the result. Here we passed the test data as input.

year = float(input(“Enter number of years : “))year = np.array(year).reshape(-1, 1)own_pred=regressor.predict(year)print(“Predicted Salary = {}”.format(own_pred[0]))
- We can also predict the output by giving the input manually.

Step 8: Calculating Error
- Mean absolute error (MAE) is a measure of errors between prediction and true values.
= | prediction values | |
= | true value | |
= | total number of data |
from sklearn import metricsprint(‘Mean Absolute Error:’, metrics.mean_absolute_error(y_test, y_pred))
- The mean absolute error is calculated by the mean_absolute_error() method in sklearn.
Mean Absolute Error: 2446.1723690465055