Salary Prediction with ML - Linear Regression


Machine learning is a branch of artificial intelligence (AI) and computer science which focuses on the use of data and algorithms to imitate the way that humans learn, gradually improving its accuracy.

How Does Machine Learning Works?

Similar to how the human brain gains knowledge and understanding, machine learning relies on input, such as training data or knowledge graphs, to understand entities, domains and the connections between them. With entities defined, deep learning can begin.

Data is The Key : The algorithms that drive machine learning are critical to success. ML algorithms build a mathematical model based on sample data, known as “training data,” to make predictions or decisions without being explicitly programmed to do so. This can reveal trends within data that information businesses can use to improve decision making, optimize efficiency and capture actionable data at scale.

AI is the Goal: Machine Learning provides the foundation for AI systems that automate processes and solve data-based business problems autonomously. It enables companies to replace or augment certain human capabilities. Common machine learning applications you may find in the real world include chatbots, self-driving cars and speech recognition.

Machine Learning Method

Machine learning classifiers fall into three primary categories, such as:

  • Supervised learning, also known as supervised machine learning, is defined by its use of labeled datasets to train algorithms that to classify data or predict outcomes accurately. As input data is fed into the model, it adjusts its weights until the model has been fitted appropriately. This occurs as part of the cross validation process to ensure that the model avoids overfitting or underfitting. Supervised learning helps organizations solve for a variety of real-world problems at scale, such as classifying spam in a separate folder from your inbox. Some methods used in supervised learning include neural networks, naïve bayes, linear regression, logistic regression, random forest, support vector machine (SVM), and more.

  • Unsupervised learning, also known as unsupervised machine learning, uses machine learning algorithms to analyze and cluster unlabeled datasets. These algorithms discover hidden patterns or data groupings without the need for human intervention. Its ability to discover similarities and differences in information make it the ideal solution for exploratory data analysis, cross-selling strategies, customer segmentation, image and pattern recognition. It’s also used to reduce the number of features in a model through the process of dimensionality reduction; principal component analysis (PCA) and singular value decomposition (SVD) are two common approaches for this. Other algorithms used in unsupervised learning include neural networks, k-means clustering, probabilistic clustering methods, and more.

  • Semi-supervised learning offers a happy medium between supervised and unsupervised learning. During training, it uses a smaller labeled data set to guide classification and feature extraction from a larger, unlabeled data set. Semi-supervised learning can solve the problem of having not enough labeled data (or not being able to afford to label enough data) to train a supervised learning algorithm.

Linear Regression

Linear regression is a basic and commonly used type of predictive analysis. The overall idea of regression is to examine two things:

  • Does a set of predictor variables do a good job in predicting an outcome (dependent) variable?

  • Which variables in particular are significant predictors of the outcome variable, and in what way do they–indicated by the magnitude and sign of the beta estimates–impact the outcome variable? These regression estimates are used to explain the relationship between one dependent variable and one or more independent variables.

The simplest form of the regression equation with one dependent and one independent variable is defined by the formula :



  • y : estimated dependent variable score
  • c : constant
  • b : regression coefficient
  • x : score on the independent variable

There are many names for a regression’s dependent variable. It may be called an outcome variable, criterion variable, endogenous variable, or regressand. The independent variables can be called exogenous variables, predictor variables, or regressors.

Three major uses for regression analysis are determining the strength of predictors, forecasting an effect, and trend forecasting.

Salary Prediction Model

First of all, we should provides the dataset. Dataset can be a excel file, csv file or etc. You can use my example dataset here.

Import pandas library for building the data frames.

import pandas as pd

Then load the dataset, like below

dataset = pd.read_excel('salary_dataset.xlsx')
knowledge technical logical year_experience salary
0 50 60 50 0 Rp 2,500,000.00
1 60 50 50 0 Rp 2,500,000.00
2 50 70 70 0 Rp 3,000,000.00
3 40 50 60 0 Rp 2,800,000.00
4 70 70 70 1.1 Rp 4,000,000.00
5 75 70 65 1.2 Rp 4,000,000.00
6 65 65 60 1.1 Rp 3,800,000.00
7 70 70 70 1.5 Rp 4,500,000.00
8 65 NaN 70 1 Rp 3,400,000.00
9 70 80 80 2 Rp 6,000,000.00
10 75 75 85 1.8 Rp 6,000,000.00
11 80 80 80 2 Rp 7,000,000.00
12 80 80 80 2.2 Rp 7,500,000.00
13 75 70 80 2.9 Rp 7,800,000.00
14 80 85 80 3 Rp 8,400,000.00
15 75 80 75 2.4 Rp 7,500,000.00
16 85 80 90 3.2 Rp 8,200,000.00
17 85 80 85 3.2 Rp 8,000,000.00
18 85 90 90 2.7 Rp 8,000,000.00
19 90 90 90 3.7 Rp 10,000,000.00
20 NaN NaN NaN 3 Rp 8,000,000.00

Cleaning Dataset

Clean null or NaN values from data frame using dropna().

dataset = dataset.dropna()
knowledge technical logical year_experience salary
0 50 60 50 0.0 Rp 2,500,000.00
1 60 50 50 0.0 Rp 2,500,000.00
2 50 70 70 0.0 Rp 3,000,000.00
3 40 50 60 0.0 Rp 2,800,000.00
4 70 70 70 1.1 Rp 4,000,000.00

Building Model

Import train_test_split from scikit learn to split arrays or matrices into random train and test subsets. Quick utility that wraps input validation and next(ShuffleSplit().split(X, y)) and application to input data into a single call for splitting (and optionally subsampling) data in a oneliner.

from sklearn.model_selection import train_test_split

x = dataset.drop('salary', axis=1)
y = dataset['salary']

knowledge technical logical year_experience
0 50 60 50 0.0
1 60 50 50 0.0
2 50 70 70 0.0
3 40 50 60 0.0
4 70 70 70 1.1
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.2, random_state=42)


Import LinearRegression from scikit learn and use the linear regression function. And create object from LinearRegression.

from sklearn.linear_model import LinearRegression

linear = LinearRegression(), y_train)

Predict test data x_test with call function predict() and store to variable y_pred. The result is prediction salary with test data using LinearRegression.

y_pred = linear.predict(x_test)

array([2122535.96880463, 3980638.07697809, 6537626.01871658, 1078550.87649938])


Machine learning model accuracy is the measurement used to determine which model is best at identifying relationships and patterns between variables in a dataset based on the input, or training data. The better a model can generalize to ‘unseen’ data, the better predictions and insights it can produce, which in turn deliver more business value.

linear.score(x_test, y_test)


Our linear regression model accuracy score is 81.4%


Implementation a Linear Regression with some input from user that have value of knowledge, techincal, logical and year of experience. Assumes, you are fresh graduate with have a knowledge score is 50, technical score is 50 and logical score is 60. In this case we will use a dictionary data and convert it into DataFrame like below.

data_dict = {
    'knowledge': 50,
    'technical': 50,
    'logical': 60,
    'year_experience': 0

input_data = pd.DataFrame([data_dict])
knowledge technical logical year_experience
0 50 50 60 0.0

And predict using LinearRegression function like below.

predicted_salary = linear.predict(input_data)[0]
# Convert decimal to integer
predicted_salary = int(predicted_salary)
print("IDR {:,.2f}".format(predicted_salary))

IDR 1,864,514.00

The result of the prediction of the case is IDR 1,864,514.00.


Simple Linear Regression help us to predict a dependent variable for salary prediction model. It can estimated of a response variable for people with values of the carier variable within the knowledges. You can download my jupyter notebook Predict Salary - Linear Regression.ipynb.