The Air of Spirituality

Happy New Year. I hope you had a good time and you have added new memories to your memory album that you will always cherish. It is the first letter from the Spiritual Tree to you in 2021, and as I…

Smartphone

独家优惠奖金 100% 高达 1 BTC + 180 免费旋转




Understanding a Linear Regression Algorithm with Example.

The above scenarios are real life use cases where we are predicting a numeric variable on the basis of one or more numeric variables. These variables are usually drawn on a X and Y-axis as in the image below.

For instance , the graph below is showing us the relationship between the time a customer spends on an app and the amount of money they spend.

Linear Regression Graph

Linear regression attempts to study a trend or pattern, and then based on this we can predict an output. For instance, in the image above, our linear regression has been achieved by taking the following steps

In mathematical language, this line of best fit can be expressed and derived through the equation y=mx+c.

Imagine you want to predict the Yearly Amount a customer spends in our business given the customer’s Time on App and the length of membership(in years).

Time on App and length of membership will be called independent variables. Meaning their value aren’t affected by other variables.

Yearly Amount however would be called the dependent variable meaning its value changes depending on the independent variables, and in this case Time on App and length of membership(in years).

Therefore the equation y=mx+c in our context will be interprated as Yearly Amount=m1*Time on App + m2* length of membership + c.

Yearly Amount=m1*Time on App + m2* length of membership + c.

c in another word is called the y-intercept(point where the graph intersects the y-axis). In this case our y-intercept is at 350 dollars. This means the minimum amount expected for any customer spend is 350 dollars.

m1 and m2 are coefficients. We will see what coefficients mean later in this article.

Imagine you’ve been hired by an e-commerce business as their datascientist.The company is trying to :

This store has in-storeclothing advice sessions represented by feature(Avg.Session Length).

Linear Regression finds the relationship between continous variables. This means that we only need to include numeric values and not categorical variables.

We now have only continous variables in our dataframe having dropped Email, Address and Avatar Columns.

Missing values in the data results in poor model performance.

Our model has no missing values. But if we had missing values, we could have dealt with them in either or combination of these main ways:

In our dataset the target variable is pretty easy to identify and work with. Our target variable is Yearly Amount Spent, which is the value we want to be able to predict given all or some of the independent variables in our dataframe.

There must be a linear relationship between our independent variables and the dependent variable for us to continue analyzing our dataset using regression analysis. This can be confirmed by using a scatter plot.

Our independent variables are increasing as the dependent variable increases as well. This is good and we can proceed with our analysis.

An outlier represents a data point that is too small or large. It can influence the model by inflating error rates. If there are outliers in the data, remove them, or replace them with the mean value or median value.

For our features, we will remove remove customers where we have some attribute that is above the 0.999 quantile which can be interpreted to be highly abnormal datapoints.

From the count output 4 customers with highly abnormal datapoints have been dropped.

For us to proceed with linear regression, our data points need to be spread symmetrically around the true mean value. We can check for the distribution by drawing distribution plots.

If the data is not normal, perform data transformation to reduce its skewness.

Negatively skewed data requires a power transformation or an exponential transformation. In contrast, positively skewed data requires a log transformation or square root transformation.

Correlation measures the relationship between variables.

Length of Membership and TIme on App have the most impact on the Yearly Amount Spent.

Correlation measures the relationship between two variables. When these two variables are so highly correlated that they explain each other (to the point that you can predict the one variable with the other), then we have Collinearity.

There must be little or no multicollinearity in the data.

We drop results whose correlation is 1.0 which indicates that we are dealing with self-correlation.

We do not have any output and therefore we do not have any independent variables that are highly correlated with each other and that’s the scenario we want.

We train our model only on part of the data because and reserve the rest(test data) to evaluate the quality of our model.

Feature scaling is done to standardize features that greatly vary in magnitude and units.This include : kNN algorithm, kMeans Clustering(Euclidean Distance), Linear Regression, Logistic Regression, and SVM.

NB- Scaling is usually done for the x-axis variables and not required for the target variable.

We will use stats model library which will explore our data and perform statistical tests and estimate statistical models.

After importing the stats model we will then fit this into our train data.

const 501 is our y-intercept.This means the minimum amount expected for any customer yearly spend is 501 dollars.

x1 is our variable Avg. Session Length. This is a regression coefficient. This means that, on average, each additional customer session length is associated with an increase of 24.5649 dollars on customer Yearly Amount spend.

x2 is our variable Time on App.This is a regression coefficient. This means that, on average each additional minute a customer spends on app, is associated with an increase of 38.7219 dollars customer Yearly Amount spend.

x3 is our variable Length of Membership.This is a regression coefficient. This means that, on average as the Length of Membership of a customer , is associated with an increase of 58.3532 dollars on customer Yearly Amount spend.

x4 is our variable Time on Website.This is a regression coefficient. This means that, on average each additional minute a customer spends on the website ,is associated with an increase of 0.312 dollars on customer Yearly Amount spend.

The Time spent on Website seems to have little influence on the Yearly Amount Spent(0.312 dollars)

The Time on App has greater influence in terms of customer spending.

What would you advise the company? Maybe customer experience on the website is not good and they could do a research and ask customers why don’t like purchasing products on the website.

Length of Membership has the most influence on customer yearly spend.

p values of 0, means the null hypothesis is rejected and our test is statistically significant.

The smaller the p-value, the stronger the evidence that we should reject the null hypothesis.

R-Squared— This is used to measure how much of the variation in the outcome can be explained by the variation in the independent variables. It is also known as the goodness of fit of a model.

It’s value ranges from 0 to 1 where 0 indicates that the outcome cannot be predicted by any of the independent variables and 1 indicates that the outcome can be predicted without error from the independent variables

Our R-Square is 0.983 or 98.3% which means that 98.3% of the ‘Yearly Amount spent’ can be explained by ‘Avg. Session Length’, ‘Time on App’,’Length of Membership’ and ’Time on Website’.

However, just to point out, that .this does not mean that our model is 98.3% accurate.A low R-squared value would indicate that our independent variables are not explaining much in the variation of your dependent variable.

Adjusted R- Squared

R Squared is a good measure to determine how well the model fits the dependent variable. However, it does not take into consideration of overfitting problem. If your regression model has many independent variables, because the model is too complicated, it may fit very well to the training data but performs badly for testing data.

Adjusted R Square is introduced because it penalizes any additional independent variables added to our model and adjusts the metric to prevent overfitting issues.

Note:adjusted R-Squared should always be lower or equal to the R-Squared

Our adjusted R-Squared is also 0.983 or 98.3% which means that 98.3% of the ‘Yearly Amount spent’ can be explained by ‘Avg. Session Length’, ‘Time on App’,’Length of Membership’ and ’Time on Website’.

Question? Calculate and interprate Mean Absolute Error, Mean Squared Error and Root Mean Squared Error.

We’ve concluded our linear regression article and seen how we can apply a business problem to the model and conclude our analysis from a statistical perspective. We’ve also seen what we need to be on the look out for when dealing with a linear regression model such as correlation, outliers, distribution etc.

Add a comment

Related posts:

Trump Will Start a News Channel After He Leaves the White House

There has been much speculation as to what soon-to-be former President Trump will do after he leaves the White House. Make no mistake. He will leave the White House. He can’t bribe, threaten, lie, or…

5 tips to become a perfect diver

Does becoming a diver appeal to you? Have you planned a vacation to the Caribbean, Egypt, the Maldives, or any other diver’s paradise? Lucky you. Diving into tropical waters to snorkel the reef once…

Asleep at My Feet

A poem on the acceptance of aging, grief and the present moment