Thursday, June 16, 2016

SIMPLE LINEAR REGRESSION-PROS AND CONS

Simple linear regression is a statistical method that allows us to summarize and study relationships between two continuous (quantitative) variables:

Some examples of statistical relationships might include:
  • Height and weight — as height increases, you'd expect weight to increase, but not perfectly.
  • Weight for Age-as the baby grows older, the weight increases.
  • Alcohol consumed and blood alcohol content — as alcohol consumption increases, you'd expect one's blood alcohol content to increase, but not perfectly.
  • Vital lung capacity and pack-years of smoking — as amount of smoking increases (as quantified by the number of pack-years of smoking), you'd expect lung function (as quantified by vital lung capacity) to decrease, but not perfectly.
  • Driving speed and gas mileage — as driving speed increases, you'd expect gas mileage to decrease, but not perfectly.
When is linear regression appropriate?
The sensible use of linear regression on a data set requires that four assumptions about that data set be true:
  1. The relationship between the variables is linear.
  2. The data is homoskedastic, meaning the variance in the residuals (the difference in the real and predicted values) is more or less constant.
  3. The residuals are independent, meaning the residuals are distributed randomly and not influenced by the residuals in previous observations. If the residuals are not independent of each other, they're considered to be autocorrelated.
  4. The residuals are normally distributed
The "Bad" linear regression model
The first step in determining if a linear regression model is appropriate for a data set is plotting the data and evaluating it qualitatively. In the case of two quantitative variables the most appropriate graphical display is the scatter plot The scatter plot allows investigation of the relationship between two variables — the independent variable is plotted along the horizontal axis and the dependent variable is plotted on the vertical axis. e. g Weight for age. A scatter plot is frequently also referred to as a plot of Y versus X.

A scatter plot of the data shown above can be seen that the relationship between weight of a baby and age of a baby has the following characteristics. 
  • Direction: Positive, i.e. as baby gets older does the weight increases; 
  • Shape: Roughly linear, i.e. the points appear to fall along a straight line; 
  • and Strength: Reasonably strong, i.e. there is considerable scatter about a straight line. 
Clearly, the characterisation of the strength of the relationship is rather subjective and a numerical estimate of the strength is preferable. Given that the relationship between income and recreation expenditure appears linear, the strength of this linear relationship can be numerically summarized using the correlation, ρ

Numerical summary of the data — Correlation 
After investigating the data visually, a numerical summary of the strength of the association between the two variables is often desired. This can be achieved with the population correlation coefficient, ρ, which measures the strength of the linear association between two variables, X and Y . Since X and Y are two quantitative variables, ρ is also known as the Pearson correlation coefficient or Pearson’s product-moment correlation coefficient.
ρ can take values between −1 and 1 and the interpretation of ρ is as follows
  • A negative value indicates a decreasing relationship between X and Y , that is, as X increases, Y decreases. 
  • A positive value indicates an increasing relationship between X and Y , that is, as X increases, so does Y .
  • A value of 0 indicates that there is no linear relationship between the two variables — this however does not imply that there is no relationship.
  • The correlation does not give an indication about the value of the slope of any linear relationship.
The type of relationship, and hence whether a correlation is an appropriate numerical summary, can only be assessed with a scatter plot. 


In my next post I will talk about how to asses if your model meets the 4 model assumptions of:-
  1. The Ei are statistically independent of each other; 
  2. The Ei have constant variance, σ 2 E , for all values of xi ; 
  3. The Ei are normally distributed with mean 0;
  4. The means of the dependent variable Y fall on a straight line for all values of the independent variable X. 
Keep it here
Can your model tell a story?

Erick Okello 

No comments:

Post a Comment