## CSE • Associate Analytics

 UNIT - IV Correlation and Regression Analysis (NOS 9001)

Regression Analysis and Modeling – Introduction:

Regression analysis is a form of predictive modeling technique which investigates the relationship between a dependent (target) and independent variable(s) (predictor). This technique is used for forecasting, time series modeling and finding the causal effect relationship between the variables. For example, relationship between rash driving and number of road accidents by a driver is best studied through regression.

Regression analysis is an important tool for modeling and analyzing data. Here, we fit a curve / line to the data points, in such a manner that the differences between the distances of data points from the curve or line is minimized

Regression analysis estimates the relationship between two or more variables. example:

Let’s say, if we want to estimate growth in sales of a company based on current economic conditions. we have the recent company data which indicates that the growth in sales is around two and a half times the growth in the economy. Using this insight, we can predict future sales of the company based on current & past information.

There are multiple benefits of using regression analysis. They are as follows:

* It indicates the significant relationships between dependent variable and independent variable.

* It indicates the strength of impact of multiple independent variables on a dependent variable.

Regression analysis also allows us to compare the effects of variables measured on different scales, such as the effect of price changes and the number of promotional activities. These benefits help market researchers / data analysts / data scientists to eliminate and evaluate the best set of variables to be used for building predictive models.

There are various kinds of regression techniques available to make predictions. These techniques are mostly driven by three metrics.

1. Number of independent variables,

2. Type of dependent variables and

3. Shape of regression lin

Linear Regression:

simple linear regression model describes the relationship between two variables and can be expressed by the following equation. The numbers α and β are called parameters, and ϵ is the error term. If we choose the parameters α and β in the simple linear regression model so as to minimize the sum of squares of the error term ϵ, we will have the so called estimated simple regression equation. It allows us to compute fitted values of based on values of x.

In R we use lm () function to do simple regression modeling.

Apply the simple linear regression model for the data set cars. The cars dataset as two variables (attributes) speed and dist and has 50 values.

speed dist

1    4    2

2    4    10

3    7    4

4    7    22

5    8    16

6    9    10

> attach(cars)

By using the attach( ) function the database is attached to the R search path. This means that the database is searched by R when evaluating a variable, so objects in the database can be accessed by simply giving their names.

> speed

 4 4 7 7 8 9 10 10 10 11 11 12 12 12 12 13 13 13 13 14 14 14 14 15 15 15 16

 16 17 17 17 18 18 18 18 19 19 19 20 20 20 20 20 22 23 24 24 24 24 25

> plot(cars)

> plot(dist,speed)

The plot() function gives a scatterplot whenever we give two numeric variables. The first variable listed will be plotted on the horizontal axis. Now apply the regression analysis on the dataset using lm( ) function.

> speed.lm=lm(speed ~ dist, data = cars)

lm function that describes the variable speed  by the variable dist, and save the linear regression model in a new variable speed.lm. In the above function y variables or dependent variable is speed and x variable or independent variable is dist.

We get the intercept “C” and the slope “m” of the equation – Y=mX+C

> speed.lm

Call:
lm(formula = speed ~ dist, data = cars)

Coefficients:
(Intercept)       dist
8.2839          0.1656

> abline(speed.lm)

This function adds one or more straight lines through the current plot. > plot(speed.lm)

The plot function displays four charts: Residuals vs. Fitted, Normal Q-Q, Scale-Location, and Residuals vs. Leverage. Residual: The difference between the predicted value (based on the regression equation) and the actual, observed value.

Outlier: In linear regression, an outlier is an observation with large residual. In other words, it is an observation whose dependent-variable value is unusual given its value on the predictor variables. An outlier may indicate a sample peculiarity or may indicate a data entry error or other problem.

Leverage: An observation with an extreme value on a predictor variable is a point with high leverage. Leverage is a measure of how far an independent variable deviates from its mean. High leverage points can have a great amount of effect on the estimate of regression coefficients.

Influence: An observation is said to be influential if removing the observation substantially changes the estimate of the regression coefficients. Influence can be thought of as the product of leverage and outlierness.

Cook's distance (or Cook's D): A measure that combines the information of leverage and residual of the observation.

Estimated simple regression equation:

Apply we will use the above simple linear regression model, and estimate the next speed if the distance covered is 80.

Extract the parameters of the estimated regression equation with the coefficients function.

> coeffs = coefficients(speed.lm)

> coeffs

(Intercept)        dist

8.2839056      0.1655676

We now fit the speed using the estimated regression equation.

> newdist = 80

> distance = coeffs + coeffs*newdist

> distance

(Intercept)

21.52931

To create a summary of the fitted model:

> summary (speed.lm)

Call:

lm(formula = speed ~ dist, data = cars)

Residuals:

Min       1Q       Median       3Q       Max

-7.5293  -2.1550  0.3615  2.4377  6.4179

Coefficients:

Estimate Std. Error t value Pr(>|t|)

(Intercept) 8.28391  0.87438  9.474  1.44e-12 ***

dist 0.16557  0.01749  9.464  1.49e-12 ***

---

Signif. codes:

0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 3.156 on 48 degrees of freedom

Multiple R-squared: 0.6511, Adjusted R-squared: 0.6438

F-statistic: 89.57 on 1 and 48 DF, p-value: 1.49e-12

OLS Regression:

ordinary least squares (OLS) or linear least squares is a method for estimating the unknown parameters in a linear regression model, with the goal of minimizing the differences between the observed responses in a dataset and the responses predicted by the linear approximation of the data.

This is applied in both simple linear and multiple regression where the common assumptions are

(1) The model is linear in the coefficients of the predictor with an additive random error term

(2) The random error terms are

* normally distributed with 0 mean and

* a variance that doesn't change as the values of the predictor covariates change.

Correlation:

Correlation is a statistical measure that indicates the extent to which two or more variables fluctuate together. That can show whether and how strongly pairs of variables are related. It Measure the association between variables. Positive and negative correlation, ranging between +1 and -1.

For example, height and weight are related; taller people tend to be heavier than shorter people. A positive correlation indicates the extent to which those variables increase or decrease in parallel; a negative correlation indicates the extent to which one variable increases as the other decreases.

When the fluctuation of one variable reliably predicts a similar fluctuation in another variable, there’s often a tendency to think that means that the change in one causes the change in the other. However, correlation does not imply causation. There may be, for example, an unknown factor that influences both variables similarly.

An intelligent correlation analysis can lead to a greater understanding of your data.

Correlation in R:

We use the cor( ) function to produce correlations.

A simplified format of  cor(x, use=, method= ) where

 Option Description x Matrix or data frame use Specifies the handling of missing data. Options are all.obs (assumes no missing data - missing data will produce an error), complete.obs (listwise deletion), and pairwise.complete.obs (pairwise deletion) method Specifies the type of correlation. Options are pearson, spearman or kendall.

> cor(cars)

speed dist

speed 1.0000000 0.8068949

dist 0.8068949 1.0000000

> cor(cars, use="complete.obs", method="kendall")

speed dist

speed 1.0000000 0.6689901

dist 0.6689901 1.0000000

> cor(cars, use="complete.obs", method="pearson")

speed dist

speed 1.0000000 0.8068949

dist 0.8068949 1.0000000

Correlation Coefficient:

The correlation coefficient of two variables in a data sample is their covariance divided by the product of their individual standard deviations. It is a normalized measurement of how the two are linearly related.

Formally, the sample correlation coefficient is defined by the following formula, where sx and sy are the sample standard deviations, and sxy is the sample covariance. Similarly, the population correlation coefficient is defined as follows, where σx and σy are the population standard deviations, and σxy is the population covariance. If the correlation coefficient is close to 1, it would indicates that the variables are positively linearly related and the scatter plot falls almost along a straight line with positive slope. For -1, it indicates that the variables are negatively linearly related and the scatter plot almost falls along a straight line with negative slope. And for zero, it would indicates a weak linear relationship between the variables.

* r : correlation coefficient

* +1 : Perfectly positive

* -1 : Perfectly negative

* 0 – 0.2 : No or very weak association

* 0.2 – 0.4 : Weak association

* 0.4 – 0.6 : Moderate association

* 0.6 – 0.8 : Strong association

* 0.8 – 1 : Very strong to perfect association

Covariance:

Covariance provides a measure of the strength of the correlation between two or more sets of random variates. Correlation is defined in terms of the variance of x, the variance of y, and the covariance of x and y (the way the two vary together; the way they co-vary) on the assumption that both variables are normally distributed.

Covariance in R:

We apply the cov function to compute the covariance of eruptions and waiting in faithful dataset

> duration = faithful\$eruptions   # the eruption durations
> waiting = faithful\$waiting      # the waiting period
> cov(duration, waiting)          # apply the cov function
 13.978

ANOVA:

Analysis of Variance (ANOVA) is a commonly used statistical technique for investigating data by comparing the means of subsets of the data. The base case is the one-way ANOVA which is an extension of two-sample t test for independent groups covering situations where there are more than two groups being compared.

In one-way ANOVA the data is sub-divided into groups based on a single classification factor and the standard terminology used to describe the set of factor levels is treatment even though this might not always have meaning for the particular application. There is variation in the measurements taken on the individual components of the data set and ANOVA investigates whether this variation can be explained by the grouping introduced by the classification factor.

To investigate these differences we fit the one-way ANOVA model using the lm function and look at the parameter estimates and standard errors for the treatment effects.

> anova(speed.lm)

Analysis of Variance Table

Response: speed

Df Sum Sq Mean Sq F value Pr(>F)

dist 1 891.98 891.98 89.567 1.49e-12 ***

Residuals 48 478.02 9.96

---

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

This table confirms that there are differences between the groups which were highlighted in the model summary. The function confint is used to calculate confidence intervals on the treatment parameters, by default 95% confidence intervals

> confint(speed.lm)

2.5 %    97.5 %

(Intercept) 6.5258378   10.0419735

dist     0.1303926    0.2007426

Heteroscedasticity:

Heteroscedasticity (also spelled heteroskedasticity) refers to the circumstance in which the variability of a variable is unequal across the range of values of a second variable that predicts it.

A scatterplot of these variables will often create a cone-like shape, as the scatter (or variability) of the dependent variable (DV) widens or narrows as the value of the independent variable (IV) increases. The inverse of heteroscedasticity is homoscedasticity, which indicates that a DV's variability is equal across values of an IV.

Hetero (different or unequal) is the opposite of Homo (same or equal). Detecting Heteroskedasticity

There are two ways in general.

The first is the informal way which is done through graphs and therefore we call it the graphical method.

The second is through formal tests for heteroskedasticity, like the following ones:

1. The Breusch-Pagan LM Test

2. The Glesjer LM Test

3. The Harvey-Godfrey LM Test

4. The Park LM Test

5. The Goldfeld-Quandt Tets

6. White’s Test

Heteroscedasticity test in R:

bptest(p) does the Breuch Pagan test to formally check presence of heteroscedasticity. To use bptest, you will have to call lmtest library.

> install.packages("lmtest")

> library(lmtest)

> bptest(speed.lm)

studentized Breusch-Pagan test

data: speed.lm

BP = 0.71522, df = 1, p-value = 0.3977

If the test is positive (low p value), you should see if any transformation of the dependent variable helps you eliminate heteroscedasticity.

Autocorrelation:

Autocorrelation, also known as serial correlation or cross-autocorrelation, is the cross-correlation of a signal with itself at different points in time. Informally, it is the similarity between observations as a function of the time lag between them. It is a mathematical tool for finding repeating patterns, such as the presence of a periodic signal obscured by noise, or identifying the missing fundamental frequency in a signal implied by its harmonic frequencies. It is often used in signal processing for analyzing functions or series of values, such as time domain signals.

The function acf ( ) in R computes estimates of the autocovariance or autocorrelation function.

Test: -

The traditional test for the presence of first-order autocorrelation is the Durbin–Watson statistic or, if the explanatory variables include a lagged dependent variable, Durbin's h statistic. The Durbin-Watson can be linearly mapped however to the Pearson correlation between values and their lags.

A more flexible test, covering autocorrelation of higher orders and applicable whether or not the regressors include lags of the dependent variable, is the Breusch–Godfrey test. This involves an auxiliary regression, wherein the residuals obtained from estimating the model of interest are regressed on (a) the original regressors and (b) k lags of the residuals, where k is the order of the test. The simplest version of the test statistic from this auxiliary regression is TR2, where T is the sample size and R2 is the coefficient of determination. Under the null hypothesis of no autocorrelation, this statistic is asymptotically distributed as x2 with k degrees of freedom.

Introduction to Multiple Regression:

Multiple regression is a flexible method of data analysis that may be appropriate whenever a quantitative variable (the dependent variable) is to be examined in relationship to any other factors (expressed as independent or predictor variables). Relationships may be nonlinear, independent variables may be quantitative or qualitative, and one can examine the effects of a single variable or multiple variables with or without the effects of other variables taken into account.

Many practical questions involve the relationship between a dependent variable of interest (call it Y) and a set of k independent variables or potential predictor variables (call them X1, X2, X3,..., Xk), where the scores on all variables are measured for N cases. For example, you might be interested in predicting performance on a job (Y) using information on years of experience (X1), performance in a training program (X2), and performance on an aptitude test (X3). A multiple regression equation for predicting Y can be expressed a follows: To apply the equation, each Xj score for an individual case is multiplied by the corresponding Bj value, the products are added together, and the constant A is added to the sum. The result is Y', the predicted Y value for the case.

Multiple Regression in R:

1 1 5501 8.1 9552 1923

2 2 5945 7.0 9680 1961

3 3 6629 7.3 9731 1979

4 4 7556 7.5 11666 2030

5 5 8716 7.0 14675 2112

6 6 9369 6.4 15265 2192

7 7 9920 6.5 15484 2235

8 8 10167 6.4 15723 2351

9 9 11084 6.3 16501 2411

10 10 12504 7.7 16890 2475

11 11 13746 8.2 17203 2524

12 12 13656 7.5 17707 2674

13 13 13850 7.4 18108 2833

14 14 14145 8.2 18266 2863

15 15 14888 10.1 19308 2839

16 16 14991 9.2 18224 2898

17 17 14836 7.7 18997 3123

18 18 14478 5.7 19505 3195

19 19 14539 6.5 19800 3239

20 20 14395 7.5 19546 3129

21 21 14599 7.3 19117 3100

22 22 14969 9.2 18774 3008

23 23 15107 10.1 17813 2983

24 24 14831 7.5 17304 3069

25 25 15081 8.8 16756 3151

26 26 15127 9.1 16749 3127

27 27 15856 8.8 16925 3179

28 28 15938 7.8 17231 3207

29 29 16081 7.0 16816 3345

> #attach data variable

> attach(datavar)

> #two predictor model

> #create a linear model using lm(FORMULA, DATAVAR)

> #predict the fall enrollment (ROLL) using the unemployment rate (UNEM) and number of spring high school graduates (HGRAD)

> twoPredictorModel <- lm(ROLL ~ UNEM + HGRAD, datavar)

> #display model

> twoPredictorModel

Call:

lm(formula = ROLL ~ UNEM + HGRAD, data = datavar)

Coefficients:

-8255.7511    698.2681    0.9423

> #what is the expected fall enrollment (ROLL) given this year's unemployment rate (UNEM) of 9% and spring high school graduating class (HGRAD) of 100,000

> -8255.8 + 698.2 * 9 + 0.9 * 100000

 88028

> #the predicted fall enrollment, given a 9% unemployment rate and 100,000 student spring high school graduating class, is 88,028 students.

> #three predictor model

> #create a linear model using lm(FORMULA, DATAVAR)

> #predict the fall enrollment (ROLL) using the unemployment rate (UNEM), number of spring high school graduates (HGRAD), and per capita income (INC)

> threePredictorModel <- lm(ROLL ~ UNEM + HGRAD + INC, datavar)

> #display model

> threePredictorModel

Call:

lm(formula = ROLL ~ UNEM + HGRAD + INC, data = datavar)

Coefficients:

-9153.2545     450.1245     0.4065     4.2749

Multicollinearity

In statistics, multicollinearity (also collinearity) is a phenomenon in which two or more predictor variables in a multiple regression model are highly correlated, meaning that one can be linearly predicted from the others with a substantial degree of accuracy. In this situation the coefficient estimates of the multiple regressions may change erratically in response to small changes in the model or the data. Multicollinearity does not reduce the predictive power or reliability of the model as a whole, at least within the sample data set; it only affects calculations regarding individual predictors. That is, a multiple regression model with correlated predictors can indicate how well the entire bundle of predictors predicts the outcome variable, but it may not give valid results about any individual predictor, or about which predictors are redundant with respect to others.