Exploring Correlation in Statistics

Correlation is a measure of the strength of a linear relationship between two numerical variables.

June 21, 2017—Silvani Vejar

Why Do We Care About Correlation?

Correlation is important in statistics because it allows us to quantify the relationship between quantities. For example, you probably have wondered if there’s a relationship between things like the amount of calories you consume and your weight, number of fast-food restaurants in a neighborhood and obesity rates, income and mortality rates, etc. The strength of a linear relationship between quantities like these ones is measured using the linear correlation coefficient. When we say two variables are correlated, we are saying that the variables seem to have a strong relationship between each other, but be careful—correlation does not imply causation.

The correlation coefficient



measures the strength of a linear relationship between two paired numerical variables, and it describes the direction (positive or negative) and strength of a relationship between two variables (how well data fits a straight-line pattern).

With the help of the Wolfram Language and data from the Wolfram Data Repository, we will be able to do an overview on calculating and interpreting the correlation coefficient.

Correlation Coefficient Properties



only measures the strength of a linear relationship. There may be nonlinear relationships.



is always between

–1

and 1 inclusive;

–1

means a perfect negative linear correlation and

means a perfect positive linear correlation.



has the same sign as the slope of the regression (best-fit) line.



does not change if the independent (

) and dependent (

) variables are interchanged.



does not change if the scale on either variable is changed. You may multiply, divide, add or subtract a value to/from all the

values or

values without changing the value of



has a Student's

-distribution.

Comparing and Interpreting Linear-Strength Relationships from Scatter Plots

In[]:=

data=BlockRandom[SeedRandom[1];Table[RandomVariate[BinormalDistribution[i],2000],{i,{-.99,-.75,-.25,-.5,0.,.25,.5,.75,.99}}]];

In[]:=

Grid[Partition[Table[ListPlot[i,PlotStyleDirective[PointSize[Tiny]],FrameTicksNone,FrameTrue,AxesNone,PlotLabelRow[{" : ",Correlation[i][[1,2]]}]],{i,data}],3]]

Out[]=

ToExpression::sntxi: Incomplete expression; more input is needed .

In[]:=

The data below shows the before and after weights of young women undergoing anorexia treatment. When you look at the data, what kind of questions might you ask?

Obtain data from the Data Repository: before and after weights of young women undergoing anorexia treatment:

In[]:=

data=ResourceData["Sample Data: Anorexia Treatment"]

Out[]=

Treatment	EntryWeight	ExitWeight
Cont	80.7lb	80.2lb
Cont	89.4lb	80.1lb
Cont	91.8lb	86.4lb
Cont	74lb	86.3lb
Cont	78.1lb	76.1lb
Cont	88.3lb	78.1lb
Cont	87.3lb	75.1lb
Cont	75.1lb	86.7lb
Cont	80.6lb	73.5lb
Cont	78.4lb	84.6lb
Cont	77.6lb	77.4lb
Cont	88.7lb	79.5lb
Cont	81.3lb	89.6lb
Cont	78.1lb	81.4lb
Cont	70.5lb	81.8lb
Cont	77.3lb	77.3lb
Cont	85.2lb	84.2lb
Cont	86lb	75.4lb
Cont	84.1lb	79.5lb
Cont	79.7lb	73lb
showing 1–20 of 72

By simply looking at data in a table, it can be difficult to observe whether or not there is a linear relationship between the entry weights and exit weights of the young female patients. A scatter diagram can help us to visualize the distribution of the data.

In[]:=

ListPlot[data[All,{"EntryWeight","ExitWeight"}],PlotLabel"Anorexia data on weight change for young female patients",AxesLabel{"Entry Weight","Exit Weigh"}]

Out[]=

By looking at the scatter plot, would you say there is a correlation between the entry and exit weights of the young female patients? The data is widely scattered, and with a visual inspection we can observe that there doesn’t seem to be a strong correlation between these two variables. But to be sure, let’s calculate the value of



Calculate :

In[]:=

Correlation[Normal@data[All,"EntryWeight"],Normal@data[All,"ExitWeight"]]

Out[]=

0.332406

Checking for Linear Correlation

We found the value of the linear correlation coefficient,

=0.332406

, but what does it mean? To interpret its meaning, we can use our knowledge on hypothesis tests and p-values (which indicate that there is a weak linear relationship between the entry and exit weights, and we can conclude that the weights of the patients are random and there is no evidence to support that the anorexia treatments are effective).

In cases where we find a strong linear correlation, we can proceed to a hypothesis test and find the line’s best fit.

To interpret



, we refer to the computed p-value. If it is less than or equal to the significance level, we conclude there is a linear correlation.

Hypothesis Test Using the
p
-Value from a
t
-Test

Using the previous example, we can conduct a formal hypothesis test of the claim that there is linear correlation between the entry and exit weights.

To claim that there is linear correlation implies that the linear correlation coefficient is not zero:

:ρ=0

: There is no linear correlation

:ρ≠0

: There is linear correlation

(*In cases where we find a strong linear correlation, we can proceed a hypothesis test and find the line best-fit.*)

Plot the data points and the regression line:

In[]:=

model["ParameterConfidenceIntervalTable"]

Out[]=

	Estimate	Standard Error	Confidence Interval
1	42.7006	14.4311	{13.9186,71.4826}
x	0.51538	0.174777	{0.166799,0.863962}

Coefficient of Variation
2

—Explained Variation

In cases where we conclude there is a linear correlation between the two variables, we can find a linear equation that expresses y in terms of

(linear regression). The value of



is the proportion of the variation in

that is explained by the linear relationship between the two variables.

In[]:=

model["RSquared"]

Out[]=

0.110494

What does it mean to say there is a positively or negatively linear relationship between two variables?

Practice Questions

Using the Data Repository, choose a set of data and create your own notebook to do the following:

Load data

Create a scatter plot

Based on the scatter plot, provide an estimation for the value of the correlation coefficient between the two variables and explain your answer.

Compute the value of the correlation coefficient.

FURTHER EXPLORATIONS

Multivariate correlation
Nonlinear regression

AUTHORSHIP INFORMATION

Silvani Vejar

06/21/17

Exploring Correlation in Statistics

Why Do We Care About Correlation?

Comparing and Interpreting Linear-Strength Relationships from Scatter Plots

Checking for Linear Correlation

Hypothesis Test Using the p-Value from a t-Test

Coefficient of Variation 2—Explained Variation

Practice Questions

Hypothesis Test Using the
p
-Value from a
t
-Test

Coefficient of Variation
2

—Explained Variation