Question 1

What is correlation and how is it measured?

Accepted Answer

Correlation measures the strength and direction of the linear relationship between two variables. It ranges from -1 to +1, where +1 indicates a perfect positive linear relationship (as one variable increases, the other increases proportionally), -1 indicates a perfect negative linear relationship (as one increases, the other decreases proportionally), and 0 indicates no linear relationship. The most common measure is the Pearson correlation coefficient (r), which quantifies linear relationships. The Spearman rank correlation coefficient (rho) measures monotonic relationships and is more robust to outliers. Correlation does not imply causation; two variables can be highly correlated due to a common third variable or coincidence.

Question 2

What is the difference between Pearson and Spearman correlation?

Accepted Answer

Pearson correlation measures the strength of the linear relationship between two continuous variables, assuming both are approximately normally distributed with no extreme outliers. It uses actual data values in its calculation. Spearman rank correlation converts data to ranks first, then computes Pearson correlation on the ranks. This makes Spearman robust to outliers, applicable to ordinal data, and able to detect monotonic (consistently increasing or decreasing) relationships that may not be linear. For example, an exponential relationship y = 2^x would have Pearson r less than 1 because the relationship is not linear, but Spearman rho would be exactly 1 because the relationship is perfectly monotonic. Use Pearson when data is continuous, roughly normal, and you expect linearity. Use Spearman otherwise.

Question 3

What does statistical significance of correlation mean?

Accepted Answer

Statistical significance tests whether the observed correlation is likely to have occurred by chance if the true population correlation were zero. The test uses a t-statistic calculated as t = r * sqrt((n-2) / (1-r^2)), which follows a t-distribution with n-2 degrees of freedom. A small p-value (typically below 0.05) means the correlation is statistically significant, meaning it is unlikely to be zero in the population. However, significance depends heavily on sample size: with large samples (n > 500), even tiny correlations like r = 0.10 become significant. Conversely, meaningful correlations may fail significance tests with small samples. Always report the correlation coefficient alongside the p-value, not just whether the result is significant.

Question 4

What are the assumptions of Pearson correlation?

Accepted Answer

Pearson correlation requires several assumptions for valid inference. Both variables should be continuous and measured on interval or ratio scales. The relationship should be approximately linear; Pearson correlation can underestimate the strength of curvilinear relationships. Both variables should be approximately normally distributed, especially for hypothesis testing with small samples. Observations should be independent of each other. There should be no significant outliers, as a single extreme point can dramatically inflate or deflate the correlation coefficient. Homoscedasticity (equal variance of y across x values) is assumed. When these assumptions are violated, consider Spearman rank correlation, Kendall tau, or data transformations before computing Pearson correlation.

Scatter Plot Correlation Calculator

Formula

Worked Examples

Example 1: Study Hours vs Exam Scores

Example 2: Temperature vs Ice Cream Sales

Frequently Asked Questions

What is correlation and how is it measured?

What is the difference between Pearson and Spearman correlation?

What does statistical significance of correlation mean?

What are the assumptions of Pearson correlation?

References