The Chi-square test can be used to measure correlations between purely categorical variables. While correlation coefficients can be used to measure relationships between ordinal, interval, and continuous variables.
## Types of Correlation Coefficients
Source: https://datascience.stackexchange.com/questions/64260/pearson-vs-spearman-vs-kendall
_Pearson's correlation coefficient_ is **parametric**, _Spearman's rank correlation coefficient_ and _Kendall's tau coefficient_ are **non-parametric**.
Pearson’s correlation measures a linear relationship between two variables, while Spearman’s and Kendall’s rank correlations measure a monotonic one (I.e. the paired datapoints can form a curved line).
### Pearson's Correlation Coefficient
- Each observation should have a pair of values.
- Each variable should be continuous.
- Neither variable should have outliers.
- Pearson's assumes _linearity_ and _homoscedasticity_.
### Spearman's Rank Correlation Coefficient
- Pairs of observations are independent.
- The two variables should be measured on an ordinal, interval or ratio scale.
- Spearman's assumes a _monotonic_ relationship between the two variables.
### Kendall's Tau Coefficient
- The same assumptions and requirements as for _Spearman's rank correlation coefficient_ apply.
### Choosing the right Coefficient
#### Pearson correlation vs Spearman and Kendall correlation
- Non-parametric correlations are less powerful because they use less information in their calculations. In the case of _Pearson's correlation_ uses information about the mean and deviation from the mean, while non-parametric correlations use only the ordinal information and scores of pairs.
- In the case of non-parametric correlation, it's possible that the X and Y values can be continuous or ordinal, and approximate normal distributions for X and Y are not required. But in the case of _Pearson's correlation_, it assumes the distributions of X and Y should be normal and continuous.
- Correlation coefficients only measure linear (_Pearson_) or monotonic (_Spearman_ and _Kendall_) relationships.
#### Spearman correlation vs Kendall correlation
- In the normal case, _Kendall correlation_ is more robust and efficient than _Spearman correlation_. It means that _Kendall correlation_ is preferred when there are small samples or some outliers.
- _Kendall correlation_ has a O(n^2) computation complexity comparing with O(n log n) of _Spearman correlation_, where n is the sample size.
- _Spearman’s rho_ usually is larger than _Kendall’s tau_.
- The interpretation of _Kendall’s tau_ in terms of the probabilities of observing the agreeable (concordant) and non-agreeable (discordant) pairs is very direct.
## Testing Correlation Strength
### Chi-square test of independence for categorical variables
- Source: https://www.pythonfordatascience.org/chi-square-test-of-independence-python/
- Source: https://freedium.cfd/https://towardsdatascience.com/how-strongly-associated-are-your-variables-80493127b3a2
#### Creating a Contingency Table
Contingency tables be created from a per-sample Pandas DF over two categorical variables using [pd.crosstab](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.crosstab.html):
```python
import pandas as pd
# Create a contingency table from two categorical variables
crosstab = pd.crosstab(df['A'], df['B'])
```
#### Calculating Chi-square and Cramer's V
Cramer's V for a Chi-square test of independence can be calculated easily "by hand" as follows:
```python
from scipy.stats import chi2_contingency
result = chi2_contingency(crosstab, correction=False)
chi2, p_value, deg_of_free, expected_freq = result
N = crosstab.values.sum()
k = min(crosstab.shape)
# Calculate Cramer's V
V = np.sqrt(X2 / (N * (k - 1)))
```
It describes the strength of the association between the two categorical variables:
| Phi and Cramer's V | Interpretation |
| ------------------ | --------------- |
| >0.25 | Very strong |
| >0.15 | Strong |
| >0.10 | Moderate |
| >0.05 | Weak |
| >0 | No or very weak |
Ref: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6107969/