The Chi-square test can be used to measure correlations between purely categorical variables. While correlation coefficients can be used to measure relationships between ordinal, interval, and continuous variables. ## Types of Correlation Coefficients Source: https://datascience.stackexchange.com/questions/64260/pearson-vs-spearman-vs-kendall _Pearson's correlation coefficient_ is **parametric**, _Spearman's rank correlation coefficient_ and _Kendall's tau coefficient_ are **non-parametric**. Pearson’s correlation measures a linear relationship between two variables, while Spearman’s and Kendall’s rank correlations measure a monotonic one (I.e. the paired datapoints can form a curved line). ### Pearson's Correlation Coefficient - Each observation should have a pair of values. - Each variable should be continuous. - Neither variable should have outliers. - Pearson's assumes _linearity_ and _homoscedasticity_. ### Spearman's Rank Correlation Coefficient - Pairs of observations are independent. - The two variables should be measured on an ordinal, interval or ratio scale. - Spearman's assumes a _monotonic_ relationship between the two variables. ### Kendall's Tau Coefficient - The same assumptions and requirements as for _Spearman's rank correlation coefficient_ apply. ### Choosing the right Coefficient #### Pearson correlation vs Spearman and Kendall correlation - Non-parametric correlations are less powerful because they use less information in their calculations. In the case of _Pearson's correlation_ uses information about the mean and deviation from the mean, while non-parametric correlations use only the ordinal information and scores of pairs. - In the case of non-parametric correlation, it's possible that the X and Y values can be continuous or ordinal, and approximate normal distributions for X and Y are not required. But in the case of _Pearson's correlation_, it assumes the distributions of X and Y should be normal and continuous. - Correlation coefficients only measure linear (_Pearson_) or monotonic (_Spearman_ and _Kendall_) relationships. #### Spearman correlation vs Kendall correlation - In the normal case, _Kendall correlation_ is more robust and efficient than _Spearman correlation_. It means that _Kendall correlation_ is preferred when there are small samples or some outliers. - _Kendall correlation_ has a O(n^2) computation complexity comparing with O(n log n) of _Spearman correlation_, where n is the sample size. - _Spearman’s rho_ usually is larger than _Kendall’s tau_. - The interpretation of _Kendall’s tau_ in terms of the probabilities of observing the agreeable (concordant) and non-agreeable (discordant) pairs is very direct. ## Testing Correlation Strength ### Chi-square test of independence for categorical variables - Source: https://www.pythonfordatascience.org/chi-square-test-of-independence-python/ - Source: https://freedium.cfd/https://towardsdatascience.com/how-strongly-associated-are-your-variables-80493127b3a2 #### Creating a Contingency Table Contingency tables be created from a per-sample Pandas DF over two categorical variables using [pd.crosstab](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.crosstab.html): ```python import pandas as pd # Create a contingency table from two categorical variables crosstab = pd.crosstab(df['A'], df['B']) ``` #### Calculating Chi-square and Cramer's V Cramer's V for a Chi-square test of independence can be calculated easily "by hand" as follows: ```python from scipy.stats import chi2_contingency result = chi2_contingency(crosstab, correction=False) chi2, p_value, deg_of_free, expected_freq = result N = crosstab.values.sum() k = min(crosstab.shape) # Calculate Cramer's V V = np.sqrt(X2 / (N * (k - 1))) ``` It describes the strength of the association between the two categorical variables: | Phi and Cramer's V | Interpretation | | ------------------ | --------------- | | >0.25 | Very strong | | >0.15 | Strong | | >0.10 | Moderate | | >0.05 | Weak | | >0 | No or very weak | Ref: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6107969/