• 703-743-9010
• info@oneoffcoder.com
• 7526 Old Linton Hall Rd, Gainesville VA, 20155

### Partial Correlation

Partial correlation, or conditional correlation, is a way to test for correlation between two variables, X and Y, given a third, Z. This technique is often useful in causal research when trying to discover confounding relationships.

# Purpose¶

The purpose of this notebook is to understand how to compute the partial correlation between two variables, $X$ and $Y$, given a third $Z$. In particular, these variables are assumed to be guassians (or, in general, multivariate gaussians).

Why is it important to estimate partial correlations? The primary reason for estimating a partial correlation is to use it to detect for confounding variables during causal analysis.

# Let's go¶

Let's start out by simulating 3 data sets. Graphically, these data sets comes from graphs represented by the following.

• serial: X -> Z -> Y
• diverging: X <- Z -> Y
• converging: X -> Z <- Y

In the serial graph, X causes Z and Z causes Y. In the diverging graph, Z causes both X and Y. In the converging graph, X and Y cause Z. Below, the serial, diverging, and converging data sets are named S, D, and C, correspondingly.

Note that in the serial graph, the data is sampled as follows.

• $X \sim \mathcal{N}(0, 1)$
• $Z \sim 2 + 1.8 \times X$
• $Y \sim 5 + 2.7 \times Z$

In the diverging graph, the data is sampled as follows.

• $Z \sim \mathcal{N}(0, 1)$
• $X \sim 4.3 + 3.3 \times Z$
• $Y \sim 5.0 + 2.7 \times Z$

Lastly, in the converging graph, the data is sampled as follows.

• $X \sim \mathcal{N}(0, 1)$
• $Y \sim \mathcal{N}(5.5, 1)$
• $Z \sim 2.0 + 0.8 \times X + 1.2 \times Y$

Note the ordering of the sampling with the variables follows the structure of the corresponding graph.

In [1]:
import numpy as np

np.random.seed(37)

def get_error(N=10000, mu=0.0, std=0.2):
return np.random.normal(mu, std, N)

def to_matrix(X, Z, Y):
return np.concatenate([
X.reshape(-1, 1),
Z.reshape(-1, 1),
Y.reshape(-1, 1)], axis=1)

def get_serial(N=10000, e_mu=0.0, e_std=0.2):
X = np.random.normal(0, 1, N) + get_error(N, e_mu, e_std)
Z = 2 + 1.8 * X + get_error(N, e_mu, e_std)
Y = 5 + 2.7 * Z + get_error(N, e_mu, e_std)

def get_diverging(N=10000, e_mu=0.0, e_std=0.2):
Z = np.random.normal(0, 1, N) + get_error(N, e_mu, e_std)
X = 4.3 + 3.3 * Z + get_error(N, e_mu, e_std)
Y = 5 + 2.7 * Z + get_error(N, e_mu, e_std)

def get_converging(N=10000, e_mu=0.0, e_std=0.2):
X = np.random.normal(0, 1, N) + get_error(N, e_mu, e_std)
Y = np.random.normal(5.5, 1, N) + get_error(N, e_mu, e_std)
Z = 2 + 0.8 * X + 1.2 * Y + get_error(N, e_mu, e_std)


In [2]:
S = get_serial()
D = get_diverging()
C = get_converging()


# Compute the partial correlation¶

For the three datasets, S, D, and C, we want to compute the partial correlation between $X$ and $Y$ given $Z$. The way to do this is as follows.

• Regress $X$ on $Z$ and also $Y$ on $Z$
• $X = b_X + w_X * Z$
• $Y = b_Y + w_Y * Z$
• With the new weights $(b_X, w_X)$ and $(b_Y, w_Y)$, predict $X$ and $Y$.
• $\hat{X} = b_X + w_X * Z$
• $\hat{Y} = b_Y + w_Y * Z$
• Now compute the residuals between the true and predicted values.
• $R_X = X - \hat{X}$
• $R_Y = Y - \hat{Y}$
• Finally, compute the Pearson correlation between $R_X$ and $R_Y$.

The correlation between the residuals is the partial correlation and runs from -1 to +1. More interesting is the test of significance. If $p > \alpha$, where $\alpha \in [0.1, 0.05, 0.01]$, then assume independence. For example, assume $\alpha = 0.01$ and $p = 0.002$, then $X$ is conditionally independent of $Y$ given $Z$.

In [3]:
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
from scipy.stats import pearsonr
from scipy import stats

def get_cond_indep_test(c_xy_z, N=10000, alpha=0.01):
point = stats.norm.ppf(1 - (alpha / 2.0))
z_transform = np.sqrt(N - 3) * np.abs(0.5 * np.log((1 + c_xy_z) / (1 - c_xy_z)))
return z_transform, point, z_transform > point

def get_partial_corr(M):
X = M[:, 0]
Z = M[:, 1].reshape(-1, 1)
Y = M[:, 2]

mXZ = LinearRegression()
mXZ.fit(Z, X)
pXZ = mXZ.predict(Z)
rXZ = X - pXZ

mYZ = LinearRegression()
mYZ.fit(Z, Y)
pYZ = mYZ.predict(Z)
rYZ = Y - pYZ

c_xy, p_xy = pearsonr(X, Y)
c_xy_z, p_xy_z = pearsonr(rXZ, rYZ)

return c_xy, p_xy, c_xy_z, p_xy_z


# Partial correlation for serial graph data¶

For X -> Z -> Y, note that the marginal correlation is high (0.99) and the correlation is significant (p < 0.01). However, the correlation between X and Y vanishes given Z to -0.01 (p > 0.01). Note the conditional independence test fails to reject the null hypothesis.

In [4]:
c_xy, p_xy, c_xy_z, p_xy_z = get_partial_corr(S)
print('corr_xy={:.5f}, p_xy={:.5f}, corr_xy_z={:.5f}, p_xy_z={:.5f}'
.format(c_xy, p_xy, c_xy_z, p_xy_z))
print(get_cond_indep_test(c_xy_z))

corr_xy=0.99331, p_xy=0.00000, corr_xy_z=-0.01493, p_xy_z=0.13543
(1.4930316470699307, 2.5758293035489004, False)


# Partial correlation for diverging graph data¶

For X <- Z -> Y, note that the marginal correlation is high (0.99) and the correlation is significant (p < 0.01). However, the correlation between X and Y vanishes given Z to 0.01 (p > 0.01). Note the conditional independence test fails to reject the null hypothesis.

In [5]:
c_xy, p_xy, c_xy_z, p_xy_z = get_partial_corr(D)
print('corr_xy={:.5f}, p_xy={:.5f}, corr_xy_z={:.5f}, p_xy_z={:.5f}'
.format(c_xy, p_xy, c_xy_z, p_xy_z))
print(get_cond_indep_test(c_xy_z))

corr_xy=0.99575, p_xy=0.00000, corr_xy_z=0.01155, p_xy_z=0.24815
(1.1548311182263977, 2.5758293035489004, False)


# Partial correlation for converging graph data¶

For X -> Z <- Y, note that the correlation is low (-0.00) and the correlation is insignficiant (p > 0.01). However, the correlation between X and Y increases to -0.96 and becomes significant (p < 0.01)! Note the conditional independence test rejects the null hypothesis.

In [6]:
c_xy, p_xy, c_xy_z, p_xy_z = get_partial_corr(C)
print('corr_xy={:.5f}, p_xy={:.5f}, corr_xy_z={:.5f}, p_xy_z={:.5f}'
.format(c_xy, p_xy, c_xy_z, p_xy_z))
print(get_cond_indep_test(c_xy_z))

corr_xy=-0.00269, p_xy=0.78774, corr_xy_z=-0.95791, p_xy_z=0.00000
(191.96010513726856, 2.5758293035489004, True)

In [ ]: