Partial correlation, or conditional correlation, is a way to test for correlation between two variables, X and Y, given a third, Z. This technique is often useful in causal research when trying to discover confounding relationships.

The purpose of this notebook is to understand how to compute the partial correlation between two variables, $X$ and $Y$, given a third $Z$. In particular, these variables are assumed to be guassians (or, in general, multivariate gaussians).

Why is it important to estimate partial correlations? The primary reason for estimating a partial correlation is to use it to detect for confounding variables during causal analysis.

Let's start out by simulating 3 data sets. Graphically, these data sets comes from graphs represented by the following.

- serial: X -> Z -> Y
- diverging: X <- Z -> Y
- converging: X -> Z <- Y

In the serial graph, X causes Z and Z causes Y. In the diverging graph, Z causes both X and Y. In the converging graph, X and Y cause Z. Below, the serial, diverging, and converging data sets are named S, D, and C, correspondingly.

Note that in the serial graph, the data is sampled as follows.

- $X \sim \mathcal{N}(0, 1)$
- $Z \sim 2 + 1.8 \times X$
- $Y \sim 5 + 2.7 \times Z$

In the diverging graph, the data is sampled as follows.

- $Z \sim \mathcal{N}(0, 1)$
- $X \sim 4.3 + 3.3 \times Z$
- $Y \sim 5.0 + 2.7 \times Z$

Lastly, in the converging graph, the data is sampled as follows.

- $X \sim \mathcal{N}(0, 1)$
- $Y \sim \mathcal{N}(5.5, 1)$
- $Z \sim 2.0 + 0.8 \times X + 1.2 \times Y$

Note the ordering of the sampling with the variables follows the structure of the corresponding graph.

In [1]:

```
import numpy as np
np.random.seed(37)
def get_error(N=10000, mu=0.0, std=0.2):
return np.random.normal(mu, std, N)
def to_matrix(X, Z, Y):
return np.concatenate([
X.reshape(-1, 1),
Z.reshape(-1, 1),
Y.reshape(-1, 1)], axis=1)
def get_serial(N=10000, e_mu=0.0, e_std=0.2):
X = np.random.normal(0, 1, N) + get_error(N, e_mu, e_std)
Z = 2 + 1.8 * X + get_error(N, e_mu, e_std)
Y = 5 + 2.7 * Z + get_error(N, e_mu, e_std)
return to_matrix(X, Z, Y)
def get_diverging(N=10000, e_mu=0.0, e_std=0.2):
Z = np.random.normal(0, 1, N) + get_error(N, e_mu, e_std)
X = 4.3 + 3.3 * Z + get_error(N, e_mu, e_std)
Y = 5 + 2.7 * Z + get_error(N, e_mu, e_std)
return to_matrix(X, Z, Y)
def get_converging(N=10000, e_mu=0.0, e_std=0.2):
X = np.random.normal(0, 1, N) + get_error(N, e_mu, e_std)
Y = np.random.normal(5.5, 1, N) + get_error(N, e_mu, e_std)
Z = 2 + 0.8 * X + 1.2 * Y + get_error(N, e_mu, e_std)
return to_matrix(X, Z, Y)
```

In [2]:

```
S = get_serial()
D = get_diverging()
C = get_converging()
```

For the three datasets, S, D, and C, we want to compute the partial correlation between $X$ and $Y$ given $Z$. The way to do this is as follows.

- Regress $X$ on $Z$ and also $Y$ on $Z$
- $X = b_X + w_X * Z$
- $Y = b_Y + w_Y * Z$

- With the new weights $(b_X, w_X)$ and $(b_Y, w_Y)$, predict $X$ and $Y$.
- $\hat{X} = b_X + w_X * Z$
- $\hat{Y} = b_Y + w_Y * Z$

- Now compute the residuals between the true and predicted values.
- $R_X = X - \hat{X}$
- $R_Y = Y - \hat{Y}$

- Finally, compute the Pearson correlation between $R_X$ and $R_Y$.

The correlation between the residuals is the partial correlation and runs from -1 to +1. More interesting is the test of significance. If $p > \alpha$, where $\alpha \in [0.1, 0.05, 0.01]$, then assume independence. For example, assume $\alpha = 0.01$ and $p = 0.002$, then $X$ is conditionally independent of $Y$ given $Z$.

In [3]:

```
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score
from scipy.stats import pearsonr
from scipy import stats
def get_cond_indep_test(c_xy_z, N=10000, alpha=0.01):
point = stats.norm.ppf(1 - (alpha / 2.0))
z_transform = np.sqrt(N - 3) * np.abs(0.5 * np.log((1 + c_xy_z) / (1 - c_xy_z)))
return z_transform, point, z_transform > point
def get_partial_corr(M):
X = M[:, 0]
Z = M[:, 1].reshape(-1, 1)
Y = M[:, 2]
mXZ = LinearRegression()
mXZ.fit(Z, X)
pXZ = mXZ.predict(Z)
rXZ = X - pXZ
mYZ = LinearRegression()
mYZ.fit(Z, Y)
pYZ = mYZ.predict(Z)
rYZ = Y - pYZ
c_xy, p_xy = pearsonr(X, Y)
c_xy_z, p_xy_z = pearsonr(rXZ, rYZ)
return c_xy, p_xy, c_xy_z, p_xy_z
```

For X -> Z -> Y, note that the marginal correlation is high (0.99) and the correlation is significant (p < 0.01). However, the correlation between X and Y vanishes given Z to -0.01 (p > 0.01). Note the conditional independence test fails to reject the null hypothesis.

In [4]:

```
c_xy, p_xy, c_xy_z, p_xy_z = get_partial_corr(S)
print('corr_xy={:.5f}, p_xy={:.5f}, corr_xy_z={:.5f}, p_xy_z={:.5f}'
.format(c_xy, p_xy, c_xy_z, p_xy_z))
print(get_cond_indep_test(c_xy_z))
```

For X <- Z -> Y, note that the marginal correlation is high (0.99) and the correlation is significant (p < 0.01). However, the correlation between X and Y vanishes given Z to 0.01 (p > 0.01). Note the conditional independence test fails to reject the null hypothesis.

In [5]:

```
c_xy, p_xy, c_xy_z, p_xy_z = get_partial_corr(D)
print('corr_xy={:.5f}, p_xy={:.5f}, corr_xy_z={:.5f}, p_xy_z={:.5f}'
.format(c_xy, p_xy, c_xy_z, p_xy_z))
print(get_cond_indep_test(c_xy_z))
```

For X -> Z <- Y, note that the correlation is low (-0.00) and the correlation is insignficiant (p > 0.01). However, the correlation between X and Y increases to -0.96 and becomes significant (p < 0.01)! Note the conditional independence test rejects the null hypothesis.

In [6]:

```
c_xy, p_xy, c_xy_z, p_xy_z = get_partial_corr(C)
print('corr_xy={:.5f}, p_xy={:.5f}, corr_xy_z={:.5f}, p_xy_z={:.5f}'
.format(c_xy, p_xy, c_xy_z, p_xy_z))
print(get_cond_indep_test(c_xy_z))
```

In [ ]:

```
```

oneoffcoder

0