We now understand that Graph Surgery and Adjustment are equivalent. However, with Bayesian networks, we can go beyond the metaphor and—quite literally—perform graph surgery. In this section, we create a Bayesian network to represent Simpson’s Paradox example and then perform graph surgery to estimate the causal effect.
We have already defined a causal graph earlier when we encoded our causal assumptions regarding this domain. We can reuse this causal understanding for building a causal Bayesian network in BayesiaLab.
As we illustrated in the context of the knowledge modeling exercise in Chapter 4, we manually create the nodes and draw the arcs on BayesiaLab’s Graph Panel. We choose to use long names for the nodes instead of X, Y, and Z. Letters were very convenient for formulas, but long names increase the readability of Bayesian networks. To further help with interpretation, we also associate images with each node and display them as Badges. Then, we use View > Layout > Genetic Grid Layout > Top-Down Repartition to obtain a layout that takes into account the direction of the arcs and define layers accordingly.
The Genetic Grid Layout algorithms are particularly useful for causal networks. We can, therefore, define one of these two algorithms as the one associated with the shortcut P via Preferences > Window > Preferences > Automatic Layout > Layout Algorithm Associated with Shortcut.
The yellow warning symbols remind us that the probability tables associated with the nodes have yet to be defined. At this point, we could set the parameters based on our knowledge of all the probabilities in this domain. Instead, we utilize the available data and use BayesiaLab’s Parameter Estimation to establish the quantitative part of this network via Maximum Likelihood Estimation. We have been using Parameter Estimation extensively in this book, either implicitly or explicitly, for instance, in the context of structural learning and missing values estimation (see Parameter Estimation in Chapter 5).
Parameter Estimation
Previously, we acquired the data needed for Parameter Estimation via the Data Import Wizard. Now we will use the Associate Data Wizard for the same purpose. Whereas the Data Import Wizard generates new nodes from columns in a database, the Associate Data Wizard links columns of data with existing nodes. This way, we can “fill” our qualitative network with data and then perform Parameter Estimation to generate the quantitative part of the network. We now show the corresponding steps in detail.
We start the Associate Data Wizard from the main menu: Data > Associate Data Source > Text File
This prompts us to select the text file containing our observational data (Simpson.csv). Upon selecting the file, BayesiaLab brings up the first screen of the Associate Data Wizard.
Given that the Associate Data Wizard mirrors the Data Import Wizard in most of its options, we omit to describe them here. We merely show the screens for reference as we click Next to progress through the wizard.
The last step shows how the variables in the dataset will be associated with the nodes of the network. If the column names in the dataset perfectly match the existing node names, BayesiaLab automatically creates an association. However, this is not the case in our example. Therefore, we have to manually define the association by iteratively selecting each Dataset Variable and Network Node, and then clicking on the right arrow.
Upon clicking on the right arrow, BayesiaLab brings up a screen for defining the association between the values used in the dataset and the states of the node. Again, the state names of our nodes do not correspond exactly to the values used in the dataset. So we have to manually define the association by iteratively selecting each Dataset Value and Network State and then clicking on the right arrow.
Once this is done for all three variables, the Associate Data Wizard displays how the columns in the dataset are associated with the nodes of the network.
Upon clicking Finish, we are prompted whether we want to view the Associate Report.
The Database icon in the lower right-hand corner of the main window indicates that our network now has a database associated with its structure. We now use this data to estimate the parameters of the network: Learning > Parameter Estimation.
Once the parameters estimated, there are no longer any warning symbols tagged onto the nodes.
We now have a fully specified Bayesian network. By opening the Node Editor of Outcome, for instance, we see that the CPT is indeed filled with probabilities.
Upon clicking on the Occurrences tab, we can see the counts that were used by the Maximum Likelihood Estimation to derive these probabilities.
Recall that distinguishing between causal and non-causal paths is crucial for the application of the Adjustment Criterion. BayesiaLab can help us review the paths that are present in the graph. Given that we already understand the paths, showing the formal path analysis with BayesiaLab is merely for reference.
Once we define Outcome as Target Node and switch into the Validation Mode (F5), we can examine all possible paths to the Target Node in this network. We select Treatment and then select Main Menu > Analysis > Visual > Graph > Influence Paths to Target
.
Then, BayesiaLab displays a pop-up window with the Influence Paths report. Selecting any of the listed paths shows the corresponding arcs in the Graph Panel. Causal paths are shown in blue; non-causal paths are pink.
It is easy to see that this automated path analysis can be particularly helpful with more complex networks. In any case, the result confirms our previous manual path analysis, which means that we need to adjust for Gender to block the non-causal path between Treatment and Outcome.
Before proceeding to the effect estimation, we bring up the Monitors of all three nodes and compare the probabilities reported by the network with the Aggregate and Gender-Specific Tables, which gave rise to the paradox.
For instance, the screenshot below shows the prior distributions (left) and the posterior distributions (right) given the observation Treatment = True.
As expected, the target variable Outcome changes upon setting this evidence. However, Gender, changes as well, even though we know that the treatment cannot possibly change the gender of a patient. What we observe here is a manifestation of the non-causal path: Treatment ← Gender → Outcome. These probabilities are obviously perfectly correct from the observational point of view: in the observed population of 1,200 individuals, three times as many men as women took the treatment.
For causal inference, however, we need a network that computes all probabilities under an intervention scenario. As we learned, Graph Surgery transforms the original causal network representing the pre-intervention distribution into a new, mutilated network that yields the post-intervention distribution.
In BayesiaLab, Graph Surgery is automated. After right-clicking the Monitor of the node Treatment, we select Intervention from the Contextual Menu.
The activation of the Intervention Mode for this node is highlighted by the blue background of the Treatment's Monitor and the arrow symbols (→) in the Treatment's badge.
By double-clicking a state of Treatment, we now set an Intervention and no longer an Observation.
By intervening on Treatment, BayesiaLab applies Graph Surgery and removes the inbound arc into Treatment.
Recall the formula that computes the Average Causal Effect (ACE):
We can take it directly as a set of instructions and compare the probability of Outcome=Recovered under do(Treatment=False) and do(Treatment=True). Note that the distribution of Gender remains the same pre- and post-intervention.
Thus, we obtain an Average Causal Effect of −0.1, which agrees with what we previously computed with the Adjustment Formula.
Returning to the original version of the CDAG, without the hidden variable, we are now ready to proceed with the estimation. However, this CDAG is only a qualitative representation of our theory about the DGP. We now need to consider this graph as a model representing the joint probability distribution of our three variables P(X, Y, Z).
We do not yet need to determine what this probability function is; we simply need to consider this graph as a non-parametric probability function linking X, Y, and Z. This will help us understand what it means to adjust for Z to estimate the causal effect.
Let us summarize what we have so far: First, we have observational data from our domain. Second, we developed a theory about the DGP, i.e., the causal relationships in the domain. Both together will serve as the basis for estimating the causal effect. Before we do that, we should contemplate a very literal interpretation of these causal relationships.
If this causal graph is a correct representation of how the domain works, then every relationship between a pair of variables holds independently. Thus, the causal graph represents autonomous relationships between parent nodes and child nodes. It is as if each node were “listening for instructions” from its parents and only from its parents: the child node’s values are solely determined by the value of its parents, not of any other nodes in the system. Also, these relationships remain invariant regardless of any values that other nodes take on.
Let us now consider an outside intervention on X. Thus, rather than "listening" to its parent Z, X is now entirely determined by an external force and set to specific values, e.g., X=1 or X=0. This external intervention breaks the natural relationship between X and Z. Thus, Z no longer influences X. However, Z → Y and X → Y remain unaffected, and the original “natural” values of Z are not affected either.
What is the significance of all this? The idea is that intervening on X is like trying out, or simulating, what would happen if treatment were to be applied universally to the entire population — or withheld universally. Isn’t this the causal effect we are interested in? In other words, computing the causal effect is like simulating outside interventions on the treatment variable X.
How does this help us? By simulating an intervention, we “mutilate” the graph. This new graph looks like we had severed the arc going into the treatment variable X. This operation is what Judea Pearl has rather colorfully named “graph surgery” or “graph mutilation.”
Applying Graph Surgery allows us to transform a causal graph that represents a joint probability distribution P of observational data, i.e., pre-intervention distribution, into a new mutilated graph that represents the joint probability distribution Pm of the same variables under a simulated intervention, i.e., post-intervention distribution.
where
How can we translate the abstract concept of Graph Surgery into something that can compute actual numerical values? In fact, we can work directly with graphs — in the form of Bayesian networks — and use BayesiaLab to perform Graph Surgery and simulate interventions.
However, before we illustrate that in the next section of this chapter, we want to formally conclude the line of reasoning that connects the pre-intervention distribution P to the post-intervention distribution Pm and introduce the Adjustment Formula. We paraphrase Pearl, Glymour, and Jewell (2016), p. 56f. to develop this formula.
In our example, we can easily estimate the pre-intervention distribution P from the available data, but we need the post-intervention distribution Pm to calculate the causal effect. The key lies in recognizing that Pm shares two essential properties with P.
Furthermore, X and Z are marginally independent in the mutilated graph. This means that the conditional probability distribution of Z given X in the mutilated graph is the same as the marginal probability distribution of Z in the pre-intervention graph:
Since the Adjustment Criterion is satisfied in the mutilated graph, we have the following:
By conditioning on Z and summing over all values z, we obtain:
Furthermore, X and Z are independent in the mutilated graph:
Using the two invariance equations above, we obtain what is known as the Adjustment Formula. It expresses the post-intervention distribution exclusively in terms of the pre-intervention distribution:
The Adjustment Formula computes the association between X and Y for each value z (or strata of z∈Z) and then produces a weighted average. On this basis, we can now estimate the Average Causal Effect (ACE):
We know that by performing a randomized experiment, we obtain an unbiased estimate of the causal effect of the treatment. More specifically, through randomization, we randomly split the patient population into two sub-populations and forced the first group to receive treatment, and withheld the treatment from the second group. Through the random assignment of the treatment, we ensure that there is no association between Z and X. Also, all other properties remain unaffected by the randomization of treatment, including the distribution of Z, the relationship between Z and Y, and the relationship between X and Y.
As a result, graph surgery can be seen as a “randomization after the fact.” However, we need to realize that performing graph surgery can only achieve quasi-randomization with regard to observed and known confounders, in our case Z. A randomized experiment, however, can make treatment independent of all other confounders, observed, unobserved, and unknown. Thus, randomized experiments remain the gold standard for establishing causal effects.
All our efforts in estimating causal effects through adjustment or graph surgery are merely an attempt to mimic the properties of a randomized experiment. Unfortunately, we can never measure how close we are to achieving this goal. We can only be disciplined with our assumptions and make a causal claim based on that.
Simpson’s Paradox Resolved
Returning to Simpson’s Paradox, the equation
gives us the answer to our question of whether we need to look at the aggregate data table or the gender-specific data table for determining the true causal effect of treatment on the outcome: “Conditioning on Z and summing over all values z” means that we need to utilize the gender-specific table. More specifically, we need to compute the association between X and Y for each value of Z, i.e., each stratum of z ∈ Z, and then calculate the weighted average. This estimation method is also known as stratification.
Aggregate Table
Gender-Specific Table
The ACE turns out to be negative, i.e., it has the opposite sign of what we would have inferred by merely looking naively at the association between treatment and outcome. This illustrates that a bias in the estimation of an effect can be more than just a nuisance for the analyst. Bias can reverse the sign of the effect! In our example, the treatment under study would kill people instead of healing them. The good news is that we have a theory stipulating that gender is a confounder, and this variable is observed. If it were not recorded in our dataset (hidden variable), we would not be able to compute the causal effect of treatment. We can also imagine situations where we do not know that confounders exist and, therefore, do not measure them. This can lead to substantially wrong estimations of causal effects and lead to policies with catastrophic consequences.
Note that the mutilated graph achieves what was stipulated by the Adjustment Criterion: the non-causal path () X ← Z → Y does not exist any longer, as required by the Adjustment Criterion, and, given the autonomy of the other arcs, the causal path () X → Y remains unblocked.
This new graph can tell us what happens to Y when we intervene and set X to a specific value, i.e., . Note the do-operator! With this mutilated graph, we can compute the quantity of interest, the Average Causal Effect (ACE):
First, the marginal distribution remains invariant under the intervention because the process of determining Z is unaffected by removing the arc Z → X. In our example, this means that the share of men and women must remain the same before and after the intervention:
Secondly, the conditional probability distribution remains invariant under the intervention because the process that determines how Y responds to X and Z stays the same, regardless of whether X changes naturally or through external intervention. We can state this formally as follows:
| Patient Recovered |
Treatment | Yes | No |
Yes | 50% | 50% |
No | 40% | 60% |
|
| Patient Recovered |
Gender | Treatment | Yes | No |
Male | Yes | 60% | 40% |
No | 70% | 30% |
Female | Yes | 20% | 80% |
No | 30% | 70% |
We now introduce Causal Effect Estimation by means of Likelihood Matching. Given the simplicity of Simpson’s Paradox example, the need for yet another estimation method may not be immediately apparent. The advantages of Likelihood Matching will only become clear as we study a more complex domain, such as the marketing mix example of the next chapter. However, the current example makes it easy to explain Likelihood Matching.
In statistics, matching refers to the technique that makes confounder distributions of the treated and untreated sub-populations as similar as possible to each other. As such, applying matching to variables qualifies as adjustment, and we can use it with the objective of keeping causal paths open and blocking non-causal paths. In the Simpson’s Paradox example, matching is fairly simple as we only need to match a single binary variable, i.e., Gender. That will meet our requirement for adjustment and block the only non-causal path in our model.
As our terminology of “blocking paths by matching” may not be understood outside the world of graphical models and Bayesian networks, we can offer a more intuitive interpretation of matching, which our example can illustrate very well.
Because of the self-selection phenomenon of our population, Gender distribution is a function of Treatment. In other words, of those who are treated, 75% turn out to be male. Among untreated patients, only 25% are male.
Given that we know that Gender has a causal effect on Outcome (men are more likely to recover than women, with or without treatment), and given this difference in gender composition, comparing the outcomes between the treated and non-treated patients is clearly not an apples-to-apples comparison.
We can propose a common-sense solution to this predicament. How about searching for a subset of patients within treated and non-treated groups, which have an identical gender mix, as illustrated below?
In statistical matching, this process typically involves the selection of units in such a way that comparable groups are created:
In practice, this can be more challenging as the observed units typically have more than just a single binary attribute. So, the idea of matching has to be extended to higher dimensions, and the observed units need to be matched on a range of attributes, including both continuous and discrete variables.
However, matching observations exactly with regard to all covariates is rarely feasible. For instance, patients are characterized by dozens or even hundreds of attributes and comorbidities. Finding two matching patients would be difficult enough, but finding populations with many matching pairs of patients would presumably be impossible.
So, how does randomization do it? Actually, randomization does not guarantee identical populations but rather ensures that the distributions of confounders are balanced between the populations under study. So, to pursue balanced confounders instead of pursuing perfect matches, numerous similarity measures have been proposed for matching.
The concept of Propensity Score Matching has become a particularly popular method (Rubin and Rosenbaum, 1983). Instead of matching individuals on their high-dimensional attributes, we would match observations by their probability of treatment, i.e., P(X=1|Z), which is known as the Propensity Score. Rubin and Rosenbaum have shown that matching on the propensity score achieves a balance of the covariate distributions.
However, matching on the Propensity Score requires the score itself to be estimated first. Conventional models only represent the outcome variable as a function of the treatment variable and the confounders, i.e., P(Y|X, Z). If we need to understand the relationship between the treatment and the confounders, i.e., P(X|Z), we have to estimate this separately. This usually means fitting a function, such as a regression, that models the propensity score, PS=P(X|Z). For binary treatment variables, logistic regression is a common choice for the functional form.
With BayesiaLab's Likelihood Matching, we do not directly match the underlying observations. Rather we match the distributions of the relevant nodes on the basis of the joint probability distribution represented by the Bayesian network. In our example, we need to ensure that the gender compositions of untreated and treated groups are the same, i.e., a 50/50 gender mix. This theoretically ideal condition is shown in the Monitors below.
However, the actual distributions reveal the inequality of gender distributions for the untreated and the treated.
How can we overcome this? By simply using Probabilistic Evidence to set a 50/50 gender mix, i.e., the marginal distribution of Gender, upon setting evidence on Treatment. We can also right-click on the Monitor for Gender and select Fix Probabilities from the Contextual Menu. This will automatically use Probabilistic Evidence to set the current marginal distribution after each conditioning on Treatment, or any other node.
With Fix Probabilities applied, the distribution of Gender in its Monitor is highlighted in purple.
Other than colors, nothing appears to have changed. However, once we set the values Treatment=False and Treatment=True, we see that the distribution for Gender does not change. We can also observe that the corresponding posterior probabilities for Outcome are the same as those obtained with Graph Surgery.
With that, we can once again calculate the Average Causal Effect:
For now, Likelihood Matching applied to Simpson's Paradox may not seem like a breakthrough method. Conceptually and practically, it appears to be another form of adjustment. The fundamental advantages of Likelihood Matching will become clear in the context of the next chapter, Causality and Optimization.