Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
We will now explore formal methodologies that can help us derive causal effects from observational data. These methodologies will ultimately allow us to answer the question raised by Simpson’s Paradox. The process of determining the size of a causal effect from observational data can be divided into two steps, namely identification, and estimation.
Identification analysis is about determining whether or not a causal effect can be established from the observed data. This requires a formal causal model or at least partial knowledge of how the data was generated. In this chapter, all causal assumptions for identification are expressed explicitly in the form of a Directed Acyclic Graph (DAG) (Pearl 1995, 2009). They represent our complete causal understanding of the DGP for the system we are studying.
Where do we get such causal assumptions? We would like to say that advanced algorithms can generate causal assumptions from data. That is not the case, unfortunately. Causal assumptions do still require human expert knowledge or, more generally, theory. In practice, this means that we need to build (or draw) a causal graph of our domain. Then, we can examine this graph against formal criteria, which determine whether the effect is identifiable or not.
It is important to realize that the absence of causal assumptions cannot be compensated for by clever statistical techniques or by providing more data. So, recognizing that a causal effect is not identifiable brings the effect analysis to an abrupt halt.
But if the causal effect is identifiable, we can proceed to estimate the effect size. The same criteria that determine identifiability do also tell us how to perform the effect estimation. With that, we can utilize the available observational data and estimate the causal effect. Depending on the complexity of the domain, the effect estimation can bring a new set of challenges. However, in the context of Simpson’s Paradox, the effect estimation will be very straightforward.
We need to understand some important properties before encoding our causal knowledge in a DAG. We learned in Chapter 2 that Bayesian networks use DAGs for the qualitative description of the Joint Probability Distribution.
In the causal context, however, the arcs in a DAG explicitly state causality instead of only representing direct probabilistic dependencies in a Bayesian network. We now designate a DAG with a causal semantic as a Causal DAG (CDAG) to highlight this distinction.
A DAG has three basic configurations in which nodes can be connected. Graphs of any size and complexity can be broken down into these basic graph structures. While these basic structures show direct dependencies/causes explicitly, there are more statements contained in them, albeit implicitly. In fact, we can read all marginal and conditional associations that exist between the nodes.
Why are we even interested in associations? Isn’t all this about understanding causal effects? It is essential to understand all associations in a system because, in nonexperimental data, all we can do is observe associations, some of which represent noncausal relationships. Our objective is to identify causal effects from associations.
This DAG represents an indirect connection of A on B via C.
A Directed Arc represents a potential causal effect. The arc direction indicates the assumed causal direction, i.e., “A → C ” means “A causes C .”
A Missing Arc encodes the definitive absence of a direct causal effect, i.e., no arc between A and B means no direct causal relationship exists between A and B and vice versa. As such, a missing arc represents an assumption.
Implication for Causality
A has a potential causal effect on B intermediated by C.
Implication for Association
Marginally (or unconditionally), A and B are dependent. This means that without knowing the exact value of C, learning about A informs us about B and vice versa, i.e., the path between the nodes is unblocked, and information can flow in both directions.
Conditionally on C, i.e., by setting Hard Evidence on (or observing) C, A, and B become independent. In other words, by “hard”conditioning on C, we block the path from A to B and from B to A. Thus, A and B are conditionally independent, given C:
Hard Evidence means that there is no uncertainty regarding the value of the observation or evidence. If uncertainty remains regarding the value of C, the path will not be entirely blocked, and an association will remain between A and B.
The second configuration has C as the common parent of A and B.
Implication for Causality
C is the common cause of both A and B.
Implication for Association
In terms of association, this structure is absolutely equivalent to the Indirect Connection. Thus, A and B are marginally dependent but conditionally independent given C (by setting Hard Evidence on C):
The final structure has a common child C, with A and B being its parents. This structure is called a “VStructure.” In this configuration, the common child C is also known as a “collider.”
Implication for Causality
A and B are the direct causes of C.
Implication for Association
Marginally (or unconditionally), A and B are independent, i.e., there is no information flow between A and B. Conditionally on C — with any kind of evidence — A and B become dependent. If we condition on the collider C, information can flow between A and B, i.e., conditioning on C opens the information flow between A and B:
Even introducing a minor change in the distribution of C, e.g., from no observation (“color unknown”) to a very vague observation (“it could be anything, but it is probably not purple”), opens the information flow.
For purposes of formal reasoning, this type of connection is of special significance. Conditioning on C facilitates intercausal reasoning, often referred to as the ability to “explain away” the other cause, given that the common effect is observed (see InterCausal Reasoning in Chapter 4).
To begin the encoding of our causal knowledge in the form of a CDAG, we draw three nodes, which represent X (Treatment), Y (Outcome), and Z (Gender). For now, we are only using the qualitative part of the network, i.e., we are not considering probabilities.
The absence of further nodes means that we assume that there are no additional variables in the DataGenerating Process (DGP), either observable or unobservable. Unfortunately, this is a very strong assumption that cannot be tested. We need to have a justification purely on theoretical grounds to make such an assumption.
In the next step, we must encode our causal assumptions regarding this domain. Given our background knowledge of this domain, we state that Z causes X and Y and that X causes Y.
This means that we believe that gender is a cause of taking the treatment and has a causal effect on the outcome, too. We also assume that the treatment has a potential causal effect on the outcome.
Having accepted these causal assumptions, we now wish to identify the causal effect of X on Y. The question is whether this is possible on the basis of this causal graph and the available observational data for these three variables. Before we can answer this question, we need to think about what this CDAG specifically implies. Recall the types of structures that can exist in a DAG (see Structures Within a DAG). As it turns out, we can find all three of the basic structures in this example:
Indirect Connection: Z causes Y via X
Common Parent: Z causes X and Y
Common Child: Z and X cause Y
$A\cancel{ \bot }B,A \bot BC$
$A\cancel{ \bot }B,A \bot BC$
$A \bot B,A\cancel{ \bot }BC$
It is selfevident that causal arcs have implications in terms of causation. However, as we pointed out earlier in this chapter (see Structures Within a DAG), there are also implications regarding the association of variables. This will perhaps become clearer as we introduce the concepts of “causal path” and “noncausal path.”
In a DAG, a path is a sequence of nonintersecting, adjacent arcs, regardless of their direction.
A causal path can be any path from cause to effect, in which all arcs are directed away from the cause and pointed toward the effect.
A noncausal path can be any path between cause and effect in which at least one of the arcs is oriented from effect to cause.
Our example contains both types of paths:
This distinction between causal and noncausal paths is critically important for identification.
The Adjustment Criterion (Shpitser et al., 2010) is perhaps the most intuitive among several graphical identification criteria. The Adjustment Criterion states that a causal effect is identified if we can adjust for a set of variables such that:
What does “adjust for” mean in practice? “Adjusting for a variable” can stand for any of the following operations, which all introduce information on a variable:
Controlling
Conditioning
Stratifying
Matching
At this point, the adjustment technique is irrelevant. Rather, we only need to determine which variables, if any, need to be adjusted in order to block the noncausal paths while keeping the causal paths open. Revisiting both paths in our CDAG, we can now examine which ones are open or blocked:
In this example, the Adjustment Criterion can be met by blocking the noncausal path X ← Z → Y by means of adjusting for Z. In other words, adjusting for Z allows identifying the causal effect from X to Y. From now on, we will often refer to such variables Z as Confounders.
Readers may be familiar with the expression “controlling for confounders.” What is important to bear in mind is that not all covariates in a system are Confounders! Recall Judea Pearl’s warning about ignorability and the risk of treating every covariate as a Confounder (see Ignorability).
Thus far, we have assumed that our example has no unobserved (also called hidden or latent) variables. However, if we had reason to believe that there is another variable, U, which appears to be relevant on theoretical grounds but was not recorded in the dataset, identification could no longer be possible. Why? Let us assume U is a hidden common cause of X and Y. By adding this unobserved variable U, a new noncausal path appears between X and Y via U.
Given that U is hidden, there is no way to adjust for it, and, therefore, we have an open, noncausal path that cannot be blocked. Hence, the causal effect is no longer identifiable, and thus, it can no longer be estimated without bias.
This highlights how easily identification can be “ruined.” Once again, we can only justify the absence of unobserved variables on theoretical grounds.
Let us summarize what we have so far: First, we have observational data from our domain. Second, we developed a theory about the DGP, i.e., the causal relationships in the domain. Both together will serve as the basis for estimating the causal effect. Before we do that, we should contemplate a very literal interpretation of these causal relationships.
If this causal graph is a correct representation of how the domain works, then every relationship between a pair of variables holds independently. Thus, the causal graph represents autonomous relationships between parent nodes and child nodes. It is as if each node were “listening for instructions” from its parents and only from its parents: the child node’s values are solely determined by the value of its parents, not of any other nodes in the system. Also, these relationships remain invariant regardless of any values that other nodes take on.
Let us now consider an outside intervention on X. Thus, rather than "listening" to its parent Z, X is now entirely determined by an external force and set to specific values, e.g., X=1 or X=0. This external intervention breaks the natural relationship between X and Z. Thus, Z no longer influences X. However, Z → Y and X → Y remain unaffected, and the original “natural” values of Z are not affected either.
What is the significance of all this? The idea is that intervening on X is like trying out, or simulating, what would happen if treatment were to be applied universally to the entire population — or withheld universally. Isn’t this the causal effect we are interested in? In other words, computing the causal effect is like simulating outside interventions on the treatment variable X.
How does this help us? By simulating an intervention, we “mutilate” the graph. This new graph looks like we had severed the arc going into the treatment variable X. This operation is what Judea Pearl has rather colorfully named “graph surgery” or “graph mutilation.”
Applying Graph Surgery allows us to transform a causal graph that represents a joint probability distribution P of observational data, i.e., preintervention distribution, into a new mutilated graph that represents the joint probability distribution Pm of the same variables under a simulated intervention, i.e., postintervention distribution.
where
How can we translate the abstract concept of Graph Surgery into something that can compute actual numerical values? In fact, we can work directly with graphs — in the form of Bayesian networks — and use BayesiaLab to perform Graph Surgery and simulate interventions.
However, before we illustrate that in the next section of this chapter, we want to formally conclude the line of reasoning that connects the preintervention distribution P to the postintervention distribution Pm and introduce the Adjustment Formula. We paraphrase Pearl, Glymour, and Jewell (2016), p. 56f. to develop this formula.
In our example, we can easily estimate the preintervention distribution P from the available data, but we need the postintervention distribution Pm to calculate the causal effect. The key lies in recognizing that Pm shares two essential properties with P.
Furthermore, X and Z are marginally independent in the mutilated graph. This means that the conditional probability distribution of Z given X in the mutilated graph is the same as the marginal probability distribution of Z in the preintervention graph:
Since the Adjustment Criterion is satisfied in the mutilated graph, we have the following:
By conditioning on Z and summing over all values z, we obtain:
Furthermore, X and Z are independent in the mutilated graph:
Using the two invariance equations above, we obtain what is known as the Adjustment Formula. It expresses the postintervention distribution exclusively in terms of the preintervention distribution:
The Adjustment Formula computes the association between X and Y for each value z (or strata of z∈Z) and then produces a weighted average. On this basis, we can now estimate the Average Causal Effect (ACE):
We know that by performing a randomized experiment, we obtain an unbiased estimate of the causal effect of the treatment. More specifically, through randomization, we randomly split the patient population into two subpopulations and forced the first group to receive treatment, and withheld the treatment from the second group. Through the random assignment of the treatment, we ensure that there is no association between Z and X. Also, all other properties remain unaffected by the randomization of treatment, including the distribution of Z, the relationship between Z and Y, and the relationship between X and Y.
As a result, graph surgery can be seen as a “randomization after the fact.” However, we need to realize that performing graph surgery can only achieve quasirandomization with regard to observed and known confounders, in our case Z. A randomized experiment, however, can make treatment independent of all other confounders, observed, unobserved, and unknown. Thus, randomized experiments remain the gold standard for establishing causal effects.
All our efforts in estimating causal effects through adjustment or graph surgery are merely an attempt to mimic the properties of a randomized experiment. Unfortunately, we can never measure how close we are to achieving this goal. We can only be disciplined with our assumptions and make a causal claim based on that.
Simpson’s Paradox Resolved
Returning to Simpson’s Paradox, the equation
gives us the answer to our question of whether we need to look at the aggregate data table or the genderspecific data table for determining the true causal effect of treatment on the outcome: “Conditioning on Z and summing over all values z” means that we need to utilize the genderspecific table. More specifically, we need to compute the association between X and Y for each value of Z, i.e., each stratum of z ∈ Z, and then calculate the weighted average. This estimation method is also known as stratification.
Aggregate Table
GenderSpecific Table
The ACE turns out to be negative, i.e., it has the opposite sign of what we would have inferred by merely looking naively at the association between treatment and outcome. This illustrates that a bias in the estimation of an effect can be more than just a nuisance for the analyst. Bias can reverse the sign of the effect! In our example, the treatment under study would kill people instead of healing them. The good news is that we have a theory stipulating that gender is a confounder, and this variable is observed. If it were not recorded in our dataset (hidden variable), we would not be able to compute the causal effect of treatment. We can also imagine situations where we do not know that confounders exist and, therefore, do not measure them. This can lead to substantially wrong estimations of causal effects and lead to policies with catastrophic consequences.
We now understand that Graph Surgery and Adjustment are equivalent. However, with Bayesian networks, we can go beyond the metaphor and—quite literally—perform graph surgery. In this section, we create a Bayesian network to represent Simpson’s Paradox example and then perform graph surgery to estimate the causal effect.
We have already defined a causal graph earlier when we encoded our causal assumptions regarding this domain. We can reuse this causal understanding for building a causal Bayesian network in BayesiaLab.
The Genetic Grid Layout algorithms are particularly useful for causal networks. We can, therefore, define one of these two algorithms as the one associated with the shortcut P via Preferences > Window > Preferences > Automatic Layout > Layout Algorithm Associated with Shortcut.
Parameter Estimation
Previously, we acquired the data needed for Parameter Estimation via the Data Import Wizard. Now we will use the Associate Data Wizard for the same purpose. Whereas the Data Import Wizard generates new nodes from columns in a database, the Associate Data Wizard links columns of data with existing nodes. This way, we can “fill” our qualitative network with data and then perform Parameter Estimation to generate the quantitative part of the network. We now show the corresponding steps in detail.
We start the Associate Data Wizard from the main menu: Data > Associate Data Source > Text File
Given that the Associate Data Wizard mirrors the Data Import Wizard in most of its options, we omit to describe them here. We merely show the screens for reference as we click Next to progress through the wizard.
The last step shows how the variables in the dataset will be associated with the nodes of the network. If the column names in the dataset perfectly match the existing node names, BayesiaLab automatically creates an association. However, this is not the case in our example. Therefore, we have to manually define the association by iteratively selecting each Dataset Variable and Network Node, and then clicking on the right arrow.
Upon clicking on the right arrow, BayesiaLab brings up a screen for defining the association between the values used in the dataset and the states of the node. Again, the state names of our nodes do not correspond exactly to the values used in the dataset. So we have to manually define the association by iteratively selecting each Dataset Value and Network State and then clicking on the right arrow.
Once this is done for all three variables, the Associate Data Wizard displays how the columns in the dataset are associated with the nodes of the network.
Upon clicking Finish, we are prompted whether we want to view the Associate Report.
The Database icon in the lower righthand corner of the main window indicates that our network now has a database associated with its structure. We now use this data to estimate the parameters of the network: Learning > Parameter Estimation.
Once the parameters estimated, there are no longer any warning symbols tagged onto the nodes.
We now have a fully specified Bayesian network. By opening the Node Editor of Outcome, for instance, we see that the CPT is indeed filled with probabilities.
Upon clicking on the Occurrences tab, we can see the counts that were used by the Maximum Likelihood Estimation to derive these probabilities.
Recall that distinguishing between causal and noncausal paths is crucial for the application of the Adjustment Criterion. BayesiaLab can help us review the paths that are present in the graph. Given that we already understand the paths, showing the formal path analysis with BayesiaLab is merely for reference.
Once we define Outcome as Target Node and switch into the Validation Mode (F5), we can examine all possible paths to the Target Node in this network. We select Treatment and then select Main Menu > Analysis > Visual > Graph > Influence Paths to Target
.
Then, BayesiaLab displays a popup window with the Influence Paths report. Selecting any of the listed paths shows the corresponding arcs in the Graph Panel. Causal paths are shown in blue; noncausal paths are pink.
It is easy to see that this automated path analysis can be particularly helpful with more complex networks. In any case, the result confirms our previous manual path analysis, which means that we need to adjust for Gender to block the noncausal path between Treatment and Outcome.
For instance, the screenshot below shows the prior distributions (left) and the posterior distributions (right) given the observation Treatment = True.
As expected, the target variable Outcome changes upon setting this evidence. However, Gender, changes as well, even though we know that the treatment cannot possibly change the gender of a patient. What we observe here is a manifestation of the noncausal path: Treatment ← Gender → Outcome. These probabilities are obviously perfectly correct from the observational point of view: in the observed population of 1,200 individuals, three times as many men as women took the treatment.
For causal inference, however, we need a network that computes all probabilities under an intervention scenario. As we learned, Graph Surgery transforms the original causal network representing the preintervention distribution into a new, mutilated network that yields the postintervention distribution.
In BayesiaLab, Graph Surgery is automated. After rightclicking the Monitor of the node Treatment, we select Intervention from the Contextual Menu.
The activation of the Intervention Mode for this node is highlighted by the blue background of the Treatment's Monitor and the arrow symbols (→) in the Treatment's badge.
By doubleclicking a state of Treatment, we now set an Intervention and no longer an Observation.
By intervening on Treatment, BayesiaLab applies Graph Surgery and removes the inbound arc into Treatment.
Recall the formula that computes the Average Causal Effect (ACE):
We can take it directly as a set of instructions and compare the probability of Outcome=Recovered under do(Treatment=False) and do(Treatment=True). Note that the distribution of Gender remains the same pre and postintervention.
NonCausal Path: X ← Z → Y ()
Causal Path: X → Y ()
All noncausal paths () between treatment and outcome are “blocked” (noncausal relationships prevented).
All causal paths () from treatment to outcome remain “open” (causal relationships preserved).
First, we look at the noncausal path () in our CDAG: X ← Z → Y. This implies that there is an indirect association between X and Y via Z that has to be blocked by adjusting for Z.
Next is the causal path () in our CDAG: X → Y. It consists of a single arc from X to Y, which is open by default and cannot be blocked.
Note that the mutilated graph achieves what was stipulated by the Adjustment Criterion: the noncausal path () X ← Z → Y does not exist any longer, as required by the Adjustment Criterion, and, given the autonomy of the other arcs, the causal path () X → Y remains unblocked.
This new graph can tell us what happens to Y when we intervene and set X to a specific value, i.e., . Note the dooperator! With this mutilated graph, we can compute the quantity of interest, the Average Causal Effect (ACE):
First, the marginal distribution remains invariant under the intervention because the process of determining Z is unaffected by removing the arc Z → X. In our example, this means that the share of men and women must remain the same before and after the intervention:
Secondly, the conditional probability distribution remains invariant under the intervention because the process that determines how Y responds to X and Z stays the same, regardless of whether X changes naturally or through external intervention. We can state this formally as follows:
As we illustrated in the context of the , we manually create the nodes and draw the arcs on BayesiaLab’s Graph Panel. We choose to use long names for the nodes instead of X, Y, and Z. Letters were very convenient for formulas, but long names increase the readability of Bayesian networks. To further help with interpretation, we also associate images with each node and display them as Badges. Then, we use View > Layout > Genetic Grid Layout > TopDown Repartition to obtain a layout that takes into account the direction of the arcs and define layers accordingly.
The yellow warning symbols remind us that the probability tables associated with the nodes have yet to be defined. At this point, we could set the parameters based on our knowledge of all the probabilities in this domain. Instead, we utilize the available data and use BayesiaLab’s Parameter Estimation to establish the quantitative part of this network via Maximum Likelihood Estimation. We have been using Parameter Estimation extensively in this book, either implicitly or explicitly, for instance, in the context of structural learning and missing values estimation (see ).
This prompts us to select the text file containing our observational data (). Upon selecting the file, BayesiaLab brings up the first screen of the Associate Data Wizard.
Before proceeding to the effect estimation, we bring up the Monitors of all three nodes and compare the probabilities reported by the network with the , which gave rise to the paradox.
Thus, we obtain an Average Causal Effect of −0.1, which agrees with what we previously computed with the .
 Patient Recovered 
Treatment  Yes  No 
Yes  50%  50% 
No  40%  60% 

 Patient Recovered 
Gender  Treatment  Yes  No 
Male  Yes  60%  40% 
No  70%  30% 
Female  Yes  20%  80% 
No  30%  70% 
We now introduce Causal Effect Estimation by means of Likelihood Matching. Given the simplicity of Simpson’s Paradox example, the need for yet another estimation method may not be immediately apparent. The advantages of Likelihood Matching will only become clear as we study a more complex domain, such as the marketing mix example of the next chapter. However, the current example makes it easy to explain Likelihood Matching.
In statistics, matching refers to the technique that makes confounder distributions of the treated and untreated subpopulations as similar as possible to each other. As such, applying matching to variables qualifies as adjustment, and we can use it with the objective of keeping causal paths open and blocking noncausal paths. In the Simpson’s Paradox example, matching is fairly simple as we only need to match a single binary variable, i.e., Gender. That will meet our requirement for adjustment and block the only noncausal path in our model.
As our terminology of “blocking paths by matching” may not be understood outside the world of graphical models and Bayesian networks, we can offer a more intuitive interpretation of matching, which our example can illustrate very well.
Because of the selfselection phenomenon of our population, Gender distribution is a function of Treatment. In other words, of those who are treated, 75% turn out to be male. Among untreated patients, only 25% are male.
Given that we know that Gender has a causal effect on Outcome (men are more likely to recover than women, with or without treatment), and given this difference in gender composition, comparing the outcomes between the treated and nontreated patients is clearly not an applestoapples comparison.
We can propose a commonsense solution to this predicament. How about searching for a subset of patients within treated and nontreated groups, which have an identical gender mix, as illustrated below?
In statistical matching, this process typically involves the selection of units in such a way that comparable groups are created:
In practice, this can be more challenging as the observed units typically have more than just a single binary attribute. So, the idea of matching has to be extended to higher dimensions, and the observed units need to be matched on a range of attributes, including both continuous and discrete variables.
However, matching observations exactly with regard to all covariates is rarely feasible. For instance, patients are characterized by dozens or even hundreds of attributes and comorbidities. Finding two matching patients would be difficult enough, but finding populations with many matching pairs of patients would presumably be impossible.
So, how does randomization do it? Actually, randomization does not guarantee identical populations but rather ensures that the distributions of confounders are balanced between the populations under study. So, to pursue balanced confounders instead of pursuing perfect matches, numerous similarity measures have been proposed for matching.
The concept of Propensity Score Matching has become a particularly popular method (Rubin and Rosenbaum, 1983). Instead of matching individuals on their highdimensional attributes, we would match observations by their probability of treatment, i.e., P(X=1Z), which is known as the Propensity Score. Rubin and Rosenbaum have shown that matching on the propensity score achieves a balance of the covariate distributions.
However, matching on the Propensity Score requires the score itself to be estimated first. Conventional models only represent the outcome variable as a function of the treatment variable and the confounders, i.e., P(YX, Z). If we need to understand the relationship between the treatment and the confounders, i.e., P(XZ), we have to estimate this separately. This usually means fitting a function, such as a regression, that models the propensity score, PS=P(XZ). For binary treatment variables, logistic regression is a common choice for the functional form.
With BayesiaLab's Likelihood Matching, we do not directly match the underlying observations. Rather we match the distributions of the relevant nodes on the basis of the joint probability distribution represented by the Bayesian network. In our example, we need to ensure that the gender compositions of untreated and treated groups are the same, i.e., a 50/50 gender mix. This theoretically ideal condition is shown in the Monitors below.
However, the actual distributions reveal the inequality of gender distributions for the untreated and the treated.
How can we overcome this? By simply using Probabilistic Evidence to set a 50/50 gender mix, i.e., the marginal distribution of Gender, upon setting evidence on Treatment. We can also rightclick on the Monitor for Gender and select Fix Probabilities from the Contextual Menu. This will automatically use Probabilistic Evidence to set the current marginal distribution after each conditioning on Treatment, or any other node.
With Fix Probabilities applied, the distribution of Gender in its Monitor is highlighted in purple.
Other than colors, nothing appears to have changed. However, once we set the values Treatment=False and Treatment=True, we see that the distribution for Gender does not change. We can also observe that the corresponding posterior probabilities for Outcome are the same as those obtained with Graph Surgery.
With that, we can once again calculate the Average Causal Effect:
$ACE = P\left( {Y = 1do(X = 1)} \right)  P\left( {Y = 1do(X = 0)} \right) =  0.1$
For now, Likelihood Matching applied to Simpson's Paradox may not seem like a breakthrough method. Conceptually and practically, it appears to be another form of adjustment. The fundamental advantages of Likelihood Matching will become clear in the context of the next chapter, Causality and Optimization.
Returning to the original version of the CDAG, without the hidden variable, we are now ready to proceed with the estimation. However, this CDAG is only a qualitative representation of our theory about the DGP. We now need to consider this graph as a model representing the joint probability distribution of our three variables P(X, Y, Z).
We do not yet need to determine what this probability function is; we simply need to consider this graph as a nonparametric probability function linking X, Y, and Z. This will help us understand what it means to adjust for Z to estimate the causal effect.