Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
The way out of this predicament is to turn to modeling. Instead of working with an enormously large joint probability table, we will approximate the joint probability distribution of the domain with a model. But does this not take us back to the problem of being unable to machine-learn a causal model? No, because we do not need a causal model. Rather, we merely need a statistical model to approximate the statistical relationships of the variables in our domain. Of course, that task is easy. In earlier chapters, we have already introduced various machine learning algorithms in BayesiaLab that can generate Bayesian networks from data.
Given that we have a target variable, i.e., Sales, Supervised Learning will be the appropriate approach. First, we import the dataset ACME_Data.csv and discretize all variables with the R2GenOpt algorithm into five states. Our choice of this discretization algorithm follows the rationale that we presented in earlier chapters.
After importing the dataset, we use the Augmented Naive Bayes algorithm to learn a network with Sales as the Target Node. The result is shown below.
Here, we will omit a discussion of the network quality. Rather, we proceed with the given network and look at how to use it to understand our problem domain.
As is, this network has the familiar appearance of a predictive model. And we have kept emphasizing that machine learning can only produce predictive models, which, in turn, are only capable of performing observational inference. However, it is causal inference that we require for the purpose of marketing mix optimization. Clearly, a causal interpretation of the arcs in the above network would not make any sense. After all, Sales is the outcome, not the cause.
This does not matter because we had no intention of finding a causal model with machine learning. We had already given up on finding the true causal model. The model we obtained is only meant to compactly represent the Joint Probability Distribution (JPD) of all variables in this domain.
Our key to causal inference is that we have the JPD, as represented by the machine-learned Bayesian network, and we know, based on domain knowledge and the Disjunctive Cause Criterion, which variables are Confounders.
One issue remains open, though, and that is what mechanism to use for estimation. Recall matching as an estimation technique and, in particular, Likelihood Matching (see Intuition for Matching). Here, however, we are not dealing with just one covariate Z, but rather with 10, of which 8 are confounders.
Before proceeding to estimation in BayesiaLab, we need to formally declare which variables are Confounders. In BayesiaLab, all nodes are considered Confounders by default. Hence, we need to declare the inverse condition, i.e., we must tell BayesiaLab which nodes are not Confounders or “Non-Confounders.”
As per the rationale we laid out earlier, we need to mark the pre-treatment variables Competitive Print, Competitive TV, and Competitive TV and assign them to the predefined Class Non_Confounders.
Select the nodes to be added to the Class Non_Confounders.
Right-click on one of them to bring up the Contextual Menu.
Select Properties > Classes > Add
.
Check the radio box for Predefined Class and pick Non_Confounder from the drop-down menu.
As a result, we now have a clear distinction between Confounders and Non-Confounders and can perform an effect estimation on that basis.
Add Color to Non-Confounders
To highlight this distinction, we assign colors to each Class.
Right-click on the Graph Panel background to bring up the Contextual Menu.
Select Edit Classes.
In the Class Editor, highlight Non_Confounder in the list and click Associate Colors.
From the dialog box, select Associate Default Colors with Classes and click OK.
Now, all Non-Counfounders are highlighted in red.
Under Estimation Challenges, we already pointed out that stratification is no longer feasible as an estimation technique, which leaves us with matching. However, this example exceeds in complexity of Simpson’s Paradox in a number of ways. In Simpson’s Paradox, we had one Confounder and one treatment variable, which had only one treatment level.
Now we have several Confounders and several treatments. And many of the treatment variables have multiple treatment levels. As a result, the task of matching is no longer straightforward. Now, we need to simultaneously balance all Confounders with regard to each treatment variable in such a way that each treatment remains at its marginal probability level. With that in place, we can simulate different treatment levels and observe the corresponding outcomes. What we then observe in the different outcomes is indeed the causal effect. It is easy to imagine that the computational effort for this process is substantial. It goes beyond the scope of this book to explain the details of BayesiaLab’s Likelihood Matching algorithm. For our purposes, we only need to know how to invoke the algorithm in BayesiaLab for causal effect estimation. Likelihood Matching is launched whenever we run Direct Effects in BayesiaLab.
While this chapter’s example is inspired by a real business, we are trying to minimize any resemblance to a particular company or industry. We shall refer to our fictional business as ACME Corp. All of its sales and marketing data are synthetically generated. This allows us to illustrate a variety of effects in a single composite example. With the few publicly available datasets of real marketing and sales data, we might not be able to observe the range of characteristics that we wish to present here. Also, for ease of interpretation, we have magnified some effects, resulting in a somewhat idealized consumer response to the marketing actions of ACME Corp. With artificial data, we also have the luxury of plentiful observations, although that is not necessarily unrealistic.
Our fictional business ACME utilizes a variety of marketing and advertising channels. Throughout this chapter, we will also refer to them as "marketing drivers" or just "drivers."
TV Advertising
Internet Advertising
Print Advertising
Direct Marketing
Incentives (i.e., price discounts)
These variables are all measured on proprietary scales. For instance, TV Advertising might be measured in GRP (Gross Rating Points), and Print might be recorded in columns-inches. Incentives refer to price promotions and discounts measured in dollars. All these marketing instruments we consider as being under ACME’s control, i.e., we can set them to any desired level within an overall budget constraint.
Furthermore, we have a target variable, Sales, measured in units sold daily, which we hope to improve by optimizing the mix of marketing instruments.
Beyond the variables that are under ACME’s control, there are four calendar variables:
Quarter
Weekday
Month
End-of-Month Indicator
Finally, we measure a number of variables that are beyond ACME's direct control but still have an influence on the business, including:
Co-Op Promotions (promotions sponsored by the vehicle manufacturer)
Competitive Incentives (i.e., price discounts on competitive products)
Web Traffic (organic traffic to ACME's website—not paid-for traffic)
Showroom Traffic (organic visits to ACME's facilities)
Test Drives (organic)
While numerous other variables do certainly exist in this domain, we have no further data available. Later, we will need to formalize our assumptions in this regard.
For a comprehensive study of marketing mix modeling and optimization, there are numerous questions we should consider, such as:
Which form of advertising is the strongest driver of ACME's sales?
How do competitive incentives affect ACME's sales?
What is the optimum marketing mix overall, given different levels of marketing budget constraints?
Are there saturation effects of certain marketing channels?
Are there counterproductive promotions?
How can we attribute the observed sales volume to marketing initiatives?
What would be the baseline sales volume without any advertising?
Are there synergy effects that make some instruments more important jointly than their individual importance?
The single most important thing we need to recognize regarding the above questions is that they are all causal questions. This means we are not looking for a prediction of Sales based on the observation of marketing variables. Rather, we wish to simulate the manipulation of all marketing variables in such a way that we maximize Sales. Thus, we are performing an intervention in our domain, which requires causal inference. This is the reason why we can only introduce the marketing mix question after having established foundational causal concepts in Chapter 10. There, we explained how a causal graph, combined with certain criteria, can tell us precisely what variables we have to adjust to identify a causal effect. We merely had to provide a causal graph, i.e., encode our causal understanding of the domain. With three variables in Simpson’s Paradox, there were only 25 possible causal structures. A common-sense understanding of the domain allowed us to quickly identify the only reasonable graph in terms of causal directions. In that case, making assumptions about the full causal structure was straightforward. So, we may now feel well-equipped to answer more complex causal questions, such as the given marketing domain.
Going from three variables in Simpson’s Paradox to 14 variables in this marketing example, it would be reasonable to expect a substantial increase in the number of possible causal networks. As it turns out, given a set of 14 variables, there are now 1.4×1036 possible causal structures as opposed to 25. Clearly, we can no longer rely on our intuition to pick the correct one out of over one undecillion possible graphs. Furthermore, clever algorithms and fast computers cannot help us with this task either. As it stands, causal directions can generally not be discovered through machine learning.
However, without a causal graph, we cannot use the familiar criteria for confounder selection, such as the Adjustment Criterion. And, without the ability to select and control for confounders, we cannot employ the usual estimation methods. A commonly used fall-back position is to simply “control for all pretreatment covariates” (Rubin, 2009). However, the example in the previous chapter highlighted the risks of doing that. So, it appears that we have already reached a dead end with our example.
For a comprehensive study of marketing mix modeling and optimization, there are numerous questions we should consider, such as:
Which form of advertising is the strongest driver of ACME's sales?
How do competitive incentives affect ACME's sales?
What is the optimum marketing mix overall, and given different levels of marketing budget constraints?
Are there saturation effects of certain marketing channels?
Are there counterproductive promotions?
How can we attribute the observed sales volume to marketing initiatives?
What would be the baseline sales volume without any advertising?
Are there synergy effects that make some instruments more important jointly as opposed to their individual importance?
The single most important thing we need to recognize regarding the above questions is that they are all causal questions. This means we are not looking for a prediction of Sales based on the observation of marketing variables. Rather, we wish to simulate the manipulation of all marketing variables in such a way that we maximize Sales. Thus, we are performing an intervention in our domain, which requires causal inference. This is the reason why we can only introduce the marketing mix question after having established foundational causal concepts in Chapter 10. There, we explained how a causal graph, combined with certain criteria, can tell us precisely what variables we have to adjust for identifying a causal effect. We merely had to provide a causal graph, i.e., encode our causal understanding of the domain. With three variables in Simpson’s Paradox, there were only 25 possible causal structures. A common-sense understanding of the domain allowed us to quickly identify the only reasonable graph in terms of causal directions. In that case, making assumptions about the full causal structure was straightforward. So, we may now feel well-equipped to answer more complex causal questions, such as the given marketing domain.
From 25 to 1,439,428,141,044,398,334,941,790,719,839,535,103 Graphs
Going from three variables in Simpson’s Paradox to 14 variables in this marketing example, it would be reasonable to expect a substantial increase in the number of possible causal networks. As it turns out, given a set of 14 variables, there are now 1.4×1036 possible causal structures as opposed to 25. Clearly, we can no longer rely on our intuition to pick the correct one out of over one undecillion possible graphs. Furthermore, clever algorithms and fast computers cannot help us with this task either. As it stands, causal directions can generally not be discovered through machine learning.
However, without a causal graph, we cannot use the familiar criteria for confounder selection, such as the Adjustment Criterion. And, without the ability to select and control for confounders, we cannot employ the usual estimation methods. A commonly used fall-back position is to simply “control for all pretreatment covariates” (Rubin, 2009). However, the example in the previous chapter highlighted the risks of doing that. So, it appears that we have already reached a dead end with our example.
As it turns out, recent research has made significant progress and produced a new criterion for selecting confounders. VanderWeele and Shpitser (2010) have discovered that it is possible to select confounders without knowing the full causal graph:
We show that irrespective of the true causal structure is and irrespective of whether there are important unobserved variables if there exists some subset of the observed covariates that suffice to control for confounding, then the set obtained by applying our criterion will also constitute a set that suffices.
This is a profound insight, given that not knowing the causal structure between covariates is not at all unique to our example. We speculate that “causal ignorance” is the prevailing condition in most research projects. VanderWeele and Shpitser have shown that confounders can be found differently:
We propose that control be made for any [pre-treatment] covariate that is either a cause of treatment or of the outcome or both.
In other words, we now need to ask the common-sense question about each covariate, “is it a cause of the treatment or the outcome or both?” If the answer is yes, the variable in question is a confounder, and we must control for it to estimate the causal effect. With this new approach to confounder selection, we do not need to know—or even consider—the relationships between the covariates.
There are a number of caveats, though. Once again, we have to assume that there are no unobserved confounders. Furthermore, we must also assume that there exists a set of variables Z that would meet one of the formal identification criteria. If this assumption holds, the proposed selection criterion will identify the set of variables Z. We must stress that such an assumption cannot be tested. It can only be justified on theoretical grounds. Nevertheless, it is a much weaker assumption than claiming to know the full causal structure.
In the context of this marketing example, we consider Sales as the outcome variable. However, unlike a single treatment X in Simpson’s Paradox, we now have many potential treatments, i.e., all the marketing variables. Not only do we have to identify Confounders with regard to one treatment/outcome relationship, but we need to do this for all treatment/outcome pairs. For instance, if we were considering TV Advertising as treatment, we will need to check all covariates as to whether they are causes of TV Advertising, Sales, or both.
In this domain, it is fairly easy to judge that all of ACME’s advertising efforts (TV Advertising, Internet Advertising, Print Advertising, Direct Marketing) and Incentives should be seen as causes of Sales. We also reason that the calendar variables, Quarter, Weekday, Month, and End-of-Month Indicator are also causes of Sales. It is common knowledge, for instance, that Saturday is the main car shopping day in the U.S. Also, the fourth quarter marks the start of a new model year and concludes the calendar year, which makes it the peak selling season.
Five variables remain, i.e., Co-Op Promotions, Competitive Incentives, Web Traffic, Showroom Traffic, and Test Drives. Are they not also causes of Sales? Yes, but we argue that they are not pre-treatment variables. Rather, given our domain knowledge, we believe that these variables “respond” to ACME’s marketing and advertising efforts, meaning that they are "downstream" from the original causes. For example, the original cause Print Advertising drives Showroom Traffic, which subsequently leads to Sales.
Regarding Competitive Incentives being a Non-Confounder, we argue that the competition presumably wants to counteract ACME efforts. If ACME increases incentives as part of a campaign, competitors would presumably follow suit with their own incentives. However, one may object to this line of reasoning and instead suggest that Competitive Incentives come first and that it prompts ACME to react. From that viewpoint, Competitive Incentives would definitely be pre-treatment. However, if we treated Competitive Incentives as Confounder, it would imply that the competition would “hold still” while ACME tries out different marketing spend levels during optimization. Clearly, this would be an unrealistic assumption. Instead, we believe that it is reasonable to think that the competition would react similarly as they have always done historically.
With that, we have specified all explicit assumptions regarding this domain. Furthermore, we have also assumed implicitly that there are no unobserved Confounders, i.e., that no other hidden variables exist that influence our domain. Such a claim is almost as bold as assuming the complete causal structure of this domain. Also, there is no way to test for the existence of hidden Confounders. As outrageous as this may seem, this assumption is made in virtually all models based on observational data, regardless of the modeling technique. The only way we can justify this assumption is on theoretical grounds, i.e., we need to have domain knowledge that allows us to rule out unobserved Confounders.
Now that we have identified the Confounders, we would expect to be able to estimate the causal effects. Theoretically, the Adjustment Formula (i.e., stratification) could serve as our computation method. Why is this not immediately feasible?
The first objection is that our dataset consists of mostly continuous variables rather than discrete states, which was the case in the previous chapter. However, we can overcome this challenge by discretizing all continuous variables, which we have demonstrated repeatedly in this book.
With the discretized dataset of our domain, a new problem looms: assuming that all confounders now have 5 discrete states, we would need to calculate the weighted average of approximately 2 million strata to estimate the effect of one treatment. In other words, we would require the entire joint probability table representing the domain. Even if we could manage the computational task with a computer fast enough and with enough memory to store the joint probability table, we would not have anywhere near enough observations to estimate this table. While we have thousands of daily observations, our joint probability table would consist of millions of rows.
Let us summarize where we stand. We started this chapter with the challenge that we could not encode one true causal graph. Thus, identification using traditional criteria was not possible. The new Disjunctive Cause Criterion by Shiptser and VanderWeele saved us from having to define a full causal graph. Fewer, simpler assumptions will now suffice to select the confounders. But now that we have the Confounders, our straightforward estimation techniques, e.g., stratification, will no longer work. It seems for every step forward, we must take another one back.
Marketing Mix Modeling and Optimization
Half the money I spend on advertising is wasted; the trouble is I don’t know which half.
We speculate that the lack of a well-established marketing mix methodology has little to do with the domain itself. Rather, it reflects the fact that marketing is yet another domain that frequently has to rely on non-experimental data for decision support. As such, marketing mix optimization is a rather prototypical problem that mirrors the challenges of many other fields.
What is perhaps unique to marketing is the large number of instruments, i.e., the wide range of advertising channels and promotions, that can be utilized as individual levers in reaching and convincing consumers. Moreover, many marketing instruments can be easily quantified in terms of cost. Hence, the marketing domain lends itself as a teaching example for this chapter.
Over the last century, various versions of this quote have been attributed to , , and , among others. Yet, 100 years after these marketing pioneers, in this day and age of big data and advanced analytics, the quote still rings true among marketing executives. The ideal composition of advertising and marketing efforts remains the industry’s Holy Grail. Certainly, there are many advertising agencies and market research firms that promote their proprietary methodology in pursuit of the optimum allocation of marketing resources. Also, there have been decades of research in marketing science on this topic. Yet, despite all commercial and academic efforts, there is a remarkable lack of universally accepted methods for marketing mix modeling and optimization. As a result, the current practice remains “more art than science.”
The question of budget brings up another issue. Thus far, all variables are shown on their original, proprietary scale without any cost information. For instance, we have not yet defined how much “one unit” of TV Advertising costs in dollars. Prior to version 6 of BayesiaLab, the Cost property was available to specify the unit cost for each variable. At the time, one could have specified that 1 GRP costs $1,000. However, real-world applications are not as straightforward as having a fixed price per unit. As is the case with most business transactions, volume discounts may apply that need to be considered when optimizing media spend.
With BayesiaLab 6, we introduced the concept of Function Nodes. They facilitate the computation of scalar values based on the distribution and values of the states in nodes. This is best illustrated in the context of our example. We will now use a Function Node to “translate” the original units of TV Advertising into dollar values.
A Function Node node calculates values ad hoc. As such, a Function Node does not exist in the original dataset.
Position the node on the Graph Panel. By default, the first Function Node to be introduced has the name F1.
Go into Arc Creation Mode and draw an arc from TV to F1.
For random nodes, the warning symbol means that the Conditional Probability Table has not been estimated yet. In the context of Function Nodes, it means that an equation has yet to be defined that will determine the value of the Function Node.
Open the Function Node F1 by double-clicking on it, which brings up the Node Editor.
Note the TV Advertising node listed in the center of the three panels at the bottom of the window. This is where the parent nodes of a Function Node are shown. In our case, TV Advertising is currently the only parent node. This means that TV Advertising is the only variable that can be included in the Equation Tab in the top panel of the window.
Whereas a Function Node, such as F1, represents scalar values, "normal" nodes, such as TV Advertising, always represent distributions of states.
This is where the functions in the bottom left panel come into play, in particular the Inference Functions. We can use them to extract a scalar statistic from TV Advertising, which F1 will then represent as a scalar value.
Here, as a first step, we want F1 to represent the mean value of the cost of TV Advertising:
Double-click on Inference Functions and then double-click on MeanValue(v). This adds the inference function to the Equation Tab. By default, a placeholder variable v is highlighted in the equation
You can single-click on TV Advertising to bring up the domain range of this variable for your information
Double-click on TV Advertising to add ?TV Advertising? to the Equation Tab. ?TV Advertising? should automatically assume the position of v if that placeholder was still highlighted. The final syntax will appear as ?F1?=MeanValue(?TV Advertising?)
Click Validate to check the syntax and have BayesiaLab compute the value of F1.
Where are we going with this? We will now add further parents to F1 so that we can calculate the cumulative cost of all Confounders. We simply draw arcs from all the Confounders to F1.
Select the Arc Creation Mode and draw arcs from all Confounders to F1.
Hold L
to remain in the Arc Creation Mode so you can keep adding arcs without having to go back to the toolbar.
Recall that we did not draw any arcs from Weekday, Month, Quarter, and End-of-Month Indicator to F1. The reason is that these calendar-related variables are beyond the control of ACME. Therefore, ACME couldn't buy more Saturdays or pay for a longer fourth quarter. Also, we won't draw arcs from the Non-Confounders to F1 as ACME cannot directly control them through their budget either.
The Function Node F1 can now serve as a "summary node" for all Confounders. If all Confounders were recorded on the same scale and had the same cost of $1/unit, you could sum up the mean values of all the Confounders:
Double-click F1 to open the Node Editor.
Enter this expression into the Equation Tab: MeanValue(?TV?)+MeanValue(?Print?)+MeanValue(?Online?)+MeanValue(?Incentives?)+MeanValue(?Direct Marketing?)
Instead of typing the entire syntax, you can also add the inference function MeanValue() and all listed Confounders by double-clicking on the respective items in the lower panels of the Node Editor.
Needless to say, assuming a cost of $1/unit for each type of advertising is entirely unrealistic. And the purpose of having a Function Node is that we can enter any arbitrary cost function for each advertising channel. A typical example would be entering a quantity discount in the form of an if/then statement:
?F1?=IF(MeanValue(?Print Advertising?)>=10, 0.9*MeanValue(?Print Advertising?),MeanValue(?Print Advertising?))
This statement would discount the cost of Print Advertising by 10% once 10 units are reached. This way, you can define even complex pricing and discount structures. Why is this so important? The optimization algorithm that BayesiaLab employs can take advantage of any such additional nonlinearities.
However, given that our model is based on synthetic data that already features plenty of nonlinearities, it serves no educational purpose to add further artifacts to our problem domain. Hence, we stick with a fictional cost of $1/unit for each advertising channel.
A Function Node is a highly flexible element in BayesiaLab and can play many different roles. Here, we justified its use by our need to calculate the cumulative cost of all advertising efforts.
However, not only do we need to know the total cost, but we also need to constrain it in the subsequent optimization. If cost were no object, we could simply read the optima from the Target Mean Analysis plot by taking the x-levels that correspond to the maximum y-levels for each driver. Alas, budget constraints do apply in the real world.
In BayesiaLab, we can formalize the "budget" role of F1 by adding it to the pre-defined Class Resource.
Right-click on F1.
From the Contextual Menu, select Properties > Classes > Add.
Then, check Predefined Class and select Resource from the drop-down menu.
Click Yes to conclude the step.
Note that we could add additional Function Nodes to this Class. This way, the Class Resource can represent the sum of multiple Function Nodes.
We previously pointed out that Quarter, Weekday, Month, and End-of-Month Indicator neither have any monetary cost. The positive effect of Weekday=Saturday on Sales is a "free" benefit to ACME. And the absence of arcs going into F1 already prevents these nodes from being included in the monetary cost summary in F1.
However, as we prepare for optimization, we must also encode formally that these variables cannot be modified or influenced by anyone. In other words, ACME cannot manipulate them. Thus, Quarter, Weekday, Month, and End-of-Month Indicator require a special designation so that BayesiaLab includes them in the Likelihood Matching of the Confounders but excludes them from active manipulation during optimization.
Unfortunately, the BayesiaLab jargon will now become a bit convoluted. The “do-not-manipulate” assignment of variables is done via their Cost attribute. This Cost is not to be confused with the monetary cost in terms of dollars, which is computed by F1. In general, the Cost attribute of a variable quantifies the effort required to observe a variable (see the Diagnosis example in Chapter 6). A special case is Cost=0, which makes a variable Not Observable in BayesiaLab. In the context of the calendar variables in our example, this nomenclature is counterintuitive as we would think that the dates in the calendar are certainly observable.
However, it goes beyond the scope of this chapter to present the rationale of this terminology. For the purposes of this example, we set Cost=0 to exclude the calendar variables from being manipulated during the optimization.
Select the calendar variables and then right-click on any one of them.
From the Contextual Menu, select Properties > Cost
Uncheck the box Cost or enter 0 in the Edit Cost window.
Click OK
The Not Observable status of the variables is now reflected in the light purple color of the calendar nodes.
In Modeling Mode, click on the Function Node Creation Mode by clicking on the corresponding icon on the menu bar.
We now return to our marketing example for good and utilize Likelihood Matching to estimate the Direct Effect of each driver variable on the Target Node Sales.
From within the Validation Mode, select Main Menu > Analysis > Report > Target > Direct Effects on Target
.
This prompts BayesiaLab to estimate the Direct Effects of each driver variable with regard to the Target Node while performing Likelihood Matching on all Confounders.
The resulting table resembles the typical output we would obtain from a linear regression analysis with parameter estimates for each covariate. As such, we may be tempted to interpret the Direct Effect as the slope of a response curve. Indeed, BayesiaLab computes the Direct Effect as the derivative of the response curve around the mean of the values of each driver. If each response curve were linear, the Direct Effect would indeed be a meaningful value for characterizing the entire curve. The question is, does this assumption of linearity hold? In Simpson’s Paradox, it certainly did. Due to the binary nature of all variables, the example was inherently linear. Hence, computing a single coefficient for the Direct Effect was adequate for describing the causal effect.
In this marketing mix example, however, we can make no such assumption. Rather than speculating about the nature of the relationships, we let BayesiaLab estimate the response curves, whatever their shapes might be:
Select Main Menu > Analysis > Visual > Target > Target's Posterior > Curves > Direct Effects
.
Then, from the options, choose:
Target: Mean
Variables: Mean
Use Hard Evidence
Click Display Sensitivity Chart, which generates a plot of Sales as a function of each driver variable.
Note that the nodes in the Class Non_Confounder are not included here.
Also, all drivers are represented with their original scales, so Weekday (Weekday ∈ {1,...,7}), Quarter (Quarter ∈ {1,...,4}), and End-of-Month Indicator (End-of-Month Indicator ∈ {0,1}) are all squeezed into the leftmost portion of the plot. Later, we will "decompress" the plot by normalizing the drivers' value range so they all appear on a 1–100 scale.
For now, however, we want to focus on a single driver:
Remove all curves by clicking the All Curves checkbox.
Select only Incentives, which leaves one curve.
The x-values of the points on the curve correspond to the mean values of the discretized states of Incentives. Given that we discretized Incentives into 5 bins, we have 5 discrete x-values. The y-values are the expected values of the Target Node Sales at each corresponding x-value of Incentives.
It is important to understand that while the node Incentives varies in value, all Confounders are balanced through Likelihood Matching in such a way that Incentives is independent of all the Confounders. With that, we can consider setting each value of Incentives as a deliberate intervention, and the changes to outcome variable Sales are the causal effect of changing Incentives. Thus, the curve we see is a causal response curve.
Given the importance of Target Mean Analysis, we now simulate the curve plotting process step by step. We show what is happening in BayesiaLab "behind the scenes" as the curve is plotted using Direct Effects.
Select the Monitors of all Confounders.
Apply Fix Probabilities to all Confounders.
This "fixed" status is indicated by purple bars in the Monitors of the Confounders.
Note that you must not fix the probabilities of the Non-Confounders. Their Monitor bars have to remain blue. Incentives and Sales, of course, must remain unfixed as well. The former you will manipulate, and the latter's response you want to observe.
Set Incentives to each possible state, from the lowest to the highest. Likelihood Matching maintains the distributions of all the Confounders while Sales and the Non-Counfounders can respond to the intervention.
To further illustrate this important process, we have extracted the Monitors for Incentives and Sales from the Monitor Panel above and lined them up side-by-side:
For instance, given the state Incentives<=24.343, which has a mean value of 16.070, Sales has a mean value of 285.063 (see leftmost panel). So, the mean values of Incentives and Sales are the x and y coordinates of the first point on the response curve below. The remaining points on the curve are formed in the same way.
Note that this step-by-step approach was only meant to show what BayesiaLab is performing in the background whenever you invoke the Target Mean Analysis plot.
Now that we have explained how the Target Mean Analysis plot is generated, you can let BayesiaLab perform it again automatically for all drivers, similar to what we did in its first run:
Main Menu > Analysis > Visual > Target > Target's Posterior > Curves > Direct Effects
.
At the time, however, it was difficult to interpret and compare the curves as the marketing variables were all recorded on different scales.
In this run, select Normalize in the dialog box, which brings all x-values on a common 0–100 scale.
Once the plot appears, deselect the calendar-related variables, i.e., Weekday, Quarter, Month, and End-of-Month Indicator.
We leave them out for now as they are of lesser interest to us—we can't modify the calendar after all. Later in our analysis, we will assign a special status to them to formally exclude these variables from being optimized.
This provides an informative picture. We can now characterize the response of Sales to the drivers that ACME has under control. More specifically, we observe the exclusive Direct Effect of each driver on Sales without confounding effects through the other variables.
What is perhaps most striking in this plot is that many of the curves appear non-linear. Clearly, any assumption of linearity would not have held. Using the Direct Effects on Target Report, which we used earlier to estimate the slope of these curves, did entirely obscure the dynamics we can observe now.
Furthermore, we can derive several important insights from this plot. For instance, the response curve for TV Advertising rises quickly around its middle values, peaks, and then declines. Direct Marketing looks like an upside-down U, suggesting that there is a “sweet spot” in terms of marketing exposure. The curve for Print Advertising looks S-shaped, while the variable Incentives appears to be exponentially linked to Sales.
The “wild mix” of response curve patterns highlights the inherent difficulty of marketing mix optimization. While the curves themselves may be individually meaningful to a marketing expert, it is far from obvious how much should be allocated to each marketing channel within the constraints of an overall marketing budget.
By invoking Direct Effect, BayesiaLab will automatically perform Likelihood Matching on all Confounders and estimate the causal effect.
Click X and then select Main Menu > Analysis > Report > Target Analysis > Direct Effects on Target
.
We immediately obtain a report that shows a Direct Effect of −0.1. This value is identical to the Average Treatment Effect we computed in the previous chapter. As expected, adjustment by stratification, Graph Surgery, and Likelihood Matching provides the same effect estimate.
For comparison, we now estimate the Total Effect:
Select Main Menu > Analysis > Report > Target Analysis > Total Effects on Target
.
The resulting report window now shows the Total Effect, which amounts to +0.1.
This result matches the naive estimator, i.e., the effect we observe when considering the whole population, which is clearly not the causal effect. So, why would we need to estimate the Total Effect at all? It would be the only correct estimator for performing observational inference, i.e., prediction. If we were merely observing treated versus not treated patients instead of performing an intervention, the Total Effect provides the expected change of the outcome variable.
Sales
TV Advertising
Internet Advertising
Print Advertising
Direct Marketing
Incentives (i.e., price discounts)
Quarter
Weekday
Month
End-of-Month Indicator
Co-Op Promotions
Competitive Incentives
Web Traffic
Showroom Traffic
Test Drives
As we proceed, we furthermore assume that there are no unobserved Confounders. Such an assumption can only be justified on theoretical grounds. Given that this example represents a fictional domain, there is no purpose in debating the validity of the assumption.
Now we are ready to set the parameters of the Target Optimization.
Select Main Menu > Analysis > Target Optimization > Genetic
.
Set Profile Search Criterion to Mean.
Set Criterion Optimization to Maximization.
Check Take Into Account the Resources.
This means that for each to-be-tested scenario, BayesiaLab computes the value of the Class Resource, i.e., the value of the Function Node F1. Furthermore, you need to specify a range of acceptable values.
Check Target Resources and set this value to the available budget. By default, the value is set to the current value of F1. In our case, it is the sum of the means of the marginal distributions of the Confounders.
Do not check Take Into Account the Joint Probability, as it does not apply to this example.
It would be applicable to models based on disaggregated data, e.g., with cross-sectional data representing the behavior of individuals.
Some optimization tasks may require a trade-off between constraints and target achievement. The Weighting option allows us to prioritize specific optimization criteria. Set Resources to 10 to ensure that the search algorithm stays as close as possible to the specified Target Resources.
Search Method depends on the problem domain. We discussed a similar set of options in the context of Target Dynamic Profile.
Set Numerical Evidence Proportional to: Mean.
Set Distribution Estimation Method to Binary.
Here, the optimization algorithm needs to modify the mean of each variable by setting Binary evidence (see Chapter 7, Binary Evidence). This is important because we are trying to determine which specific values—not distributions—achieve the maximum level of Sales. From a theoretical viewpoint, we could certainly search for the optimal distribution, but that is presumably not practical. Marketing budgets get approved and allocated on the basis of single-point dollar values, not distributions.
Clicking the Edit Variations button brings up the Variation Editor.
Set Intermediate Points to 10.
This setting refers to the number of points to test for each variable. To manage a potentially large computing load, it is good practice to start with a smaller number of Intermediate Points, e.g., 10, and later perform a search with a finer grid.
Check Direct Effects. As we have emphasized throughout this chapter, we are looking for the causal effects of the driver variables. Otherwise, BayesiaLab would perform an optimization based on the Total Effect and produce meaningless results for our purposes.
Output determines how BayesiaLab stores the solutions found by Target Optimization. In our case, we want to save all new scenarios and overwrite any existing scenarios. Also, as the optimization can carry on for a long time, it is important to know that it can be stopped anytime. In that case, all solutions computed up to that point are saved as Evidence Scenarios.
Set Return the n Best Solutions to 10
In terms of the Genetic Settings, we recommend keeping the defaults. Computer scientists will be familiar with these algorithm-related options. For the type of problem at hand, however, little can be gained by changing the defaults.
By definition, genetic search algorithms continue mutating scenarios endlessly in order to find better solutions. Such an algorithm will never come to a natural conclusion. Hence, the Genetic Stop Settings are a practical way to stop the algorithm when no improvement has been observed after a certain number of iterations.
Clicking OK starts the optimization, and we can monitor the activity in the status line at the bottom of the screen.
Once the Genetic Stop Criterion is met or when you manually terminate the optimization, BayesiaLab delivers the Optimization Report for the Target NodeSales.
Note that there is no progress bar, as no predetermined endpoint exists for a genetic optimization algorithm. We can, however, monitor the score to see the development of solutions.
Initial States: Value/Mean refers to the mean value of the marginal distribution of the Target NodeSales.
Resources represents the corresponding value of F1.
Search Methods repeats the options we chose. While this may appear redundant, it is critically important to understand the precise conditions under which the solution was found.
For the same purpose, the report lists the Not Fixed Nodes, which are the Non-Confounders that we had defined, meaning that they were not included in Likelihood Matching.
We find the main part of the report in the Synthesis table. Here, we see the Initial State, which lists the mean values of the marginal distributions of the drivers. Best Solution shows the optimal levels of each driver. The values in parentheses indicate the deviation from the Initial State. Thus, we can easily see by what amount the marketing variables need to be changed. For example, the report recommends decreasing the Direct Marketing spend by 5.736 units.
In the Best Solutions table at the bottom of the report, BayesiaLab lists the top scenarios identified plus the corresponding achievement in terms of the Target NodeSales. This tells us, for instance, that Sales will increase to 310.917 if the marketing variables are set to the specified levels. At the same time, the Resources would amount to 171.364, which is very close to the specified constraint.
If we suspect that we might have missed out on a potentially better solution, we could re-run the optimization using more Intermediate Points or set a different Stop Criterion. The potential extent of the search is ultimately a function of the available computing resources. However, given the size of the search space and the non-exhaustive nature of genetic search algorithms, we can never be certain to have found the optimum.
It’s Causal!
At this point, we need not shy away from causal language. All our causal assumptions had been explicitly specified earlier, and on that basis, we performed a causal optimization. If the assumptions can be justified, the effect estimation is causal and does not need to be circumscribed with cautious language. Given our input and applying hard-and-fast identification criteria, we are not in a gray area in terms of the causal interpretation of the results. If the results are to be challenged regarding their causal validity, we need to go straight back to the assumptions. The causal claims stand and fall with the assumptions we make, not with the estimation techniques.
Evidence Scenarios
While we may immediately gravitate toward the best solution listed first in the solutions table, we can examine all proposed solutions, which are stored as Evidence Scenarios:
Check Include Not Observable to see the complete solution, including the values of the Not-Observable nodes.
Select the Evidence Set with Index 0., which sets all nodes to the values specified under Best Solution in the Optimization Report.
This simulation confirms what we obtained in the Optimization Report, i.e., we can increase Sales by almost 15 units (per day) with the same budget. As a by-product of retrieving this Evidence Set, we can observe the corresponding values of the Non-Confounders. Similarly, we can evaluate the remaining scenarios for their plausibility. A seemingly inferior scenario could perhaps be more practical to implement.
For reference, all Monitors corresponding to Evidence Set 0 are shown below:
This chapter presented a comprehensive workflow for optimizing the marketing mix of an organization. The key to this approach was using a machine-learned Bayesian network model in combination with causal assumptions in the form of confounder selection according to the Disjunctive Cause Criterion. BayesiaLab's Likelihood Matching algorithm facilitated the causal effect estimation and the subsequent optimization of marketing drivers.
Why are we using the term “Direct Effect” instead of “causal effect,” which is obviously what we are looking for? It helps to recall the Simpson’s Paradox example from the previous chapter. Through path analysis, we were able to distinguish between causal (blue ) and non-causal paths (pink ).
We argued that the arc , i.e., the direct causal path, represents the causal effect. By adjusting for and thus blocking the non-causal path , we were able to isolate the “direct effect” of on .
However, if had we not adjusted for , both the causal and the non-causal path would remain open, and we would obtain the “total effect.” This is indeed the nomenclature we follow in BayesiaLab.
To emphasize the distinction between Direct Effect and Total Effect, we look one more time at Simpson’s Paradox ().
We already discussed the Variation Editor in the context of the . Here, we can further constrain individual variables in addition to the overall constraint determined by the specified Target Resource. A typical reason could be that some marketing expenditures may be locked into long-term contracts, which cannot be changed,
Additionally, we need to take note of any warning symbols appearing in the bottom right corner of the main window. As Likelihood Matching is performed repeatedly throughout the optimization, there is the possibility of the Likelihood Matching algorithm—not the Target Optimization algorithm—being unable to converge. A yellow warning triangle icon indicates this particular condition. Further details can be retrieved from the Console.
Right-click on the Evidence Scenario icon .