Chapter 8: Probabilistic Structural Equation Models for Key Driver Analysis

Structural Equation Modeling (SEM) is a statistical technique for estimating and testing causal relationships by combining observed data with qualitative causal assumptions. The foundations of SEM were established through the work of Sewall Wright (1921), Trygve Haavelmo (1943), and Herbert Simon (1953), and the framework was formally unified and extended by Judea Pearl (2000). SEMs support both confirmatory and exploratory modeling, making them suitable for validating existing theories as well as generating new ones.

In BayesiaLab, Probabilistic Structural Equation Models (PSEMs) serve a conceptually similar purpose to traditional SEMs but are constructed on the foundation of Bayesian networks rather than systems of equations. While SEMs typically require a high level of statistical expertise and involve numerous manual steps, PSEMs are designed to be more accessible—particularly to subject matter experts without advanced statistical training. Moreover, the PSEM modeling process in BayesiaLab is significantly faster and more efficient, often reducing modeling time by several orders of magnitude.

Once validated, a PSEM can be used like any other Bayesian network within BayesiaLab. This enables users to apply a full suite of analytical, simulation, and optimization tools, thereby maximizing the utility of the causal knowledge embedded in the model.

Example: Consumer Survey

This chapter introduces a prototypical application of Probabilistic Structural Equation Modeling (PSEM) focused on key driver analysis and product optimization using consumer survey data. The analysis explores how consumers perceive various product attributes and how these perceptions influence their purchase intent.

To address the uncertainty inherent in survey data, the study incorporates "latent" variables—underlying constructs not directly captured in the survey responses. These are derived from patterns among the "manifest" variables, which are the directly observed survey items. Incorporating latent variables enables the construction of models that are more robust and interpretable than those relying solely on manifest data.

The overarching goal is to enhance the interpretability of survey results for researchers and increase their practical relevance for decision-makers. Ultimately, the PSEM aims to guide the prioritization of marketing and product strategies to maximize consumer purchase intent.

Dataset

This study is based on a monadic consumer survey about perfumes conducted by a market research agency in France. In this study, each respondent evaluated only one perfume. In this example, we use survey responses from 1,320 women who evaluated 11 fragrances (representative of the French market) on a wide range of attributes.

Workflow Overview

A PSEM is a hierarchical Bayesian network that can be generated through a series of machine-learning and analysis tasks:

All relationships in a PSEM are probabilistic—hence the name, as opposed to having deterministic relationships plus error terms in traditional SEMs.
PSEMs are nonparametric, which facilitates the representation of nonlinear relationships plus relationships between categorical variables.
The structure of PSEMs is partially or fully machine-learned from data.
- 27 ratings on fragrance-related attributes, such as Sweet, Flowery, Feminine, etc., measured on a 1–10 scale.
- 12 ratings with regard to imagery about someone who wears the respective fragrance, e.g. Sexy, Modern, measured on a 1–10 scale.
- 1 variable for Intensity, a measure reflecting the level of intensity, measured on a 1–5 scale. The variable Intensity is listed separately due to the a priori knowledge of its non-linearity and the existence of a “just-about-right” level.
- 1 variable for Purchase Intent, measured on a 1–6 scale.
- 1 nominal variable, Product, for product identification.
Unsupervised Learning to discover the strongest relationships between the manifest variables.
Variable Clustering, based on the learned Bayesian network, to identify groups of variables that are strongly connected.
Multiple Clustering: we consider the strong intra-cluster connections identified in the Variable Clustering step to be due to a “hidden common cause.” For each cluster of variables, we use Data Clustering—on the variables within the cluster only—to induce a latent variable representing the hidden cause.
Unsupervised Learning to find the interrelations between the newly-created latent variables and their relationships with the Target Node.

Workflow Details

Unsupervised Learning Data Import