1 of 10

Missing Values Processing in BayesiaLab

Generating the Test Dataset

To begin this exercise, we use BayesiaLab to produce the test data that we will later use for evaluating the Missing Values Processing methods.

We can directly generate data according to the joint probability distribution encoded by the Reference Network: Main Menu > Data > Generate Data.

Next, we must specify whether to generate this data internally or externally. For now, we generate the data internally, which means that we associate data points with all nodes. This includes missing values and Filtered Values according to the reference network.

For the Number of Examples (i.e., cases or records), we set 10,000.

Now that this data exists inside BayesiaLab, we need to export it, so we can truly start “from scratch” with the test dataset. Also, regarding realism, we only want to make the observable variables available rather than all. We first select the nodes X1_obs through X5_obs and then select Main Menu > Data > Save Data from the main menu.

Next, we confirm that we only want to save the Selected Nodes, i.e., the observable variables.

Upon specifying a file name and saving the file, the export task is complete.

A quick look at the CSV file confirms that the newly generated data contain missing and Filtered Values, as indicated with question marks (?) and asterisks (*), respectively.

Now that we have produced a test dataset with all types of missingness, we forget our reference model to start “from scratch.” We approach this dataset as if this were the first time we see it, without any assumptions and without any background knowledge. This provides us with a suitable test case for BayesiaLab’s range of missing values processing methods.

Importing the Test Dataset into BayesiaLab

We show the first two steps of the Data Import Wizard only for reference, as their options have already been discussed in previous chapters.

Our test dataset consisting of 10,000 records was saved as a CSV file, so we start the import process via Main Menu > Data > Open Data Source > Text File.

Step 1: Data Structure Definition

Note the missing values in columns X1_obs, X2_obs, and X4_obs in the Data Panel. Column X5_obs features Filtered Values, which are marked with an asterisk (*).

Step 2: Definition of Variable Types

The next step of the Data Import Wizard requires no further input, but we can review the statistics provided in the Information Panel: we have 5,547 missing values (=11.09% of all cells in the Data panel) and 1,364 Filtered Values (=2.73%).

Step 3: Data Selection, Filtering, and Missing Values Processing

The next screen brings us to the core task of selecting the Missing Values Processing method. In the screenshot, the default option Structural EM is pre-selected, but we will explore all options systematically from the top. The default method can be specified under Main Menu > Window > Preferences > Data > Import & Associate > Missing & Filtered Values.

We explain and evaluate each Missing Values Processing method separately. Please select the topic below or open it in the navigation bar.

Filter (Listwise/Casewise Deletion)
Replace By (Mean/Modal Imputation)
Infer — Static Imputation
Infer — Dynamic Imputation
Infer — Structural EM
Infer — Entropy-Based Imputations
Approximate Dynamic Imputation

Filter (Listwise/Casewise Deletion)

BayesiaLab’s Filter method is generally known as “listwise deletion” or “casewise deletion” in the field of statistics. It is the first option listed in Step 3 of the Data Import Wizard. It represents the simplest approach to dealing with missing values, and it is presumably the most commonly used one, too. This method deletes any record that contains a missing value in the specified variables.

The Filter method is not to be confused with Filtered Values.

The screenshot below shows Filter applied to X1_obs only. Given this selection, the Number of Rows, i.e., the number of cases or records in the dataset, drops from the original 10,000 to 8,950. Note that Filter can be applied variable by variable. Thus, it is possible to apply Filter to a subset of variables only and use other methods for the remaining variables.

Before we can evaluate the effect of the Filter, we need to complete the Data Import Wizard. However, given the number of times we have already presented the entire import process, we omit a detailed presentation of these steps. Instead, we fast forward to review the Monitors of the processed variables in BayesiaLab.

In the Graph Panel, the absence of the question mark icon on X1obs signals that it no longer contains any missing values.

The Monitors now show the processed distributions. However, for a formal review of the processing effects, we must compare the distributions of the newly processed variables with their unprocessed counterparts.

In the overview below, we compare the original distributions (left column), followed by the distributions corresponding to the 10,000 generated samples (center column), and the distributions produced by the application of Missing Values Processing (right column). This is the format we will employ to evaluate all missing values processing methods.

Recalling the section on MCAR data, we know that applying Filter to an MCAR variable should not affect its distribution. Indeed, for X1_obs (top right) versus X1 (top left), the difference between the distributions is insignificant, and it is only due to the sample size. Sampling an infinite-size dataset would lead to the exact same distribution.

Now we turn to test the application of Filter to all variables with missing values, i.e., X1_obs, X2_obs, and X4_obs.

Even before evaluating the resulting distributions, we see in the Information Panel that over half of the rows of data are being deleted due to applying Filter. It is easy to see that in a dataset with more variables, this could quickly reduce the number of remaining records—potentially down to zero. In a dataset in which not a single record is completely observed, Filter is obviously not applicable at all.

The following illustration presents the final distributions (right column), which are all substantially biased compared to the originals (left column). Whereas filtering alone on X1_obs, an MCAR variable, was at least “safe” for X1_obs by itself, filtering on X1_obs, X2_obs, and X4_obs, adversely affects all variables, including X1_obs and even X3_obs, which does not contain any missing values.

As a result, we must strongly advise against using this method within BayesiaLab or in any statistical analysis unless it is certain that all to-be-deleted incomplete observations correspond to missing values that have been generated completely at random (MCAR). Another exception would be if the to-be-deleted observations only represented a very small fraction of all observations. Unfortunately, these caveats are rarely observed, and the Filter method, i.e., listwise or casewise deletion, remains one of the most commonly used methods of dealing with missing values (Peugh and Enders, 2004).

Replace By (Mean/Modal Imputation)

As opposed to deletion-type methods, such as Filter (Listwise/Casewise Deletion), we now consider the “opposite” approach, i.e., filling in the missing values with imputed values. Here, imputing means replacing the non-observed values with estimates in order to facilitate the analysis of the whole dataset.

In BayesiaLab, this function is available via the Replace By option. We can specify to impute any arbitrary value, e.g., based on expert knowledge or an automatically generated value. For a Continuous variable, BayesiaLab offers a default replacement of the missing values with the mean value of the variable. For a Discrete variable, the default is the modal value, i.e., the most frequently observed value of the variable. In our example, X1_obs has a mean value of 0.40878022. This is the value to be imputed for all missing values for X1_obs.

Note that Replace By can be applied variable by variable. Thus, it is possible to apply Replace By to a subset of variables only and use other methods for the remaining variables.

For the purposes of our example, we use Replace By for X1_obs, X2_obs, and X4_obs. As soon as this is specified, the number of the remaining missing values is updated in the Information Panel. By using the selected method, no missing values remain.

In the same way, we studied the performance of Filter, we now review the results of the Replace By method. Whereas this imputation method is optimal at the individual/observation level (it is the rational decision for minimizing the prediction error), it is not optimal at the population/dataset level. The right column in the following screenshot shows that imputing all missing values with the same value has a strong impact on the shape of the distributions. Even though the mean values of the processed variables (right column) remain unchanged compared to the observed values (center column), the standard deviation is much reduced.

Similar to our verdict on Filter (Listwise/Casewise Deletion), Replace By cannot be recommended either for general use. However, its application could be justified if expert knowledge were available for setting a specific replacement value or if the number of affected records were negligible compared to the overall size of the dataset.

Infer — Static Imputation

Static Imputation resembles the Replace By (Mean/Modal Imputation) method but differs in three important aspects:

The buttons under Infer are available whenever a variable with missing values is selected in the Data Panel.

While Replace By (Mean/Modal Imputation) is deterministic, Static Imputation performs random draws from the marginal distributions of the observed data and saves these randomly drawn values as “placeholder values.”
The imputation is only performed internally, and BayesiaLab still “remembers” exactly which observations are missing.
Whereas Replace By (Mean/Modal Imputation) can be applied to individual variables, any of the options under Infer apply to all variables with missing values, with the exception of those that have already been processed by Filter (Listwise/Casewise Deletion) or Replace By (Mean/Modal Imputation).

Although this probabilistic imputation method is not optimal at the observation/individual level (it is not the rational decision for minimizing the prediction error), it is optimal at the dataset/population level.

As illustrated below, drawing the imputed values from the current distribution keeps the distributions of variables pre and post-processing the same. As a result, Static Imputation returns distributions that match the ones produced by Filter (Listwise/Casewise Deletion) but without deleting any observations. As no records are discarded, Static Imputation does not introduce any additional biases. However, the distributions of X2 (MAR) and X4 (MNAR) remain strongly biased.

Infer — Dynamic Imputation

Dynamic Imputation is the first of a range of methods that take advantage of the structural learning algorithms available in BayesiaLab.

Like Infer — Static Imputation, Dynamic Imputation is probabilistic; imputed values are drawn from distributions. However, unlike Infer — Static Imputation, Dynamic Imputation does not only perform imputation once but rather whenever the current model is modified, i.e., after each arc addition, deletion, and reversal during structural learning. This way, Dynamic Imputation always uses the latest network structure for updating the distributions from which the imputed values are drawn.

Upon completion of the data import, the resulting unconnected network initially has exactly the same distributions as the ones we would have obtained with Static Imputation. In both cases, imputation is only based on marginal distributions. With Dynamic Imputation, however, the imputation quality gradually improves during learning as the structure becomes more representative of the data-generating process. For example, a correct estimation of the MAR variables is possible once the network contains the relationships that explain the missingness mechanisms.

Dynamic Imputation might also improve the estimation of MNAR variables if structural learning finds relationships with proxies of hidden variables that are part of the missingness mechanisms.

The question marks associated with X1_obs, X2_obs, and X4_obs confirm that the missingness is still present, even though the observations have been internally imputed.

On the basis of this unconnected network, we can perform structural learning. We select Main Menu > Learning > Unsupervised Structural Learning > Taboo.

While the network only takes a few moments to learn, we notice that it is somewhat slower compared to what we would have observed using a non-dynamic missing values processing method, e.g., Filter (Listwise/Casewise Deletion), Replace By (Mean/Modal Imputation), or Infer — Static Imputation. For our small example, the additional computation time requirement is immaterial. However, the computational cost increases with the number of variables in the network, the number of missing values, and, most importantly, the complexity of the network. As a result, Dynamic Imputation can slow down the learning process significantly.

The following screenshot reports the performance of the Dynamic Imputation. The distributions show a substantial improvement compared to all the other methods we have discussed so far. As expected, X2_obs is now correctly estimated, and it even improves the distribution estimation of the difficult-to-estimate MNAR variable X4_obs. More specifically, there is now much less of an underestimation of the mean value.

Infer — Structural EM

Structural Expectation Maximization (or Structural EM for short) is the next available option under Infer. This method is very similar to Dynamic Imputation, but instead of imputing values after each structural modification of the model, the set of observations is supplemented with one weighted observation per combination of the states of the jointly unobserved variables. Each weight equals the posterior joint probability of the corresponding state combination.

Upon completion of the data import process, we perform structural learning again, analogously to what we did in the context of Dynamic Imputation. As it turns out, the discovered structure is equivalent to the one previously learned. Hence, we can immediately proceed to evaluate the performance.

The distributions produced by Structural EM are quite similar to those obtained with Dynamic Imputation. At least, in theory, Structural EM should perform slightly better. However, the computational cost can be even higher than that of Dynamic Imputation because the computational cost of Structural EM also depends on the number of state combinations of the jointly unobserved variables.

Infer — Entropy-Based Imputations

Under Infer, we have two additional options, namely Entropy-Based Static Imputation and Entropy-Based Dynamic Imputation. As their names imply, they are based on Static Imputation and Dynamic Imputation.

Whereas the standard (non-entropy-based) approaches randomly choose the sequence in which missing values are imputed within a row of data, the entropy-based methods select the order based on the conditional uncertainty associated with the unobserved variable. More specifically, missing values are imputed first for those variables that meet the following conditions:

Variables that have a fully-observed/imputed Markov Blanket;
Variables that have the lowest conditional entropy, given the observations and imputed values.

The advantages of the entropy-based methods are (a) the speed improvement over their corresponding standard methods and (b) their improved ability to handle datasets with large proportions of missing values.

Approximate Dynamic Imputation

As stated earlier, any substantial improvement in the performance of missing values processing comes at a high computational cost. Thus, we recommend an alternative workflow for networks with a large number of nodes and many missing values. The proposed approach combines the efficiency of Static Imputation with the imputation quality of Dynamic Imputation.

Static Imputation is efficient for learning because it does not impose any additional computational cost on the learning algorithm. With Static Imputation, missing values are imputed in memory, which makes the imputed dataset equivalent to a fully observed dataset.

Even though, by default, Static Imputation runs only once at the time of data import, it can be triggered to run again at any time by selecting Main Menu > Learning > Parameter Estimation. Whenever Parameter Estimation is run, BayesiaLab computes the probability distributions on the basis of the current model. The missing values are then imputed by drawing from these distributions. If we now alternate structural learning and Static Imputation repeatedly, we can approximate the behavior of the Dynamic Imputation method. The speed advantage comes from the fact that values are now only imputed (on demand) at the completion of each full learning cycle instead of being imputed at every single step of the structural learning algorithm.

Usage

As a best-practice recommendation, we propose the following sequence of steps:

In Step 3 of the Data Import Wizard, we choose Static Imputation (standard or entropy-based). This produces an initial imputation with the fully unconnected network, in which all the variables are independent.
We run the Maximum Weight Spanning Tree algorithm to learn the first network structure.
Upon completion, we prompt another Static Imputation by running Parameter Estimation. Given the tree structure of the network, pairwise variable relationships provide the distributions used by the Static Imputation process.
Given the now-improved imputation quality, we start another structural learning algorithm, such as EQ, which may produce a more complex network.
The latest, more complex network then serves as the basis for yet another Static Imputation. We repeat steps 4 and 5 until we see the network converge toward a stable structure.
With a stable network structure in place, we change the imputation method from Static Imputation to Structural EM via Main Menu > Learning > Missing Values Processing > Structural EM.

While this Approximate Dynamic Imputation workflow requires more input and supervision by the researcher, for learning large networks, it can save a substantial amount of time compared to using the all-automatic Dynamic Imputation or Structural EM. Here, “substantial” can mean the difference between minutes and days of learning time.

Filter (Listwise/Casewise Deletion)

The Filter method is not to be confused with Filtered Values.

In the Graph Panel, the absence of the question mark icon on X1obs signals that it no longer contains any missing values.

Now we turn to test the application of Filter to all variables with missing values, i.e., X1_obs, X2_obs, and X4_obs.