In molecular epidemiology studies biospecimen data are collected, often with the purpose of evaluating the synergistic role between a biomarker and another feature on an outcome. Typically, biomarker data are collected on only a proportion of subjects eligible for study, leading to a missing data problem. Missing data methods, however, are not customarily incorporated into analyses. Instead, complete-case (CC) analyses are performed, which can result in biased and inefficient estimates.
Through simulations, we characterized the performance of CC methods when interaction effects are estimated. We also investigated whether standard multiple imputation (MI) could improve estimation over CC methods when the data are not missing at random (NMAR) and auxiliary information may or may not exist.
CC analyses were shown to result in considerable bias and efficiency loss. While MI reduced bias and increased efficiency over CC methods under specific conditions, it too resulted in biased estimates depending on the strength of the auxiliary data available and the nature of the missingness. In particular, CC performed better than MI when extreme values of the covariate were more likely to be missing, while MI outperformed CC when missingness of the covariate related to both the covariate and outcome. MI always improved performance when strong auxiliary data were available. In a real study, MI estimates of interaction effects were attenuated relative to those from a CC approach.
Our findings suggest the importance of incorporating missing data methods into the analysis. If the data are MAR, standard MI is a reasonable method. Auxiliary variables may make this assumption more reasonable even if the data are NMAR. Under NMAR we emphasize caution when using standard MI and recommend it over CC only when strong auxiliary data are available. MI, with the missing data mechanism specified, is an alternative when the data are NMAR. In all cases, it is recommended to take advantage of MI's ability to account for the uncertainty of these assumptions.
Recent advances in technology to measure biomarkers have given rise to increasingly more studies in molecular epidemiology. Consequently, many epidemiology studies now collect data from biospecimens for the purpose of studying the role of biomarkers in disease. Often these investigations assess synergistic effects between the biomarker and another feature on an outcome. A recent assessment of molecular epidemiology studies revealed that 30% of such studies evaluate a gene-environment interaction . Availability of biospecimens such as blood or tissue samples, however, is generally limited to a subset of the subjects in the study, posing a missing data problem. Despite this, appropriate missing data methods are not typically being employed. In a 1995 study, Greenland and Finkle  attributed the underuse of missing data methods in epidemiology studies to their inaccessibility and complexity. Although missing data methods are more readily available at present, a recent study by Klebanoff and Cole in 2008  found that less than 2% of papers published in epidemiology journals make use of more accessible missing data methods like multiple imputation (MI). Instead, a complete-case (CC) analysis continues to be the most widely applied method [1-4]. More specifically, a CC analysis excludes subjects missing data on at least one variable considered in the analysis. Desai et al. recently assessed the handling of missing data specifically in molecular epidemiology studies and found that while the majority of studies had missing data (65%) and/or excluded subjects with missing data from study entry (45%), 88% of these utilized a CC analysis .
The reasons underlying why the biospecimen data are missing matter. These may relate to observed features in the data set and/or the unobserved values of the biomarkers themselves. The statistical validity of CC methods (i.e., providing unbiased estimates and confidence intervals that achieve nominal coverage), however, relies on an assumption that the data are missing completely at random (MCAR); i.e., that missingness is unrelated to observed or unobserved data yielding a study sample that is representative of the larger cohort [5,6]. See Rubin for a more complete discussion on statistical validity . If missingness is related only to observed variables (e.g., age), the data are considered missing at random (MAR). If, however, the reason for missing data is related to the unobserved values (e.g., even after conditioning on age, those with higher values of the biomarker are more likely to be missing biomarker data), the data are not missing at random (NMAR). CC analyses conducted on data that are not MCAR can lead to biased and inefficient estimates.
The data are limited in what they can reveal about missingness. Violation of the MCAR assumption can easily be investigated through simple comparisons of features between those with and without missing data. Without making unverifiable assumptions, however, it is impossible to distinguish between NMAR and MAR patterns, since the nature of missingness cannot be examined for data that do not exist. Thus, one must rely on assumptions based on biological, clinical and epidemiological understandings.
Theoretically sound methods for analyzing data under the MAR or NMAR conditions have been developed. For the former, this includes likelihood-based methods and standard MI , where MI is particularly simple to implement and readily available. For the latter, analogous methods (likelihood-based and MI-based) are available. These, however, are not as easily accessible, and are more complex to implement; unlike under the MAR condition, under the NMAR condition, one must model the missing data distribution. Valid likelihood-based methods for NMAR data include EM approaches to obtaining maximum-likelihood estimates and similar estimation strategies that exploit auxiliary data (defined as additional data that can be used to improve model performance given the missingness) [7-10]. While software has been developed for some cases under NMAR conditions, it has not been incorporated into mainstream statistical packages. Thus, access to specialized software presents a barrier to using these methods. MI, with the missing data distribution specified (such as pattern mixture models), is another alternative when the data are NMAR .
In molecular epidemiology studies there is often good reason to suspect the data are NMAR. For example, suppose those who consume more vegetables are more willing to provide blood samples and as a result, folate levels are measured less frequently on those with low levels. Furthermore, because it is not straightforward to distinguish between NMAR and MAR conditions, analysts may incorrectly assume the data are MAR. Such studies may also make auxiliary data available. A useful auxiliary variable is one that either correlates with the variable for which the data are missing, or one that correlates with missingness, or both. An example of the former may be the number of vegetable servings consumed, if it correlates with folate levels. An example of an auxiliary variable that correlates with missingness may be disease severity, if those with more advanced disease are more motivated to provide biospecimens. Finally, many molecular epidemiology studies focus on interaction effects, such as gene-environment effects. The bias and efficiency loss that results specifically from estimating these effects using a CC approach has yet to be characterized.
The goals of this paper are to characterize the performance of CC analysis when interaction effects are being assessed, and to discuss MI methods as a possible practical solution. We specifically investigate the consequences of applying MI, which in its standard form relies on the MAR assumption, and assess the extent that auxiliary data can help in estimation performance when data from a covariate are NMAR and the interest lies in estimating an interaction effect, involving the covariate. We examine situations when the covariate and therefore the modifying variable of interest are missing data and evaluate the impact of the strength of the auxiliary information under three conditions of missingness: large values of the covariate are more likely to be missing; extreme values of the covariate are more likely to be missing; and missingness depends on both the covariate and the outcome. We compare MI to the commonly applied CC approach on simulated and real data from a molecular epidemiology study.
Multiple Imputation (MI)
MI is a simulation-based method for handling missing data. There are three main steps involved in conducting an MI-based analysis. The first step consists of imputing plausible values for missing data from a specified distribution. To incorporate the uncertainty of the imputed values, this is done m times to create m complete data sets, where m typically varies between 3 and 10. The data are analyzed separately for each of the m data sets in step 2, with the estimates appropriately combined to yield one summary result in step 3. The theoretical underpinnings of the method are described in Little and Rubin .
There are several approaches to specifying an appropriate distribution from which to draw the missing values required in the imputation step. In general, the strategies fall into one of two classes: the joint modeling approach or the fully conditional specification approach . The joint modeling approach relies on specifying a joint density for the data to derive the posterior predictive distribution of the missing values . The fully conditional specification approach, on the other hand, bypasses this step and imputes data on a variable-by-variable basis based on a specified conditional density. For more details on the comparison of these approaches see Van Buuren . These methods are available in easily accessible software. SAS, for example, utilizes MI based on the joint modeling approach via the PROC MI and PROC MIANALYZE procedures. We provide example code that uses the fully conditional approach implemented via the ICE and MICOMBINE procedures developed by Patrick Royston for use in STATA [13-15] in Appendix A. Other software implementing MI can be found in Horton and Kleinman's comprehensive review .
MI for Interaction Effects
Estimating interaction effects with MI is slightly more complicated than estimating main effects . This has to do with the assumptions under which the data are imputed. More specifically, MI methods that rely on parametric assumptions such as a multivariate normal distribution may produce reasonable results for the estimation of linear relationships, but not for higher-order relationships. There are several approaches to imputing interaction terms. The two main approaches are to 1) impute the variables involved in the interaction first and then generate the product term for inclusion in the analytic model or 2) generate the product term prior to imputation and then impute this term like one would any other variable. These methods and others are discussed in detail by von Hippel . We show example STATA code in Appendix A that implements both methods.
Design of Simulation Studies
We assessed the performance of CC and MI methods for estimating interaction effects in two sets of simulation studies. In the first set, we evaluated their performance for estimating an interaction effect between two predictors (X1, which in some cases is continuous and in others is dichotomous, and a dichotomous predictor X2) on a dichotomous outcome (Y). One of the predictors (X1), and therefore the interaction term, is NMAR for a proportion of the subjects. An auxiliary variable (Z), generated as a linear function of X1 and random noise, is also available. The second set of simulations was performed to enhance our understanding of the impact of the nature of missingness using a more realistic set of auxiliary variables. To that end, we made use of a real molecular epidemiology study taken from the Long Island Breast Cancer Study Project (LIBCSP)[19,20], a population-based case-control study of breast cancer, where, for our purposes, the interest was in an interaction effect between alcohol consumption and presence of a mutant allele for being a fast metabolizer of alcohol as determined by the ADH3 genotype on breast cancer risk. Our simulations involved one variable with missing data (either genotype, which is dichotomous or alcohol exposure, which is either dichotomous or ordinal), a covariate of interest (either genotype or exposure) and a dichotomous outcome (case-control status).
Table 1 describes the thirteen scenarios examined. In the first set of simulations (Scenarios A-I) we evaluated the impact of  the nature of the missingness, and  the relationship between X1 and the auxiliary variable, Z, on estimation performance. To evaluate the impact of the nature of missingness, three conditions were considered. Under condition 1, the log odds of the probability of missing X1 is a linear function of X1. Under condition 2, X1 is more likely to be missing extreme values, that is, the log odds of the probability of missing X1 is a quadratic function of X1. Finally, under condition 3 the log odds of missingness is a linear function of X1 given Y. If Y represented case-control status, for example, this would allow cases and controls to differ with respect to missingness. In our study, cases are more likely to be missing large values of X1, and controls are more likely to be missing small values of X1. To assess the effect of the strength of the relationship between X1 and the auxiliary variable on the results, nonexistent, moderate and strong relationships were considered. Moderate strength was defined as a correlation between X1 and Z of 0.57 for X1 continuous and an increase of 1 unit in Z for X1 = 1 versus 0 for X1 dichotomous. For X1 continuous, a strong auxiliary variable was one that had a correlation of 0.97 with X1. For X1 binary, a strong relationship was defined as a 3 or 4 unit increase in the auxiliary variable when X1 was 1 versus 0. Each scenario is based on 1000 iterations with a sample size of 1000.
Table 1. Description of Scenarios Used in Two Sets of Simulation Studies.
Scenarios J-M comprise the second set of simulations and are also described in Table 1. These scenarios are informed by the real molecular epidemiology data set in two aspects that relate to the potential utility of the set of auxiliary variables: 1) the strength of the observed variables in describing missingness and 2) the strength of the observed variables in describing the variable with missing data. To determine these relationships a "fully observed" data set was created from the larger cohort. To create such a data set, subjects missing at least one of the variables of interest (case-control status, an indicator for having the genotype for fast metabolism, alcohol exposure, their interaction, and potential confounders of this relationship) were excluded, yielding a fully observed data set. The actual covariate values, and as a consequence, their inter-relationships were maintained in the simulations to enable comparison of methods under a more realistic and complicated set of auxiliary variables. New outcomes representing case-control status were generated as a function of genotype, alcohol exposure and their interaction.
Missingness was induced for genotype (Scenarios J-K) as well as for alcohol exposure (Scenarios L-M). This allowed assessment of methods under a realistic set of auxiliary variables under condition 2 (Scenario M: where extreme values of alcohol exposure are more likely to be missing), which only applies to variables with more than 2 levels and allowed comparison of methods under different sets of realistic auxiliary variables for condition 1. More specifically, there were fewer variables that correlated with genotype than with alcohol exposure with varying magnitudes of strength. We therefore compared methods across two sets of available auxiliary data for condition 1 (Scenarios J and L). Missingness was induced under three conditions analogous to the ones described above. In the first, the log odds of the probability of missing the variable (either genotype or exposure) is a linear function of the variable, as well as the following observed variables: the other covariate of interest (either exposure or genotype), ever having had a mammogram, education, race, having breast fed for at least 6 months, ever having been exposed to oral contraceptives, ever having been exposed to hormone therapy and smoking status. The second condition (Scenario M) specifies the log odds of the probability of missing alcohol exposure to be a non-linear function of exposure so that missingness is more likely for extreme exposures (none or high). Additionally, it is also a linear function of the observed variables mentioned above including genotype. Finally, the third condition specifies different missing data mechanisms conditional on genotype, exposure and case-control status (Scenario K). It specifies that exposed controls are 7.4 times more likely to be missing genotype if they have the fast metabolizing genotype, whereas exposed cases are 7.4 times less likely to be missing genotype if they have the fast metabolizing genotype and, unexposed subjects with the fast metabolizing genotype are 2.7 times more likely to be missing data on genotype. The log odds of the probability of missing genotype is also linearly related to the other observed variables considered above. The relationship between missingness and the observed variables is informed from the fully observed data set. Each scenario is based on 1000 iterations with a sample size of 2058, the size of the fully observed data set.
Three different models are presented:  the full model, which is the model fit on the complete data;  the CC model; and  the MI-based model, where the missing values of X1 were imputed as a function of X2, Y and the auxiliary variable, Z for the first set of simulations and as a function of the other covariate of interest (either exposure or genotype), ever having had a mammogram, education, race, having breast fed for at least 6 months, ever having been exposed to oral contraceptives, ever having been exposed to hormone therapy and smoking status for the second set. To produce optimal results, we set m = 10 as opposed to the more typical m = 5, although we found negligible differences when comparing the two. For each scenario, the data were analyzed using a logistic regression model with Y (case-control status) as the outcome and X1 (genotype), X2 (alcohol exposure) and their interaction as predictors. Average point estimates, average model-based standard errors (SE), average biases, mean squared errors (MSE), mean squared errors relative to CC (RelMSE), and percentage coverage with 95% confidence intervals were calculated for X1 given X2 = 0, X1 given X2 = 1, and their interaction. The comparison of MI to CC using the RelMSE statistic is critical, as a comparison of MI to the method currently used in practice (CC) is more relevant than its comparison to an optimal method. For an ideal reference, however, the full model is presented.
Table 2 describes the observed relationships between the potential auxiliary variables and the variable with missing data in the real data set. Candidate auxiliary variables for genotype (Scenarios J and K) include three variables that are highly correlated with genotype in the fully observed data set: alcohol exposure, race, and having breastfed for at least 6 months. For every increase in level of average alcohol exposure, a subject is 10% less likely to have the fast-metabolizing gene. Similarly those who have breastfed are roughly 30% less likely to have the fast metabolizing gene than those who have not. On the other hand, African American subjects have a 4-fold increased risk of having the fast metabolizing gene relative to Caucasians. There are a larger number of auxiliary variables for alcohol exposure (defined as alcohol consumer versus non-consumer), relevant for Scenarios L and M. These variables include race, where African Americans are less likely than Caucasians to be consumers, and where a higher education, use of oral contraceptives, hormone therapy and smoking exposure are all associated with a higher probability of alcohol exposure. Having the fast metabolizing, gene, however, corresponds to a 30% decreased risk in alcohol exposure.
Table 2. Description of Relationship Between Variable with Missing Data (Genotype or Exposure) and Set of Auxiliary Variables.
Example Data Set
As an illustration of these methods, we compared a previously published CC analysis of the gene-environment interaction of interest to one based on MI methods using data from the LIBCSP . This particular analysis was undertaken to address a possible interaction effect between alcohol consumption and ADH3 genotype on breast cancer risk. Details of the overall study design are provided in prior publications [19,20]. In-person interviews were completed for 1,508 cases (82.1% of eligible cases) and 1,556 controls (62.8% of eligible controls). Seventy-three percent of both cases and controls who completed an interview donated a blood sample. As the CC approach adjusted for potential confounders, it further excluded those who were missing at least one variable and resulted in a data set of 1,008 cases and 1,055 controls. As previously published , subjects were more likely to donate blood if they were white, non-smokers, ever consumed alcohol, ever used hormone replacement therapy, breast-fed for six months or more, or ever had a mammogram.
The impact of the strength of the auxiliary variable for the three conditions generated under the first set of simulations, where approximately 20% of the data are missing, is shown in Table 3. For each condition, the first set of simulations examines three scenarios corresponding to non-informative, moderate, and strong auxiliary variables. These studies are supplemented by the second set of simulations shown in Table 4, which compares the methods under a more realistic and complicated set of auxiliary variables. Like the first set, roughly 20% of the data are missing. For completeness, estimates are presented for X1 given each level of X2 (denoted going forward as X1|X2) and their interaction. Performance, however, is based solely on estimation of X1|X2 = 0 and the interaction as their sum yields the estimate of X1|X2 = 1.
Table 3. Impact of Auxiliary Relationship Under Conditions 1, 2, and 3.
Table 4. Impact of Nature of Missingness With Auxiliary Relationship Based on Data from LIBSCP.
Characterization of Performance of CC Analysis
In the first set of simulations (Table 3) under condition 1 (Scenarios A-C: where X1 is 12.2 times more likely to be missing for X1 = 1 than for X1 = 0), CC overestimated the interaction effect and gave average estimates of 1.7-2.0 when the true parameter was 1.5. In addition, CC analyses yielded large standard error estimates (ranging from 4.9 to 12.7), likely due to small cell counts when X1 is binary, missing for about 20% of subjects, and the proportion with X1 = 1 is small (0.2). In the second set of simulations, however, under an analogous condition 1 (Scenarios J and L: where genotype is 12.2 times more likely to be missing for those with the genotype of interest and missingness additionally relates to a set of observed variables), CC analysis performed better with respect to bias and efficiency. Although it still suffered from a loss in efficiency, the loss was much less dramatic than in the first set. Differences in how the data were generated between the sets contribute to the varying performance of CC under these similar conditions. In the first set of simulations the sample size (N = 1000) is half that of the second set (N = 2058). Additionally, the proportion of those with the genotype (or with X1 = 1) is 0.20 in the first set compared to 0.4 in the second set, where the latter proportion is determined by the real data set. Finally, in the second set, missingness also relates to additional observed variables. The performance of CC was comparable however between the two sets of simulations under conditions 2 and 3. Under condition 2, where extreme values of the variable are more likely to be missing, CC yielded approximately unbiased estimates on average for both sets of simulations but suffered a loss in efficiency; having the complete data set (full model) resulted in MSEs that were reduced by 50% for estimating X1|X2 = 0 and by 15-20% for the interaction effect. Under condition 3, CC performed the worst, underestimating the effect of X1|X2 = 0 in the first set of simulations (Scenarios G-I) yielding MSEs that were 14-15 times that of the full model and overstating the interaction effect in both sets of simulations, with MSEs that were approximately 1.5 and 20 times larger than that of the full model in the first and second set of simulations, respectively.
Performance of MI Relative to CC
Under all conditions shown in Table 3, when auxiliary information is strong (Scenarios C, F, and I), MI performed well and outperformed CC. Under condition 1 even when there is no auxiliary information, MI outperformed CC for both parameters, where the RelMSE statistic for the interaction effect showed a 93% improvement in estimation. Much of this improvement, however, is due to the inflated standard errors that CC provided due to small cell counts that occur in the presence of missingness when X1 is binary and the proportion with X1 = 1 is small. Under condition 1 in the second set (Scenarios J and L), however, CC was superior to MI. In both sets of simulations, MI underestimated the interaction effect and in the second, it provided poor coverage probabilities for the interaction effect, resulting in MSEs that were 3 and 2 times that of CC for Scenarios J and L, respectively. Under condition 2, MI needed a strong auxiliary variable to compete with CC. CC did not yield biased results, but suffered a loss in efficiency. With moderate to weak auxiliary information, MI underestimated the interaction effect and overestimated the effect of X1|X2 = 0. Overall it performed worse than CC and had an MSE that was 1.3 times greater than that of CC for X1|X2 = 0 in both sets of simulations and an MSE for the interaction that was 1.1 and 4.4 times greater in the first and second sets of simulations. The second set of simulations suggested that even under a realistic set of auxiliary variables, MI did not improve estimation over CC under this condition, where MI underestimated interaction effects. In both sets of simulations, MI improved performance over CC under condition 3. When there is no auxiliary data, MI and CC both underestimated X1|X2 = 0, but while CC overestimated the interaction effect, MI underestimated it. MI had a superior MSE statistic for both X1|X2 = 0 and the interaction. In the second set of simulations (Scenario K), these findings were supported. MI outperformed CC where CC overstated the interaction effect. MI performed well with good coverage probabilities and provided an MSE for the interaction effect that was 93% improved over that of CC.
Illustration on Real Data Set
The results from analysis of the LIBCSP evaluating an interaction effect between alcohol consumption and the ADH3 genotype on breast cancer risk are presented in Table 5. Adjusted odds ratios (ORs) from the previously published CC analysis, an MI-based analysis, and the percentage change in the beta coefficients or log-ORs for the interaction effect are shown. Both analyses involved fitting a logistic regression model adjusting for potential confounders (age at diagnosis; education; race; caloric intake; smoking status; and BMI). To impute values for genotype, these confounders as well as any variables related to missingness were used. These possible auxiliary variables included: having ever breastfed; having ever used hormone replacement therapy; having ever used oral contraceptives; ever having a mammogram; income level; and having benign breast disease. Terry et al. previously reported a two-fold association (OR = 2.3, 95% CI 1.3-4.0) for moderate alcohol consumption (15-30 g/day) for fast metabolizers using a CC approach. MI resulted in a 39% reduction in the coefficient (OR = 1.7, 95% 1.0-2.8) . MI yielded parameter estimates that were smaller and closer to the null than those obtained by CC, where the percentage change in the beta coefficients ranged from 17% to > 100%, and the median percentage change was 31%.
Studies of molecular epidemiology often involve collecting data on biomarkers. Issues with missing data arise when data are not fully observed for all subjects included in a study. The most common approach to analyzing these data is CC analysis [1-4], which has the advantage of computational ease but can result in estimates that are biased and inefficient. Using missing data methods in analyses, therefore, needs to become more customary. Standard MI is simple to implement and accessible but not recommended when the data are suspected to be NMAR. For example, while Taylor and colleagues promote using MI to reduce non-response bias in epidemiologic studies, they recommend doing so only when the MAR assumption is likely to hold . In molecular epidemiology studies, however, one may suspect that the data are NMAR or one may incorrectly assume the data are MAR. In addition, many molecular epidemiology studies evaluate interaction effects, for which the bias and inefficiency of CC estimates have not been fully characterized. Our goal, therefore, was to characterize the performance under CC and to investigate standard MI, a method that is nearly as easy to implement and as accessible as CC, specifically in the context of assessing interaction effects when one of the predictors is NMAR.
Characterization of Performance of CC Analysis
Biased and inefficient estimates from the CC approach were observed in our simulation studies, indicating a strong need for missing data methods. The extent to which they were observed, however, varied by the nature of missingness. When missingness is a function of the covariate only and not of the outcome (as in conditions 1 and 2) the performance of CC methods largely suffered in terms of efficiency loss. This loss in efficiency increased as the percentage missing increased from 20-40% (results shown in Table 6: Appendix B). Specifically, for condition 1, the MSE from the CC analysis was 12 times that of the full model when only 20% are missing and 125 times that of the full model when 40% are missing. For condition 2, the MSE of CC relative to the full model increased from 1.15 (when 20% are missing) to 1.78 (when 40% are missing). When missingness is a function of both the covariate and outcome (as in condition 3), the bias from CC was the most dramatic where it underestimated the effect of X1|X2 = 0 and overestimated the interaction effect.
Comparison of MI and CC Approaches in the Simulation Studies
Differences between MI and CC analyses varied by both the nature of the missingness and the strength of auxiliary information. Under all conditions, when auxiliary information is strong, MI outperformed CC. Interestingly, however, under certain conditions MI gave results that were more misleading than CC. Specifically, when large values of the covariate are more likely to be missing under a realistic set of auxiliary variables, MI was more biased than CC and underestimated interaction effects. Also, when extreme values are more likely to be missing and auxiliary information is not strong, MI was more misleading than CC. MI was superior to CC, however, when missingness relates to both the covariate and the outcome even under moderate and realistic sets of auxiliary variables, where CC overestimated interaction effects.
Application of MI and CC Approaches to LIBCSP
In the LIBCSP example, data were not MCAR as blood donation was related to a variety of observed factors . We suspected that the data may be NMAR as having the genotype for fast metabolism was related to alcohol intake, and alcohol intake was associated with providing a blood sample. If after conditioning on alcohol intake, data on metabolism status were more likely to be missing for fast metabolizers (i.e., if other unmeasured features related to metabolism correlated with missingness) the data would be NMAR. While inference was similar between the two analytic approaches (the overall interaction effect was not statistically significant by either method) and consistent with previous findings [22,23], MI estimates were attenuated toward the null relative to CC. If missing genotype were related to observed features only, we would have more faith in the MI-based results as they rely on a more flexible assumption about the missingness than the CC method, which relies on an assumption shown to be violated in this case. Based on the findings of our simulation study, however, if genotype is also more likely to be missing for fast metabolizers (condition 1), we may be more likely to believe our CC results. If, however, missing genotype were to additionally relate to both genotype values and case-control status (condition 3), we may be more likely to believe the MI results. It makes sense in cases when CC and MI results are discrepant to present both analyses. The analysis that corresponds to the most plausible set of assumptions should serve as the primary analysis. Furthermore, a more comprehensive sensitivity analysis that presents results by varying assumptions about missingness can be performed using MI and is discussed in greater detail below.
MI, the MAR Assumption, and Its Relationship to Auxiliary Variables
The intuition behind why MI performed well when auxiliary data were strong has to do with the MAR assumption. Assuming the data are MAR is equivalent to assuming that the information needed to impute the missing values can be found in the observed data. This is a more reasonable assumption when the data include auxiliary information that is strongly related to the unobserved data. Thus, even if one were to suspect the data are NMAR, the presence of strong auxiliary information may allow one to proceed with methods that assume MAR.
Our simulation study is limited in that it does not provide a precise definition for the strength of the relationship necessary for one to assume MAR. In our simulations, for X1 continuous, a strong auxiliary variable was one that had a correlation of 0.97 with X1. For X1 binary, a strong relationship was defined as a 3 or 4 unit increase in the auxiliary variable when X1 was 1 versus 0. Although extreme and perhaps not likely outside of longitudinal studies, we felt it was important to study the extremes (no association and strong association) in addition to a moderate association (in our study, a correlation of 0.57 for X1 continuous and an increase of 1 unit in Z for X1 = 1 versus 0 for X1 dichotomous). To enhance our understanding, we supplemented the first set of simulations with another set based on the real molecular epidemiology data set that maintained the actual relationships between the variable with missing data and other observed data. Table 2 shows the relationship on the fully observed data between the variable missing data and observed variables. Not surprisingly, these relationships were altered in the observed data sets, with some variables not appearing to correlate as they did in the complete data, demonstrating the challenge of assessing good auxiliary variables from an observed data set. We therefore recommend coupling observations from the data with a priori knowledge when making assumptions about auxiliary relationships. Below we give specific guidelines to use in practice.
For detailed guidelines, we refer the reader to a recent study on the handling of missing data for molecular epidemiology studies, where Desai and colleagues provide greater details on the steps involved in a data analysis in the presence of missing data . Below we emphasize some key points.
Making Plausible Assumptions
The first step when doing an analysis in the presence of missing data is to consider plausible assumptions about the missing data mechanism or reasons why the data may be missing. As the data themselves can only indicate whether the MCAR assumption should be ruled out, this will largely involve drawing on strong a priori knowledge. If there is no evidence that the MCAR assumption is violated (through a simple comparison of features between those with and without the missing variable(s)) and if based on clinical understandings, there is no reason to suspect the data are NMAR, MCAR may be assumed. As mentioned above, assuming the data are MAR is equivalent to assuming what needs to be learned about the missing data can be gleaned from the observed data. Thus, even if the data are NMAR, the presence of a strong auxiliary variable may make the MAR assumption reasonable. Similarly, even if MAR is more plausible than NMAR, one needs to consider which auxiliary variables allow this assumption to reasonably hold.
Choosing Appropriate Analytic Tools
A CC analysis should be performed in all cases. Additional tools, however, should be used and may serve as the primary analysis depending on the assumptions that are most likely. If the proportion missing is small (say less than 10%), a CC analysis should suffice. If the proportion missing is larger and there is evidence the data are not MCAR, one should incorporate a missing data method in addition to performing a CC analysis. If MAR is plausible or NMAR is suspected in the presence of strong auxiliary variables, standard MI is a reasonable choice. Otherwise, as we saw in the simulation study, standard MI can be misleading and under certain conditions, worse than CC methods. MI, with the missing data mechanism explicitly specified, such as pattern mixture models or selection models, is an appropriate choice in these situations and is discussed in greater detail in other sources . Such an approach requires making explicit assumptions about the nature of missingness.
Making Use of Auxiliary Variables
As mentioned above, the choice of which auxiliary variables to include plays an important role in estimation performance of MI and the MAR assumption. In simulation studies described by Collins et al. , where this very issue was assessed for the MAR case, being more inclusive even when doubtful of the usefulness of some auxiliary variables resulted in increased efficiency and reduced bias.
Performing a Sensitivity Analysis
A nice feature of MI is its ability to incorporate the uncertainty of assumptions into the results, where the assumptions may involve the missing data mechanism (NMAR and MAR) as well as which auxiliary variables to include. One should perform a sensitivity analysis that involves presenting results using different subsets of auxiliary variables in the MI analysis, or in the case where MI is used after explicitly modeling the missing data mechanism, findings resulting from various assumptions of the missing data mechanism that should include multiple models under the NMAR mechanism. This will give a sense of the robustness of the results. The CC analysis should be included among these. If results across analyses differ, the investigator must decide which sets of assumptions are most plausible. Additionally, one can average over the resulting findings to provide one summary result that accounts for the uncertainty of the assumptions involving the missing data mechanism and/or choice of auxiliary variables.
In summary, molecular epidemiology studies face a particularly challenging missing data problem in that the majority of these studies will be missing data on the key variable of interest, the biomarker. While it may seem sensible to study only those with the measured biomarker, we argue the importance of including those who would be eligible for study despite the missing biomarker. At the very least, we urge comparison of features between those with and without missing data and strongly encourage the incorporation of missing data methods into the analysis when it is warranted. More specifically, if these comparisons indicate the data are not MCAR, and MAR seems plausible, we highly recommend use of standard MI. Even in cases where the data are MCAR, one can benefit in efficiency from MI. If it is likely that the data are NMAR and one can assume the strong presence of auxiliary information, standard MI may still be a reasonable estimation-enhancing tool. Otherwise, MI that models the missing data mechanism is a possibility. In all cases, a useful feature of MI is that it allows for incorporation of uncertainty of the missing data mechanism into the results.
The authors declare that they have no competing interests.
MBT conceived of the idea and helped to draft the manuscript. MD designed and carried out the simulations, interpreted the results, and helped to draft the manuscript. DE conducted the literature review and helped to draft the manuscript. MG provided insightful comments, suggestions, and edits, and led the parent study of the original molecular epidemiology study. All authors read and gave their final approval of the manuscript.
APPENDIX A: STATA Code for Implementing MI
/* case is a binary indicator for case/control status, x1 and x2 are binary variables, where x1 is missing data on 20% of subjects and is NMAR and z is a continuous auxiliary variable */
/*Read in data set where data were generated under condition 1*/
insheet using "~/scen1.csv",
/* Method 1 for Imputing Interaction Effects: Generate interaction term first and then impute */
/*Create Interaction term*/
gen theint = x1*x2
/*Use ICE to fit imputation model and create 10 imputed data sets*/
ice case x1 x2 theint z, saving(simimpute.dta) m (10) replace
/*Read in data set containing all 10 imputed data sets*/
use simimpute.dta, clear
/*Use MICOMBINE to fit the desired scientific model and combine results across 10 data sets*/
micombine logit case x1 x2 theint
/* Method 2 for Imputing Interaction Effects: Impute first then create interaction term as is done in passive imputation */
/*Create Interaction term*/
gen theint = x1*x2
/*Use ICE to fit imputation model and create 10 imputed data sets*/
/* Use passive option to implement Method 2 for imputing interaction term */
ice case x1 x2 theint z, saving (simimpute.dta) m (10) passive (theint:x1*x2) replace
/*Read in data set containing all 10 imputed data sets*/
use simimpute.dta, clear
/*Use MICOMBINE to fit the desired model and combine results across 10 data sets*/
micombine logit case x1 x2 theint
MD is the Director of the Quantitative Sciences Unit and Clinical Associate Professor in the Department of Medicine at Stanford University. DE is Research Assistant Professor in the Departments of Biostatistics and Medicine at the University of North Carolina at Chapel Hill. MG is Professor of Epidemiology in the School of Public Health at the University of North Carolina at Chapel Hill. MBT is Associate Professor of Epidemiology in the School of Public Health at Columbia University.
COBRA Preprint Series 2010.
American Journal of Epidemiology 1995, 142:1255-1264. PubMed Abstract
Journal of the American Statistical Association 1996, 91:473-489. Publisher Full Text
Biometrika 2001, 88:551-564. Publisher Full Text
Applied Statistics 2001, 50:361-373. Publisher Full Text
Gammon MD, Neugut AI, Santella RM, Teitelbaum SL, Britton JA, Terry MB, Eng SM, Wolff MS, Stellman SD, Kabat GC, Levin B, Bradlow HL, Hatch M, Beyea J, Camann D, Trent M, Senie RT, Garbowski G, Maffeo C, Montalvan P, Berkowitz GS, Kemeny M, Citron M, Schnabel F, Schuss A, Hajdu S, Vinceguerra V, Collman GW, Obrams GI: The Long Island Breast Cancer Study Project: Description of a multi-institutional collaboration to identify environmental risk factors for breast cancer.
Smith-Warner SA, Spiegelman D, Yaun SS, van den Brandt PA, Folsom AR, Goldbohm RA, Graham S, Holmberg L, Howe GR, Marshall JR, Miller AB, Potter JD, Speizer FE, Willett WC, Wolk A, Hunter DJ: Alcohol and breast cancer in women: a pooled analysis of cohort studies.