Chapter 8 Design and Sampling

Let us say that the ministry of health was pleased with the quality and results of the evaluation of the Health Insurance Subsidy Program (HISP). However, before scaling up the pro- gram, the ministry decides to pilot an expanded version of the program, which they call HISP+. The original HISP pays for part of the cost of health insurance for poor rural households, covering costs of primary care and drugs, but it does not cover hospitalization. The minister of health wonders whether an expanded HISP+ that also covers hospitalization would further lower out-of-pocket health expenditures of poor households. The ministry asks you to design an impact evaluation to assess whether HISP+ would decrease health expenditures for poor rural households.

In this case, choosing an impact evaluation design is not a challenge for you: HISP+ has limited resources and cannot be implemented universally immediately. As a result, you have concluded that randomized assignment would be the most viable and robust impact evaluation method. The minister of health understands how well the randomized assignment method can work and is supportive.

To finalize the design of the impact evaluation, you have hired a statistician who will help you establish how big a sample is needed. Before they start working, the statistician asks you for some key inputs. They uses a checklist of five questions.

1. Will the HISP+ program generate clusters? At this point, you are not totally sure. You believe that it might be possible to randomize the expanded benefit package at the household level among all poor rural households that already benefit from HISP. However, you are aware that the minister of health may prefer to assign the expanded program at the village level, and that would create clusters. The statistician sug- gests conducting power calculations for a benchmark case without clusters, and then considering how results would change with clusters.

2. What is the outcome indicator? You explain that the government is interested in a well-defined indicator: out-of-pocket health expenditures of poor households. The statistician looks for the most up-to-date source to obtain benchmark values for this indicator and suggests using the follow-up survey from the HISP evaluation. They notes that among households that received HISP, the per capita yearly out-of-pocket health expenditures have averaged US$7.84.

3. What is the minimum level of impact that would justify the investment in the intervention? In other words, what decrease in out-of-pocket health expenditures below the average of US$7.84 would make this intervention worthwhile? The statistician stresses that this is not only a technical consideration, but truly a policy question; that is why a policy maker like you must set the minimum effect that the evaluation should be able to detect. You remember that based on ex ante economic analysis, the HISP+ program would be considered effective if it reduced household out-of-pocket health expenditures by US$2. Still, you know that for the purpose of the evaluation, it may be better to be conservative in deter- mining the minimum detectable impact, since any smaller impact is unlikely to be captured. To understand how the required sample size varies based on the minimum detectable effect, you suggest that the statistician perform calculations for a minimum reduction of out-of-pocket health expenditures of US$1, US$2, and US$3.

4. What is the variance of the outcome indicator in the population of interest? The statistician goes back to the data set of treated HISP households, pointing out that the standard deviation of out-of-pocket health expenditures is US$8.

5. What would be a reasonable level of power for the evaluation being conducted? The statistician adds that power calculations are usually conducted for a power between 0.8 and 0.9. They recommends 0.9, but offers to perform robustness checks later for a less conservative level of 0.8.

We can calculate all the summary statistics we need as follows

# mean and standard deviation of outcome
sumstats <- df_elig %>%
  filter(round == 1 & treatment_locality == 1) %>%
  summarise(mean_health = mean(health_expenditures),
            sd_health = sd(health_expenditures),
            mean_hospital = mean(hospital),
            sd_hospital = sd(hospital))

Equipped with all this information, the statistician undertakes the power calculations.

power_calc_health <- tibble(d_health_expenditures = rep(-1:-3,2),
                            power = rep(c(0.8, 0.9), each = 3)) %>%
  mutate(n_required = map2(d_health_expenditures, power,
                           ~ {pwr.t.test(d = .x / sumstats$sd_health,
                                         sig.level = 0.05, power = .y)$n}) %>%
           unlist() %>%
           ceiling())  

As agreed, they first present the more conservative case of a power of 0.9.

power_calc_health %>%
  filter(power == 0.9) %>%
  mutate(mde = gsub("-", "$", d_health_expenditures)) %>%
  select(mde, power, n_required) %>%
  kable(align = "c", 
        col.names = c("Minimum Detectable Effect", "Power", 
                      "Sample Required per Group"),
        caption = "Evaluating HISP+: Sample Size Required to Detect Various Minimum Detectable Effects, Power = 0.9") %>%
  kable_styling(full_width = TRUE)
Table 8.1: Evaluating HISP+: Sample Size Required to Detect Various Minimum Detectable Effects, Power = 0.9
Minimum Detectable Effect Power Sample Required per Group
$1 0.9 1345
$2 0.9 337
$3 0.9 151

The statistician concludes that to detect a US$2 decrease in out-of-pocket health expenditures with a power of 0.9, the sample needs to contain at least 672 units (336 treated units and 336 comparison units, with no clustering). They notes that if you were satisfied to detect a US$3 decrease in out-of-pocket health expenditures, a smaller sample of at least 300 units (150 units in each group) would be sufficient. By contrast, a much larger sample of at least 2,688 units (1,344 in each group) would be needed to detect a US$1 decrease in out-of-pocket health expenditures.

The statistician then produces another table for a power level of 0.8.

power_calc_health %>%
  filter(power == 0.8) %>%
  mutate(mde = gsub("-", "$", d_health_expenditures)) %>%
  select(mde, power, n_required) %>%
  kable(align = "c", 
        col.names = c("Minimum Detectable Effect", "Power", 
                      "Sample Required per Group"),
        caption = "Evaluating HISP+: Sample Size Required to Detect Various Minimum Detectable Effects, Power = 0.8") %>%
  kable_styling(full_width = TRUE)
Table 8.2: Evaluating HISP+: Sample Size Required to Detect Various Minimum Detectable Effects, Power = 0.8
Minimum Detectable Effect Power Sample Required per Group
$1 0.8 1005
$2 0.8 252
$3 0.8 113

The table shows that the required sample sizes are smaller for a power of 0.8 than for a power of 0.9. To detect a US$2 reduction in household out-of-pocket health expenditures, a total sample of at least 502 units would be sufficient. To detect a US$3 reduction, at least 224 units are needed. However, to detect a US$1 reduction, at least 2,008 units would be needed in the sample. The statistician stresses that the following results are typical of power calculations:

  • The higher (more conservative) the level of power, the larger the required sample size.
  • The smaller the impact to be detected, the larger the required sample size.

The statistician asks whether you would like to conduct power calculations for other outcomes of interest. You suggest also considering the sample size required to detect whether HISP+ affects the hospitalization rate. In the sample of treated HISP villages, a household member visits the hospital in a given year in 5 percent of households; this provides a benchmark rate.

power_calc_hospital <- tibble(d_hospital = rep(c(-.01,-.02, -.03),2),
                              power = rep(c(0.8, 0.9), each = 3)) %>% 
  mutate(n_required = map2(d_hospital, power,
                           ~ {pwr.t.test(d = .x / sumstats$sd_hospital,
                                         sig.level = 0.05, power = .y)$n}) %>%
           unlist() %>%
           ceiling ()) 

The statistician produces a new table, which shows that relatively large samples would be needed to detect changes in the hospitalization rate of 1, 2, or 3 percentage points from the baseline rate of 5 percent.

power_calc_hospital %>%
  filter(power == 0.8) %>%
  mutate(mde = gsub("-", "", d_hospital * 100)) %>%
  select(mde, power, n_required) %>%
  kable(align = "c", 
        col.names = c("Minimum Detectable Effect (%)", "Power", 
                      "Sample Required per Group"),
        caption = "Evaluating HISP+: Sample Size Required to Detect Various Minimum Desired Effects (Increase in Hospitalization Rate)") %>%
  kable_styling(full_width = TRUE)
Table 8.3: Evaluating HISP+: Sample Size Required to Detect Various Minimum Desired Effects (Increase in Hospitalization Rate)
Minimum Detectable Effect (%) Power Sample Required per Group
1 0.8 7257
2 0.8 1815
3 0.8 808

The table shows that sample size requirements are larger for this outcome (the hospitalization rate) than for out-of-pocket health expenditures. The statistician concludes that if you are interested in detecting impacts on both outcomes, you should use the larger sample sizes implied by the power calculations performed on the hospitalization rates. If sample sizes from the power calculations performed for out-of-pocket health expenditures are used, the statistician suggests letting the minister of health know that the evaluation will not have sufficient power to detect policy-relevant effects on hospitalization rates.

Which sample size would you recommend to estimate the impact of HISP+ on out-of-pocket health expenditures?

This answer will depend on policy priorities and available budgets. Under randomized assignment at the individual level, a total sample size of 2,688 units (1,344 in each group) would be needed to detect a $1 decrease in out-of-pocket health expenditures with a power of 0.9. A total sample size of 672 (336 treated and 336 comparison units) would detect a change as small as $2 in health expenditures at the 0.9 power. This would cut the required sample and related data collection costs substantially. At the same time, it would still allow detecting the impacts that would make the program effective based on the ex-ante economic analysis. As such, such a sample may be a good compromise if budgets are limited.

Would that sample size be sufficient to detect changes in the hospitalization rate?

A sample size of 672 would not be sufficient to detect even a 3 percent change in hospitalization rate with a power of 0.9. Much larger sample sizes (above 1,614) will be required to detect impacts on hospitalization rates.

8.1 Power Calculations with Clusters

After your first discussion with the statistician about power calculations for HISP+, you decided to talk briefly to the minister of health about the implications of randomly assigning the expanded HISP+ benefits among all individuals in the population who receive the basic HISP plan. The consultation revealed that such a procedure would not be politically feasible: in that context, it would be hard to explain why one person would receive the expanded benefits, while her neighbor would not.

Instead of randomization at the individual level, you therefore suggest randomly selecting a number of HISP villages to pilot HISP+. All villagers in the selected village would then become eligible. This procedure will create clusters and thus require new power calculations. You now want to determine how large a sample is required to evaluate the impact of HISP+ when it is randomly assigned by cluster.

You consult with your statistician again. They reassures you: only a little more work is needed. On their checklist, only one question is left unanswered. They needs to know how variable the outcome indicator is within clusters. Luckily, this is also a question they can answer using the HISP data. They finds that the within-village correlation of out-of-pocket health expenditures is equal to 0.04.

# intraclass correlation, for cluster calculations
df_elig_t1r1 <- df_elig %>%
  filter(round == 1 & treatment_locality == 1) %>%
  select(health_expenditures, locality_identifier)

icc_est <- clus.rho(df_elig_t1r1$health_expenditures, 
                df_elig_t1r1$locality_identifier, 
                type = 3)$icc
icc_est
##                value
## ANOVA rho 0.04061629

They also ask whether an upper limit has been placed on the number of villages in which it would be feasible to implement the new pilot. Since the program now has 100 HISP villages, you explain that you could have, at most, 50 treatment villages and 50 comparison villages for HISP+. With that information, the statistician produces the power calculations shown in for a power of 0.8.

# calculating all at once
power_clstr_calc_health <- tibble(d_health_expenditures = -1:-3) %>%
  mutate(n_required = map(d_health_expenditures,
                           ~ {crtpwr.2mean(d = .x, m = 50,
                                           alpha = 0.05, power = 0.8,
                                           cv = 0, icc = icc_est,
                                           varw = sumstats$sd_health^2)}) %>%
           unlist() %>%
           ceiling()) 
power_clstr_calc_health %>%
  mutate(mde = gsub("-", "$", d_health_expenditures),
         total_clusters = 50 * 2,
         total_sample = total_clusters * n_required) %>%
  select(mde, total_clusters, n_required, total_sample) %>%
  kable(col.names = c("Minimum Detectable Effect", "Number of Clusters", 
                      "Units per Cluster", "Total Observations"),
        align = "c") %>%
  kable_styling(full_width = TRUE)
Minimum Detectable Effect Number of Clusters Units per Cluster Total Observations
$1 100 117 11700
$2 100 7 700
$3 100 3 300

The statistician concludes that to detect a US$2 decrease in out-of-pocket health expenditures, the sample must include at least 700 units: that is, 7 units per cluster in 100 clusters (50 clusters in the treatment group and 50 clusters in the comparison group). They note that this number is higher than in the sample under randomized assignment at the household level, which required only a total of 504 units (252 in the treatment group and 252 in the comparison group). To detect a US$3 decrease in out-of-pocket health expenditures, the sample would need to include at least 300 units, or 3 units in each of 100 clusters (50 clusters in the treatment group and 50 clusters in the comparison group).

The statistician then shows you how the total number of observations required in the sample varies with the total number of clusters. He decides to repeat the calculations for a minimum detectable effect of US$2 and a power of 0.8. The size of the total sample required to estimate such an effect increases strongly when the number of clusters diminishes. With 120 clusters, a sample of 600 observations would be needed. If only 30 clusters were available, the total sample would need to contain 1,920 observations. By contrast, if 90 clusters were available, only 720 observations would be needed.

# calculating all at once
power_clstr_n_calc_health <- tibble(n_clusters = c(15, 29, 40.5, 45, 60)) %>%
  mutate(n_required = map(n_clusters,
                           ~ {crtpwr.2mean(m = .x, d = 2,
                                           alpha = 0.05, power = 0.8,
                                           cv = 0, icc = icc_est,
                                           varw = sumstats$sd_health^2)}) %>%
           unlist() %>%
           ceiling())
power_clstr_n_calc_health %>%
  mutate(mde = "$2",
         # multiply clusters by 2, as power calculator works with number of
         # clusters per group (not total number of clusters)
         total_clusters = 2 * n_clusters, 
         total_sample = total_clusters * n_required) %>%
  select(mde, total_clusters, n_required, total_sample) %>%
  kable(col.names = c("Minimum Detectable Effect", "Number of Clusters", 
                      "Units per Cluster", "Total Observations"),
        align = "c") %>%
  kable_styling(full_width = TRUE)
Minimum Detectable Effect Number of Clusters Units per Cluster Total Observations
$2 30 64 1920
$2 58 14 812
$2 81 9 729
$2 90 8 720
$2 120 5 600

Which total sample size would you recommend to estimate the impact of HISP+ on out-of-pocket health expenditures?

This answer will depend on policy priorities and available budgets. A total sample size of 600 households (with 120 villages and 5 households per village) would be appropriate for the evaluation, as this sample size would detect a change of $2 with a power of 0.8.

In how many villages would you advise the minister of health to roll out HISP+?

Power is maximized when the number of treatment and control observations is the same. If a total sample of 90 villages is needed, rolling out HISP+ to 45 villages would maximize power. The other 45 villages would be comparison villages.