---
title: "Pressure testing functions"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Pressure testing functions}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
```

This vignette shows how to use functions added from the pressure testing report.

```{r setup}
library(vimcheck)
library(dplyr)
library(tidyr)
```

## Example data

Example impact data are taken from [_vimpact_](https://github.com/vimc/vimpact) and included in the package as `eg_impact`.
This dataset holds projections for four countries, four diseases, and three modelling groups; combinations are shown below.

```{r impact_data}
eg_impact

# check combinations
distinct(eg_impact, country_name, disease, modelling_group)
```

Data on WHO regions is provided as `who_subregions` to enable comparing countries with their regions.

```{r who_regions}
who_subregions
```

## Filtering impact data

Impact data can be filtered, or flagged for filtering, in different ways.

### Filtering on touchstone

Data can be filtered on touchstone using `filter_recent_ts()`; rows with `scenario_type` matching "default" are retained.

Some useful default touchstone values are `DEF_TOUCHSTONE_NEW` (`r DEF_TOUCHSTONE_NEW`), `DEF_TOUCHSTONE_OLD` (`r DEF_TOUCHSTONE_OLD`), and `DEF_TOUCHSTONE_OLD_OLD` (`r DEF_TOUCHSTONE_OLD_OLD`)

```{r filter_ts}
# assign dummy touchstones and scenario type for demo
df <- eg_impact
df$touchstone <- "202401"

test_scenario_types <- rep(
  c("default", "dummy"),
  each = nrow(df) / 2
)
df$scenario_type <- test_scenario_types

# use a package default touchstone
DEF_TOUCHSTONE_NEW

# touchstone filtering is applied to all non-default scenario rows
filter_recent_ts(df, DEF_TOUCHSTONE_NEW)
```

### Filtering on diseases

Data can be filtered to exclude a fixed set of diseases, if the touchstone is older than a threshold value, using `filter_excluded_diseases_ts()`.

The excluded diseases are stored as the package constant `EXCLUDED_DISEASES` (`r cli::cli_text("{.str {EXCLUDED_DISEASES}}")`).

```{r filter_diseases}
# make a copy and add dummy disease values
df_copy <- df
df_copy$disease <- rep(
  EXCLUDED_DISEASES, each = nrow(df_copy) / length(EXCLUDED_DISEASES)
)

# pass dummy touchstone to filter out all rows
filter_excluded_diseases_ts(df_copy, "202501")
```

### Flagging duplicates

Duplicated rows in the data can be identifier by adding a flag variable column `n_key`, using `flag_duplicates()`.

Duplicates are identified across columns specified by the argument `key_cols`, which defaults to `r cli::cli_text("{.str {COLNAMES_KEY_PRESSURE_TEST}}")`.

```{r flag_dups}
# view n_keys column
flag_duplicates(eg_impact) %>%
  select(
    modelling_group, country, disease, burden_outcome,
    vaccine, activity_type, year, n_key
  )
```

### Filtering invalid trajectories

The function `filter_invalid_trajectories()` can be used to remove rows where outcome values are missing in one of two paired datasets (each ideally from a different touchstone).

This function should be applied to data which has the impact outcomes as columns (i.e., in wide format), with duplicates removed.

The outcome to check is specified by the argument `"outcome"`.

```{r filter_traj}
# create some dummy data from exampled data
prev_df <- flag_duplicates(eg_impact)
prev_df <- dplyr::filter(prev_df, n_key == 1)
prev_df <- tidyr::pivot_wider(
  prev_df,
  id_cols = {{ COLNAMES_KEY_PRESSURE_TEST }},
  names_from = "burden_outcome",
  values_from = "impact"
)

# will be replaced by proportion of GAVI support in future
prev_df$support_type <- "other"
rows <- nrow(prev_df)
prev_df$coverage <- 0.5

prev_df$deaths_averted <- withr::with_seed(
  1, sample(c(5e2, NA_real_), rows, TRUE)
)
prev_df$dalys_averted <- withr::with_seed(
  1, sample(c(5e5, NA_real_), rows, TRUE)
)

# assign dummy values
curr_df <- prev_df
curr_df$deaths_averted <- 1e3
curr_df$dalys_averted <- 1e6

# View data with invalid trajectories removed
filter_invalid_trajectories(prev_df, curr_df, "deaths_averted")
```

## Identifying differences between datasets

This section provides a general demonstration of tools that help to identify differences between two paired datasets.

### Generating differences

The function `generate_diffs()` uses the [_diffdf_ package](https://cran.r-project.org/package=diffdf) to identify differences between two data frames.
The output is a list of tibbles with the added column `VARIABLE` for the column examined for differences, with the baseline and comparator as `BASE` and `COMPARE`.

```{r gen_diffs}
# use prev_df from section above
prev_df$deaths_averted <- withr::with_seed(
  1, rnorm(rows, 1e3, 100)
)
prev_df$dalys_averted <- withr::with_seed(
  1, rnorm(rows, 1e6, 100)
)

# assign dummy values
curr_df <- prev_df
curr_df$deaths_averted <- prev_df$deaths_averted * 2
curr_df$dalys_averted <- prev_df$dalys_averted * 2

interest_cols <- c("deaths_averted", "dalys_averted")
difflist <- generate_diffs(
  prev_df,
  curr_df,
  interest_cols
)

# all rows are different - view the output types
names(difflist)

difflist
```

### Generate national IQRs

The function `generate_national_iqr()` generates the inter-quartile range of the impact outcome for a dataset.

```{r gen_iqr}
# assign dummy values to check functionality
df <- prev_df
df$deaths_averted <- withr::with_seed(
  1, rnorm(rows, 1e3, 100)
)

iqr_df <- gen_national_iqr(df, value_cols = "deaths_averted")

iqr_df
```

### Flag large differences

The function `flag_large_diffs()` can be used with the outputs of `generate_diffs()` and `gen_national_iqr()` to find rows where the impact estimate is outside the range expected, given the IQR.

```{r flag_diffs}
# assign some dummy values that will trigger flagging
difflist2 <- difflist
difflist2$deaths_averted$COMPARE <- 1e9 # typical values for BASE are ~1000

# all rows are flagged as having differences > IQR
flag_large_diffs(difflist2, iqr_df, "deaths_averted")
```

### Generate a combined dataset

`gen_combined_df()` can be used to generate a combined dataset across two different touchstones.

```{r gen_combined}
# regenerate data
prev_df <- flag_duplicates(eg_impact)
prev_df <- dplyr::filter(prev_df, n_key == 1)
prev_df <- tidyr::pivot_wider(
  prev_df,
  id_cols = {{ COLNAMES_KEY_PRESSURE_TEST }},
  names_from = "burden_outcome",
  values_from = "impact"
)
prev_df$support_type <- "other" # unsure what values this can take
prev_df$coverage <- 0.5
prev_df$fvps <- 1e6
prev_df$target_population <- 2e6
prev_df$touchstone <- "202010"

# assign dummy values
curr_df <- prev_df
curr_df$deaths_averted <- 1e6
curr_df$dalys_averted <- 1e9
curr_df$touchstone <- "202110"

gen_combined_df(prev_df, curr_df)
```

## Comparing national values to regional values

`compare_natl_subreg()` allows comparing national impact rates with regional rates, where regions are the WHO regions.

There is no example for this functionality as yet.

## Plotting functions

This section covers plotting functions.

First we prepare some dummy data for plotting.

```{r plot_prep}
# preparatory code with dummy data
prev_df <- flag_duplicates(eg_impact)
prev_df <- dplyr::filter(prev_df, n_key == 1)
prev_df <- tidyr::pivot_wider(
  prev_df,
  id_cols = {{ COLNAMES_KEY_PRESSURE_TEST }},
  names_from = "burden_outcome",
  values_from = "impact"
)
prev_df$support_type <- "other" # unsure what values this can take
prev_df$coverage <- 0.5
prev_df$fvps <- 1e6
prev_df$target_population <- 2e6

prev_df$deaths_averted <- withr::with_seed(
  1,
  rnorm(nrow(prev_df), 100, 0.1)
)
prev_df$dalys_averted <- prev_df$deaths_averted * 100
prev_df$touchstone <- "202010"

# assign dummy values
curr_df <- prev_df
curr_df$deaths_averted <- withr::with_seed(
  1,
  rnorm(nrow(prev_df), 300, 0.1)
)
curr_df$dalys_averted <- curr_df$deaths_averted * 100
curr_df$touchstone <- "202110"

interest_cols <- c("deaths_averted", "dalys_averted")
changes <- generate_diffs(
  prev_df,
  curr_df,
  interest_cols
)

# national IQR - inset dummy values for tests
national_iqr <- gen_national_iqr(prev_df)
national_iqr$national_iqr_deaths_averted <- seq_len(nrow(national_iqr))
```

### Plotting significant differences

Find and flag large diffs using `flag_large_diffs()` and visualise the output using `plot_sig_diff()`.

```{r plot_sig_diff}
flag_large_diffs(changes, national_iqr) |>
  plot_sig_diff()
```

### Plotting modelling group variation

Visualise variation in impact by modelling group using `plot_modelling_group_variation()`.

Data should be prepared using `prep_plot_mod_grp_varn()` first.

```{r plot_mod_grp_var}
prev_df_copy <- dplyr::select(curr_df, vaccine, modelling_group) %>%
  dplyr::distinct() %>%
  dplyr::group_by(vaccine) %>%
  dplyr::mutate(mod_num = dplyr::row_number())

prep_plot_mod_grp_varn(curr_df, prev_df_copy) |>
  plot_modelling_group_variation()
```

### Plotting GAVI vaccination

Use `plot_vaccine_gavi()` on data that has been prepared using `prep_plot_vax_gavi()`.

```{r plot_gavi}
prep_plot_vax_gavi(curr_df, prev_df, "deaths_averted") |>
  plot_vaccine_gavi()
```

### Plot cumulative values

Use `plot_cumul()` on data prepared using `prep_plot_cumul()` and `gen_combined_df()`.

```{r plot_cumul}
gen_combined_df(prev_df, curr_df) |>
  prep_plot_cumul("deaths_averted", "Measles") |>
  plot_cumul()
```