AutoWMM: Automating the WMM on Trees

library(AutoWMM)

Introduction

In population size estimation, methods based on back-calculation (multiplier methods) are a popular approach to estimating the size of a target population which is partially hidden and not directly enumerable from existing data. The basis behind this approach is to use a subpopulation with known count and a estimate of the proportion of the target population belonging to this subgroup to estimate the size of the target population. However, this basic method is not applicable when many subgroup counts are available, in which case evidence could be synthesized to provide a more accurate estimate of the target population size. If subgroups are mutually exclusive, a tree-like structure can be created with the target population at the root node, and children (or grandchildren) of the root representing these subgroups with known counts. In this case, to combine available evidence a weighted sum of estimates from each back-calculated path can be made, which is automated and achieved with this package using the weighted multiplier method (WMM) (Flynn and Gustafson 2024b). Variance-minimizing weights are used to provide an estimate of the root population size on any admissible tree-structured data, as well as options for visual rendering of trees for reporting of results.

The implementation behind these functions is described at length elsewhere (Flynn and Gustafson 2024a), and is based on previously developed methodology (Flynn and Gustafson 2024b). A more extensive application to real-world data can be found in (Flynn, Gustafson, and Irvine 2024).

makeTree()

The makeTree function creates a tree object from any admissible dataframe; for a dataframe to be admissible, it must contain specific column labels. These columns include the following:

  • ‘from’ (string, node label)

  • ‘to’ (string, node label)

  • ‘Estimate’ (+ integer)

  • ‘Total’ (+ integer)

  • ‘Count’ (+ integer)

The root node is the assumed to represent the target population for which estimation is sought. The makeTree function will accept a data.frame object with these columns, and create a data.tree object (from data.tree package) from data.frame; it also checks the dataframe columns and structure to ensure the data.tree object can be used for root node population size estimation.

Further details on each column are as follows:

  • from’ (node label) and ‘to’ (node label), which encode the tree structure. from and to describe the edge for that row of data.

  • Estimate’ (+ integer), ‘Total’ (+ integer), which are used as parameters in a Beta(Estimate+1,Total-Estimate+1) distribution at each branch, or, if ‘Total’ is equal for all branches in a sibling group, then a Dirichlet distributions with parameters Estimate+1 from each child of a sibling group is used. Estimate and Total are assumed to come from surveys of size Total (a sample of the population at node from), and observe Estimate number of those individuals at Total which move to the node described by to. Used for branch probabilities only.

  • Count’ (for terminal nodes with marginal counts). Count column is NA for rows where to nodes are not leaves, and also for all leaves without a marginal count.

A Population (logical) column is not needed, but can be added if Estimate and Total come from population numbers, rather than samples. A Description column (string) is also possible to include if particular descriptions are desired on the visual diagram made by drawTree function.

An example of this package in use can be seen in the following:

## create admissible dataset
treeData <- data.frame("from" = c("Z", "Z", "A", "A"),
                        "to" = c("A", "B", "C", "D"),
                        "Estimate" = c(4, 34, 9, 1),
                        "Total" = c(11, 70, 10, 10),
                        "Count" = c(NA, 500, NA, 50),
                        "Population" = c(FALSE, FALSE, FALSE, FALSE),
                        "Description" = c("First child of the root", "Second child of the root",
                        "First grandchild", "Second grandchild"))

## make tree object using makeTree
tree <- makeTree(treeData)
tree
#>   levelName
#> 1 Z        
#> 2  ¦--A    
#> 3  ¦   ¦--C
#> 4  ¦   °--D
#> 5  °--B

drawTree()

The drawTree function creates a descriptive diagram of a tree object created using makeTree, which allows the user to visualize the tree before it is used for WMM estimation. Prints descriptions or node labels on nodes, and probabilities based on previous surveys to branches. Specifically, node descriptions are given within respective nodes if provided by Description column in dataframe used in makeTree, and branch probabilities calculated using the ratio of data columns Estimate over Total are given along tree branches.

The function has three arguments, the first being the makeTree tree object the user would like the render. The user may also specify an argument probs, which takes the values TRUE/FALSE and determines whether probabilities will be displayed along tree branches. Lastly, the argument desc takes the values TRUE/FALSE and determines whether descriptions will be displayed in each node; for desc=TRUE, a Description column must be included as part of the data.frame used in the makeTree function tree construction.

A use case, using the above tree created by makeTree, is as follows:

## draw tree pre-estimation, with descriptions on nodes (default), and suppressing probabilities on branching
drawTree(tree, probs = FALSE)

wmmTree()

This function is the central functionality of the AutoWMM package, and performs WMM estimation on the tree created with makeTree. It compute an estimate of the size of a target population represented by the root node of tree-structure data. The wmmTree function accepts a tree object made using the makeTree function, and generates an estimate of the root node (the target population) by combining multiple back-calculated path estimates. This function will generate a weighted estimate using variance-minimizing weights, which combines back-calculated estimates of the root via the multiplier method. It will incorporate data from all leaves with both 1) known marginal leaf counts at the terminal point of the root to leaf path, and 2) available branch probability estimates for each segment along the root-to-leaf path. These paths are deemed “informative” (Flynn and Gustafson 2024b).
The wmmTree function uses the following arguments:

  • tree: A tree object created using makeTree function.

  • sample_length: The desired number of estimates of the target (root) node.

  • method: The method of back-calculation. Current version only supports the default multiplier method to produce back-calculated estimates using an internal function, method = "mmEstimate", unless the tree is a special case using single-source sibling data only. In the latter case, the method uses closed forms to generate path specific means and variances and WMM estimate (see single.source argument below).

  • int.type: The type of confidence interval desired. The default, and recommended, interval produced is the central 95% using quantiles (int.type = "quants"}. Setting this argument to var generates a 95% confidence interval based on sample variance, while the cox option for this argument generates the Cox interval for log-normal data.

  • single.source: This logical argument should be set to TRUE only when all sibling groups which have branches that are used in back-calculated paths are fully informed by a single source of data. In this special case, sampling can be bypassed, and closed forms are used to generate the WMM root estimate and it’s uncertainty; at this time, this method provides an approximation only as paths are assumed independent.

Back-calculation from each leaf proceeds as follows. Surveys or literature estimates are assumed to inform Estimate and Total columns of dataframe. These branches are then sampled ‘sample_length’ number of times (i.e., number of runs) using a variety of methods, depending on the sources of data being used to inform sibling branches. For example, two sibling branches informed by a single survey which observed the movement from parent to child node of Estimate individuals in a survey with sample size Total, can be assigned a Beta(Estimate+1, Total-Estimate+1) distribution. For more complex configurations of source knowledge, importance sampling and rejection schemes are also employed to ensure consistency among sibling branch groups and root estimates for a given run. For each leaf with non-empty Count, we back-calculate by multiplying by the sampled inverse probabilities of each branch along the root-to-leaf path.

The function ultimately generates sample_length number of weighted estimates of the root target population. Using functionality of data.tree package on makeTree object also provides:

  • confidence intervals of estimates of the root provided by leaves with marginal counts, as node$uncertainty

  • probability samples can be accessed as node$probability_samples

  • root estimates from terminal nodes with non-empty count as stored as node$targetEst_samples

The output of the function is a list of four entries; WMM root node estimate and corresponding uncertainty, a vector of estimates given on each run, and the weights associated with each path which provides a root node estimate (an informative path). Specifically, these four outputs are as follows:

  • The first is a list with four entries: in the first position is the WMM root node size estimate given by the synthesis across all informative paths. The second entry is the uncertainty associated with the root node size estimate. In the third entry we find a vector of weights which sum to one, with length equal to the number of informative paths and the weight associated with those paths. In the fourth and final position, a vector of length sample_length of estimates of the target (root) population size given by the WMM for each run.

  • The second entry can alternatively be accessed using the Get function through data.tree to access the node attribute uncertainty; it gives the 95% confidence intervals for estimates given by leaves with marginal counts, as well as the root node.

  • The third entry can alternatively be accessed using the Get function through data.tree to access the node attribute probability_samples; it returns vectors of probabilities sampled for all branches leading into each node.

  • The fourth entry can alternatively be accessed using the Get function through data.tree to access the node attribute targetEst_samples; it returns vectors of root estimates calculated for all paths with marginal counts on the respective root-to-leaf path.

The calculation of estimates relies on the internal functions confInts, ko.weights, meanlogEstimates, mmEstimate, root.confInt, rootEsts, logEstimates, and sampleBeta.

The function can be demonstrated using the tree created in the makeTree section above:

## perform root node estimation
## small sample_length was chosen for efficiency across machines
Zhats <- wmmTree(tree, sample_length = 3)
#> using variance-weighted mean with multiplier method sampled path estimates

The user can then print the estimates of the root node generated by each iteration, the weights of each branch, the final estimate of the root node population size calculated using the WMM, and the final rounded estimate of the root:

# print the estimates of the root node generated by the iterations
Zhats$estimates 
#>        [,1]
#> D    891.09
#> D.1 1503.63
#> D.2  807.97
# prints the weights of each branch
Zhats$weights 
#>              D         B
#> [1,] 0.1559304 0.8440696
# prints the final estimate of the root node by WMM
Zhats$root 
#>      Z 
#> 1026.8
# prints the final rounded estimate of the root with conf. int.
Zhats$uncertainty 
#>          Z
#> lower  812
#> upper 1465

The user may also use the data.tree functionality with the makeTree object to obtain the average root estimate with a 95% confidence interval, the samples generated from each path which provided the root estimate, as well as the sampled probabilities for each branch over the iterations:

## show the average root estimate with 95\% confidence interval, as well as
## average estimates with confidence intervals for each node with a marginal
## count
tree$Get('uncertainty')
#>          Z  A  C         D        B
#> lower  812 NA NA  658.3958  764.426
#> upper 1465 NA NA 2085.4851 1421.119

## show the samples generated from each path which provides root estimates
tree$Get('targetEst_samples')
#> $Z
#> numeric(0)
#> 
#> $A
#> numeric(0)
#> 
#> $C
#> numeric(0)
#> 
#> $D
#>         D         D         D 
#> 2104.1151 1761.2526  625.1666 
#> 
#> $B
#> [1]  760.3020 1460.3421  847.1723

## show the probabilities sampled at each branch leading into the given node
tree$Get('probability_samples')
#>       Z         A         C          D         B
#> [1,] NA 0.3423666 0.9305921 0.06940794 0.6576334
#> [2,] NA 0.6576145 0.9568305 0.04316950 0.3423855
#> [3,] NA 0.4098013 0.8048355 0.19516454 0.5901987

A second example using a slightly larger tree can be seen as follows:

## create 2nd admissible dataset
## this example handles many branch sampling cases, including all siblings informed from different surveys, same survey, and mixed case, as well as some siblings not informed and the rest from different surveys, same survey, and mixed case.
treeData2 <- data.frame("from" = c("Z", "Z", "Z",
                                    "A", "A",
                                    "B", "B", "B",
                                    "C", "C", "C",
                                    "H", "H", "H",
                                    "K", "K", "K"),
                        "to" = c("A", "B", "C",
                                  "D", "E",
                                  "F", "G", "H",
                                  "I", "J", "K",
                                  "L", "M", "N",
                                  "O", "P", "Q"),
                        "Estimate" = c(24, 34, 12,
                                      9, 1,
                                      NA, 19, 1,
                                      NA, 2, 1,
                                      20, 10, 12,
                                      5, 3, NA),
                        "Total" = c(70, 70, 70,
                                    10, 11,
                                    NA, 30, 8,
                                    NA, 12, 12,
                                    40, 40, 40,
                                    10, 10, NA),
                        "Count" = c(NA, NA, NA,
                                    50, NA,
                                    NA, 15, NA,
                                    NA, 10, NA,
                                    NA, NA, 20,
                                    5, 2, NA))

## make tree object using makeTree
tree2 <- makeTree(treeData2)
#> WARNING: No 'Population' column exists. We assume all
#>           values are sample estimates

## perform root node estimation
Zhats <- wmmTree(tree2, sample_length = 3)
#> using variance-weighted mean with multiplier method sampled path estimates
Zhats$estimates # print the estimates of the root node generated by the 15 iterations
#>       [,1]
#> A   712.32
#> A.1 338.30
#> A.2 771.11
Zhats$weights # prints the weights of each branch
#>                D         G           N         J         O         P
#> [1,] 0.004186207 0.1495208 -0.01443815 0.1646713 0.2533162 0.4427437
Zhats$root # prints the final estimate of the root node by WMM
#>      Z 
#> 570.64
Zhats$uncertainty # prints the final rounded estimate of the root with conf. int.
#>         Z
#> lower 351
#> upper 768

## show the average root estimate with 95\% confidence interval, as well as average estimates with confidence intervals for each node with a marginal count
tree2$Get('uncertainty')
#>         Z  A        D  E  B  F        G  H  L  M        N  C  I         J  K
#> lower 351 NA 230.0965 NA NA NA 34.37295 NA NA NA 598.1148 NA NA  275.1833 NA
#> upper 768 NA 324.7290 NA NA NA 51.63774 NA NA NA 623.3015 NA NA 2787.2710 NA
#>              O         P  Q
#> lower 1084.351  389.3079 NA
#> upper 2286.701 1586.0396 NA

## show the samples generated from each path which provides root estimates
tree2$Get('targetEst_samples')
#> $Z
#> numeric(0)
#> 
#> $A
#> numeric(0)
#> 
#> $D
#>        A        A        A 
#> 227.2910 290.4938 326.6387 
#> 
#> $E
#> numeric(0)
#> 
#> $B
#> numeric(0)
#> 
#> $F
#> numeric(0)
#> 
#> $G
#>        G        G        G 
#> 45.05961 33.88667 52.00941 
#> 
#> $H
#> numeric(0)
#> 
#> $L
#> numeric(0)
#> 
#> $M
#> numeric(0)
#> 
#> $N
#>        H        H        H 
#> 597.8559 624.3857 603.0553 
#> 
#> $C
#> numeric(0)
#> 
#> $I
#> numeric(0)
#> 
#> $J
#>         J         J         J 
#> 3103.1605  362.4960  271.2209 
#> 
#> $K
#> numeric(0)
#> 
#> $O
#>        O        O        O 
#> 2307.993 1052.302 1917.485 
#> 
#> $P
#>         P         P         P 
#>  536.9465  382.7753 1679.0786 
#> 
#> $Q
#> numeric(0)

## show the probabilities sampled at each branch leading into the given node
tree2$Get('probability_samples')
#>       Z         A         D         E         B  F         G         H
#> [1,] NA 0.3546774 0.6202320 0.3797680 0.4519486 NA 0.7365713 0.1932825
#> [2,] NA 0.2506694 0.6866444 0.3133556 0.6144495 NA 0.7204041 0.1205096
#> [3,] NA 0.3178260 0.4816294 0.5183706 0.5219796 NA 0.5525299 0.1622218
#>              L         M         N         C  I          J          K         O
#> [1,] 0.4282136 0.1888276 0.3829588 0.1933740 NA 0.01666471 0.03415615 0.3279960
#> [2,] 0.3635435 0.2038739 0.4325826 0.1348811 NA 0.20452459 0.08641670 0.4076437
#> [3,] 0.4545039 0.1538352 0.3916609 0.1601944 NA 0.23015981 0.02674769 0.6085613
#>              P  Q
#> [1,] 0.5639388 NA
#> [2,] 0.4482675 NA
#> [3,] 0.2779875 NA

countTree()

After WMM estimation has been performed on a makeTree object, the countTree function renders a diagram of the tree, much like drawTree, but showing the root estimate generated using wmmTree in the root node, as well as the marginal leaf counts (data column Count) which contributed to the weighted estimate displayed within the corresponding leaf nodes. The mean of the sampled branch probabilities generated using the wmmTree method are also displayed along each branch.

Functionality can be demonstrated using the trees defined above:

## visualize the tree post-estimation, with final weighted root estimate (rounded) displayed in the root node and marginal counts displayed in their respective leaves.  
## means of sampled probability appear on branches, so note that sum of sibling branches may not equal 1.
countTree(tree)

estTree()

The estTree function is for use after wmmTree has been applied to a makeTree tree object. The function allows the user to visualize the tree with the root size estimate given by wmmTree displayed in the root node, and the root estimate given by each particular path which contributed to the weighted estimate displayed in the corresponding leaf node. It also displays average of probability samples generated using wmmTree method on each branch.

Functionality can again be demonstrated using the trees defined above:

## visualize the tree post-estimation, with final weighted root estimate (rounded) displayed in the root node and path-specific estimates in their respective leaves.
## The means of sampled probability appear on branches, so note that sum of sibling branches may not equal 1
estTree(tree)

References

Flynn, M. J., and P. Gustafson. 2024a. AutoWMM and JAGStree - R Packages for Population Estimation on Relational Tree-Structured Data.” [Manuscript in Preparation] Department of Statistics, University of British Columbia.
———. 2024b. “Leveraging Relational Evidence: Population Size Estimation on Tree-Structured Data with the Weighted Multiplier Method.” [Manuscript in Preparation] Department of Statistics, University of British Columbia.
Flynn, M. J., P. Gustafson, and M. A. Irvine. 2024. “Estimating the Number of Opioid Overdoses in British Columbia Using Relational Evidence with Tree Structure.” [Manuscript in Preparation] Department of Statistics, University of British Columbia.