A Hierarchical framework for Correcting Under-Reporting in Count Data

[Stoner et al.]

April 24, 2024


What is Under-Reporting?

  • Consider count data with true counts \(y_i\)
    • If \(y_i\) is perfectly observed, the counts can be modeled by an appropriate conditional distribution \(p(y_i | \boldsymbol{\theta})\)
    • Usually either Poisson or Negative Binomial
  • In some cases the true counts are not observed, but instead \(z_i\) are observed such that \(z_i \le y_i\)
  • Using the conditional distribution above on the observed counts \(z_i\) can bias the results of the model if under-reporting is not accounted for


  • TB cases in Brazil can be assumed as under reported from previous research for a few reasons
    • Under developed rural regions
    • Immature disease tracking infrastructure
    • High cost of testing
    • etc.
  • This paper attempts to build a model so that the severity of under-reporting is estimated and potentially informed by available covariates that relate to the under-reporting mechanism [1]

Observed data

  • Sub region data of TB cases
  • Annual cases from 2012 - 2014
  • The TB detection rate was estimated at 91%, 84%, and 87% in Brazil from 2012-20141
  • Spatial clustering

Previous methods

  1. Using validation data
  • Use a more accurate data set to build a model based on the error prone data
  1. Censored Likelihood methods
  • Use standard likelihood with indicator variable for full reporting
  • Requires information to determine indicator of under-reporting
  • Possible to use a threshold if no information about under-reporting counts

Censored likelihood

The censored likelihood is written as

\[ p(\boldsymbol{y} | \boldsymbol{z, \theta}) = \prod_{I_i=1} p(y_i|\boldsymbol{\theta}) \prod_{I_i = 0} p(y_i \ge z_i | \boldsymbol{\theta}) \]

Where \(I_i\) is the indicator for which data are under reported.

  • \(I_i = 1\) when \(z_i = y_i\)
  • By accounting for the under reporting in the model design, a more robust inference on \(\boldsymbol{\theta}\) is possible


The Poisson-Logistic model

  • Instead of using an indicator variable, consider a continuous range [0,1]
    • This can be considered as the proportion of true counts
  • Using a hierarchical model consisting of:
    • A binomial component on the observed count \(z_i\)
    • A latent Poisson model for the true counts \(y_i\)
  • The Poisson-Logistic has been used across many fields; economics, criminology, natural hazards, epidemiology1, etc.

Poisson-Logistic model

The Poisson-Logistic or Pogit model is given by \[\begin{align*} z_i | y_i &\sim \text{Binomial}(\pi_i, y_i) \\ \log \left(\frac{\pi_i}{1 - \pi_i}\right) &= \beta_0 + \sum_{j=1}^{J} \beta_j w_i^{j} \\ y_i &\sim \text{Poisson}(\lambda_i) \\ \log(\lambda_i) &= \alpha_0 + \sum_{k=1}^{K}\alpha_k x_i^{(k)} \end{align*}\]

  • The vectors \(\boldsymbol{\alpha} = (\alpha_0, \ldots, \alpha_K)\) and \(\boldsymbol{\beta} = (\beta_0, \ldots, \beta_J)\) are parameters to be estimated
  • Mean-centered \(\mathbf{X}\) and \(\mathbf{W}\) permit that \(\alpha_0\) and \(\beta_0\) are interpreted as the mean of \(y_i\) on the log scale, and the mean of the reporting rate \(\pi_i\) on the logistic scale, with the covariates at their means

Extended model

Let \(z_{t, s}\) be the observed (under reported) counts, \(y_{t, s}\) be the true unknown counts, \(\pi_s\) be the under reporting rate, and \(\lambda_{t, s}\) be the Poisson mean.

The hierarchical model from the paper is \[\begin{align*} z_{t, s} | y_{t, s}, \gamma_{t, s} \sim \text{Binomial}(\pi_s, &y_{t, s}) \\ &\downarrow \\ &y_{t, s} | \phi_s, \theta_s \sim \text{Poisson}(\lambda_{t, s}). \end{align*}\] Where \(\pi_s\) and \(\lambda_{t, s}\) are defined as \[\begin{align*} \log\left(\frac{\pi_s}{1 - \pi_s}\right) &= \beta_0 + g(u_s) + \gamma_{t, s} \\ \log(\lambda_{t, s}) &= \log(P_{t, s}) + a_0 + f_1(x_s^{(1)}) + f_2(x_s^{(2)}) \\ &\quad + f_3(x_s^{(3)}) + f_4(x_s^{(4)}) + \phi_s + \theta_s. \end{align*}\]


Simulation Results

  • Simulated data with spatial correlation and under reporting
  • In sensitivity analysis it was found that the model with no completely observed values is robust in terms of quantifying uncertainty
    • As long as the practitioner specifies a prior for \(\beta_0\) that is informative

Model results for TB counts

  • The predicted reporting probabilities (\(\pi_s\))
  • There appears to be substantial clustering
    • High reporting in the south central part of Brazil
  • The predicted residual spatial variability in the TB incidence rate (\(\phi_s + \theta_s\))
  • There appears to be substantial clustering
    • Negative values in the center of Brazil
    • Positive values in the north west

Estimated true values

Here the observed counts are shown along with the 5%, 50%, and 95% quantiles for the predicted unreported cases.


  • This is a flexible model for dealing with under-reporting data
    • By using informative priors
    • Without need for validation data
  • Provides predictions at a micro-regional level
    • Can guide best use of resources to improve reporting

Thank you!


O. Stoner, T. Economou, and G. Drummond Marques da Silva, “A Hierarchical Framework for Correcting Under-Reporting in Count Data,” Journal of the American Statistical Association, vol. 114, no. 528, pp. 1481–1492, Oct. 2019, doi: 10.1080/01621459.2019.1573732.