Improving the Analysis of Object (or Cell) Counts with Lots of Zeros | by Daniel Manrique-Castano | Apr, 2024

0


A zero-inflated model effectively captures the nuances of datasets characterized by a preponderance of zeros. It operates by distinguishing between two distinct processes: 1) Determining whether the result is zero, and 2) predicting the values for non-zero results. This dual approach is particularly apt for asking questions like, “Are there any cells present, and if so, how many?”

For handling datasets with an abundance of zeros, we employ models such as hurdle_poisson() and Zero_inflated_poisson, both designed for scenarios where standard count models like the Poisson or negative binomial models prove inadequate (3).Loosely speaking, a key difference between hurdle_poisson() and Zero_inflated_poisson is that the latter incorporates an additional probability component specifically for zeros, enhancing their ability to handle datasets where zeros are not merely common but significant. We’ll see the impact these features have in our modeling strategy using brms.

Fitting a hurdle_poisson model

Let’s start by using the hurdle_poisson() distribution in our modeling scheme:

Hurdle_Fit1 <- brm(Cells ~ Hemisphere, 
data = Svz_data,
family = hurdle_poisson(),
# seed for reproducibility purposes
seed = 8807,
control = list(adapt_delta = 0.99),
# this is to save the model in my laptop
file = "Models/2024-04-19_CountsZeroInflated/Hurdle_Fit1.rds",
file_refit = "never")

# Add loo for model comparison
Hurdle_Fit1 <-
add_criterion(Hurdle_Fit1, c("loo", "waic", "bayes_R2"))

Let’s see the results using the standard summary function.

summary(Hurdle_Fit1)

Given this family distribution, the estimates are shown in the log scale (mu = log). In practical terms, this means that the number of cells in the contralateral subventricular zone (SVZ) can be expressed as exp(1.11) = 3.03. Similarly, the ipsilateral hemisphere is estimated to have exp(1.07) = 2.91 times the number of cells. These results align well with our expectations and offer a coherent interpretation of the cell distribution between the two hemispheres.

Additionally, the hu parameter within the “Family Specific Parameters” sheds light on the likelihood of observing zero cell counts. It indicates a 38% probability of zero occurrences. This probability highlights the need for a zero-inflated model approach and justifies its use in our analysis.

To better visualize the implications of these findings, we can leverage the conditional_effects function. This tool in the brms package allows us to plot the estimated effects of different predictors on the response variable, providing a clear graphical representation of how the predictors influence the expected cell counts.

Hurdle_CE <- 
conditional_effects(Hurdle_Fit1)

Hurdle_CE <- plot(Hurdle_CE,
plot = FALSE)[[1]]

Hurdle_Com <- Hurdle_CE +
Plot_theme +
theme(legend.position = "bottom", legend.direction = "horizontal")

Hurdle_CE_hu <-
conditional_effects(Hurdle_Fit1, dpar = "hu")

Hurdle_CE_hu <- plot(Hurdle_CE_hu,
plot = FALSE)[[1]]

Hurdle_hu <- Hurdle_CE_hu +
Plot_theme +
theme(legend.position = "bottom", legend.direction = "horizontal")

Hurdle_Com | Hurdle_hu

Figure 5: Conditional effects for the hurdle fit

These plots draw a more logical picture than our first model. The graph on the left shows the two parts of the model (“mu” and “hu”). Also, if this model is suitable, we should see more aligned predictions when using pp_check:

pp_check(Hurdle_Fit1, ndraws = 100) +
labs(title = "Hurdle regression") +
theme_classic()
Figure 6: Posterior predictive checks hurdle model

As expected, our model predictions have a lower boundary at 0.

Modeling the dispersion of the data

Observing the data presented in the right graph of Figure 5 reveals a discrepancy between our empirical findings and our theoretical understanding of the subject. Based on established knowledge, we expect a higher probability of non-zero cell counts in the subventricular zone (SVZ) of the ipsilateral hemisphere, especially following an injury. This is because the ipsilateral SVZ typically becomes a hub of cellular activity, with significant cell proliferation post-injury. Our data, indicating prevalent non-zero counts in this region, supports this biological expectation.

However, the current model predictions do not fully align with these insights. This divergence underscores the importance of incorporating scientific understanding into our statistical modeling. Relying solely on standard tests without contextual adaptation can lead to misleading conclusions.

To address this, we can refine our model by specifically adjusting the hu parameter, which represents the probability of zero occurrences. This allows us to more accurately reflect the expected biological activity in the ipsilateral hemisphere’s SVZ. We build then a second hurdle model:

Hurdle_Mdl2 <- bf(Cells ~ Hemisphere, 
hu ~ Hemisphere)

Hurdle_Fit2 <- brm(
formula = Hurdle_Mdl2,
data = Svz_data,
family = hurdle_poisson(),
# seed for reproducibility purposes
seed = 8807,
control = list(adapt_delta = 0.99),
# this is to save the model in my laptop
file = "Models/2024-04-19_CountsZeroInflated/Hurdle_Fit2.rds",
file_refit = "never")

# Add loo for model comparison
Hurdle_Fit2 <-
add_criterion(Hurdle_Fit2, c("loo", "waic", "bayes_R2"))

Let’s see first if the results graph aligns with our hypothesis:

Hurdle_CE <- 
conditional_effects(Hurdle_Fit2)

Hurdle_CE <- plot(Hurdle_CE,
plot = FALSE)[[1]]

Hurdle_Com <- Hurdle_CE +
Plot_theme +
theme(legend.position = "bottom", legend.direction = "horizontal")

Hurdle_CE_hu <-
conditional_effects(Hurdle_Fit2, dpar = "hu")

Hurdle_CE_hu <- plot(Hurdle_CE_hu,
plot = FALSE)[[1]]

Hurdle_hu <- Hurdle_CE_hu +
Plot_theme +
theme(legend.position = "bottom", legend.direction = "horizontal")

Hurdle_Com | Hurdle_hu

Figure 7: Conditional effects for the hurdle fit 2

This revised modeling approach seems to be a substantial improvement. By specifically accounting for the higher probability of zero counts (~75%) in the contralateral hemisphere, the model now aligns more closely with both the observed data and our scientific knowledge. This adjustment not only reflects the expected lower cell activity in this region but also enhances the precision of our estimates. With these changes, the model now offers a more nuanced interpretation of cellular dynamics post-injury. Let’s see the summary and the TRANSFORMATION FOR THE hu parameters (do not look at the others) to visualize them in a probability scale using the logit2prob function we created at the beginning.

logit2prob(fixef(Hurdle_Fit2))

Although the estimates for the number of cells are similar, the hu parameters (in the logit scale) tells us that the probability for seeing zeros in the contralateral hemisphere is:

Conversely:

Depicts a drastic reduction to about 0.23% probability of observing zero cell counts in the injured (ipsilateral) hemisphere. This is a remarkable change in our estimates.

Now, let’s explore if a zero_inflated_poisson() distribution family changes these insights.

Leave a Reply

Your email address will not be published. Required fields are marked *