How to Simulate the Number of Losses on a Portfolio?

Simulations
Frequency
Trend
Published

November 15, 2023

In a previous post we simulated a portfolio of insurance policies. Today, we simulate its loss count.

We will need the portfolio data frame policy_df that we generated back then. With the below, we re-run that code. The code below will regenerate that data with the actuarialRecipes package:

# devtools::install_github("dreanod/actuarialRecipes")
library(actuarialRecipes)

simulated_years <- 2010:2015
initial_policy_count <- 12842
portfolio_growth_rate <- .5 / 100
initial_avg_premium <- 87
rate_changes <- tibble::tibble(
  effective_date = lubridate::dmy(c(
    "04/01/2011", "07/01/2012", "10/01/2013",
    "07/01/2014", "10/01/2015", "01/01/2016"
  )),
  rate_change = c(-5.0, 10.0, 5.0, -2.0, 5.0, 5.0) / 100,
)
premium_trend <- 2 / 100
policy_length <- 6
n_expo_per_policy <- 1

policy_df <- simulate_portfolio(
  sim_years = simulated_years,
  initial_policy_count = initial_policy_count,
  ptf_growth = portfolio_growth_rate,
  n_expo_per_policy = n_expo_per_policy,
  policy_length = policy_length,
  initial_avg_premium = initial_avg_premium,
  premium_trend = premium_trend,
  rate_change_data = rate_changes
)
 policy_df
# A tibble: 78,022 × 5
   policy_id      inception_date expiration_date n_expo premium
   <chr>          <date>         <date>           <dbl>   <dbl>
 1 policy_2010_1  2010-01-01     2010-06-30           1    87  
 2 policy_2010_2  2010-01-01     2010-06-30           1    87.0
 3 policy_2010_3  2010-01-01     2010-06-30           1    87.0
 4 policy_2010_4  2010-01-01     2010-06-30           1    87.0
 5 policy_2010_5  2010-01-01     2010-06-30           1    87.0
 6 policy_2010_6  2010-01-01     2010-06-30           1    87.0
 7 policy_2010_7  2010-01-01     2010-06-30           1    87.0
 8 policy_2010_8  2010-01-01     2010-06-30           1    87.0
 9 policy_2010_9  2010-01-01     2010-06-30           1    87.0
10 policy_2010_10 2010-01-01     2010-06-30           1    87.0
# … with 78,012 more rows

We also need to load these packages from the tidyverse:

library(dplyr)
library(lubridate)
library(ggplot2)
library(tidyr)

We can be interested in either bulk simulations of the total number of losses over a period of time, or in how many losses have been generated by each policy of the portfolio. We will look at both situations in this post.

Choice of Frequency Distribution

I will assume that the loss count for each policy is independent, Poisson distributed. This implicitly assumes that the timing of losses is independent1. In other words, the fact that a policy had a loss at a given time does not change the distribution of future losses (for this policy and for the rest of the portfolio). This intuitively makes sense and is mostly in line with real data 2.

I will assume that the frequency distribution for exposures written on January 1, 2010 is 5.87%. This means that one exposure written at this date will inccur on average 0.0587 losses in a year. We further assume that the frequency decreases at an annual rate of 1%.

initial_freq <- 0.0587
freq_trend <- -1 / 100

Bulk Simulation

In a bulk simulation we simulate the number of losses for a cohort of exposures over period of time. In this case, we will look at the number of losses generated by the policies written in each year.

Calculation of the Annual Loss Frequencies

The total claim count for a group of policy is Poisson distributed with frequency the sum of the frequencies of the individual policy exposures. This is because the claim counts of each exposure is Poisson and independent from one another. For simplicity, we assume that the frequency is the same for all exposure written in the same year. As a result, the claim count of any given policy year is proportional to the number of written exposures.

We therefore calculate the number of written exposure per year:

freq_df <- policy_df |>
  group_by(policy_year = year(inception_date)) |>
  summarize(n_written_expo = sum(n_expo))
Code
display_table(freq_df)
policy_year n_written_expo
2010 12842
2011 12906
2012 12971
2013 13036
2014 13101
2015 13166

Number of Exposures Written per Year

Now, we can calculate the frequency for exposures written in each annual cohort. The frequency needs to be trended from the January 1, 2010 to the average inception date of each cohort:

initial_freq_date <- ymd("2010-01-01")

freq_df <- policy_df |>
  group_by(PY = year(inception_date)) |>
  summarize(n_expo = sum(n_expo),
            mean_incept_dt = mean(inception_date),
            trending_period = (initial_freq_date %--% mean_incept_dt) / years(1),
            trending_factor = (1 + freq_trend)^trending_period,
            freq_per_expo = initial_freq * trending_factor,
            total_freq = freq_per_expo * n_expo)
Code
display_table(freq_df)
PY n_expo mean_incept_dt trending_period trending_factor freq_per_expo total_freq
2010 12842 2010-07-02 0.4999611 0.9949878 0.0584058 750.0471
2011 12906 2011-07-02 1.4999613 0.9850379 0.0578217 746.2472
2012 12971 2012-07-01 2.4999615 0.9751876 0.0572435 742.5056
2013 13036 2013-07-02 3.4999616 0.9654357 0.0566711 738.7641
2014 13101 2014-07-02 4.4999618 0.9557813 0.0561044 735.0233
2015 13166 2015-07-02 5.4999620 0.9462235 0.0555433 731.2834

Calculation of the Annual Aggregate Frequency of Loss

Calculation of the Expected Claim Count over the Policy Period

The frequency we calculated is on an annual basis. We therefore have to divide it by two to have the expected claim count for semi-annual policies:

freq_df <- freq_df |>
  mutate(exp_claim_count = total_freq / 2)
Code
freq_df |>
  select(PY, total_freq, exp_claim_count) |>
  display_table()
PY total_freq exp_claim_count
2010 750.0471 375.0235
2011 746.2472 373.1236
2012 742.5056 371.2528
2013 738.7641 369.3821
2014 735.0233 367.5116
2015 731.2834 365.6417

Calculation of the Annual Aggregate Frequency of Loss

Simulating the Number of Losses

We can now simulate the number of losses for each year:

set.seed(100)
freq_df <- freq_df |> 
  mutate(loss_count = rpois(n(), lambda = exp_claim_count))
1
When doing random simulations, this makes sure that we can reproduce the same random outputs every time we run the code.
Code
freq_df |> 
  select(PY, exp_claim_count, loss_count) |> 
  display_table()
PY exp_claim_count loss_count
2010 375.0235 365
2011 373.1236 342
2012 371.2528 388
2013 369.3821 371
2014 367.5116 373
2015 365.6417 354

Simulated Loss Count per Policy Year

Overview of the Simulated Data

We can compare the expected frequency to the simulated total loss count:

Code
freq_df |> 
  rename(`Expected Frequency` = exp_claim_count,
         `Simulated Loss Count` = loss_count) |>
  pivot_longer(c("Expected Frequency", "Simulated Loss Count")) |>
  ggplot(aes(PY, value, color = name)) +
  geom_line() + geom_point() +
  ylab("Loss Count") + xlab("Policy Year") +
  labs(color = "")

Comparing Expected Claim Frequency to Simulated Loss Count

Our simulated loss count is close to the expected frequency. However, the simulated number goes up and down around the expected frequency and we cannot discern a downward trend.

This is due to the natural variability of the Poisson distribution. We can show this by comparing the simulated loss count to the 90% confidence interval of the distribution:

freq_df |>
  mutate(lower_bound = qpois(.05, exp_claim_count),
         upper_bound = qpois(.95, exp_claim_count)) |>
  select(PY, exp_claim_count, lower_bound, upper_bound, loss_count) |>
  display_table()
1
qpois is the quantile function of the Poisson distribution.
PY exp_claim_count lower_bound upper_bound loss_count
2010 375.0235 343 407 365
2011 373.1236 342 405 342
2012 371.2528 340 403 388
2013 369.3821 338 401 371
2014 367.5116 336 399 373
2015 365.6417 334 397 354

Comparing Simulated Loss Count to Confidence Intervals

We can see from the table that all simulated loss counts fall within the 90% confidence interval, which helps us validate our results.

Policy-per-Policy Simulation

The previous simulation only tells us about the total number of losses generated in each cohort. However, it does not tell which policy or exposure generated which loss. This is limiting for some applications. For example, we may want to have the size of losses vary depending on exposure characteristics. In this case, we need to simulate a loss count per policy. We see how to do this in the rest of this post:

Calculate average frequency per Policy

First, we need to calculate the expected frequency of loss for each policy. We do this by trending the initial frequency from January 1, 2010 to the inception date of each policy:

policy_df <- policy_df |>
  mutate(trend_period = (initial_freq_date %--% inception_date) / years(1),
         trend_factor = (1 + freq_trend)^trend_period,
         frequency = initial_freq * n_expo * trend_factor)
Code
policy_df |>
  group_by(policy_year = year(inception_date)) |>
  filter(row_number() == 1) |>
  ungroup() |>
  select(policy_id, inception_date, trend_period, trend_factor, frequency) |>
  display_table()
policy_id inception_date trend_period trend_factor frequency
policy_2010_1 2010-01-01 0 1.000000 0.0587000
policy_2011_1 2011-01-01 1 0.990000 0.0581130
policy_2012_1 2012-01-01 2 0.980100 0.0575319
policy_2013_1 2013-01-01 3 0.970299 0.0569566
policy_2014_1 2014-01-01 4 0.960596 0.0563870
policy_2015_1 2015-01-01 5 0.950990 0.0558231

Frequency Calculation for the First Policy of Each Year

We can check from this table that we do have a -1% change in frequency between policies that are written one year apart.

Calculate the Expected Claim Count per Policy

We need to calculate each policy expected total claim count by restating the annual frequency for the actual policy duration:

policy_df <- policy_df |>
  mutate(policy_duration = (inception_date %--% (expiration_date + days(1))) / years(1),
         exp_claim_count = policy_duration * frequency)
Code
policy_df |>
  group_by(policy_year = year(inception_date)) |>
  filter(row_number() == 1) |>
  ungroup() |>
  select(policy_id, inception_date, frequency, exp_claim_count) |>
  display_table()
policy_id inception_date frequency exp_claim_count
policy_2010_1 2010-01-01 0.0587000 0.0291088
policy_2011_1 2011-01-01 0.0581130 0.0288177
policy_2012_1 2012-01-01 0.0575319 0.0286087
policy_2013_1 2013-01-01 0.0569566 0.0282442
policy_2014_1 2014-01-01 0.0563870 0.0279618
policy_2015_1 2015-01-01 0.0558231 0.0276821

Frequency Calculation for the First Policy of Each Year

Simulate poisson for each policy

With the expected per-policy frequency, we can simulate the number of actual losses, assuming independence and a Poisson distribution:

set.seed(100)
policy_df <- policy_df |>
  mutate(n_claims = rpois(n(), exp_claim_count))
1
rpois will simulate a Poisson random variable for each row in the policy_df data frame, with the corresponding frequency given by the frequency column.
Code
policy_df |>
  select(policy_id, inception_date, exp_claim_count, n_claims) |>
  group_by(policy_year = year(inception_date)) |>
  filter(row_number() == 1) |>
  ungroup() |>
  display_table()
policy_id inception_date exp_claim_count n_claims policy_year
policy_2010_1 2010-01-01 0.0291088 0 2010
policy_2011_1 2011-01-01 0.0288177 0 2011
policy_2012_1 2012-01-01 0.0286087 0 2012
policy_2013_1 2013-01-01 0.0282442 0 2013
policy_2014_1 2014-01-01 0.0279618 0 2014
policy_2015_1 2015-01-01 0.0276821 0 2015

Simulated Claim Count for the first Policy of Each Year

As the expected claim count is around 3% we expect most policies will have no claim.

Check results and compare

As a check, we compare these results to the bulk simulations above. We therefore aggregate the policy table to obtain annual expected frequencies and simulated claim counts:

Code
summary_df <- policy_df |>
  group_by(PY = year(inception_date)) |>
  summarize(
    total_expected_loss_count = sum(exp_claim_count),
    total_simulated_loss_count = sum(n_claims),
  )

display_table(summary_df)
PY total_expected_loss_count total_simulated_loss_count
2010 373.9791 378
2011 371.9272 351
2012 370.3819 366
2013 368.3533 380
2014 366.4882 362
2015 364.4694 354

Simulated Claim Count for the first Policy of Each Year

We plot this data against the bulk data to help comparison:

Code
summary_df_per_policy <- summary_df |>
  rename(`Per-Policy Expected` = total_expected_loss_count,
         `Per-Policy Simulated` = total_simulated_loss_count) |>
  pivot_longer(c("Per-Policy Expected", "Per-Policy Simulated"))

summary_df_bulk <- freq_df |> 
  rename(`Bulk Expected` = exp_claim_count,
         `Bulk Simulated` = loss_count) |>
  pivot_longer(c("Bulk Expected", "Bulk Simulated"))

summary_df <- bind_rows(summary_df_bulk, summary_df_per_policy)

summary_df |>
  ggplot(aes(PY, value, color = name)) +
  geom_line() + geom_point() +
  ylab("Loss Count") + xlab("Policy Year") +
  labs(color = "", subtitle = "*Bulk and per-policy expected are overlapping")

Comparing Bulk Annual to Per-Policy Frequency Simulations

The bulk and the per-policy expected loss frequencies are very close but not exactly the same. This is because the policy durations are not exactly half a year as we assumed in the bulk simulation. In fact, the policy durations can be slightly shorter or longer depending on the inception date. On average, it is slightly shorter for the overall portfolio. This results in a lower expected claim count for the per-policy simulation than for the bulk simulation.

On the other hand, the simulated loss counts are different due to the natural volatility of the Poisson distribution. We can however verify that the per-policy simulated loss count is within the 90% confidence interval we derived for the bulk simulation.

The Function That Does it All

The simulate_loss_count function from the actuarialRecipes package reproduces the per-policy simulation:

# devtools::install_github("dreanod/actuarialRecipes")
library(actuarialRecipes)

initial_freq <- 0.0587
initial_freq_date <- lubridate::ymd("2010-01-01")
freq_trend <- -1 / 100

set.seed(100)
loss_count <- simulate_loss_count(
  portfolio = policy_df,
  initial_freq = initial_freq,
  initial_freq_date = initial_freq_date,
  freq_trend = freq_trend
)

The returned value loss_count is a vector with the number of losses for each policy in policy_df.

Conclusion

In summary, we have seen too approaches to simulate loss counts for a portfolio of policies: a bulk approach that simulates claim counts on annual cohorts of written policies, and a per-policy approach that simulates the claim count for individual policies. The latter approach is the most flexible, as it gives us the possibility to simulate a dataset of individual losses, with claim amount and occurrence date. We will explore this topic in a future post.

Footnotes

  1. This is linked to the memory-free property of the exponential distribution. But this is a topic for another post.↩︎

  2. Well, in many situations the loss distribution is slightly “over-dispersed” and one can adjust the Poisson distribution or use another one to reflect this. This is again a topic for another post. In most cases however, the Poisson distribution is just fine as a first approximation.↩︎