How to Simulate the Number of Losses on a Portfolio?

Published

November 15, 2023

In a previous post we simulated a portfolio of insurance policies. Today, we simulate its loss count.

We will need the portfolio data frame policy_df that we generated back then. With the below, we re-run that code. The code below will regenerate that data with the actuarialRecipes package:

# devtools::install_github("dreanod/actuarialRecipes")
library(actuarialRecipes)

simulated_years <- 2010:2015
initial_policy_count <- 12842
portfolio_growth_rate <- .5 / 100
initial_avg_premium <- 87
rate_changes <- tibble::tibble(
  effective_date = lubridate::dmy(c(
    "04/01/2011", "07/01/2012", "10/01/2013",
    "07/01/2014", "10/01/2015", "01/01/2016"
  )),
  rate_change = c(-5.0, 10.0, 5.0, -2.0, 5.0, 5.0) / 100,
)
premium_trend <- 2 / 100
policy_length <- 6
n_expo_per_policy <- 1

policy_df <- simulate_portfolio(
  sim_years = simulated_years,
  initial_policy_count = initial_policy_count,
  ptf_growth = portfolio_growth_rate,
  n_expo_per_policy = n_expo_per_policy,
  policy_length = policy_length,
  initial_avg_premium = initial_avg_premium,
  premium_trend = premium_trend,
  rate_change_data = rate_changes
)
 policy_df

# A tibble: 78,022 × 5
   policy_id      inception_date expiration_date n_expo premium
   <chr>          <date>         <date>           <dbl>   <dbl>
 1 policy_2010_1  2010-01-01     2010-06-30           1    87  
 2 policy_2010_2  2010-01-01     2010-06-30           1    87.0
 3 policy_2010_3  2010-01-01     2010-06-30           1    87.0
 4 policy_2010_4  2010-01-01     2010-06-30           1    87.0
 5 policy_2010_5  2010-01-01     2010-06-30           1    87.0
 6 policy_2010_6  2010-01-01     2010-06-30           1    87.0
 7 policy_2010_7  2010-01-01     2010-06-30           1    87.0
 8 policy_2010_8  2010-01-01     2010-06-30           1    87.0
 9 policy_2010_9  2010-01-01     2010-06-30           1    87.0
10 policy_2010_10 2010-01-01     2010-06-30           1    87.0
# … with 78,012 more rows

We also need to load these packages from the tidyverse:

library(dplyr)
library(lubridate)
library(ggplot2)
library(tidyr)

We can be interested in either bulk simulations of the total number of losses over a period of time, or in how many losses have been generated by each policy of the portfolio. We will look at both situations in this post.

Choice of Frequency Distribution

I will assume that the loss count for each policy is independent, Poisson distributed. This implicitly assumes that the timing of losses is independent¹. In other words, the fact that a policy had a loss at a given time does not change the distribution of future losses (for this policy and for the rest of the portfolio). This intuitively makes sense and is mostly in line with real data ².

I will assume that the frequency distribution for exposures written on January 1, 2010 is 5.87%. This means that one exposure written at this date will inccur on average 0.0587 losses in a year. We further assume that the frequency decreases at an annual rate of 1%.

initial_freq <- 0.0587
freq_trend <- -1 / 100

Bulk Simulation

In a bulk simulation we simulate the number of losses for a cohort of exposures over period of time. In this case, we will look at the number of losses generated by the policies written in each year.

Calculation of the Annual Loss Frequencies

The total claim count for a group of policy is Poisson distributed with frequency the sum of the frequencies of the individual policy exposures. This is because the claim counts of each exposure is Poisson and independent from one another. For simplicity, we assume that the frequency is the same for all exposure written in the same year. As a result, the claim count of any given policy year is proportional to the number of written exposures.

We therefore calculate the number of written exposure per year:

freq_df <- policy_df |>
  group_by(policy_year = year(inception_date)) |>
  summarize(n_written_expo = sum(n_expo))

Code

display_table(freq_df)

policy_year	n_written_expo
2010	12842
2011	12906
2012	12971
2013	13036
2014	13101
2015	13166

Number of Exposures Written per Year

Now, we can calculate the frequency for exposures written in each annual cohort. The frequency needs to be trended from the January 1, 2010 to the average inception date of each cohort:

initial_freq_date <- ymd("2010-01-01")

freq_df <- policy_df |>
  group_by(PY = year(inception_date)) |>
  summarize(n_expo = sum(n_expo),
            mean_incept_dt = mean(inception_date),
            trending_period = (initial_freq_date %--% mean_incept_dt) / years(1),
            trending_factor = (1 + freq_trend)^trending_period,
            freq_per_expo = initial_freq * trending_factor,
            total_freq = freq_per_expo * n_expo)

Code

display_table(freq_df)

PY	n_expo	mean_incept_dt	trending_period	trending_factor	freq_per_expo	total_freq
2010	12842	2010-07-02	0.4999611	0.9949878	0.0584058	750.0471
2011	12906	2011-07-02	1.4999613	0.9850379	0.0578217	746.2472
2012	12971	2012-07-01	2.4999615	0.9751876	0.0572435	742.5056
2013	13036	2013-07-02	3.4999616	0.9654357	0.0566711	738.7641
2014	13101	2014-07-02	4.4999618	0.9557813	0.0561044	735.0233
2015	13166	2015-07-02	5.4999620	0.9462235	0.0555433	731.2834

Calculation of the Annual Aggregate Frequency of Loss

Calculation of the Expected Claim Count over the Policy Period

The frequency we calculated is on an annual basis. We therefore have to divide it by two to have the expected claim count for semi-annual policies:

freq_df <- freq_df |>
  mutate(exp_claim_count = total_freq / 2)

Code

freq_df |>
  select(PY, total_freq, exp_claim_count) |>
  display_table()

PY	total_freq	exp_claim_count
2010	750.0471	375.0235
2011	746.2472	373.1236
2012	742.5056	371.2528
2013	738.7641	369.3821
2014	735.0233	367.5116
2015	731.2834	365.6417

Calculation of the Annual Aggregate Frequency of Loss

Simulating the Number of Losses

We can now simulate the number of losses for each year:

set.seed(100)
freq_df <- freq_df |> 
  mutate(loss_count = rpois(n(), lambda = exp_claim_count))

1: When doing random simulations, this makes sure that we can reproduce the same random outputs every time we run the code.

Code

freq_df |> 
  select(PY, exp_claim_count, loss_count) |> 
  display_table()

PY	exp_claim_count	loss_count
2010	375.0235	365
2011	373.1236	342
2012	371.2528	388
2013	369.3821	371
2014	367.5116	373
2015	365.6417	354

Simulated Loss Count per Policy Year

Overview of the Simulated Data

We can compare the expected frequency to the simulated total loss count:

Code

freq_df |> 
  rename(`Expected Frequency` = exp_claim_count,
         `Simulated Loss Count` = loss_count) |>
  pivot_longer(c("Expected Frequency", "Simulated Loss Count")) |>
  ggplot(aes(PY, value, color = name)) +
  geom_line() + geom_point() +
  ylab("Loss Count") + xlab("Policy Year") +
  labs(color = "")

Comparing Expected Claim Frequency to Simulated Loss Count

Our simulated loss count is close to the expected frequency. However, the simulated number goes up and down around the expected frequency and we cannot discern a downward trend.

This is due to the natural variability of the Poisson distribution. We can show this by comparing the simulated loss count to the 90% confidence interval of the distribution:

freq_df |>
  mutate(lower_bound = qpois(.05, exp_claim_count),
         upper_bound = qpois(.95, exp_claim_count)) |>
  select(PY, exp_claim_count, lower_bound, upper_bound, loss_count) |>
  display_table()

1: qpois is the quantile function of the Poisson distribution.

PY	exp_claim_count	lower_bound	upper_bound	loss_count
2010	375.0235	343	407	365
2011	373.1236	342	405	342
2012	371.2528	340	403	388
2013	369.3821	338	401	371
2014	367.5116	336	399	373
2015	365.6417	334	397	354

Comparing Simulated Loss Count to Confidence Intervals

We can see from the table that all simulated loss counts fall within the 90% confidence interval, which helps us validate our results.

Policy-per-Policy Simulation

The previous simulation only tells us about the total number of losses generated in each cohort. However, it does not tell which policy or exposure generated which loss. This is limiting for some applications. For example, we may want to have the size of losses vary depending on exposure characteristics. In this case, we need to simulate a loss count per policy. We see how to do this in the rest of this post:

Calculate average frequency per Policy

First, we need to calculate the expected frequency of loss for each policy. We do this by trending the initial frequency from January 1, 2010 to the inception date of each policy:

policy_df <- policy_df |>
  mutate(trend_period = (initial_freq_date %--% inception_date) / years(1),
         trend_factor = (1 + freq_trend)^trend_period,
         frequency = initial_freq * n_expo * trend_factor)

Code

policy_df |>
  group_by(policy_year = year(inception_date)) |>
  filter(row_number() == 1) |>
  ungroup() |>
  select(policy_id, inception_date, trend_period, trend_factor, frequency) |>
  display_table()

policy_id	inception_date	trend_period	trend_factor	frequency
policy_2010_1	2010-01-01	0	1.000000	0.0587000
policy_2011_1	2011-01-01	1	0.990000	0.0581130
policy_2012_1	2012-01-01	2	0.980100	0.0575319
policy_2013_1	2013-01-01	3	0.970299	0.0569566
policy_2014_1	2014-01-01	4	0.960596	0.0563870
policy_2015_1	2015-01-01	5	0.950990	0.0558231

Frequency Calculation for the First Policy of Each Year

We can check from this table that we do have a -1% change in frequency between policies that are written one year apart.

Calculate the Expected Claim Count per Policy

We need to calculate each policy expected total claim count by restating the annual frequency for the actual policy duration:

policy_df <- policy_df |>
  mutate(policy_duration = (inception_date %--% (expiration_date + days(1))) / years(1),
         exp_claim_count = policy_duration * frequency)

Code

policy_df |>
  group_by(policy_year = year(inception_date)) |>
  filter(row_number() == 1) |>
  ungroup() |>
  select(policy_id, inception_date, frequency, exp_claim_count) |>
  display_table()

policy_id	inception_date	frequency	exp_claim_count
policy_2010_1	2010-01-01	0.0587000	0.0291088
policy_2011_1	2011-01-01	0.0581130	0.0288177
policy_2012_1	2012-01-01	0.0575319	0.0286087
policy_2013_1	2013-01-01	0.0569566	0.0282442
policy_2014_1	2014-01-01	0.0563870	0.0279618
policy_2015_1	2015-01-01	0.0558231	0.0276821

Frequency Calculation for the First Policy of Each Year

Simulate poisson for each policy

With the expected per-policy frequency, we can simulate the number of actual losses, assuming independence and a Poisson distribution:

set.seed(100)
policy_df <- policy_df |>
  mutate(n_claims = rpois(n(), exp_claim_count))

1: rpois will simulate a Poisson random variable for each row in the policy_df data frame, with the corresponding frequency given by the frequency column.

Code

policy_df |>
  select(policy_id, inception_date, exp_claim_count, n_claims) |>
  group_by(policy_year = year(inception_date)) |>
  filter(row_number() == 1) |>
  ungroup() |>
  display_table()

policy_id	inception_date	exp_claim_count	policy_year
policy_2010_1	2010-01-01	0.0291088	2010
policy_2011_1	2011-01-01	0.0288177	2011
policy_2012_1	2012-01-01	0.0286087	2012
policy_2013_1	2013-01-01	0.0282442	2013
policy_2014_1	2014-01-01	0.0279618	2014
policy_2015_1	2015-01-01	0.0276821	2015

Simulated Claim Count for the first Policy of Each Year

As the expected claim count is around 3% we expect most policies will have no claim.

Check results and compare

As a check, we compare these results to the bulk simulations above. We therefore aggregate the policy table to obtain annual expected frequencies and simulated claim counts:

Code

summary_df <- policy_df |>
  group_by(PY = year(inception_date)) |>
  summarize(
    total_expected_loss_count = sum(exp_claim_count),
    total_simulated_loss_count = sum(n_claims),
  )

display_table(summary_df)

PY	total_expected_loss_count	total_simulated_loss_count
2010	373.9791	378
2011	371.9272	351
2012	370.3819	366
2013	368.3533	380
2014	366.4882	362
2015	364.4694	354

Simulated Claim Count for the first Policy of Each Year

We plot this data against the bulk data to help comparison:

Code

summary_df_per_policy <- summary_df |>
  rename(`Per-Policy Expected` = total_expected_loss_count,
         `Per-Policy Simulated` = total_simulated_loss_count) |>
  pivot_longer(c("Per-Policy Expected", "Per-Policy Simulated"))

summary_df_bulk <- freq_df |> 
  rename(`Bulk Expected` = exp_claim_count,
         `Bulk Simulated` = loss_count) |>
  pivot_longer(c("Bulk Expected", "Bulk Simulated"))

summary_df <- bind_rows(summary_df_bulk, summary_df_per_policy)

summary_df |>
  ggplot(aes(PY, value, color = name)) +
  geom_line() + geom_point() +
  ylab("Loss Count") + xlab("Policy Year") +
  labs(color = "", subtitle = "*Bulk and per-policy expected are overlapping")

Comparing Bulk Annual to Per-Policy Frequency Simulations

The bulk and the per-policy expected loss frequencies are very close but not exactly the same. This is because the policy durations are not exactly half a year as we assumed in the bulk simulation. In fact, the policy durations can be slightly shorter or longer depending on the inception date. On average, it is slightly shorter for the overall portfolio. This results in a lower expected claim count for the per-policy simulation than for the bulk simulation.

On the other hand, the simulated loss counts are different due to the natural volatility of the Poisson distribution. We can however verify that the per-policy simulated loss count is within the 90% confidence interval we derived for the bulk simulation.

The Function That Does it All

The simulate_loss_count function from the actuarialRecipes package reproduces the per-policy simulation:

# devtools::install_github("dreanod/actuarialRecipes")
library(actuarialRecipes)

initial_freq <- 0.0587
initial_freq_date <- lubridate::ymd("2010-01-01")
freq_trend <- -1 / 100

set.seed(100)
loss_count <- simulate_loss_count(
  portfolio = policy_df,
  initial_freq = initial_freq,
  initial_freq_date = initial_freq_date,
  freq_trend = freq_trend
)

The returned value loss_count is a vector with the number of losses for each policy in policy_df.

Conclusion

In summary, we have seen too approaches to simulate loss counts for a portfolio of policies: a bulk approach that simulates claim counts on annual cohorts of written policies, and a per-policy approach that simulates the claim count for individual policies. The latter approach is the most flexible, as it gives us the possibility to simulate a dataset of individual losses, with claim amount and occurrence date. We will explore this topic in a future post.

Footnotes

This is linked to the memory-free property of the exponential distribution. But this is a topic for another post.↩︎
Well, in many situations the loss distribution is slightly “over-dispersed” and one can adjust the Poisson distribution or use another one to reflect this. This is again a topic for another post. In most cases however, the Poisson distribution is just fine as a first approximation.↩︎