Insurance (Linear programming and discrete optimization)?

Welcome to adulthood. After three years of marriage, my friends definitely empathize, things got real and it was time to do some grown up things around our life plan. After working on Wall Street and several FinTech startups spanning personal finance, financial planning, mortgage, and property and casualty insurance…..embarrasingly I didn’t ever consider life insurance. My wife and I realized that we both should have probably gotten some earlier in our lives when things were much cheaper, but better late than never, right?

There are numerous views on how much life insurance coverage is necessary. A quick google search will populate various methods ranging from the extremely simple (i.e. a multiple X age bracket, to more complex cash flow modeling around projected income and/or expenses). For your sake, I’ll keep this post focused the fun data/math problems I solved.

Full disclaimer here again: I’m not a life insurance expert, the below was just a fun problem to explore some methods/packages I was previously unfamiliar with.

I scraped multiple policy quotes and transcribed them into a spreadsheet when I noticed something interesting. Please, see below for a glimpse into the dataframe.

policy_id insurer term coverage yearly_premium
0 1 Insurer 1 10 $100,000 $98.00
1 2 Insurer 1 10 $250,000 $143.00
2 3 Insurer 2 10 $250,000 $228.00
3 4 Insurer 1 10 $500,000 $205.00
4 5 Insurer 2 10 $500,000 $370.00

I started to wonder if this could be viewed as a potential portfolio consisting of optimal terms, coverage amounts, and, most importantly, costs. After taking a deeper look at a few plots, I started to realize how these 3 parameters have some interesting correlations, leading me to conclude that perhaps there could be a potential optimum portfolio of coverage.

plt.figure(figsize=(10, 8))
plt.scatter(x = df_matrix['term'], 
            y = df_matrix['coverage'], 
            s = df_matrix['yearly_premium'],
            alpha=0.3, 
            edgecolors='w',)

plt.xlabel('Term')
plt.ylabel('Coverage')
plt.title('Term - Coverage - Yearly Premium', y=1.03)

png

The above plot is pretty, but perhaps not very intuitive. When we try a more explicit 3-dimensional plot, it starts to get more interesting as you can visualize some real linear relationships here!

fig = plt.figure(figsize=(14, 6))
ax = fig.add_subplot(111, projection='3d')

xs = df_matrix['term']
ys = df_matrix['coverage']
zs = df_matrix['yearly_premium']
ax.scatter(xs, ys, zs, s=50, alpha=0.6, edgecolors='w')

ax.set_xlabel('Term',labelpad=10)
ax.set_ylabel('Coverage',labelpad=15)
ax.set_zlabel('Premium')

plt.show()

png

You can see the linearity is particularly pronounced along coverage amounts and premium, which is intuitive (equally important is that it is good sanity check on the data). Let’s create a new metric around premium per unit of coverage ($1,000) that helps us understand if something is “cheap” or “expensive” across all of these parameters. You can probably see where we are headed, but let’s just flush this out a bit more with some intuition before we get to the fun programming.

# Let's create a new metric
df_matrix['premium_per_thousand']=(df_matrix.yearly_premium/df_matrix.coverage)*1000
fig = plt.figure(figsize=(14, 6))
ax = fig.add_subplot(111, projection='3d')

xs = df_matrix['term']
ys = df_matrix['coverage']
zs = df_matrix['premium_per_thousand']
ax.scatter(xs, ys, zs, s=50, alpha=0.6, edgecolors='w')

ax.set_xlabel('Term',labelpad=10)
ax.set_ylabel('Coverage',labelpad=15)
ax.set_zlabel('Premium')

plt.show()

png

The above plot isn’t really that helpful, so let’s try another view as it seems that these relationships vary a great deal according to the term.

sns.set(style="whitegrid")
ax = sns.barplot(x=df_matrix.term.astype('str'), y=df_matrix.premium_per_thousand)

png

Ok, here we go, we can see here that it appears that premium per $1k of coverage is more expensive at shorter and longer terms. I speculate that shorter terms have some fixed transactional costs in them and longer terms are more expensive simply due to the fact that the longer the term the higher likelyhood of an event happening during that term period.

df_matrix.head(2)

After taking a glimpse at the dataframe below, we can clearly see the premium_per_thousand metric. If we know exactly the term we want and have a coverage amount range (and assume all insurers to be equal), we now have a metric that enables us to select the most attractively priced policy. It’s simple and elegant without a lot of complicated data science and math.

policy_id insurer term coverage yearly_premium monthly_premium premium_per_thousand
0 1 Insurer 1 10 100000.0 98.0 8.166667 0.980
1 2 Insurer 1 10 250000.0 143.0 11.916667 0.572

Full disclaimer here again: I’m not a life insurance expert, the below was just a fun problem to explore some methods/packages I was unfamiliar with. This isn’t a recommendation on selecting life insurance.

What if we were flexible in terms of the amount of coverage we needed and the term profile? Also what if there were hypothetical instances where we can construct mult-policy portfolios with varying coverage amounts and terms over time but weighted averages that are similar to individual policies at a more attractive cost?

Hypothetical scenario: I’m interested in Policy A which has a 20 year term and for $1 million dollars of coverage our premium is $2000 per year

Policy B has a 10 year term and for $500k of coverage our premium is $300 per year Policy C has a 30 year term and for $500k of coverage our premium is $1300 per year

If I purchase policy B & C: $500k+$500k = $1 million of coverage (for the first 10 years of course then slides to $500k) ($500k10+$500k30)/($500k+$500k) = 20 years of average coverage $300+$1300 = $1600 per year

Let’s say that I really only wanted $1 million of coverage for the first 10 years, but wanted to lock in coverage over 30 years.

The above might be a fun way to explore/refresh Linear Programming/Optimization in python. Linear programming is one of the simplest ways to perform optimization (often overlooked) and is often used across many business lines. This is often used in business operations to help evaluate the trade-offs between cost and efficiency(throughput) with very complicated contraints and dependencies. I personally used this approach in optimizing call center operations where there were varying constraints on types of headcount, hours of operation, etc that really had implications on the quality of the customer experience.

Let’s take a look at our problem, evaluating our goal or metric we’re trying to optimize which is our price (premium) here; often referred to as our objective function. The next step is to explicitly state the constraints, for example, we would need a minimum average term period of 30 years and a minimum of $1.5 million of coverage.

Next I found a python pulp package, which has some pretty good documentation.

Here we go:

df_matrix['cov_x_term']=(df_matrix.coverage * df_matrix.term)
# Create a list of the insurance items, terms, coverage, premiums, etc.
insurance_items = list(df_matrix['policy_id'])
term = dict(zip(insurance_items,df_matrix['term']))
coverage = dict(zip(insurance_items,df_matrix['coverage']))
yearly_premium = dict(zip(insurance_items,df_matrix['yearly_premium']))
cov_x_term = dict(zip(insurance_items,df_matrix['cov_x_term']))
df_matrix.head(15)
policy_id insurer term coverage yearly_premium monthly_premium premium_per_thousand cov_x_term
0 1 Insurer 1 10 100000.0 98.0 8.166667 0.980000 1000000.0
1 2 Insurer 1 10 250000.0 143.0 11.916667 0.572000 2500000.0
2 3 Insurer 2 10 250000.0 228.0 19.000000 0.912000 2500000.0
3 4 Insurer 1 10 500000.0 205.0 17.083333 0.410000 5000000.0
4 5 Insurer 2 10 500000.0 370.0 30.833333 0.740000 5000000.0
5 6 Insurer 1 10 750000.0 277.0 23.083333 0.369333 7500000.0
6 7 Insurer 3 10 750000.0 488.0 40.666667 0.650667 7500000.0
7 8 Insurer 1 15 100000.0 100.0 8.333333 1.000000 1500000.0
8 9 Insurer 1 15 250000.0 130.0 10.833333 0.520000 3750000.0
9 10 Insurer 1 15 500000.0 250.0 20.833333 0.500000 7500000.0
10 11 Insurer 3 15 500000.0 378.0 31.500000 0.756000 7500000.0
11 12 Insurer 1 15 750000.0 345.0 28.750000 0.460000 11250000.0
12 13 Insurer 3 15 750000.0 521.0 43.416667 0.694667 11250000.0
13 14 Insurer 1 15 1750000.0 666.0 55.500000 0.380571 26250000.0
14 15 Insurer 1 20 100000.0 105.0 8.750000 1.050000 2000000.0


Now let’s setup a framework to solve this thing with linear programming. In this case it’ll be integer programming as we can’t have fractional weights, i.e. .3 of an insurance policy, we can only have integers.

png

from pulp import *
min_avg_term = 30
min_tot_cov = 1500000
prob = LpProblem("Insurance Problem",LpMinimize)
insurance_vars = LpVariable.dicts("Life",insurance_items,0,cat='Integer')
#Objective is the lowest yearly premium
prob += lpSum([yearly_premium[i]*insurance_vars[i] for i in insurance_items])

#Minimum Total Coverage
prob += lpSum([coverage[f] * insurance_vars[f] for f in insurance_items]) >= min_tot_cov, "MinimumTotalCoverage"

#Minimum Average Term
prob += lpSum([cov_x_term[f] * insurance_vars[f] for f in insurance_items]) >= min_avg_term*(lpSum([coverage[f] * insurance_vars[f] for f in insurance_items])), "MinimumAvgDuration"

prob.solve()

print("Status:", LpStatus[prob.status])
Status: Optimal
#Now we have a dictionary that has a series of variables and an optimal set of weights for example
print (prob.variables()[3].varValue)
```    0.0



```python
#Let's grab all the weights and put them into a dataframe so we can easily take a look 
weights_dict = {} 
pricing_dict = {}

weights_dict={}
for v in prob.variables():
    weights_dict.update({v.name[5:]:v.varValue})
weights_df = pd.DataFrame(weights_dict,index=[0]).transpose()
weights_df.reset_index(inplace=True)
weights_df.rename(columns={'index':'policy_id',0:'weights'},inplace=True)
#formatting these fields so we can compare them to the original matrix
weights_df.policy_id = weights_df.policy_id.apply(lambda l: float(l))
weights_df.weights = weights_df.weights.apply(lambda l: float(l))
weights_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 49 entries, 0 to 48
Data columns (total 2 columns):
policy_id    49 non-null float64
weights      49 non-null float64
dtypes: float64(2)
memory usage: 864.0 bytes
weights_df[weights_df.weights>0.0]
policy_id weights
39 45.0 1.0

There is only 1 policy that has a weight equal to 1 for a minimum average term of 30 years and $1.5 million of coverage. It appears that for these constraints, this policy is optimally priced.

df_scenario = df_matrix.merge(weights_df,how='left',on='policy_id')
df_scenario.tail()
policy_id insurer term coverage yearly_premium monthly_premium premium_per_thousand cov_x_term weights
44 45 Insurer 1 30 1500000.0 941.0 78.416667 0.627333 45000000.0 1.0
45 46 Insurer 2 30 1500000.0 955.0 79.583333 0.636667 45000000.0 0.0
46 47 Insurer 1 30 1750000.0 1087.0 90.583333 0.621143 52500000.0 0.0
47 48 Insurer 2 30 1750000.0 1100.0 91.666667 0.628571 52500000.0 0.0
48 49 Insurer 2 30 2000000.0 1234.0 102.833333 0.617000 60000000.0 0.0
df_scenario[df_scenario.weights>0.0]
policy_id insurer term coverage yearly_premium monthly_premium premium_per_thousand cov_x_term weights
44 45 Insurer 1 30 1500000.0 941.0 78.416667 0.627333 45000000.0 1.0

Perhaps, let’s look at a few different scenarios of coverage and term. It is expected that the common policy terms 10, 20, and 30 years are likely optimally priced. So, perhaps looking at some potential terms in the middle may yield some fruit from using our integer programming technique/tool.

# coverage
min_cov_array = [750000,1000000,1250000,1500000,1750000,2000000]
min_term_array = [15,20,23,25,27]
num_scenarios = len(min_cov_array)*len(min_term_array)

weights_dict = {} 
pricing_dict = {}

df_scenario = pd.DataFrame()
df_final_scenario_policy_weights = pd.DataFrame() 
df_final_summary = pd.DataFrame()

#min_avg_term = 25.0
for min_avg_term in min_term_array:
    #Can break below into sub_function
    for min_tot_cov in min_cov_array:
        #keep data on each run remember scenario for min_avg_term
        
        prob = LpProblem("Insurance Problem",LpMinimize)
        insurance_vars = LpVariable.dicts("Life",insurance_items,0,cat='Integer')

        prob += lpSum([yearly_premium[i]*insurance_vars[i] for i in insurance_items])
        prob += lpSum([coverage[f] * insurance_vars[f] for f in insurance_items]) >= min_tot_cov, "MinimumTotalCoverage"
        prob += lpSum([cov_x_term[f] * insurance_vars[f] for f in insurance_items]) >= min_avg_term*(lpSum([coverage[f] * insurance_vars[f] for f in insurance_items])), "MinimumAvgDuration"
        prob.solve()

        weights_dict[f'{min_avg_term}_{min_tot_cov:}']={}
        for v in prob.variables():
            weights_dict[f'{min_avg_term}_{min_tot_cov:}'].update({v.name[5:]:v.varValue})            

    #write all this into dataframe
    weights_df = pd.DataFrame(weights_dict)
    weights_df.reset_index(inplace=True)
    weights_df.rename(columns={'index':'policy_id'},inplace=True)
    
    weights_df['policy_id'] = weights_df['policy_id'].astype('int64')

    df_scenario = df_matrix.merge(weights_df,how='left',on='policy_id')
    
num_scenarios: 30
30
weights_df.shape
(49, 31)
df_scenario.head(5)
policy_id insurer term coverage yearly_premium monthly_premium premium_per_thousand cov_x_term 15_750000 15_1000000 ... 25_1250000 25_1500000 25_1750000 25_2000000 27_750000 27_1000000 27_1250000 27_1500000 27_1750000 27_2000000
0 1 Insurer 1 10 100000.0 98.0 8.166667 0.980 1000000.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1 2 Insurer 1 10 250000.0 143.0 11.916667 0.572 2500000.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2 3 Insurer 2 10 250000.0 228.0 19.000000 0.912 2500000.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
3 4 Insurer 1 10 500000.0 205.0 17.083333 0.410 5000000.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
4 5 Insurer 2 10 500000.0 370.0 30.833333 0.740 5000000.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

5 rows × 38 columns

# for a more convenient view let's only keep the policies that have weights > 0
df_scenario.shape
(49, 38)
## for illustrative purposes I'm copying this over to a separate dataframe, with large datasets this is not recommended. 
df_weighted = df_scenario[(df_scenario.iloc[:,-num_scenarios:]!=0).any(axis=1)]
df_weighted = df_weighted.copy()
df_weighted.shape
(19, 38)
# can see your composition of policies for each optimized scenario
df_weighted.head(5)
policy_id insurer term coverage yearly_premium monthly_premium premium_per_thousand cov_x_term 15_750000 15_1000000 ... 25_1250000 25_1500000 25_1750000 25_2000000 27_750000 27_1000000 27_1250000 27_1500000 27_1750000 27_2000000
8 9 Insurer 1 15 250000.0 130.0 10.833333 0.520000 3750000.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0 0.0 0.0
17 18 Insurer 1 20 500000.0 215.0 17.916667 0.430000 10000000.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 1.0 1.0
19 20 Insurer 1 20 750000.0 292.0 24.333333 0.389333 15000000.0 1.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
21 22 Insurer 1 20 1000000.0 365.0 30.416667 0.365000 20000000.0 0.0 1.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
23 24 Insurer 1 20 1250000.0 442.0 36.833333 0.353600 25000000.0 0.0 0.0 ... 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0

5 rows × 38 columns

Below, I have listed a few scenarios with policies > 1, which indicates that a combination of policies will yield a more optimal price!

#Quick view of our policies shows that this may be working.
df_weighted.iloc[:,-num_scenarios:].astype(bool).sum()
15_750000     1
15_1000000    1
15_1250000    1
15_1500000    1
15_1750000    1
15_2000000    1
20_750000     1
20_1000000    1
20_1250000    1
20_1500000    1
20_1750000    1
20_2000000    1
23_750000     1
23_1000000    1
23_1250000    2
23_1500000    2
23_1750000    2
23_2000000    2
25_750000     1
25_1000000    1
25_1250000    1
25_1500000    1
25_1750000    1
25_2000000    1
27_750000     1
27_1000000    1
27_1250000    2
27_1500000    2
27_1750000    2
27_2000000    2
dtype: int64
aggregate = {} 
for i in df_weighted.iloc[:,-num_scenarios:].columns:
    aggregate[i]={
                'policy_count':df_weighted[i].astype(bool).sum(),
                'yearly_premium':np.dot(df_weighted.yearly_premium,df_weighted[i]),
                'wa_term':np.dot(df_weighted['cov_x_term'],df_weighted[i])/(np.dot(df_weighted.coverage,df_weighted[i])),
                'total_coverage':np.dot(df_weighted.coverage,df_weighted[i])}
#count number of unique policies
df_portfolio_stats = pd.DataFrame(aggregate).transpose()
df_portfolio_stats.index.rename('scenario',inplace=True)
df_portfolio_stats.reset_index(inplace=True)
df_portfolio_stats['min_avg_term']=df_portfolio_stats.scenario.str.split("_",expand=True)[0]
df_portfolio_stats['min_tot_cov']=df_portfolio_stats.scenario.str.split("_",expand=True)[1]
df_portfolio_stats = df_portfolio_stats[['scenario', 'min_avg_term','min_tot_cov','policy_count', 'yearly_premium', 'wa_term','total_coverage']]
df_portfolio_stats['premium_per_thousand']=(df_portfolio_stats.yearly_premium/df_portfolio_stats.total_coverage)*1000

Now we have everything in a format that allows us to compare to our original price matrix; personally I was intrigued by 23_1500000,23_1750000,23_2000000.

Enjoy!

df_portfolio_stats.head(10)
scenario min_avg_term min_tot_cov policy_count yearly_premium wa_term total_coverage premium_per_thousand
0 15_750000 15 750000 1.0 292.0 20.000000 750000.0 0.389333
1 15_1000000 15 1000000 1.0 365.0 20.000000 1000000.0 0.365000
2 15_1250000 15 1250000 1.0 442.0 20.000000 1250000.0 0.353600
3 15_1500000 15 1500000 1.0 518.0 20.000000 1500000.0 0.345333
4 15_1750000 15 1750000 1.0 594.0 20.000000 1750000.0 0.339429
5 15_2000000 15 2000000 1.0 671.0 20.000000 2000000.0 0.335500
6 20_750000 20 750000 1.0 292.0 20.000000 750000.0 0.389333
7 20_1000000 20 1000000 1.0 365.0 20.000000 1000000.0 0.365000
8 20_1250000 20 1250000 1.0 442.0 20.000000 1250000.0 0.353600
9 20_1500000 20 1500000 1.0 518.0 20.000000 1500000.0 0.345333
10 20_1750000 20 1750000 1.0 594.0 20.000000 1750000.0 0.339429
11 20_2000000 20 2000000 1.0 671.0 20.000000 2000000.0 0.335500
12 23_750000 23 750000 1.0 417.0 25.000000 750000.0 0.556000
13 23_1000000 23 1000000 1.0 520.0 25.000000 1000000.0 0.520000
14 23_1250000 23 1250000 2.0 632.0 23.000000 1250000.0 0.505600
15 23_1500000 23 1500000 2.0 731.0 23.333333 1500000.0 0.487333
16 23_1750000 23 1750000 2.0 850.0 23.571429 1750000.0 0.485714
17 23_2000000 23 2000000 2.0 927.0 23.125000 2000000.0 0.463500
18 25_750000 25 750000 1.0 417.0 25.000000 750000.0 0.556000
19 25_1000000 25 1000000 1.0 520.0 25.000000 1000000.0 0.520000
20 25_1250000 25 1250000 1.0 635.0 25.000000 1250000.0 0.508000
21 25_1500000 25 1500000 1.0 750.0 25.000000 1500000.0 0.500000
22 25_1750000 25 1750000 1.0 865.0 25.000000 1750000.0 0.494286
23 25_2000000 25 2000000 1.0 981.0 25.000000 2000000.0 0.490500
24 27_750000 27 750000 1.0 520.0 30.000000 750000.0 0.693333
25 27_1000000 27 1000000 1.0 647.0 30.000000 1000000.0 0.647000
26 27_1250000 27 1250000 2.0 777.0 27.000000 1250000.0 0.621600
27 27_1500000 27 1500000 2.0 924.0 27.500000 1500000.0 0.616000
28 27_1750000 27 1750000 2.0 1009.0 27.142857 1750000.0 0.576571
29 27_2000000 27 2000000 2.0 1156.0 27.500000 2000000.0 0.578000



Full Code Here

Written on June 4, 2019