1 Introduction – R Workspace and Data Exploration

In this section, you learn to:

Set-up the your R workspace – see Section 1.1;
How to describe the data to a general audience – see Section 1.2;
Explore the trends in the data via descriptive tools – see Section 1.3.

1.1 Install R packages

Install and activate the R packages for specific computations of this lab such as fitdistrplus and actuar. Some additional generic R packages are needed such as knitr, data.table, etc. Recall that this R lab is written in Bookdown setting and installing R packages should be done directly from RStudio. If you run the R code in a normal fashion, use e.g. install.packages(“knitr”) to install the R package knitr. Activating R packages is possible by individual activation, e.g. library(knitr), or by activating multiple R packages in one go, e.g.

\[ libraries(c("fitdistrplus", "actuar", "knitr", "data.table")), \] but the latter requires a previous installation of the R package easypackages.

#install.packages("knitr") ## example of individual R package installation
#library(knitr)            ## example of individual R package activation
#suppressWarnings(library(easypackages))
#library(easypackages)
#libraries(c("fitdistrplus", "actuar", "knitr", "data.table", "MASS", "survival")) ## example of multiple R package activation in one go
suppressWarnings(library(actuar, warn.conflicts = FALSE))  ## warning messages from activating "actuar" package are ignored
suppressWarnings(library(MASS))
suppressWarnings(library(survival))
suppressWarnings(library(fitdistrplus))
suppressWarnings(library(grDevices))
suppressWarnings(library(knitr))
suppressWarnings(library(data.table))

1.2 Data Description

The purpose of this section is to describe the data to a general audience. Our insurance data are given as `Building insurance claim size’ and have a sample of size \(1,502\) of \[(x_{i1}, x_{i2}, x_{i3},y_{i1},y_{i2}) \quad \text{with}\; 1\le i \le 1,502,\] where \((x_{i1}, x_{i2}, x_{i3})\) are the covariates sampled from \((X_1,X_2,X_3)\) that help to predict the target variables \((Y_1,Y_2)\) via the observed sample \((y_{i1}, y_{i2})\). Recall that \((Y_1,Y_2)\) represents the building and content claim that are recorded as Building Insurance Claim Size and Content Insurance Claim Size variables. Moreover, \((X_1,X_2,X_3)\) provides redundant information, since X_1 is the Recorded time variable that contains the information given by X_2 and X_3 which records the claim occurrence time (year and month, respectively). That is,

The feature values \(\textbf{x}\)’s include information about the following

Recorded time (\(X_1\)) – date variable;
Building Insurance Claim Size (\(Y_1\)) – continuous variable that records the building insurance claim;
Content Insurance Claim Size (\(Y_2\)) – continuous variable that records the content insurance claim;
Year (\(X_2\)) – categorical variable with 11 possible values: from 1980 to 1990;
Month (\(X_3\)) – categorical variable with 12 possible values: from 1 representing January to 12representing December.

The data file is named Bivariate_Fire_Insurance_Clean_Data.csv and a small snapshoot of the data is given in Table 1.1. A full Exploratory Data Analaysis is provided in Section 1.3, which explains the sample in greater details.

Our insurance data are given as `Building insurance claim size’ and has the following structure:

Bivariate_Fire_Insurance_Clean_Data <- read.table("Bivariate_Fire_Insurance_Clean_Data.csv",header=TRUE,as.is=TRUE,sep=",")
kable(head(Bivariate_Fire_Insurance_Clean_Data[,1:5]), caption="Fire Insurance Data Snapshoot",align = "rrrrr", format.args = list(big.mark = ","))

Table 1.1: Fire Insurance Data Snapshoot
Recorded.time	Building.Insurance.Claim.Size	Content.Insurance.Claim.Size	Year	Month
03/01/1980	1,098,096.6	585,651.5	1,980	1
04/01/1980	1,756,954.6	336,749.6	1,980	1
07/01/1980	1,244,509.5	3,367,496.0	1,980	1
10/01/1980	4,452,039.5	4,273,234.0	1,980	1
10/01/1980	2,494,875.5	3,543,192.0	1,980	1
16/01/1980	775,689.6	993,117.1	1,980	1

Bivariate_Fire_Insurance_Clean_Data_toy <- read.table("Bivariate_Fire_Insurance_Clean_Data.csv",header=TRUE,as.is=TRUE,sep=",")
colnames(Bivariate_Fire_Insurance_Clean_Data_toy) <- c("X1", "Y1", "Y2","X2", "X3")
#kable(head(Bivariate_Fire_Insurance_Clean_Data_toy), caption="Fire Insurance Data Snapshoot",align = "rrrrr", format.args = list(big.mark = ","))
summary(Bivariate_Fire_Insurance_Clean_Data_toy[,2:5])

       Y1                 Y2                  X2             X3        
 Min.   :   23191   Min.   :      825   Min.   :1980   Min.   : 1.000  
 1st Qu.:  878753   1st Qu.:   268458   1st Qu.:1983   1st Qu.: 4.000  
 Median : 1224291   Median :   482160   Median :1986   Median : 7.000  
 Mean   : 1871507   Mean   :  1629436   Mean   :1986   Mean   : 6.525  
 3rd Qu.: 1928640   3rd Qu.:  1188155   3rd Qu.:1988   3rd Qu.: 9.000  
 Max.   :95168375   Max.   :132013200   Max.   :1990   Max.   :12.000

Note that this R lab aims to provide a prediction model only the Building claims (\(Y_1\)), since the Content claims (\(Y_2\)) could be modelled in the same fashion. Keep in mind that \(Y_1\) and \(Y_2\) are dependent random variables.

suppressWarnings(library(corrplot))
par(mfrow=c(1,2))
### yearly correlation plot 
coryear <- c()
for(i in 1980:1990){coryear <- c(coryear, cor(Bivariate_Fire_Insurance_Clean_Data_toy[which(Bivariate_Fire_Insurance_Clean_Data_toy$X2==i),c("Y1","Y2")])[1,2])}
names(coryear) <-  1980:1990
plot(1980:1990,coryear, main="Yearly Correlation Plot", ylab="Corr", xlab="year", type="p", col="red")
### monthly correlation plot 
cormonth <- c()
nmonth <-  c("Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec")
for(i in 1:12){cormonth <- c(cormonth, cor(Bivariate_Fire_Insurance_Clean_Data_toy[which(Bivariate_Fire_Insurance_Clean_Data_toy$X3==i),c("Y1","Y2")])[1,2])}
names(cormonth) <- nmonth
plot(1:12,cormonth, main="Monthly Correlation Plot",ylab="Corr", xlab="month", type="p", col="red",xaxt="n") 
axis(side=1,at=1:12,labels=nmonth)

Figure 1.1: Correlation for \((Y_1,Y_2)\) for various years (left) and months (right)

It is quite difficult to explain the yearly correlations from Figure 1.1, but the monthly correlations are quite intuitive. The Building and Content claims are highly correlated over the hot weather – as it could be seen in July and August – when concomitant large claims are more likely to occur during that period of time.

1.3 Exploratory Data Analysis (EDA)

The purpose of this section is to explore the trends in the data through descriptive tools, which could help with selecting the most appropriate data analysis through statistical evidence which is done in Section 3.

Recall that the strength of association between the Building and Content claims has been examined in Section 1.2, namely Figure 1.1. Since our real data analysis from Section 3 will be focused on Building claims, less detailed qualitative data explorations are made for Content claims in Section 1.3. We mainly look at how homogeneous the claims are over the time, though statistical evidence would be required to confirm the findings from this section.

First, we look at two timeplots, namely Figures 1.2 and 1.3, which show the evolution over time of the Building and Content claims, receptively.

Bivariate_Fire_Insurance_Clean_Data_toy$X1 <- as.Date(Bivariate_Fire_Insurance_Clean_Data_toy$X1, "%d/%m/%Y")
library(ggplot2)
library(dplyr)
library(plotly)
library(hrbrthemes)
pp1 <- Bivariate_Fire_Insurance_Clean_Data_toy %>%
    ggplot( aes(x=X1, y=Y1)) +
    geom_area(fill="#00ffff", alpha=0.5) +
    geom_line(color="#69b3a2") +
    ylab("Building Insurance Claim Size (Y1)") +
    xlab("Recorded time (X1)")+
    theme_ipsum()
pp1 <- ggplotly(pp1)
pp1

Figure 1.2: Building Insurance Claim Size Timeplot

#Bivariate_Fire_Insurance_Clean_Data_toy$X1 <- as.Date(Bivariate_Fire_Insurance_Clean_Data_toy$X1, "%d/%m/%Y")
library(ggplot2)
library(dplyr)
library(plotly)
library(hrbrthemes)
pp1 <- Bivariate_Fire_Insurance_Clean_Data_toy %>%
    ggplot( aes(x=X1, y=Y2)) +
    geom_area(fill="#84ff00", alpha=0.5) +
    geom_line(color="#69b3a2") +
    ylab("Content Insurance Claim Size (Y2)") +
    xlab("Recorded time (X1)")+
    theme_ipsum()
pp1 <- ggplotly(pp1)
pp1

Figure 1.3: Content Insurance Claim Size Timeplot

Figures 1.2 and 1.3 give an overall understanding of the claims, and we observe some outliers, i.e. very large claims as compared to the common claims. We should be mindful of such claims when fitting the claim amount distribution as such atypical claims may affect the robustness of our estimation.

The next step is to look at some classical histograms to understand whether the claims’ pattern show unusual patterns in some years and/or months. Figures 1.4 and 1.5 show the yearly histograms for the Building and Content claims, respectively.

par(mfrow=c(3,4))
for(i in 1980:1990){
 xxx <- Bivariate_Fire_Insurance_Clean_Data_toy[which(Bivariate_Fire_Insurance_Clean_Data_toy$X2==i),]
 hist(xxx$Y1,main="", xlab=i, border="red", col="green")}
hist(Bivariate_Fire_Insurance_Clean_Data_toy$Y1, main="", xlab="All Years", border="red", col="green")

Figure 1.4: Building Insurance Claim Size (Y1) – Yearly Histograms

par(mfrow=c(3,4))
for(i in 1980:1990){
 xxx <- Bivariate_Fire_Insurance_Clean_Data_toy[which(Bivariate_Fire_Insurance_Clean_Data_toy$X2==i),]
 hist(xxx$Y2,main="", xlab=i, border="red", col="green")}
hist(Bivariate_Fire_Insurance_Clean_Data_toy$Y2, main="", xlab="All Years", border="red", col="green")

Figure 1.5: Content Insurance Claim Size (Y2) – Yearly Histograms

Figures 1.6 and 1.7 show the monthly histograms for the Building and Content claims, respectively.

par(mfrow=c(3,4))
nmonth <-  c("Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec")
for(i in 1:12){
 xxx <- Bivariate_Fire_Insurance_Clean_Data_toy[which(Bivariate_Fire_Insurance_Clean_Data_toy$X3==i),]
 hist(xxx$Y1,main="", xlab=nmonth[i], border="red", col="green")}

Figure 1.6: Building Insurance Claim Size (Y1) – Monthly Histograms

par(mfrow=c(3,4))
nmonth <-  c("Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec")
for(i in 1:12){
 xxx <- Bivariate_Fire_Insurance_Clean_Data_toy[which(Bivariate_Fire_Insurance_Clean_Data_toy$X3==i),]
 hist(xxx$Y2,main="", xlab=nmonth[i], border="red", col="green")}

Figure 1.7: Content Insurance Claim Size (Y2) – Monthly Histograms

It is quite clear that interpreting numerous histograms – displayed as Figures 1.4 to 1.7 – is not convenient, and thus, we now report summary statistics in some radar plots that could help gathering and comparing the yearly and monthly risks with ease.

The following radar plots are provided for the Building Insurance Claim Size (Y1):

Figure 1.8 that displays the yearly mean and standard deviation;
Figure 1.9 that displays the yearly quantiles;
Figure 1.10 that displays the monthly mean and standard deviation;
Figure 1.11 that displays the monthly quantiles.

Note that the claims are divided by 1,000,000, i.e. all measures of risks – mean, standard deviation and quantiles – are scaled down by a factor of 1,000,000.

#Bivariate_Fire_Insurance_Clean_Data_toy <- read.table("Bivariate_Fire_Insurance_Clean_Data.csv",header=TRUE,as.is=TRUE,sep=",")
#colnames(Bivariate_Fire_Insurance_Clean_Data_toy) <- c("X1", "Y1", "Y2","X2", "X3")
meanyear <- numeric(11)
sdyear <- numeric(11)
for(i in 1980:1990){
 xxx <- Bivariate_Fire_Insurance_Clean_Data_toy[which(Bivariate_Fire_Insurance_Clean_Data_toy$X2==i),]
 meanyear[i-1979] <- mean(xxx$Y1)/1000000
 sdyear[i-1979] <- sd(xxx$Y1)/1000000}
library(plotly)
fig1eY1 <- plot_ly(
    type = 'scatterpolar',
    r = c(meanyear, meanyear[1]),
    theta=paste0("year ",as.character(c(1980:1990,1980))), #the labels are forced to be characters, so the setting from below does not work
               # theta = c("1980", "1981", "1982", "1983", "1984", "1985", "1986'", "1987", "1988", "1989", "1990", "1980"),
    #fill = 'toself',
    linetype = I("dot"),
    name = 'Average Claim/1,000,000'
  ) 
fig1eY1 <- fig1eY1 %>% add_trace(
  type = 'scatterpolar',
  r = c(sdyear, sdyear[1]),
  theta=paste0("year ",as.character(c(1980:1990,1980))),
  #fill = 'toself',
  linetype = I("dotdash"),
  name = 'Standard Deviation Claim/1,000,000'
  ) 
fig1eY1

Figure 1.8: Building Insurance Claim Size (Y1) – Yearly Summary Statistics

#Bivariate_Fire_Insurance_Clean_Data_toy <- read.table("Bivariate_Fire_Insurance_Clean_Data.csv",header=TRUE,as.is=TRUE,sep=",")
#colnames(Bivariate_Fire_Insurance_Clean_Data_toy) <- c("X1", "Y1", "Y2","X2", "X3")
q50year <- numeric(11)
q75year <- numeric(11)
q95year <- numeric(11)
for(i in 1980:1990){
 xxx <- Bivariate_Fire_Insurance_Clean_Data_toy[which(Bivariate_Fire_Insurance_Clean_Data_toy$X2==i),]
 q50year[i-1979] <- quantile(xxx$Y1,0.50)/1000000
 q75year[i-1979] <- quantile(xxx$Y1,0.75)/1000000
 q95year[i-1979] <- quantile(xxx$Y1,0.95)/1000000}
library(plotly)
fig1qY1 <- plot_ly(
    type = 'scatterpolar',
    r = c(q50year, q50year[1]),
    theta=paste0("year ",as.character(c(1980:1990,1980))), 
    #fill = 'toself',
    linetype = I("dot"),
    name = 'q(50%)/1,000,000'
  ) 
fig1qY1 <- fig1qY1 %>% add_trace(
  type = 'scatterpolar',
  r = c(q75year, q75year[1]),
  theta=paste0("year ",as.character(c(1980:1990,1980))),
  #fill = 'toself',
  linetype = I("dashdot"),
  name = 'q(75%)/1,000,000'
  ) 
fig1qY1 <- fig1qY1 %>% add_trace(
  type = 'scatterpolar',
  r = c(q95year, q95year[1]),
  theta=paste0("year ",as.character(c(1980:1990,1980))),
  #fill = 'toself',
  linetype = I("dash"),
  name = 'q(95%)/1,000,000'
  ) 
fig1qY1

Figure 1.9: Building Insurance Claim Size (Y1) – Yearly Summary Statistics (quantiles)

#Bivariate_Fire_Insurance_Clean_Data_toy <- read.table("Bivariate_Fire_Insurance_Clean_Data.csv",header=TRUE,as.is=TRUE,sep=",")
#colnames(Bivariate_Fire_Insurance_Clean_Data_toy) <- c("X1", "Y1", "Y2","X2", "X3")
meanmonth <- numeric(11)
sdmonth <- numeric(11)
for(i in 1:12){
 xxx <- Bivariate_Fire_Insurance_Clean_Data_toy[which(Bivariate_Fire_Insurance_Clean_Data_toy$X3==i),]
 meanmonth[i] <- mean(xxx$Y1)/1000000
 sdmonth[i] <- sd(xxx$Y1)/1000000}
library(plotly)
fig1eY1 <- plot_ly(
    type = 'scatterpolar',
    r = c(meanmonth, meanmonth[1]),
    theta = c("Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec", "Jan"),
    #fill = 'toself',
    linetype = I("dot"),
    name = 'Average Claim/1,000,000'
  ) 
fig1eY1 <- fig1eY1 %>% add_trace(
  type = 'scatterpolar',
  r = c(sdmonth, sdmonth[1]),
  theta = c("Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec", "Jan"),
  #fill = 'toself',
  linetype = I("dotdash"),
  name = 'Standard Deviation Claim/1,000,000'
  ) 
fig1eY1

Figure 1.10: Building Insurance Claim Size (Y1) – Monthly Summary Statistics

#Bivariate_Fire_Insurance_Clean_Data_toy <- read.table("Bivariate_Fire_Insurance_Clean_Data.csv",header=TRUE,as.is=TRUE,sep=",")
#colnames(Bivariate_Fire_Insurance_Clean_Data_toy) <- c("X1", "Y1", "Y2","X2", "X3")
q50month <- numeric(11)
q75month <- numeric(11)
q95month <- numeric(11)
for(i in 1:12){
 xxx <- Bivariate_Fire_Insurance_Clean_Data_toy[which(Bivariate_Fire_Insurance_Clean_Data_toy$X3==i),]
 q50month[i] <- quantile(xxx$Y1,0.50)/1000000
 q75month[i] <- quantile(xxx$Y1,0.75)/1000000
 q95month[i] <- quantile(xxx$Y1,0.95)/1000000}
library(plotly)
fig1qY1 <- plot_ly(
    type = 'scatterpolar',
    r = c(q50month, q50month[1]),
    theta = c("Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec", "Jan"),
    #fill = 'toself',
    linetype = I("dot"),
    name = 'q(50%)/1,000,000'
  ) 
fig1qY1 <- fig1qY1 %>% add_trace(
  type = 'scatterpolar',
  r = c(q75month, q75month[1]),
  theta = c("Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec", "Jan"),
  #fill = 'toself',
  linetype = I("dashdot"),
  name = 'q(75%)/1,000,000'
  ) 
fig1qY1 <- fig1qY1 %>% add_trace(
  type = 'scatterpolar',
  r = c(q95month, q95month[1]),
  theta = c("Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec", "Jan"),
  #fill = 'toself',
  linetype = I("dash"),
  name = 'q(95%)/1,000,000'
  ) 
fig1qY1

Figure 1.11: Building Insurance Claim Size (Y1) – Monthly Summary Statistics (quantiles)

Building claims show an interesting trend. While Figure 1.8 shows large variation – coefficient of variation larger than 1 – on the yearly basis, the monthly claims from Figure 1.10 show the same behaviour but only for July, August and October. This peculiar behaviour could be explained by the lack of homogeneity on the annual basis, i.e. the iid (independent and identical distributed) assumption may not hold for the individual claims (sampled annually). This is only our initial intuition, though statistical testing could confirm or not this behaviour. At the same time, the timplot from Figures 1.2 shows three extreme claims; the most extreme claim appears in 1980, and the second(third) most extreme claim is recorded in 1985(1988), which is in accordance with Figure 1.8. Quantiles could better explain the tail behaviour, and thus, the \(50\%\), \(75\%\) and \(95\%\) quantiles are plotted in Figures 1.9 and 1.11. The same outliers – explained before – tend to explain the large difference between the \(75\%\) and \(95\%\) quantiles, especially in year 1985 and month October.

Content claims show an even more extreme behaviour – see Figure 1.12 to 1.15 – which is in accordance with multiple unusually large claims observed in the timeplot from Figures 1.3. The last visualisation tools of this section are now given for Content Insurance Claim Size (Y2), i.e. teh following radar plots:

Figure 1.12 that displays the yearly mean and standard deviation;
Figure 1.13 that displays the yearly quantiles;
Figure 1.14 that displays the monthly mean and standard deviation;
Figure 1.15 that displays the monthly quantiles.

Note that the claims are divided by 1,000,000, i.e. all measures of risks – mean, standard deviation and quantiles – are scaled down by a factor of 1,000,000.

#Bivariate_Fire_Insurance_Clean_Data_toy <- read.table("Bivariate_Fire_Insurance_Clean_Data.csv",header=TRUE,as.is=TRUE,sep=",")
#colnames(Bivariate_Fire_Insurance_Clean_Data_toy) <- c("X1", "Y1", "Y2","X2", "X3")
meanyear <- numeric(11)
sdyear <- numeric(11)
for(i in 1980:1990){
 xxx <- Bivariate_Fire_Insurance_Clean_Data_toy[which(Bivariate_Fire_Insurance_Clean_Data_toy$X2==i),]
 meanyear[i-1979] <- mean(xxx$Y2)/1000000
 sdyear[i-1979] <- sd(xxx$Y2)/1000000}
library(plotly)
fig1eY2 <- plot_ly(
    type = 'scatterpolar',
    r = c(meanyear, meanyear[1]),
    theta=paste0("year ",as.character(c(1980:1990,1980))), #the labels are forced to be characters, so the setting from below does not work
               # theta = c("1980", "1981", "1982", "1983", "1984", "1985", "1986'", "1987", "1988", "1989", "1990", "1980"),
    #fill = 'toself',
    linetype = I("dot"),
    name = 'Average Claim/1,000,000'
  ) 
fig1eY2 <- fig1eY2 %>% add_trace(
  type = 'scatterpolar',
  r = c(sdyear, sdyear[1]),
  theta=paste0("year ",as.character(c(1980:1990,1980))),
  #fill = 'toself',
  linetype = I("dotdash"),
  name = 'Standard Deviation Claim/1,000,000'
  ) 
fig1eY2

Figure 1.12: Content Insurance Claim Size (Y2) – Yearly Summary Statistics

#Bivariate_Fire_Insurance_Clean_Data_toy <- read.table("Bivariate_Fire_Insurance_Clean_Data.csv",header=TRUE,as.is=TRUE,sep=",")
#colnames(Bivariate_Fire_Insurance_Clean_Data_toy) <- c("X1", "Y1", "Y2","X2", "X3")
q50year <- numeric(11)
q75year <- numeric(11)
q95year <- numeric(11)
for(i in 1980:1990){
 xxx <- Bivariate_Fire_Insurance_Clean_Data_toy[which(Bivariate_Fire_Insurance_Clean_Data_toy$X2==i),]
 q50year[i-1979] <- quantile(xxx$Y2,0.50)/1000000
 q75year[i-1979] <- quantile(xxx$Y2,0.75)/1000000
 q95year[i-1979] <- quantile(xxx$Y2,0.95)/1000000}
library(plotly)
fig1qY2 <- plot_ly(
    type = 'scatterpolar',
    r = c(q50year, q50year[1]),
    theta=paste0("year ",as.character(c(1980:1990,1980))), 
    #fill = 'toself',
    linetype = I("dot"),
    name = 'q(50%)/1,000,000'
  ) 
fig1qY2 <- fig1qY2 %>% add_trace(
  type = 'scatterpolar',
  r = c(q75year, q75year[1]),
  theta=paste0("year ",as.character(c(1980:1990,1980))),
  #fill = 'toself',
  linetype = I("dashdot"),
  name = 'q(75%)/1,000,000'
  ) 
fig1qY2 <- fig1qY2 %>% add_trace(
  type = 'scatterpolar',
  r = c(q95year, q95year[1]),
  theta=paste0("year ",as.character(c(1980:1990,1980))),
  #fill = 'toself',
  linetype = I("dash"),
  name = 'q(95%)/1,000,000'
  ) 
fig1qY2

Figure 1.13: Content Insurance Claim Size (Y2) – Yearly Summary Statistics (quantiles)

#Bivariate_Fire_Insurance_Clean_Data_toy <- read.table("Bivariate_Fire_Insurance_Clean_Data.csv",header=TRUE,as.is=TRUE,sep=",")
#colnames(Bivariate_Fire_Insurance_Clean_Data_toy) <- c("X1", "Y1", "Y2","X2", "X3")
meanmonth <- numeric(11)
sdmonth <- numeric(11)
for(i in 1:12){
 xxx <- Bivariate_Fire_Insurance_Clean_Data_toy[which(Bivariate_Fire_Insurance_Clean_Data_toy$X3==i),]
 meanmonth[i] <- mean(xxx$Y2)/1000000
 sdmonth[i] <- sd(xxx$Y2)/1000000}
library(plotly)
fig1eY2 <- plot_ly(
    type = 'scatterpolar',
    r = c(meanmonth, meanmonth[1]),
    theta = c("Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec", "Jan"),
    #fill = 'toself',
    linetype = I("dot"),
    name = 'Average Claim/1,000,000'
  ) 
fig1eY2 <- fig1eY2 %>% add_trace(
  type = 'scatterpolar',
  r = c(sdmonth, sdmonth[1]),
  theta = c("Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec", "Jan"),
  #fill = 'toself',
  linetype = I("dotdash"),
  name = 'Standard Deviation Claim/1,000,000'
  ) 
fig1eY2

Figure 1.14: Content Insurance Claim Size (Y2) – Monthly Summary Statistics

#Bivariate_Fire_Insurance_Clean_Data_toy <- read.table("Bivariate_Fire_Insurance_Clean_Data.csv",header=TRUE,as.is=TRUE,sep=",")
#colnames(Bivariate_Fire_Insurance_Clean_Data_toy) <- c("X1", "Y1", "Y2","X2", "X3")
q50month <- numeric(11)
q75month <- numeric(11)
q95month <- numeric(11)
for(i in 1:12){
 xxx <- Bivariate_Fire_Insurance_Clean_Data_toy[which(Bivariate_Fire_Insurance_Clean_Data_toy$X3==i),]
 q50month[i] <- quantile(xxx$Y2,0.50)/1000000
 q75month[i] <- quantile(xxx$Y2,0.75)/1000000
 q95month[i] <- quantile(xxx$Y2,0.95)/1000000}
library(plotly)
fig1qY2 <- plot_ly(
    type = 'scatterpolar',
    r = c(q50month, q50month[1]),
    theta = c("Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec", "Jan"),
    #fill = 'toself',
    linetype = I("dot"),
    name = 'q(50%)/1,000,000'
  ) 
fig1qY2 <- fig1qY2 %>% add_trace(
  type = 'scatterpolar',
  r = c(q75month, q75month[1]),
  theta = c("Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec", "Jan"),
  #fill = 'toself',
  linetype = I("dashdot"),
  name = 'q(75%)/1,000,000'
  ) 
fig1qY2 <- fig1qY2 %>% add_trace(
  type = 'scatterpolar',
  r = c(q95month, q95month[1]),
  theta = c("Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec", "Jan"),
  #fill = 'toself',
  linetype = I("dash"),
  name = 'q(95%)/1,000,000'
  ) 
fig1qY2

Figure 1.15: Content Insurance Claim Size (Y2) – Monthly Summary Statistics (quantiles)

This ends our EDA that is a preamble of the real data analysis provided in Section @ref{RDA}. We next provide a detailed simulation study in Section that relies on the same loss data estimation procedures from Section @ref{RDA}. The main aim of the simulation study is to provide the possible shortcomings of statistical models used to fit our insurance data.

2 Simulation Study

In this section, you learn to:

Simulate synthetic data from univariate (loss) distributions – see Section 2.1;
Fit data via Maximum Likelihood Estimation (MLE) – see Section 2.2;
Fit data via Method of Moments (MOM) – see Section 2.3;
Understand and explain your MLE and MOM estimates – see Sections 2.4, 2.5 and 2.6.

2.1 Simulate Data

This section provides elementary simulations from univariate (loss) distributions. We first simulate samples of size \(n=500\) – though samples of size \(n=\{25, 50, 100, 250, 500\}\) will be later generated – from the following three parametric distributions:

LogNormal with parameters \(\mu=10\) and \(\sigma=1\) known as Sample 1 – note that all LogNormal distributions are moderately heavy tailed distributions;
Weibull with parameters \(a=0.5\) and \(b=1\), known as Sample 2 – note that this particular Weibull is a moderately heavy tailed distribution since \(a<1\);
Weibull with parameters \(a=2\) and \(b=1\), known as Sample 2 – note that this particular Weibull is a very light tailed distribution since \(a>1\);

Note that our LogNormal distribution has the usual parametrisation, i.e. its expected value is \(\exp\left(\mu+\frac{\sigma^2}{2}\right)\). Further, our Weibull parametrisation, which is in terms of \(a,b>0\), has the following cumulative distribution function:

\[ F_{Weibull}(x;a,b)=\Pr(X\le x)=\exp\left\{-\left(\frac{x}{b}\right)^a\right\},\quad\text{for all}\;x\ge0,\;\text{i.e.}\;\mathbb{E}[X]=b\cdot\Gamma\left(1+\frac{1}{a}\right), \] where \(\Gamma()\) is the usual Gamma function.

The summary statistics for the three samples are given below:

set.seed(123); #set your seed so you could replicate the same sample every time you run the R script
n=500;  
#Simulate Sample 1: logNormal($\mu=10, \sigma=1$) moderate heavy tail
Sample1 <- rlnorm(n, meanlog=10, sdlog=1)  
summary(Sample1)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1539   12399   22488   36912   43705  563003

#Simulate Sample 2: Weibull($a=0.5, b=1$) moderate heavy tail
Sample2 <- rweibull(n, shape=0.5 , scale=1) 
#Note that the scale parameter b=1 
summary(Sample2)

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##  0.00000  0.09372  0.48676  2.01320  1.98672 45.73743

#Simulate Sample 3: Weibull($a=2, b=1$) very light tail
Sample3 <- rweibull(n, shape=2 , scale=1)
summary(Sample3)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.06029 0.53292 0.83407 0.88164 1.16460 2.31943

Let us plot the histograms for three samples, which are given in Figure 2.1 below.

par(mfrow=c(2,2))
hist(Sample1, prob=TRUE, border="red", col="green");
hist(log(Sample1), prob=TRUE, border="red", col="green");
hist(Sample2, prob=TRUE, border="red", col="green");
hist(Sample3, prob=TRUE, border="red", col="green");

Figure 2.1: Histograms for Sample 1 and log of Sample 1 (top), and Samples 2 and 3 (bottom)

As expected, Samples 1 and 2 histograms are right skewed in Figure 2.1 due to their heavy tailness, while Sample 3 shows a very different pattern. At a first glance, Sample 2 seems to come from a less heavy tailed distribution than Sample 1 since one may say the Sample 1 histogram fades away after the threshold \(3\times10^5\), while Sample 2 histogram fades away after the threshold \(30\); this argument is faulty since the two distributions calibrate distributions with very different moments, e.g. the theoretical expected value of Sample 1 and Sample 2 is \(\exp(10.5)=36,315.50\) and \(1\cdot\Gamma\left(1+\frac{1}{0.5}\right)=2\), respectively. In fact, one could show that the theoretical distribution of Weibull(\(a=0.5\),\(b=1\)) exhibits a heavier tail distribution than LogNormal(\(\mu=10\),\(\sigma=1\)).

Finally, the log-transformation of the Sample 1 data leads to a symmetric histogram due to the fact that \(\log(X)\sim N\big(\mu,\sigma^2\big)\) whenever \(X\sim LogN\big(\mu,\sigma^2\big)\).

2.2 Maximum Likelihood Estimation (MLE)

We are now ready to parametrically fit our sample data from Section 2.1, and two methods are chosen:

Maximum Likelihood Estimation (MLE) – see Section 2.2 – where the parameters are chosen so that the likelihood of observing the given sample is maximised;
Method of Moments (MOM) – see Section 2.3 – where sample and theoretical higher moments are matched, i.e are equal.

MLE and MOM estimations are possible via function mledist and mmedist functions that are available in the fitdistrplus R package. There are numerous numerical issues with both MLE and MOM implementations in R especially when fitting distributions with more than one parameter. Why?

MLE requires solving non-linear optimisation problems, and R (and Python) rely on basic optimisation tools that are designed for general use, and bespoke software like MATLAB would be more computationally stable; this issue is more acute when the dimension of the optimisation problem, i.e. number of parameters gets larger (two or more parameters).

MOM usually requires solving a non-linear system of equations, which is computationally feasible even in R (and Python) – as an application of the Fixed Point Theorem though other methods are available – when one parameter distribution is fitted, but MOM is computationally more challenging (not only in R and Python) when the number of parameters gets larger (two or more parameters).

Eight parametric families are used to fit our three samples. The standard R notations are used to signify which parametric we refer to. in the two packages for

lnorm refers to LogNormal distribution – see actuar and stats R packages;
exp refers to Exponential distribution – see actuar and stats R packages;
gamma refers to Gamma distribution – see actuar and stats R packages;
weibull refers to Weibull distribution – see actuar and stats R packages;
llogis refers to LogLogistic distribution – see actuar R package;
invweibull refers to Inverse Weibull distribution – see actuar R package;
pareto refers to Pareto distribution – see actuar R package;
trgamma refers to transformed Gamma distribution – see actuar R package.

Recall that pareto is defined on \((0, \infty)\), which is not true for Pareto Type 1, and thus, we prefer using pareto so that all eight parametric families are defined on the same range.

MLE estimation requires a `good’ starting point for their numerical algorithms, and thus, one should be careful when performing MLE estimation for two or more parametric distributions; local minima or even saddle points could be obtained, which are far from the desired global optimum point/points if this/these exists/exist as the likelihood function is not always a convex function. MOM faces numerical shortcomings when performing MLE estimation for two or more parametric distributions. Numerical issues are signified as NA in the R output.

Sample 1 (simulated from LogNormal) is fitted via MLE (assuming lnorm family) as follows:

MLE.lnorm=mledist(Sample1, "lnorm")
MLE.lnorm$estimate

##    meanlog      sdlog 
## 10.0345904  0.9717961

Sample 1 (simulated from LogNormal) is fitted via MLE assuming exp family as follows:

MLE.exp=mledist(Sample1/1000, "exp") 
#Large valued data (not large sized samples) require to scale down the data by a factor 
#we chose here to scale down by a factor of 1,000 since the expected value is around \exp(10.5)=36,315.50
#we should scale back the MLE estimate, and since we actually fit the rate, we divide the MLE estimate by 1,000
#recall that expected value is equal to 1/rate
MLE.exp$estimate/1000

##         rate 
## 2.709161e-05

Sample 1 (simulated from LogNormal) is fitted via MLE (assuming gamma family) as follows:

MLE.gamma=mledist(Sample1, "gamma",lower=c(0, 0), start=list(shape =1, rate=1))
#A `good' starting point is needed here
MLE.gamma$estimate  #rate=1/scale

##        shape         rate 
## 1.177195e+00 3.188515e-05

Sample 1 (simulated from LogNormal) is fitted via MLE (assuming weibull family) as follows:

MLE.weibull=mledist(Sample1, "weibull")
MLE.weibull$estimate

##        shape        scale 
##     1.016727 37225.585178

Sample 1 (simulated from LogNormal) is fitted via MLE (assuming llogis family) as follows:

MLE.llogis=mledist(Sample1, "llogis")
MLE.llogis$estimate

##        shape        scale 
##     1.802915 22563.980389

Sample 1 (simulated from LogNormal) is fitted via MLE (assuming invweibull family) as follows:

MLE.invweibull=mledist(Sample1, "invweibull")
MLE.invweibull$estimate

##        shape        scale 
##    0.4334861 1036.7107921

Sample 1 (simulated from LogNormal) is fitted via MLE (assuming pareto family) as follows:

MLE.pareto=mledist(Sample1, "pareto",lower=c(0, 0), start=list(shape =1, scale=1))
#A `good' starting point is needed here
MLE.pareto$estimate

##        shape        scale 
## 7.484903e+00 2.382607e+05

Sample 1 (simulated from LogNormal) is fitted via MLE (assuming trgamma family) as follows:

MLE.trgamma=mledist(Sample1, "trgamma")
MLE.trgamma$estimate

##      shape1      shape2        rate 
## 6.922385543 0.391949647 0.004997469

MLE fitted distributions have their density plots in Figure 2.2, though there is not much to say about these plots.

par(mfrow=c(2,4))
curve(dlnorm(x,MLE.lnorm$estimate[1],MLE.lnorm$estimate[2]),from=0,to=max(Sample1),ylab="Density",  col=1, main="lnorm", xlab = "")
curve(dexp(x,MLE.exp$estimate/1000),from=0,to=max(Sample1),ylab="Density",  col=2, main="exp", xlab = "")
curve(dgamma(x,MLE.gamma$estimate[1],MLE.gamma$estimate[2]),from=0,to=max(Sample1),ylab="Density",  col=3, main="gamma", xlab = "")
curve(dweibull(x,MLE.weibull$estimate[1],MLE.weibull$estimate[2]),from=0, to = max(Sample1), ylab="Density",col=4,main="weibull", xlab = "")
curve(dllogis(x,MLE.llogis$estimate[1],MLE.llogis$estimate[2]),from=0, to = max(Sample1), ylab="Density",col=5,main="llogis", xlab = "")
curve(dinvweibull(x,MLE.invweibull$estimate[1],MLE.invweibull$estimate[2]),from=0, to = max(Sample1), ylab="Density",col=6,main="invweibull", xlab = "")
curve(dpareto(x,MLE.pareto$estimate[1],MLE.pareto$estimate[2]),from=0, to = max(Sample1), ylab="Density",col=7,main="pareto", xlab = "")
curve(dtrgamma(x,MLE.trgamma$estimate[1],MLE.trgamma$estimate[2],MLE.trgamma$estimate[3]),xlim=c(0,max(Sample1)),ylab="Density",col=8, main="trgamma", xlab = "")

Figure 2.2: MLE Density Plots

2.3 Method of Moments

MOM usually requires solving a non-linear system of equations, which is often computationally infeasible when MOM is applied for parametric families with two or more parameters – also mentioned at the beginning of Section 2.2. Sample 1 data are fitted via MOM through the mmedist function, and you could see that MOM estimation fails in some cases as we anticipated.

Sample 1 (simulated from LogNormal) is fitted via MOM (assuming lnorm family) as follows:

MME.lnorm=mmedist(Sample1, "lnorm")
MME.lnorm$estimate

##   meanlog     sdlog 
## 10.042359  0.973578

Sample 1 (simulated from LogNormal) is fitted via MLE (assuming exp family) as follows:

MME.exp=mmedist(Sample1, "exp")
MME.exp$estimate

##         rate 
## 2.709161e-05

Sample 1 (simulated from LogNormal) is fitted via MLE (assuming gamma family) as follows:

MME.gamma=mmedist(Sample1, "gamma")
MME.gamma$estimate

##        shape         rate 
## 6.328446e-01 1.714478e-05

Sample 1 (simulated from LogNormal) is fitted via MLE (assuming weibull family) as follows:

memp  <-  function(x, order) mean(x^order)
MME.weibull=mmedist(Sample1, "weibull", order=c(1, 2), memp=memp)

## checkstartfix is carried out

MME.weibull$estimate

##        shape        scale 
## 9.504584e-01 3.991438e+04

Sample 1 (simulated from LogNormal) is fitted via MLE (assuming llogis family) as follows:

memp  <-  function(x, order) mean(x^order)
MME.llogis=mmedist(Sample1, "llogis", order=c(1, 2), memp=memp)

## checkstartfix is carried out
## <simpleError in optim(par = vstart, fn = fnobj, fix.arg = fix.arg, obs = data,     gr = gradient, mdistnam = mdistname, memp = memp, hessian = TRUE,     method = opt.meth, lower = lower, upper = upper, ...): function cannot be evaluated at initial parameters>

MME.llogis$estimate

## [1] NA NA

Sample 1 (simulated from LogNormal) is fitted via MLE (assuming invweibull family) as follows:

memp  <-  function(x, order) mean(x^order)
MME.invweibull=mmedist(Sample1, "invweibull", order=c(1, 2), memp=memp)

## checkstartfix is carried out

MME.invweibull$estimate

##     shape     scale 
##  7379.188 59279.946

Sample 1 (simulated from LogNormal) is fitted via MLE (assuming pareto family) as follows:

memp  <-  function(x, order) mean(x^order)
MME.pareto=mmedist(Sample1, "pareto", order=c(1, 2), memp=memp)

## checkstartfix is carried out

MME.pareto$estimate

##        shape        scale 
## 5.447284e+00 1.641572e+05

Sample 1 (simulated from LogNormal) is fitted via MLE (assuming trgamma family) as follows:

memp  <-  function(x, order) mean(x^order)
MME.trgamma=mmedist(Sample1, "trgamma", order=c(1, 2, 3), memp=memp)

## checkstartfix is carried out
## <simpleError in optim(par = vstart, fn = fnobj, fix.arg = fix.arg, obs = data,     gr = gradient, mdistnam = mdistname, memp = memp, hessian = TRUE,     method = opt.meth, lower = lower, upper = upper, ...): non-finite finite-difference value [3]>

MME.trgamma$estimate

## [1] NA NA NA

There are no obvious concerns with MOM estimates when fitting lnorm, exp and gamma families. There is no doubt that MOM fails when fitting llogis and trgamma families, and there is no immediate remedy. MOM estimation when fitting weibull, invweibull and pareto families provide some caution messages, and thus, we check whether the empirical and theoretical two moments are matched as it supposed to be.

xxx <- c(mean(Sample1),mean(Sample1^2))
xxxW<-c(mweibull(1, shape=MME.weibull$estimate[1], scale = MME.weibull$estimate[2]),mweibull(2, shape=MME.weibull$estimate[1], scale = MME.weibull$estimate[2]))
xxxIW<-c(minvweibull(1, shape=MME.invweibull$estimate[1], scale = MME.invweibull$estimate[2]),minvweibull(2, shape=MME.invweibull$estimate[1], scale = MME.invweibull$estimate[2]))
xxxP<-c(mpareto(1, shape=MME.pareto$estimate[1], scale = MME.pareto$estimate[2]),mpareto(2, shape=MME.pareto$estimate[1], scale = MME.pareto$estimate[2]))
cat("First two (raw) moments of Samples 1 -- empirical estimates")

## First two (raw) moments of Samples 1 -- empirical estimates

xxx

## [1] 3.691179e+04 3.515427e+09

cat("First two (raw) moments of Samples 1 -- theoretical MOM weibull")

## First two (raw) moments of Samples 1 -- theoretical MOM weibull

xxxW

## [1] 4.083912e+04 3.515497e+09

cat("First two (raw) moments of Samples 1 -- theoretical MOM invweibull")

## First two (raw) moments of Samples 1 -- theoretical MOM invweibull

xxxIW

## [1] 5.928458e+04 3.514662e+09

cat("First two (raw) moments of Samples 1 -- theoretical MOM pareto")

## First two (raw) moments of Samples 1 -- theoretical MOM pareto

xxxP

## [1] 3.691179e+04 3.515427e+09

cat("The diferences between the empirical and theoretical MOM for weibull, invweibull and pareto")

## The diferences between the empirical and theoretical MOM for weibull, invweibull and pareto

xxx-xxxW

## [1]  -3927.331 -70056.150

xxx-xxxIW

## [1] -22372.79 764842.04

xxx-xxxP

## [1] 3.953755e-07 0.000000e+00

This suggests that the weibull and invweibull MOM estimates are very off, while the pareto MOM estimation is fine as it is.

One could help R to better estimate the weibull MOM estimates from first principles. That is, we have to solve the following system of equations: \[\mathbb{E}[X]=b\cdot\Gamma\left(1+\frac{1}{a}\right)=m_1\quad\text{and}\quad\mathbb{E}\big[X^2\big]=b^2\cdot\Gamma\left(1+\frac{2}{a}\right)=m_2,\] where \(m_1\) and \(m_2\) are the first and second empirical moment estimates. That is, we need to find the MOM estimate of the shape parameter \(\hat{a}_{MOM}>0\) such that \[\frac{m_2}{(m_1)^2}\left(\Gamma\left(1+\frac{1}{a}\right)\right)^2-\Gamma\left(1+\frac{2}{a}\right)=0,\] which is a root problem that could be solved in R with the help of the uniroot function. Clearly, the MOM estimate of the scale parameter is given by \[\hat{b}_{MOM}=\frac{m_1}{\Gamma\left(1+\frac{1}{\hat{a}_{MOM}}\right)}.\]

Now, let us solve the root equation defined earlier, and we get that

xxx <- c(mean(Sample1),mean(Sample1^2))
fW <- function(x) {(xxx[2]/xxx[1]^2)*(gamma(1+(1/x)))^2-gamma(1+(2/x))}
asol<-uniroot(fW, c(0.1,1))$root
bsol<-xxx[1]/gamma(1+(1/asol))
c(asol,bsol)

## [1] 8.020448e-01 3.263813e+04

xxxW<-c(mweibull(1, shape=asol, scale = bsol),mweibull(2, shape=asol, scale = bsol))
cat("The diferences between the empirical and theoretical MOM for weibull")

## The diferences between the empirical and theoretical MOM for weibull

xxx - xxxW

## [1]     0.00 16711.37

cat("Weibull MOM estimates -- (a=shape, b=scale)")

## Weibull MOM estimates -- (a=shape, b=scale)

c(asol,bsol)

## [1] 8.020448e-01 3.263813e+04

As you could see, the default toleration in the uniroot function does not lead to reasonable weibull MOM estimates. Therefore, we reduce the toleration to \(tol=10^{-15}\) and narrow down the search interval to \((0.7,0.9)\) from the previous search interval of \((0.1,1)\), and we get that

#fW <- function(x) {(xxx[2]/xxx[1]^2)*(gamma(1+(1/x)))^2-gamma(1+(2/x))}
asol1<-uniroot(fW, c(0.7,0.9), tol=10^(-15))$root
bsol1<-xxx[1]/gamma(1+(1/asol1))
xxxW1<-c(mweibull(1, shape=asol1, scale = bsol1),mweibull(2, shape=asol1, scale = bsol1))
cat("The diferences between the empirical and theoretical MOM for weibull")

## The diferences between the empirical and theoretical MOM for weibull

xxx -xxxW1

## [1]  0.000000e+00 -4.768372e-07

cat("Weibull MOM estimates -- (a=shape, b=scale)")

## Weibull MOM estimates -- (a=shape, b=scale)

c(asol1,bsol1)

## [1] 8.020419e-01 3.263804e+04

The above estimates are reasonable since the empirical and theoretical moments are different up to an error of \(10^{-7}\), even though there are small differences (default toleration and reduced toleration) between the MOM outputs, \(\big(\hat{a}_{MOM},\hat{b}_{MOM}\big)\).

It is a good exercise for you to try finding a remedy for the invweibull MOM estimation, which is similar to the weibull remedy from above.

As a final note, you should be aware that some ad-hoc MOM estimation are based on dispersion measures as well. For example, if one fits gamma family that has two parameters, one may try matching the first moment and variance, instead of matching the first and second moments. This is not a big issue, though variance could be estimated via the unbiased or the asymptotically unbiased estimator – each having its pros and cons depending on the sample size – and the two choices lead to different set of estimates; recall that even a small difference in parameter estimates could lead to massive model error as we noticed in the previous weibull MOM estimation.

2.4 LogNormal Sample – Further Analysis

In this section, you learn to:

Compare competitive fitting models;
Use bootstrapping to robustify your estimation;
Understand the effect of sample size over the objectives from above.

So far, Sample 1 is drawn from a LogNormal distribution – a moderately heavy tailed distribution – with parameters \(\mu=10\) and \(\sigma=1\). The sample of size of Sample 1 has been \(n=500\), and now take five samples of size \(n=\{25, 50, 100, 250, 500\}\) as follows:

set.seed(1234) #Generate Sample1 and save in a list
n=c(25, 50, 100, 250, 500)
lnorm.meanlog=10
lnorm.sdlog=1
Sample1.list=list(n25=rep(0,25),n50=rep(0,50),n100=rep(0,100),n250=rep(0,250),n500=rep(0,500))
for (i in 1:length(n)) {
Sample1=rlnorm(n[i], meanlog=lnorm.meanlog, sdlog=lnorm.sdlog)
Sample1.list[[i]]=Sample1
}

We then perform a parametric estimation – via MLE and MOM – for all five samples when the underlying distribution is one of the eight parametric families considered before.

Note that our MLE estimates from this section are as given by the R function mledist without checking whether these estimates are points of global maxima, which is beyond the scope of this session. Section 2.2 has detailed what one should do to enhance our confidence in our MLE estimates.

Note also that MOM estimation is as given by the R function mmedist, which we have already seen how off could be for some parametric choices; specifically, when \(n=500\), weibull and invweibull MOM estimations are not acceptable, though the pareto estimates turned out to be fine. We will not repeat the same exercise as in Section 2.3 to rectify these shortcomings, but one should keep in mind these possible drawbacks.

Here is the MLE and MOM estimation for all samples and parametric assumptions.

#MLE and MME for 8 different distributions and saved in each matrix
M1.MLE.lnorm=matrix(0,length(n),2,dimnames=list(c("n25","n50","n100","n250","n500"),c("meanlog","sdlog")))
M2.MLE.exp=matrix(0,length(n),1,dimnames=list(c("n25","n50","n100","n250","n500"),c("rate")))
M3.MLE.gamma=matrix(0,length(n),2,dimnames=list(c("n25","n50","n100","n250","n500"),c("shape","rate")))
M4.MLE.weibull=matrix(0,length(n),2,dimnames=list(c("n25","n50","n100","n250","n500"),c("shape","scale")))
M5.MLE.llogis=matrix(0,length(n),2,dimnames=list(c("n25","n50","n100","n250","n500"),c("shape","scale")))
M6.MLE.invweibull=matrix(0,length(n),2,dimnames=list(c("n25","n50","n100","n250","n500"),c("shape","scale")))
M7.MLE.pareto=matrix(0,length(n),2,dimnames=list(c("n25","n50","n100","n250","n500"),c("shape","scale")))
M8.MLE.trgamma=matrix(0,length(n),3,dimnames=list(c("n25","n50","n100","n250","n500"),c("shape1","shape2","rate")))
M1.MME.lnorm=matrix(0,length(n),2,dimnames=list(c("n25","n50","n100","n250","n500"),c("meanlog","sdlog")))
M2.MME.exp=matrix(0,length(n),1,dimnames=list(c("n25","n50","n100","n250","n500"),c("rate")))
M3.MME.gamma=matrix(0,length(n),2,dimnames=list(c("n25","n50","n100","n250","n500"),c("shape","rate")))
M4.MME.weibull=matrix(0,length(n),2,dimnames=list(c("n25","n50","n100","n250","n500"),c("shape","scale")))
M5.MME.llogis=matrix(0,length(n),2,dimnames=list(c("n25","n50","n100","n250","n500"),c("shape","scale")))
M6.MME.invweibull=matrix(0,length(n),2,dimnames=list(c("n25","n50","n100","n250","n500"),c("shape","scale")))
M7.MME.pareto=matrix(0,length(n),2,dimnames=list(c("n25","n50","n100","n250","n500"),c("shape","scale")))
M8.MME.trgamma=matrix(0,length(n),3,dimnames=list(c("n25","n50","n100","n250","n500"),c("shape1","shape2","rate")))
for (i in 1:length(n)) {
#lnorm
M1.MLE.lnorm[i,]=mledist(Sample1.list[[i]], "lnorm")$estimate
M1.MME.lnorm[i,]=mmedist(Sample1.list[[i]], "lnorm")$estimate
#exp
MLE.exp=mledist(Sample1.list[[i]]/1000, "exp")
#if (is.na(MLE.exp$estimate)==is.na(NA)) {MLE.exp$estimate=1/mean(Sample1.list[[i]])}
M2.MLE.exp[i,]=MLE.exp$estimate/1000
M2.MME.exp[i,]=mmedist(Sample1.list[[i]], "exp")$estimate
#gamma
MLE.gamma=mledist(Sample1.list[[i]], "gamma",lower=c(0, 0), start=list(shape =1, rate=1))  #when the meanlog is too large, become NA
M3.MLE.gamma[i,]=MLE.gamma$estimate
#if (is.na(MLE.gamma$estimate[1])==is.na(NA)) {
#  MLE.gamma$estimate=gammaMLE(Sample1.list[[i]])
#  MLE.gamma=c(MLE.gamma$estimate[1],1/MLE.gamma$estimate[2])
#  M3.MLE.gamma[i,]=MLE.gamma}
M3.MME.gamma[i,]=mmedist(Sample1.list[[i]], "gamma")$estimate
#weibull
M4.MLE.weibull[i,]=mledist(Sample1.list[[i]], "weibull")$estimate
memp  <-  function(x, order) mean(x^order)
MME.weibull=mmedist(Sample1.list[[i]], "weibull", order=c(1, 2), memp=memp)
M4.MME.weibull[i,]=MME.weibull$estimate
#llogis
M5.MLE.llogis[i,]=mledist(Sample1.list[[i]], "llogis")$estimate
memp  <-  function(x, order) mean(x^order)
MME.llogis=mmedist(Sample1.list[[i]], "llogis", order=c(1, 2), memp=memp)
M5.MME.llogis[i,]=MME.llogis$estimate
#invweibull  
M6.MLE.invweibull[i,]=mledist(Sample1.list[[i]], "invweibull")$estimate
memp  <-  function(x, order) mean(x^order)
MME.invweibull=mmedist(Sample1.list[[i]], "invweibull", order=c(1, 2), memp=memp)
M6.MME.invweibull[i,]=MME.invweibull$estimate
#pareto
M7.MLE.pareto[i,]=mledist(Sample1.list[[i]], "pareto")$estimate
memp  <-  function(x, order) mean(x^order)
MME.pareto=mmedist(Sample1.list[[i]], "pareto", order=c(1, 2), memp=memp)
M7.MME.pareto[i,]=MME.pareto$estimate
#trgamma
M8.MLE.trgamma[i,]=mledist(Sample1.list[[i]], "trgamma")$estimate
memp  <-  function(x, order) mean(x^order)
MME.trgamma=mmedist(Sample1.list[[i]], "trgamma", order=c(1, 2, 3), memp=memp)
M8.MME.trgamma[i,]=MME.trgamma$estimate
}

## checkstartfix is carried out
## checkstartfix is carried out
## checkstartfix is carried out
## checkstartfix is carried out
## checkstartfix is carried out
## <simpleError in optim(par = vstart, fn = fnobj, fix.arg = fix.arg, obs = data,     gr = gradient, mdistnam = mdistname, memp = memp, hessian = TRUE,     method = opt.meth, lower = lower, upper = upper, ...): non-finite finite-difference value [3]>
## checkstartfix is carried out
## checkstartfix is carried out
## <simpleError in optim(par = vstart, fn = fnobj, fix.arg = fix.arg, obs = data,     gr = gradient, mdistnam = mdistname, memp = memp, hessian = TRUE,     method = opt.meth, lower = lower, upper = upper, ...): function cannot be evaluated at initial parameters>
## checkstartfix is carried out
## checkstartfix is carried out
## checkstartfix is carried out
## <simpleError in optim(par = vstart, fn = fnobj, fix.arg = fix.arg, obs = data,     gr = gradient, mdistnam = mdistname, memp = memp, hessian = TRUE,     method = opt.meth, lower = lower, upper = upper, ...): non-finite finite-difference value [3]>
## checkstartfix is carried out
## checkstartfix is carried out
## <simpleError in optim(par = vstart, fn = fnobj, fix.arg = fix.arg, obs = data,     gr = gradient, mdistnam = mdistname, memp = memp, hessian = TRUE,     method = opt.meth, lower = lower, upper = upper, ...): function cannot be evaluated at initial parameters>
## checkstartfix is carried out
## checkstartfix is carried out
## checkstartfix is carried out
## <simpleError in optim(par = vstart, fn = fnobj, fix.arg = fix.arg, obs = data,     gr = gradient, mdistnam = mdistname, memp = memp, hessian = TRUE,     method = opt.meth, lower = lower, upper = upper, ...): non-finite finite-difference value [3]>
## checkstartfix is carried out
## checkstartfix is carried out
## <simpleError in optim(par = vstart, fn = fnobj, fix.arg = fix.arg, obs = data,     gr = gradient, mdistnam = mdistname, memp = memp, hessian = TRUE,     method = opt.meth, lower = lower, upper = upper, ...): function cannot be evaluated at initial parameters>
## checkstartfix is carried out
## checkstartfix is carried out
## checkstartfix is carried out
## <simpleError in optim(par = vstart, fn = fnobj, fix.arg = fix.arg, obs = data,     gr = gradient, mdistnam = mdistname, memp = memp, hessian = TRUE,     method = opt.meth, lower = lower, upper = upper, ...): non-finite finite-difference value [3]>
## checkstartfix is carried out
## checkstartfix is carried out
## <simpleError in optim(par = vstart, fn = fnobj, fix.arg = fix.arg, obs = data,     gr = gradient, mdistnam = mdistname, memp = memp, hessian = TRUE,     method = opt.meth, lower = lower, upper = upper, ...): function cannot be evaluated at initial parameters>
## checkstartfix is carried out
## checkstartfix is carried out
## checkstartfix is carried out
## <simpleError in optim(par = vstart, fn = fnobj, fix.arg = fix.arg, obs = data,     gr = gradient, mdistnam = mdistname, memp = memp, hessian = TRUE,     method = opt.meth, lower = lower, upper = upper, ...): non-finite finite-difference value [3]>

Sample 1 (simulated from LogNormal) of sizes \(n=\{25, 50, 100, 250, 500\}\) are fitted via MLE and MOM (assuming lnorm family) as follows:

#print table
kable(cbind(M1.MLE.lnorm, M1.MME.lnorm), digits = 7, format.args = list(big.mark = ","), caption = "MLE estimates (left) and MOM estimates (right)")

Table 2.1: MLE estimates (left) and MOM estimates (right)
	meanlog	sdlog	meanlog	sdlog
n25	9.758218	0.9048491	9.643222	1.1298040
n50	9.773453	1.0790497	9.872765	1.0785899
n100	10.052040	0.8846443	10.091675	0.8532116
n250	10.035170	1.0340745	10.039688	1.0131995
n500	9.954018	0.9791639	9.952332	0.9837266

Sample 1 (simulated from LogNormal) of sizes \(n=\{25, 50, 100, 250, 500\}\) are fitted via MLE and MOM (assuming exp family) as follows:

#print table
kable(cbind(M2.MLE.exp, M2.MME.exp), digits = 7, format.args = list(big.mark = ","), caption = "MLE estimates (left) and MOM estimates (right)")

Table 2.2: MLE estimates (left) and MOM estimates (right)
	rate	rate
n25	3.43e-05	3.43e-05
n50	2.88e-05	2.88e-05
n100	2.88e-05	2.88e-05
n250	2.61e-05	2.61e-05
n500	2.94e-05	2.94e-05

Sample 1 (simulated from LogNormal) of sizes \(n=\{25, 50, 100, 250, 500\}\) are fitted via MLE and MOM (assuming gamma family) as follows:

#print table
kable(cbind(M3.MLE.gamma, M3.MME.gamma), digits = 7, format.args = list(big.mark = ","), caption = "MLE estimates (left) and MOM estimates (right)")

Table 2.3: MLE estimates (left) and MOM estimates (right)
	shape	rate	shape	rate
n25	1.0918842	3.74e-05	0.3870090	1.33e-05
n50	0.8632702	2.49e-05	0.4544097	1.31e-05
n100	1.3814064	3.98e-05	0.9338181	2.69e-05
n250	1.1027760	2.88e-05	0.5581965	1.46e-05
n500	1.1761067	3.45e-05	0.6127707	1.80e-05

Sample 1 (simulated from LogNormal) of sizes \(n=\{25, 50, 100, 250, 500\}\) are fitted via MLE and MOM (assuming weibull family) as follows:

#print table
kable(cbind(M4.MLE.weibull, M4.MME.weibull), digits = 7, format.args = list(big.mark = ","), caption = "MLE estimates (left) and MOM estimates (right)")

Table 2.4: MLE estimates (left) and MOM estimates (right)
	shape	scale	shape	scale
n25	0.9437471	29,543.90	0.7860005	29,567.97
n50	0.8482235	31,171.88	0.7734185	32,456.36
n100	1.1193522	36,437.53	1.1269249	39,020.15
n250	0.9823712	37,948.02	0.9101338	41,143.93
n500	1.0165764	34,341.51	0.9437451	36,941.55

Sample 1 (simulated from LogNormal) of sizes \(n=\{25, 50, 100, 250, 500\}\) are fitted via MLE and MOM (assuming llogis family) as follows:

#print table
kable(cbind(M5.MLE.llogis, M5.MME.llogis), digits = 7, format.args = list(big.mark = ","), caption = "MLE estimates (left) and MOM estimates (right)")

Table 2.5: MLE estimates (left) and MOM estimates (right)
	shape	scale	shape	scale
n25	0.1879159	17,280.25	126,073,923	2,071,502
n50	1.6328594	15,686.76	NA	NA
n100	0.1950570	23,373.39	NA	NA
n250	1.7470879	23,175.25	NA	NA
n500	1.7940411	20,968.28	NA	NA

Sample 1 (simulated from LogNormal) of sizes \(n=\{25, 50, 100, 250, 500\}\) are fitted via MLE and MOM (assuming invweibull family) as follows:

#print table
kable(cbind(M6.MLE.invweibull, M6.MME.invweibull), digits = 7, format.args = list(big.mark = ","), caption = "MLE estimates (left) and MOM estimates (right)")

Table 2.6: MLE estimates (left) and MOM estimates (right)
	shape	scale	shape	scale
n25	0.5934289	1,714.3955	6,404.055	55,248.86
n50	0.5975857	1,616.6989	7,926.824	62,073.97
n100	0.6110728	2,850.8647	6,417.635	49,987.98
n250	0.3123593	365.3364	8,418.296	63,971.32
n500	0.4255051	902.3304	6,379.283	55,269.19

Sample 1 (simulated from LogNormal) of sizes \(n=\{25, 50, 100, 250, 500\}\) are fitted via MLE and MOM (assuming pareto family) as follows:

#print table
kable(cbind(M7.MLE.pareto, M7.MME.pareto), digits = 7, format.args = list(big.mark = ","), caption = "MLE estimates (left) and MOM estimates (right)")

Table 2.7: MLE estimates (left) and MOM estimates (right)
	shape	scale	shape	scale
n25	3.930348	82,776.96	3.262691	66,039.26
n50	2.395978	49,999.88	3.665754	92,496.85
n100	30.655653	1,030,842.25	30.219733	1,015,106.90
n250	6.199372	197,403.65	4.526900	135,049.08
n500	7.722460	227,712.70	5.164899	141,900.73

Sample 1 (simulated from LogNormal) of sizes \(n=\{25, 50, 100, 250, 500\}\) are fitted via MLE and MOM (assuming trgamma family) as follows:

#print table
kable(cbind(M8.MLE.trgamma, M8.MME.trgamma), digits = 7, format.args = list(big.mark = ","), caption = "MLE estimates (left) and MOM estimates (right)")

Table 2.8: MLE estimates (left) and MOM estimates (right)
	shape1	shape2	rate	shape1	shape2	rate
n25	9.115813	0.3457418	0.0288253	NA	NA	NA
n50	7.066169	0.3303238	0.0168226	NA	NA	NA
n100	7.186013	0.4164912	0.0040745	NA	NA	NA
n250	6.720832	0.3822265	0.0052258	NA	NA	NA
n500	8.476387	0.3523861	0.0171844	NA	NA	NA

There is a lot to absorb from these estimates, and we will not analyse every single set of estimates, but we note the following:

MLE and MOM estimates are identical when fitting exp, as expected since both should equal to \(1/m_1\) where \(m_1\) is the sample mean estimator;
There are no obvious concerns with almost all MLE estimates though pareto family when \(n=100\) are very different than all other samples;
There are no obvious concerns with MOM estimates when fitting lnorm, exp and gamma families;
As in Section 2.3, there is no doubt that MOM fails when fitting llogis and trgamma families;
As in Section 2.3, MOM estimates when fitting weibull, invweibull and pareto families should be checked and bespoke remedies might be needed; MOM(like MLE) estimates for pareto family when \(n=100\) are very different than all other samples.

Let us check the MOM estimates for pareto family when \(n=100\), though the MLE counterpart has the same issue.

xxx <- c(mean(Sample1.list[[3]]),mean(Sample1.list[[3]]^2))
xxxP<-c(mpareto(1, shape=M7.MME.pareto[3,1], scale = M7.MME.pareto[3,2]),mpareto(1, shape=M7.MME.pareto[3,1], scale = M7.MME.pareto[3,2]))
cat("First two (raw) moments of Samples 1 with n=100 -- empirical estimates")

## First two (raw) moments of Samples 1 with n=100 -- empirical estimates

xxx

## [1] 3.474046e+04 2.499335e+09

cat("Pareto MOM estimates of Samples 1 with n=100 -- (a=shape, b=scale)")

## Pareto MOM estimates of Samples 1 with n=100 -- (a=shape, b=scale)

xxxP

## [1] 34740.46 34740.46

cat("The diferences between the empirical and theoretical MOM for weibull")

## The diferences between the empirical and theoretical MOM for weibull

xxx-xxxP

## [1] -1.383093e-05  2.499300e+09

Thus, we need to find a bespoke remedy for this issue. One could help R to better estimate the pareto MOM estimates from first principles. That is, we have to solve the following system of equations: \[\mathbb{E}[X]=\frac{\lambda}{\alpha-1}=m_1\quad\text{and}\quad\mathbb{E}\big[X^2\big]=\frac{2\lambda^2}{(\alpha-1)(\alpha-2)}=m_2,\] where \(m_1\) and \(m_2\) are the first and second empirical moment estimates. Recall that the shape parameter should satisfy \(\alpha>2\) so that the first two moments are finite, and \(\lambda>0\) should always need to hold, since our pareto parametrisation has the following cumulative distribution function: \[ F_{Pareto}(x;\alpha,\lambda)=\Pr(X\le x)=1-\left(\frac{\lambda}{\lambda+x}\right)^\alpha\quad\text{for all}\;x\ge0\quad\text{with}\;\alpha,\lambda>0. \] Our system of equations requires that \(\lambda=m_1(\alpha-1)\), which in turn gives that \[ m_2=2m_1^2\cdot\frac{\alpha-1}{\alpha-2}\implies \hat{\alpha}_{MOM}=2+\left(\frac{m_2}{2m_1^2}-1\right)^{-1}\quad\text{and}\quad \hat{\lambda}_{MOM}=m_1(\hat{\alpha}_{MOM}-1). \] Recall that \(\hat{\alpha}_{MOM}>2\) and \(\hat{\lambda}_{MOM}>0\) should hold, which is true as long as \(\frac{m_2}{2m_1^2}-1>0\), but for this particular sample \(\frac{m_2}{2m_1^2}-1=1.035436\) so the MOM are feasible. We could now check whether we got what we need not only what we hoped to:

xxx <- c(mean(Sample1.list[[3]]),mean(Sample1.list[[3]]^2))
alpha1MOM <- 2+(xxx[2]/(2*xxx[1]^2)-1)^(-1)
lambda1MOM <- (alpha1MOM-1)*xxx[1]
cat("Pareto MOM estimates of Samples 1 with n=100 -- (alpha1=shape, lambda1=scale)")

## Pareto MOM estimates of Samples 1 with n=100 -- (alpha1=shape, lambda1=scale)

c(alpha1MOM,lambda1MOM)

## [1] 3.021973e+01 1.015107e+06

xxxP1<-c(mpareto(1, shape=alpha1MOM, scale = lambda1MOM),mpareto(2, shape=alpha1MOM, scale = lambda1MOM))
cat("First two (raw) moments of Samples 1 with n=100 -- empirical estimates")

## First two (raw) moments of Samples 1 with n=100 -- empirical estimates

xxx

## [1] 3.474046e+04 2.499335e+09

cat("First two (raw) moments of Samples 1 with n=100 -- theoretical MOM pareto")

## First two (raw) moments of Samples 1 with n=100 -- theoretical MOM pareto

xxxP1

## [1] 3.474046e+04 2.499335e+09

cat("The diferences between the empirical and theoretical MOM for pareto")

## The diferences between the empirical and theoretical MOM for pareto

xxx-xxxP1

## [1] 1.964509e-10 4.959106e-05

and we notice that the empirical and theoretical moments are extremely close to each other with an error up to \(10^{-5}\). These MOM estimates for pareto will not be further used so that the R code does not get more cumbersome that is now, but keep in mind that some MOM sets of estimates are not very accurate, particularly those MOM estimates when fitting weibull, invweibull and pareto families.

Q-Q plots and P-P plots are intuitive graphical tools for checking two given distributions. Q-Q plots tend to be more useful, but for completeness, all plots are provided as follows:

Compare MLE and MOM fitted distribution (by assuming lnorm) with the empirical distribution – see the Q-Q Plots from Figure 2.3;
Compare MLE and MOM fitted distribution by assuming lnorm) with the empirical distribution – see the P-P Plots from Figure 2.4;
Compare MLE and MOM fitted distribution (by assuming exp) with the empirical distribution – see the Q-Q Plots from Figure 2.5;
Compare MLE and MOM fitted distribution (by assuming exp) with the empirical distribution – see the P-P Plots from Figure 2.6.

Note that the five samples (of sizes \(n=\{25, 50, 100, 250, 500\}\)) are drawn/simulated from LogNormal , i.e. the true distribution is LogNormal, but these samples are fitted via MLE/MOM by assuming lnorm and exp families. Clearly, the other six parametric assumptions could be checked in the same fashion.

par(mfrow = c(2, 5))
mle.fln <- fitdist(Sample1.list[[1]], "lnorm", method="mle")
qqcomp(list(mle.fln), legendtext = "lnorm (mle, n=25)")
mle.fln <- fitdist(Sample1.list[[2]], "lnorm", method="mle")
qqcomp(list(mle.fln), legendtext = "lnorm (mle, n=50)")
mle.fln <- fitdist(Sample1.list[[3]], "lnorm", method="mle")
qqcomp(list(mle.fln), legendtext = "lnorm (mle, n=100)")
mle.fln <- fitdist(Sample1.list[[4]], "lnorm", method="mle")
qqcomp(list(mle.fln), legendtext = "lnorm (mle, n=250)")
mle.fln <- fitdist(Sample1.list[[5]], "lnorm", method="mle")
qqcomp(list(mle.fln), legendtext = "lnorm (mle, n=500)")
mme.fln <- fitdist(Sample1.list[[1]], "lnorm", method="mme")
qqcomp(list(mme.fln), legendtext = "lnorm (mme, n=25)")
mme.fln <- fitdist(Sample1.list[[2]], "lnorm", method="mme")
qqcomp(list(mme.fln), legendtext = "lnorm (mme, n=50)")
mme.fln <- fitdist(Sample1.list[[3]], "lnorm", method="mme")
qqcomp(list(mme.fln), legendtext = "lnorm (mme, n=100)")
mme.fln <- fitdist(Sample1.list[[4]], "lnorm", method="mme")
qqcomp(list(mme.fln), legendtext = "lnorm (mme, n=250)")
mme.fln <- fitdist(Sample1.list[[5]], "lnorm", method="mme")
qqcomp(list(mme.fln), legendtext = "lnorm (mme, n=500)")

Figure 2.3: Q-Q plots (by assuming lnorm) for MLE (top) and MOM (bottom)

par(mfrow = c(2, 5))
mle.fln <- fitdist(Sample1.list[[1]], "lnorm", method="mle")
ppcomp(list(mle.fln), legendtext = "lnorm (mle, n=25)")
mle.fln <- fitdist(Sample1.list[[2]], "lnorm", method="mle")
ppcomp(list(mle.fln), legendtext = "lnorm (mle, n=50)")
mle.fln <- fitdist(Sample1.list[[3]], "lnorm", method="mle")
ppcomp(list(mle.fln), legendtext = "lnorm (mle, n=100)")
mle.fln <- fitdist(Sample1.list[[4]], "lnorm", method="mle")
ppcomp(list(mle.fln), legendtext = "lnorm (mle, n=250)")
mle.fln <- fitdist(Sample1.list[[5]], "lnorm", method="mle")
ppcomp(list(mle.fln), legendtext = "lnorm (mle, n=500)")
mme.fln <- fitdist(Sample1.list[[1]], "lnorm", method="mme")
ppcomp(list(mme.fln), legendtext = "lnorm (mme, n=25)")
mme.fln <- fitdist(Sample1.list[[2]], "lnorm", method="mme")
ppcomp(list(mme.fln), legendtext = "lnorm (mme, n=50)")
mme.fln <- fitdist(Sample1.list[[3]], "lnorm", method="mme")
ppcomp(list(mme.fln), legendtext = "lnorm (mme, n=100)")
mme.fln <- fitdist(Sample1.list[[4]], "lnorm", method="mme")
ppcomp(list(mme.fln), legendtext = "lnorm (mme, n=250)")
mme.fln <- fitdist(Sample1.list[[5]], "lnorm", method="mme")
ppcomp(list(mme.fln), legendtext = "lnorm (mme, n=500)")

Figure 2.4: P-P plots (by assuming lnorm) for MLE (top) and MOM (bottom)

#Here, in order to use fitdist of exp, we re-scale of the data sample (just here only)
set.seed(1234) #Generating Sample1 and saved in a list
n=c(25, 50, 100, 250, 500)
lnorm.meanlog=10
lnorm.sdlog=1
Sample1.list.rescaled=list(n25=rep(0,25),n50=rep(0,50),n100=rep(0,100),n250=rep(0,250),n500=rep(0,500))
for (i in 1:length(n)) {
Sample1=rlnorm(n[i], meanlog=lnorm.meanlog, sdlog=lnorm.sdlog)
Sample1.list.rescaled[[i]]=Sample1/1000
}
par(mfrow = c(2, 5))
mle.fe <- fitdist(Sample1.list.rescaled[[1]], "exp", method="mle")
qqcomp(list(mle.fe), legendtext = "exp (mle, n=25)")
mle.fe <- fitdist(Sample1.list.rescaled[[2]], "exp", method="mle")
qqcomp(list(mle.fe), legendtext = "exp (mle, n=50)")
mle.fe <- fitdist(Sample1.list.rescaled[[3]], "exp", method="mle")
qqcomp(list(mle.fe), legendtext = "exp (mle, n=100)")
mle.fe <- fitdist(Sample1.list.rescaled[[4]], "exp", method="mle")
qqcomp(list(mle.fe), legendtext = "exp (mle, n=250)")
mle.fe <- fitdist(Sample1.list.rescaled[[5]], "exp", method="mle")
qqcomp(list(mle.fe), legendtext = "exp (mle, n=500)")
mme.fe <- fitdist(Sample1.list.rescaled[[1]], "exp", method="mme")
qqcomp(list(mme.fe), legendtext = "exp (mme, n=25)")
mme.fe <- fitdist(Sample1.list.rescaled[[2]], "exp", method="mme")
qqcomp(list(mme.fe), legendtext = "exp (mme, n=50)")
mme.fe <- fitdist(Sample1.list.rescaled[[3]], "exp", method="mme")
qqcomp(list(mme.fe), legendtext = "exp (mme, n=100)")
mme.fe <- fitdist(Sample1.list.rescaled[[4]], "exp", method="mme")
qqcomp(list(mme.fe), legendtext = "exp (mme, n=250)")
mme.fe <- fitdist(Sample1.list.rescaled[[5]], "exp", method="mme")
qqcomp(list(mme.fe), legendtext = "exp (mme, n=500)")

Figure 2.5: Q-Q plots (by assuming exp) for MLE (top) and MOM (bottom)

par(mfrow = c(2, 5))
mle.fe <- fitdist(Sample1.list.rescaled[[1]], "exp", method="mle")
ppcomp(list(mle.fe), legendtext = "exp (mle, n=25)")
mle.fe <- fitdist(Sample1.list.rescaled[[2]], "exp", method="mle")
ppcomp(list(mle.fe), legendtext = "exp (mle, n=50)")
mle.fe <- fitdist(Sample1.list.rescaled[[3]], "exp", method="mle")
ppcomp(list(mle.fe), legendtext = "exp (mle, n=100)")
mle.fe <- fitdist(Sample1.list.rescaled[[4]], "exp", method="mle")
ppcomp(list(mle.fe), legendtext = "exp (mle, n=250)")
mle.fe <- fitdist(Sample1.list.rescaled[[5]], "exp", method="mle")
ppcomp(list(mle.fe), legendtext = "exp (mle, n=500)")
mme.fe <- fitdist(Sample1.list.rescaled[[1]], "exp", method="mme")
ppcomp(list(mme.fe), legendtext = "exp (mme, n=25)")
mme.fe <- fitdist(Sample1.list.rescaled[[2]], "exp", method="mme")
ppcomp(list(mme.fe), legendtext = "exp (mme, n=50)")
mme.fe <- fitdist(Sample1.list.rescaled[[3]], "exp", method="mme")
ppcomp(list(mme.fe), legendtext = "exp (mme, n=100)")
mme.fe <- fitdist(Sample1.list.rescaled[[4]], "exp", method="mme")
ppcomp(list(mme.fe), legendtext = "exp (mme, n=250)")
mme.fe <- fitdist(Sample1.list.rescaled[[5]], "exp", method="mme")
ppcomp(list(mme.fe), legendtext = "exp (mme, n=500)")

Figure 2.6: P-P plots (by assuming exp) for MLE (top) and MOM (bottom)

Quantiles play an important role in fitting the best possible distribution. In addition, the real data analysis from Section 3 requires estimating quantiles at level \(\{50\%, 75\%,95\%\}\). Therefore, we next estimate the same quantiles so that we understand the estimation error in a controlled setting where data are simulated.

The following quantile related computations are tabulated: i) `True’ quantile values and ii) MLE and MOM and quantile estimates with various sample sizes.

Quantile.true=qlnorm(c(0.5,0.75,0.95), meanlog=lnorm.meanlog, sdlog=lnorm.sdlog, lower.tail = TRUE, log.p = FALSE)
Quantile.true=matrix(Quantile.true,1,3,dimnames=list(c("Quantile value"),c("(50%)","(75%)","(95%)")))
kable(Quantile.true, digits = 2, format.args = list(big.mark = ","), caption="Quantile values of the true LogNormal distribution at levels 50%, 75% and 95%")

Table 2.9: Quantile values of the true LogNormal distribution at levels 50%, 75% and 95%
	(50%)	(75%)	(95%)
Quantile value	22,026.47	43,238.64	114,102.6

suppressWarnings(library(stats))
Table1.1=matrix(0,8,6,dimnames=list(c("lnorm","exp","gamma","weibull","llogis","invwei","Pareto","trgamma"),c("MOM(50%)","MLE(50%)","MOM(75%)","MLE(75%)","MOM(95%)","MLE(95%)")))
#lnorm
mle.fln.n25 <- fitdist(Sample1.list[[1]], "lnorm", method="mle")
Quantile.lnorm.mle.n25=quantile(mle.fln.n25, probs = c(0.5, 0.75, 0.95))
mme.fln.n25 <- fitdist(Sample1.list[[1]], "lnorm", method="mme")
Quantile.lnorm.mme.n25=quantile(mme.fln.n25, probs = c(0.5, 0.75, 0.95))
#exp
mle.fe.n25 <- fitdist(Sample1.list.rescaled[[1]], "exp", method="mle") #Use re-scaled value
Quantile.exp.mle.n25=quantile(mle.fe.n25, probs = c(0.5, 0.75, 0.95))
mme.fe.n25 <- fitdist(Sample1.list[[1]], "exp", method="mme")
Quantile.exp.mme.n25=quantile(mme.fe.n25, probs = c(0.5, 0.75, 0.95))
#gamma
mle.fg.n25 <- fitdist(Sample1.list[[1]], "gamma", method="mle",lower=c(0, 0), start=list(shape =1, rate=1))
Quantile.gamma.mle.n25=quantile(mle.fg.n25, probs = c(0.5, 0.75, 0.95))
mme.fg.n25 <- fitdist(Sample1.list[[1]], "gamma", method="mme")
Quantile.gamma.mme.n25=quantile(mme.fg.n25, probs = c(0.5, 0.75, 0.95))
#weibull
mle.fw.n25 <- fitdist(Sample1.list[[1]], "weibull", method="mle")
Quantile.weibull.mle.n25=quantile(mle.fw.n25, probs = c(0.5, 0.75, 0.95))
memp  <-  function(x, order) mean(x^order)
mme.fw.n25 <- fitdist(Sample1.list[[1]], "weibull", method="mme", order=c(1, 2), memp=memp)
Quantile.weibull.mme.n25=quantile(mme.fw.n25, probs = c(0.5, 0.75, 0.95))
#llogis
mle.fl.n25 <- fitdist(Sample1.list[[1]], "llogis", method="mle")
Quantile.llogis.mle.n25=quantile(mle.fl.n25, probs = c(0.5, 0.75, 0.95))
#mme.fl.n25 <- fitdist(Sample1.list[[1]], "llogis", method="mme", order=c(1, 2), memp=memp)
#Quantile.llogis.mme.n25=quantile(mme.fl.n25, probs = c(0.5, 0.75, 0.95))
#invweibull
mle.fi.n25 <- fitdist(Sample1.list[[1]], "invweibull", method="mle")
Quantile.invweibull.mle.n25=quantile(mle.fi.n25, probs = c(0.5, 0.75, 0.95))
mme.fi.n25 <- fitdist(Sample1.list[[1]], "invweibull", method="mme", order=c(1, 2), memp=memp)
Quantile.invweibull.mme.n25=quantile(mme.fi.n25, probs = c(0.5, 0.75, 0.95))
#pareto
mle.fp.n25 <- fitdist(Sample1.list[[1]], "pareto", method="mle")
Quantile.pareto.mle.n25=quantile(mle.fp.n25, probs = c(0.5, 0.75, 0.95))
mme.fp.n25 <- fitdist(Sample1.list[[1]], "pareto", method="mme", order=c(1, 2), memp=memp)
Quantile.pareto.mme.n25=quantile(mme.fp.n25, probs = c(0.5, 0.75, 0.95))
#trgamma
mle.ft.n25 <- fitdist(Sample1.list.rescaled[[1]], "trgamma", method="mle",optim.method="SANN") #Use re-scaled
Quantile.trgamma.mle.n25=quantile(mle.ft.n25, probs = c(0.5, 0.75, 0.95))
memp  <-  function(x, order) mean(x^order)
mme.ft.n25 <- fitdist(Sample1.list[[1]], "trgamma", method="mme", order=c(1, 2, 3), memp=memp)

<simpleError in optim(par = vstart, fn = fnobj, fix.arg = fix.arg, obs = data,     gr = gradient, mdistnam = mdistname, memp = memp, hessian = TRUE,     method = opt.meth, lower = lower, upper = upper, ...): non-finite finite-difference value [3]>

Quantile.trgamma.mme.n25=quantile(mme.ft.n25, probs = c(0.5, 0.75, 0.95))
#Table
Table1.1[1,c(2,4,6)]=abs(as.numeric(Quantile.lnorm.mle.n25$quantiles)-Quantile.true)
Table1.1[1,c(1,3,5)]=abs(as.numeric(Quantile.lnorm.mme.n25$quantiles)-Quantile.true)
Table1.1[2,c(2,4,6)]=abs(as.numeric(Quantile.exp.mle.n25$quantiles)*1000-Quantile.true) #Use re-scaled
Table1.1[2,c(1,3,5)]=abs(as.numeric(Quantile.exp.mme.n25$quantiles)-Quantile.true)
Table1.1[3,c(2,4,6)]=abs(as.numeric(Quantile.gamma.mle.n25$quantiles)-Quantile.true)
Table1.1[3,c(1,3,5)]=abs(as.numeric(Quantile.gamma.mme.n25$quantiles)-Quantile.true)
Table1.1[4,c(2,4,6)]=abs(as.numeric(Quantile.weibull.mle.n25$quantiles)-Quantile.true)
Table1.1[4,c(1,3,5)]=abs(as.numeric(Quantile.weibull.mme.n25$quantiles)-Quantile.true)
Table1.1[5,c(2,4,6)]=abs(as.numeric(Quantile.llogis.mle.n25$quantiles)-Quantile.true)
#Table1.1[5,c(1,3,5)]=abs(as.numeric(Quantile.llogis.mme.n25$quantiles)-Quantile.true)
Table1.1[5,c(1,3,5)]=c(NA,NA,NA)
Table1.1[6,c(2,4,6)]=abs(as.numeric(Quantile.invweibull.mle.n25$quantiles)-Quantile.true)
#Table1.1[6,c(1,3,5)]=abs(as.numeric(Quantile.invweibull.mme.n25$quantiles)-Quantile.true)
Table1.1[6,c(1,3,5)]=c(NA,NA,NA)
Table1.1[7,c(2,4,6)]=abs(as.numeric(Quantile.pareto.mle.n25$quantiles)-Quantile.true)
Table1.1[7,c(1,3,5)]=abs(as.numeric(Quantile.pareto.mme.n25$quantiles)-Quantile.true)
Table1.1[8,c(2,4,6)]=abs(as.numeric(Quantile.trgamma.mle.n25$quantiles)*1000-Quantile.true) #Use re-scaled
Table1.1[8,c(1,3,5)]=abs(as.numeric(Quantile.trgamma.mme.n25$quantiles)-Quantile.true)
#print table
kable(Table1.1, digits = 3, format.args = list(big.mark = ","), caption="MLE and MOM quantile estimation errors at levels 50%, 75% and 95% for the LogNormal sample of size n=25")

Table 2.10: MLE and MOM quantile estimation errors at levels 50%, 75% and 95% for the LogNormal sample of size n=25
	MOM(50%)	MLE(50%)	MOM(75%)	MLE(75%)	MOM(95%)	MLE(95%)
lnorm	6,609.535	4,730.688	10,205.622	11,397.029	15,230.399	3.748657e+04
exp	1,796.157	1,796.158	2,778.020	2,778.021	26,668.697	2.666870e+04
gamma	11,845.148	1,114.471	6,734.219	2,821.699	8,550.559	2.933522e+04
weibull	3,477.938	1,990.717	1,563.530	1,476.883	5,312.393	1.961538e+04
llogis	NA	4,746.214	NA	5,934,882.525	NA	1.102743e+11
invwei	NA	18,847.107	NA	29,245.701	NA	1.416476e+05
Pareto	6,394.976	6,061.714	8,275.686	8,230.097	14,731.784	1.948882e+04
trgamma	NA	3,392.109	NA	8,640.248	NA	3.673115e+04

Table1.2=matrix(0,8,6,dimnames=list(c("lnorm","exp","gamma","Weibull","llogis","invWeibull","Pareto","trgamma"),c("MOM(50%)","MLE(50%)","MOM(75%)","MLE(75%)","MOM(95%)","MLE(95%)")))
#lnorm
mle.fln.n50 <- fitdist(Sample1.list[[2]], "lnorm", method="mle")
Quantile.lnorm.mle.n50=quantile(mle.fln.n50, probs = c(0.5, 0.75, 0.95))
mme.fln.n50 <- fitdist(Sample1.list[[2]], "lnorm", method="mme")
Quantile.lnorm.mme.n50=quantile(mme.fln.n50, probs = c(0.5, 0.75, 0.95))
#exp
mle.fe.n50 <- fitdist(Sample1.list.rescaled[[2]], "exp", method="mle") #Use re-scaled
Quantile.exp.mle.n50=quantile(mle.fe.n50, probs = c(0.5, 0.75, 0.95))
mme.fe.n50 <- fitdist(Sample1.list[[2]], "exp", method="mme")
Quantile.exp.mme.n50=quantile(mme.fe.n50, probs = c(0.5, 0.75, 0.95))
#gamma
mle.fg.n50 <- fitdist(Sample1.list[[2]], "gamma", method="mle",lower=c(0, 0), start=list(shape =1, rate=1))
Quantile.gamma.mle.n50=quantile(mle.fg.n50, probs = c(0.5, 0.75, 0.95))
mme.fg.n50 <- fitdist(Sample1.list[[2]], "gamma", method="mme")
Quantile.gamma.mme.n50=quantile(mme.fg.n50, probs = c(0.5, 0.75, 0.95))
#weibull
mle.fw.n50 <- fitdist(Sample1.list[[2]], "weibull", method="mle")
Quantile.weibull.mle.n50=quantile(mle.fw.n50, probs = c(0.5, 0.75, 0.95))
memp  <-  function(x, order) mean(x^order)
mme.fw.n50 <- fitdist(Sample1.list[[2]], "weibull", method="mme", order=c(1, 2), memp=memp)
Quantile.weibull.mme.n50=quantile(mme.fw.n50, probs = c(0.5, 0.75, 0.95))
#llogis
mle.fl.n50 <- fitdist(Sample1.list[[2]], "llogis", method="mle")
Quantile.llogis.mle.n50=quantile(mle.fl.n50, probs = c(0.5, 0.75, 0.95))
mme.fl.n50 <- fitdist(Sample1.list[[2]], "llogis", method="mme", optim.method="Nelder-Mead", order=c(1, 2), memp=memp)

<simpleError in optim(par = vstart, fn = fnobj, fix.arg = fix.arg, obs = data,     gr = gradient, mdistnam = mdistname, memp = memp, hessian = TRUE,     method = opt.meth, lower = lower, upper = upper, ...): function cannot be evaluated at initial parameters>

Quantile.llogis.mme.n50=quantile(mme.fl.n50, probs = c(0.5, 0.75, 0.95))
#invweibull
mle.fi.n50 <- fitdist(Sample1.list[[2]], "invweibull", method="mle")
Quantile.invweibull.mle.n50=quantile(mle.fi.n50, probs = c(0.5, 0.75, 0.95))
mme.fi.n50 <- fitdist(Sample1.list[[2]], "invweibull", method="mme", order=c(1, 2), memp=memp)
Quantile.invweibull.mme.n50=quantile(mme.fi.n50, probs = c(0.5, 0.75, 0.95))
#pareto
mle.fp.n50 <- fitdist(Sample1.list[[2]], "pareto", method="mle")
Quantile.pareto.mle.n50=quantile(mle.fp.n50, probs = c(0.5, 0.75, 0.95))
mme.fp.n50 <- fitdist(Sample1.list[[2]], "pareto", method="mme", order=c(1, 2), memp=memp)
Quantile.pareto.mme.n50=quantile(mme.fp.n50, probs = c(0.5, 0.75, 0.95))
#trgamma
mle.ft.n50 <- fitdist(Sample1.list.rescaled[[2]], "trgamma", method="mle",optim.method="SANN") #Use re-scaled
Quantile.trgamma.mle.n50=quantile(mle.ft.n50, probs = c(0.5, 0.75, 0.95))
memp  <-  function(x, order) mean(x^order)
mme.ft.n50 <- fitdist(Sample1.list[[2]], "trgamma", method="mme",optim.method="SANN", order=c(1, 2, 3), memp=memp)

<simpleError in optim(par = vstart, fn = fnobj, fix.arg = fix.arg, obs = data,     gr = gradient, mdistnam = mdistname, memp = memp, hessian = TRUE,     method = opt.meth, lower = lower, upper = upper, ...): non-finite finite-difference value [3]>

Quantile.trgamma.mme.n50=quantile(mme.ft.n50, probs = c(0.5, 0.75, 0.95))
#Table
Table1.2[1,c(2,4,6)]=abs(as.numeric(Quantile.lnorm.mle.n50$quantiles)-Quantile.true)
Table1.2[1,c(1,3,5)]=abs(as.numeric(Quantile.lnorm.mme.n50$quantiles)-Quantile.true)
Table1.2[2,c(2,4,6)]=abs(as.numeric(Quantile.exp.mle.n50$quantiles)*1000-Quantile.true) #Use re-scaled
Table1.2[2,c(1,3,5)]=abs(as.numeric(Quantile.exp.mme.n50$quantiles)-Quantile.true)
Table1.2[3,c(2,4,6)]=abs(as.numeric(Quantile.gamma.mle.n50$quantiles)-Quantile.true)
Table1.2[3,c(1,3,5)]=abs(as.numeric(Quantile.gamma.mme.n50$quantiles)-Quantile.true)
Table1.2[4,c(2,4,6)]=abs(as.numeric(Quantile.weibull.mle.n50$quantiles)-Quantile.true)
Table1.2[4,c(1,3,5)]=abs(as.numeric(Quantile.weibull.mme.n50$quantiles)-Quantile.true)
Table1.2[5,c(2,4,6)]=abs(as.numeric(Quantile.llogis.mle.n50$quantiles)-Quantile.true)
#Table1.2[5,c(1,3,5)]=abs(as.numeric(Quantile.llogis.mme.n25$quantiles)-Quantile.true)
Table1.2[5,c(1,3,5)]=c(NA,NA,NA)
Table1.2[6,c(2,4,6)]=abs(as.numeric(Quantile.invweibull.mle.n25$quantiles)-Quantile.true)
#Table1.2[6,c(1,3,5)]=abs(as.numeric(Quantile.invweibull.mme.n25$quantiles)-Quantile.true)
Table1.2[6,c(1,3,5)]=c(NA,NA,NA)
Table1.2[7,c(2,4,6)]=abs(as.numeric(Quantile.pareto.mle.n50$quantiles)-Quantile.true)
Table1.2[7,c(1,3,5)]=abs(as.numeric(Quantile.pareto.mme.n50$quantiles)-Quantile.true)
Table1.2[8,c(2,4,6)]=abs(as.numeric(Quantile.trgamma.mle.n50$quantiles)*1000-Quantile.true) #Use re-scaled
Table1.2[8,c(1,3,5)]=abs(as.numeric(Quantile.trgamma.mme.n50$quantiles)-Quantile.true)
#print table
kable(Table1.2, digits = 3, format.args = list(big.mark = ","), caption="MLE and MOM quantile estimation errors at levels 50%, 75% and 95% for the LogNormal sample of size n=50")

Table 2.11: MLE and MOM quantile estimation errors at levels 50%, 75% and 95% for the LogNormal sample of size n=50
	MOM(50%)	MLE(50%)	MOM(75%)	MLE(75%)	MOM(95%)	MLE(95%)
lnorm	2,631.579	4,465.168	3,093.259	6,877.322	232.280	10,498.602
exp	2,024.488	2,024.488	4,863.270	4,863.270	10,156.143	10,156.143
gamma	7,596.424	541.577	1,872.793	4,812.937	23,784.094	4,591.929
Weibull	1,819.864	1,791.306	6,273.690	2,575.692	19,988.347	463.243
llogis	NA	6,339.707	NA	12,496.572	NA	18,895.216
invWeibull	NA	18,847.107	NA	29,245.701	NA	141,647.586
Pareto	2,773.589	5,252.152	725.469	4,062.445	2,831.833	10,469.517
trgamma	NA	3,351.292	NA	3,175.781	NA	7,042.718

Table1.3=matrix(0,8,6,dimnames=list(c("lnorm","exp","gamma","Weibull","llogis","invWeibull","Pareto","trgamma"),c("MOM(50%)","MLE(50%)","MOM(75%)","MLE(75%)","MOM(95%)","MLE(95%)")))
#lnorm
mle.fln.n100 <- fitdist(Sample1.list[[3]], "lnorm", method="mle")
Quantile.lnorm.mle.n100=quantile(mle.fln.n100, probs = c(0.5, 0.75, 0.95))
mme.fln.n100 <- fitdist(Sample1.list[[3]], "lnorm", method="mme")
Quantile.lnorm.mme.n100=quantile(mme.fln.n100, probs = c(0.5, 0.75, 0.95))
#exp
mle.fe.n100 <- fitdist(Sample1.list.rescaled[[3]], "exp", method="mle")
Quantile.exp.mle.n100=quantile(mle.fe.n100, probs = c(0.5, 0.75, 0.95))
mme.fe.n100 <- fitdist(Sample1.list[[3]], "exp", method="mme")
Quantile.exp.mme.n100=quantile(mme.fe.n100, probs = c(0.5, 0.75, 0.95))
#gamma
mle.fg.n100 <- fitdist(Sample1.list[[3]], "gamma", method="mle",lower=c(0, 0), start=list(shape =1, rate=1))
Quantile.gamma.mle.n100=quantile(mle.fg.n100, probs = c(0.5, 0.75, 0.95))
mme.fg.n100 <- fitdist(Sample1.list[[3]], "gamma", method="mme")
Quantile.gamma.mme.n100=quantile(mme.fg.n100, probs = c(0.5, 0.75, 0.95))
#weibull
mle.fw.n100 <- fitdist(Sample1.list[[3]], "weibull", method="mle")
Quantile.weibull.mle.n100=quantile(mle.fw.n100, probs = c(0.5, 0.75, 0.95))
memp  <-  function(x, order) mean(x^order)
mme.fw.n100 <- fitdist(Sample1.list[[3]], "weibull", method="mme", order=c(1, 2), memp=memp)
Quantile.weibull.mme.n100=quantile(mme.fw.n100, probs = c(0.5, 0.75, 0.95))
#llogis
mle.fl.n100 <- fitdist(Sample1.list[[3]], "llogis", method="mle")
Quantile.llogis.mle.n100=quantile(mle.fl.n100, probs = c(0.5, 0.75, 0.95))
mme.fl.n100 <- fitdist(Sample1.list[[3]], "llogis", method="mme", optim.method="Nelder-Mead", order=c(1, 2), memp=memp)

<simpleError in optim(par = vstart, fn = fnobj, fix.arg = fix.arg, obs = data,     gr = gradient, mdistnam = mdistname, memp = memp, hessian = TRUE,     method = opt.meth, lower = lower, upper = upper, ...): function cannot be evaluated at initial parameters>

Quantile.llogis.mme.n100=quantile(mme.fl.n100, probs = c(0.5, 0.75, 0.95))
#invweibull
mle.fi.n100 <- fitdist(Sample1.list[[3]], "invweibull", method="mle")
Quantile.invweibull.mle.n100=quantile(mle.fi.n100, probs = c(0.5, 0.75, 0.95))
mme.fi.n100 <- fitdist(Sample1.list[[3]], "invweibull", method="mme", order=c(1, 2), memp=memp)
Quantile.invweibull.mme.n100=quantile(mme.fi.n100, probs = c(0.5, 0.75, 0.95))
#pareto
mle.fp.n100 <- fitdist(Sample1.list[[3]], "pareto", method="mle")
Quantile.pareto.mle.n100=quantile(mle.fp.n100, probs = c(0.5, 0.75, 0.95))
mme.fp.n100 <- fitdist(Sample1.list[[3]], "pareto", method="mme", order=c(1, 2), memp=memp)
Quantile.pareto.mme.n100=quantile(mme.fp.n100, probs = c(0.5, 0.75, 0.95))
#trgamma
mle.ft.n100 <- fitdist(Sample1.list.rescaled[[3]], "trgamma", method="mle",optim.method="SANN")
Quantile.trgamma.mle.n100=quantile(mle.ft.n100, probs = c(0.5, 0.75, 0.95))
memp  <-  function(x, order) mean(x^order)
mme.ft.n100 <- fitdist(Sample1.list[[3]], "trgamma", method="mme", order=c(1, 2, 3), memp=memp)

<simpleError in optim(par = vstart, fn = fnobj, fix.arg = fix.arg, obs = data,     gr = gradient, mdistnam = mdistname, memp = memp, hessian = TRUE,     method = opt.meth, lower = lower, upper = upper, ...): non-finite finite-difference value [3]>

Quantile.trgamma.mme.n100=quantile(mme.ft.n100, probs = c(0.5, 0.75, 0.95))
#Table
Table1.3[1,c(2,4,6)]=abs(as.numeric(Quantile.lnorm.mle.n100$quantiles)-Quantile.true)
Table1.3[1,c(1,3,5)]=abs(as.numeric(Quantile.lnorm.mme.n100$quantiles)-Quantile.true)
Table1.3[2,c(2,4,6)]=abs(as.numeric(Quantile.exp.mle.n100$quantiles)*1000-Quantile.true)
Table1.3[2,c(1,3,5)]=abs(as.numeric(Quantile.exp.mme.n100$quantiles)-Quantile.true)
Table1.3[3,c(2,4,6)]=abs(as.numeric(Quantile.gamma.mle.n100$quantiles)-Quantile.true)
Table1.3[3,c(1,3,5)]=abs(as.numeric(Quantile.gamma.mme.n100$quantiles)-Quantile.true)
Table1.3[4,c(2,4,6)]=abs(as.numeric(Quantile.weibull.mle.n100$quantiles)-Quantile.true)
Table1.3[4,c(1,3,5)]=abs(as.numeric(Quantile.weibull.mme.n100$quantiles)-Quantile.true)
Table1.3[5,c(2,4,6)]=abs(as.numeric(Quantile.llogis.mle.n100$quantiles)-Quantile.true)
#Table1.3[5,c(1,3,5)]=abs(as.numeric(Quantile.llogis.mme.n25$quantiles)-Quantile.true)
Table1.3[5,c(1,3,5)]=c(NA,NA,NA)
Table1.3[6,c(2,4,6)]=abs(as.numeric(Quantile.invweibull.mle.n25$quantiles)-Quantile.true)
#Table1.3[6,c(1,3,5)]=abs(as.numeric(Quantile.invweibull.mme.n25$quantiles)-Quantile.true)
Table1.3[6,c(1,3,5)]=c(NA,NA,NA)
Table1.3[7,c(2,4,6)]=abs(as.numeric(Quantile.pareto.mle.n100$quantiles)-Quantile.true)
Table1.3[7,c(1,3,5)]=abs(as.numeric(Quantile.pareto.mme.n100$quantiles)-Quantile.true)
Table1.3[8,c(2,4,6)]=abs(as.numeric(Quantile.trgamma.mle.n100$quantiles)*1000-Quantile.true)
Table1.3[8,c(1,3,5)]=abs(as.numeric(Quantile.trgamma.mme.n100$quantiles)-Quantile.true)
#print table
kable(Table1.3, digits = 3, format.args = list(big.mark = ","), caption="MLE and MOM quantile estimation errors at levels 50%, 75% and 95% for the LogNormal sample of size n=100")

Table 2.12: MLE and MOM quantile estimation errors at levels 50%, 75% and 95% for the LogNormal sample of size n=100
	MOM(50%)	MLE(50%)	MOM(75%)	MLE(75%)	MOM(95%)	MLE(95%)
lnorm	2,114.735	1,176.609	315.869	1,099.864	15,871.023	1.467848e+04
exp	2,053.784	2,053.784	4,921.863	4,921.863	10,029.525	1.002952e+04
gamma	1,383.222	4,788.592	4,923.187	4,528.691	7,471.584	2.104357e+04
Weibull	6,160.097	4,236.670	8,900.913	5,545.530	10,797.083	1.699704e+04
llogis	NA	1,346.923	NA	6,484,788.997	NA	8.404552e+10
invWeibull	NA	18,847.107	NA	29,245.701	NA	1.416476e+05
Pareto	1,526.024	1,547.151	4,412.807	4,447.684	8,316.650	8.280262e+03
trgamma	NA	2,812.880	NA	1,555.266	NA	1.756407e+04

Table1.4=matrix(0,8,6,dimnames=list(c("lnorm","exp","gamma","weibull","llogis","invwei","Pareto","trgamma"),c("MOM(50%)","MLE(50%)","MOM(75%)","MLE(75%)","MOM(95%)","MLE(95%)")))
#lnorm
mle.fln.n250 <- fitdist(Sample1.list[[4]], "lnorm", method="mle")
Quantile.lnorm.mle.n250=quantile(mle.fln.n250, probs = c(0.5, 0.75, 0.95))
mme.fln.n250 <- fitdist(Sample1.list[[4]], "lnorm", method="mme")
Quantile.lnorm.mme.n250=quantile(mme.fln.n250, probs = c(0.5, 0.75, 0.95))
#exp
mle.fe.n250 <- fitdist(Sample1.list.rescaled[[4]], "exp", method="mle")
Quantile.exp.mle.n250=quantile(mle.fe.n250, probs = c(0.5, 0.75, 0.95))
mme.fe.n250 <- fitdist(Sample1.list[[4]], "exp", method="mme")
Quantile.exp.mme.n250=quantile(mme.fe.n250, probs = c(0.5, 0.75, 0.95))
#gamma
mle.fg.n250 <- fitdist(Sample1.list[[4]], "gamma", method="mle",lower=c(0, 0), start=list(shape =1, rate=1))
Quantile.gamma.mle.n250=quantile(mle.fg.n250, probs = c(0.5, 0.75, 0.95))
mme.fg.n250 <- fitdist(Sample1.list[[4]], "gamma", method="mme")
Quantile.gamma.mme.n250=quantile(mme.fg.n250, probs = c(0.5, 0.75, 0.95))
#weibull
mle.fw.n250 <- fitdist(Sample1.list[[4]], "weibull", method="mle")
Quantile.weibull.mle.n250=quantile(mle.fw.n250, probs = c(0.5, 0.75, 0.95))
memp  <-  function(x, order) mean(x^order)
mme.fw.n250 <- fitdist(Sample1.list[[4]], "weibull", method="mme", order=c(1, 2), memp=memp)
Quantile.weibull.mme.n250=quantile(mme.fw.n250, probs = c(0.5, 0.75, 0.95))
#llogis
mle.fl.n250 <- fitdist(Sample1.list[[4]], "llogis", method="mle")
Quantile.llogis.mle.n250=quantile(mle.fl.n250, probs = c(0.5, 0.75, 0.95))
mme.fl.n250 <- fitdist(Sample1.list[[4]], "llogis", method="mme", optim.method="Nelder-Mead", order=c(1, 2), memp=memp)

<simpleError in optim(par = vstart, fn = fnobj, fix.arg = fix.arg, obs = data,     gr = gradient, mdistnam = mdistname, memp = memp, hessian = TRUE,     method = opt.meth, lower = lower, upper = upper, ...): function cannot be evaluated at initial parameters>

Quantile.llogis.mme.n250=quantile(mme.fl.n250, probs = c(0.5, 0.75, 0.95))
#invweibull
mle.fi.n250 <- fitdist(Sample1.list[[4]], "invweibull", method="mle")
Quantile.invweibull.mle.n250=quantile(mle.fi.n250, probs = c(0.5, 0.75, 0.95))
mme.fi.n250 <- fitdist(Sample1.list[[4]], "invweibull", method="mme", order=c(1, 2), memp=memp)
Quantile.invweibull.mme.n250=quantile(mme.fi.n250, probs = c(0.5, 0.75, 0.95))
#pareto
mle.fp.n250 <- fitdist(Sample1.list[[4]], "pareto", method="mle")
Quantile.pareto.mle.n250=quantile(mle.fp.n250, probs = c(0.5, 0.75, 0.95))
mme.fp.n250 <- fitdist(Sample1.list[[4]], "pareto", method="mme", order=c(1, 2), memp=memp)
Quantile.pareto.mme.n250=quantile(mme.fp.n250, probs = c(0.5, 0.75, 0.95))
#trgamma
mle.ft.n250 <- fitdist(Sample1.list.rescaled[[4]], "trgamma", method="mle",optim.method="SANN")
Quantile.trgamma.mle.n250=quantile(mle.ft.n250, probs = c(0.5, 0.75, 0.95))
memp  <-  function(x, order) mean(x^order)
mme.ft.n250 <- fitdist(Sample1.list[[4]], "trgamma", method="mme", order=c(1, 2, 3), memp=memp)

<simpleError in optim(par = vstart, fn = fnobj, fix.arg = fix.arg, obs = data,     gr = gradient, mdistnam = mdistname, memp = memp, hessian = TRUE,     method = opt.meth, lower = lower, upper = upper, ...): non-finite finite-difference value [3]>

Quantile.trgamma.mme.n250=quantile(mme.ft.n250, probs = c(0.5, 0.75, 0.95))
#Table
Table1.4[1,c(2,4,6)]=abs(as.numeric(Quantile.lnorm.mle.n250$quantiles)-Quantile.true)
Table1.4[1,c(1,3,5)]=abs(as.numeric(Quantile.lnorm.mme.n250$quantiles)-Quantile.true)
Table1.4[2,c(2,4,6)]=abs(as.numeric(Quantile.exp.mle.n250$quantiles)*1000-Quantile.true)
Table1.4[2,c(1,3,5)]=abs(as.numeric(Quantile.exp.mme.n250$quantiles)-Quantile.true)
Table1.4[3,c(2,4,6)]=abs(as.numeric(Quantile.gamma.mle.n250$quantiles)-Quantile.true)
Table1.4[3,c(1,3,5)]=abs(as.numeric(Quantile.gamma.mme.n250$quantiles)-Quantile.true)
Table1.4[4,c(2,4,6)]=abs(as.numeric(Quantile.weibull.mle.n250$quantiles)-Quantile.true)
Table1.4[4,c(1,3,5)]=abs(as.numeric(Quantile.weibull.mme.n250$quantiles)-Quantile.true)
Table1.4[5,c(2,4,6)]=abs(as.numeric(Quantile.llogis.mle.n250$quantiles)-Quantile.true)
#Table1.4[5,c(1,3,5)]=abs(as.numeric(Quantile.llogis.mme.n25$quantiles)-Quantile.true)
Table1.4[5,c(1,3,5)]=c(NA,NA,NA)
Table1.4[6,c(2,4,6)]=abs(as.numeric(Quantile.invweibull.mle.n25$quantiles)-Quantile.true)
#Table1.4[6,c(1,3,5)]=abs(as.numeric(Quantile.invweibull.mme.n25$quantiles)-Quantile.true)
Table1.4[6,c(1,3,5)]=c(NA,NA,NA)
Table1.4[7,c(2,4,6)]=abs(as.numeric(Quantile.pareto.mle.n250$quantiles)-Quantile.true)
Table1.4[7,c(1,3,5)]=abs(as.numeric(Quantile.pareto.mme.n250$quantiles)-Quantile.true)
Table1.4[8,c(2,4,6)]=abs(as.numeric(Quantile.trgamma.mle.n250$quantiles)*1000-Quantile.true)
Table1.4[8,c(1,3,5)]=abs(as.numeric(Quantile.trgamma.mme.n250$quantiles)-Quantile.true)
#print table
kable(Table1.4, digits = 3, format.args = list(big.mark = ","), caption="MLE and MOM quantile estimation errors at levels 50%, 75% and 95% for the LogNormal sample of size n=250")

Table 2.13: MLE and MOM quantile estimation errors at levels 50%, 75% and 95% for the LogNormal sample of size n=250
	MOM(50%)	MLE(50%)	MOM(75%)	MLE(75%)	MOM(95%)	MLE(95%)
lnorm	891.759	788.462	2,152.874	2,589.015	7,225.328	10,897.672
exp	4,514.943	4,514.943	9,844.182	9,844.181	607.428	607.427
gamma	2,936.292	5,506.932	8,254.973	9,777.674	27,308.961	3,275.622
weibull	5,478.712	4,104.661	15,668.511	9,677.747	23,256.793	1,839.958
llogis	NA	1,148.783	NA	224.350	NA	10,911.390
invwei	NA	18,847.107	NA	29,245.701	NA	141,647.586
Pareto	319.000	1,326.292	5,149.616	6,229.499	12,601.131	8,545.292
trgamma	NA	2,392.174	NA	4,916.701	NA	1,890.435

Table1.5=matrix(0,8,6,dimnames=list(c("lnorm","exp","gamma","Weibull","llogis","invWeibull","Pareto","trgamma"),c("MOM(50%)","MLE(50%)","MOM(75%)","MLE(75%)","MOM(95%)","MLE(95%)")))
#lnorm
mle.fln.n500 <- fitdist(Sample1.list[[5]], "lnorm", method="mle")
Quantile.lnorm.mle.n500=quantile(mle.fln.n500, probs = c(0.5, 0.75, 0.95))
mme.fln.n500 <- fitdist(Sample1.list[[5]], "lnorm", method="mme")
Quantile.lnorm.mme.n500=quantile(mme.fln.n500, probs = c(0.5, 0.75, 0.95))
#exp
mle.fe.n500 <- fitdist(Sample1.list.rescaled[[5]], "exp", method="mle")
Quantile.exp.mle.n500=quantile(mle.fe.n500, probs = c(0.5, 0.75, 0.95))
mme.fe.n500 <- fitdist(Sample1.list[[5]], "exp", method="mme")
Quantile.exp.mme.n500=quantile(mme.fe.n500, probs = c(0.5, 0.75, 0.95))
#gamma
mle.fg.n500 <- fitdist(Sample1.list[[5]], "gamma", method="mle",lower=c(0, 0), start=list(shape =1, rate=1))
Quantile.gamma.mle.n500=quantile(mle.fg.n500, probs = c(0.5, 0.75, 0.95))
mme.fg.n500 <- fitdist(Sample1.list[[5]], "gamma", method="mme")
Quantile.gamma.mme.n500=quantile(mme.fg.n500, probs = c(0.5, 0.75, 0.95))
#weibull
mle.fw.n500 <- fitdist(Sample1.list[[5]], "weibull", method="mle")
Quantile.weibull.mle.n500=quantile(mle.fw.n500, probs = c(0.5, 0.75, 0.95))
memp  <-  function(x, order) mean(x^order)
mme.fw.n500 <- fitdist(Sample1.list[[5]], "weibull", method="mme", order=c(1, 2), memp=memp)
Quantile.weibull.mme.n500=quantile(mme.fw.n500, probs = c(0.5, 0.75, 0.95))
#llogis
mle.fl.n500 <- fitdist(Sample1.list[[5]], "llogis", method="mle")
Quantile.llogis.mle.n500=quantile(mle.fl.n500, probs = c(0.5, 0.75, 0.95))
mme.fl.n500 <- fitdist(Sample1.list[[5]], "llogis", method="mme", optim.method="Nelder-Mead", order=c(1, 2), memp=memp)

<simpleError in optim(par = vstart, fn = fnobj, fix.arg = fix.arg, obs = data,     gr = gradient, mdistnam = mdistname, memp = memp, hessian = TRUE,     method = opt.meth, lower = lower, upper = upper, ...): function cannot be evaluated at initial parameters>

Quantile.llogis.mme.n500=quantile(mme.fl.n500, probs = c(0.5, 0.75, 0.95))
#invweibull
mle.fi.n500 <- fitdist(Sample1.list[[5]], "invweibull", method="mle")
Quantile.invweibull.mle.n500=quantile(mle.fi.n500, probs = c(0.5, 0.75, 0.95))
mme.fi.n500 <- fitdist(Sample1.list[[5]], "invweibull", method="mme", order=c(1, 2), memp=memp)
Quantile.invweibull.mme.n500=quantile(mme.fi.n500, probs = c(0.5, 0.75, 0.95))
#pareto
mle.fp.n500 <- fitdist(Sample1.list[[5]], "pareto", method="mle")
Quantile.pareto.mle.n500=quantile(mle.fp.n500, probs = c(0.5, 0.75, 0.95))
mme.fp.n500 <- fitdist(Sample1.list[[5]], "pareto", method="mme", order=c(1, 2), memp=memp)
Quantile.pareto.mme.n500=quantile(mme.fp.n500, probs = c(0.5, 0.75, 0.95))
#trgamma
mle.ft.n500 <- fitdist(Sample1.list.rescaled[[5]], "trgamma", method="mle",optim.method="SANN")
Quantile.trgamma.mle.n500=quantile(mle.ft.n500, probs = c(0.5, 0.75, 0.95))
memp  <-  function(x, order) mean(x^order)
mme.ft.n500 <- fitdist(Sample1.list[[5]], "trgamma", method="mme", order=c(1, 2, 3), memp=memp)

<simpleError in optim(par = vstart, fn = fnobj, fix.arg = fix.arg, obs = data,     gr = gradient, mdistnam = mdistname, memp = memp, hessian = TRUE,     method = opt.meth, lower = lower, upper = upper, ...): non-finite finite-difference value [3]>

Quantile.trgamma.mme.n500=quantile(mme.ft.n500, probs = c(0.5, 0.75, 0.95))
#Table
Table1.5[1,c(2,4,6)]=abs(as.numeric(Quantile.lnorm.mle.n500$quantiles)-Quantile.true)
Table1.5[1,c(1,3,5)]=abs(as.numeric(Quantile.lnorm.mme.n500$quantiles)-Quantile.true)
Table1.5[2,c(2,4,6)]=abs(as.numeric(Quantile.exp.mle.n500$quantiles)*1000-Quantile.true)
Table1.5[2,c(1,3,5)]=abs(as.numeric(Quantile.exp.mme.n500$quantiles)-Quantile.true)
Table1.5[3,c(2,4,6)]=abs(as.numeric(Quantile.gamma.mle.n500$quantiles)-Quantile.true)
Table1.5[3,c(1,3,5)]=abs(as.numeric(Quantile.gamma.mme.n500$quantiles)-Quantile.true)
Table1.5[4,c(2,4,6)]=abs(as.numeric(Quantile.weibull.mle.n500$quantiles)-Quantile.true)
Table1.5[4,c(1,3,5)]=abs(as.numeric(Quantile.weibull.mme.n500$quantiles)-Quantile.true)
Table1.5[5,c(2,4,6)]=abs(as.numeric(Quantile.llogis.mle.n500$quantiles)-Quantile.true)
#Table1.5[5,c(1,3,5)]=abs(as.numeric(Quantile.llogis.mme.n25$quantiles)-Quantile.true)
Table1.5[5,c(1,3,5)]=c(NA,NA,NA)
Table1.5[6,c(2,4,6)]=abs(as.numeric(Quantile.invweibull.mle.n25$quantiles)-Quantile.true)
#Table1.5[6,c(1,3,5)]=abs(as.numeric(Quantile.invweibull.mme.n25$quantiles)-Quantile.true)
Table1.5[6,c(1,3,5)]=c(NA,NA,NA)
Table1.5[7,c(2,4,6)]=abs(as.numeric(Quantile.pareto.mle.n500$quantiles)-Quantile.true)
Table1.5[7,c(1,3,5)]=abs(as.numeric(Quantile.pareto.mme.n500$quantiles)-Quantile.true)
Table1.5[8,c(2,4,6)]=abs(as.numeric(Quantile.trgamma.mle.n500$quantiles)*1000-Quantile.true)
Table1.5[8,c(1,3,5)]=abs(as.numeric(Quantile.trgamma.mme.n500$quantiles)-Quantile.true)
#print table
kable(Table1.5, digits = 3, format.args = list(big.mark = ","), caption="MLE and MOM quantile estimation errors at levels 50%, 75% and 95% for the LogNormal sample of size n=500")

Table 2.14: MLE and MOM quantile estimation errors at levels 50%, 75% and 95% for the LogNormal sample of size n=500
	MOM(50%)	MLE(50%)	MOM(75%)	MLE(75%)	MOM(95%)	MLE(95%)
lnorm	1,025.327	989.887	2,462.778	2,519.477	8,184.871	8,799.418
exp	1,589.495	1,589.495	3,993.285	3,993.284	12,036.150	12,036.151
gamma	3,830.746	3,012.984	3,063.139	3,854.470	7,568.605	17,698.725
Weibull	3,026.117	1,919.941	8,980.098	4,115.913	4,044.073	13,048.882
llogis	NA	1,058.191	NA	4,556.497	NA	5,874.291
invWeibull	NA	18,847.107	NA	29,245.701	NA	141,647.586
Pareto	1,645.910	642.249	449.642	1,537.962	2,558.748	6,185.344
trgamma	NA	362.050	NA	351.373	NA	14,517.625

Akaike Information Criterion (AIC) is useful tool for model selection that applies only to MLE and not to MOM. In other words, the smallest or the lowest value (depending how AIC is defined) indicates the best MLE-based model choice; the lowest AIC indicates the best model in R. From now on, the best distribution chosen through AIC is denoted as DISToo.

Since MLE for exp and trgamma requires re-scalling the data, these two distributions are ignored from the AIC-based comparison, though something more clever could have been done.

#print table
kable(Table.AIC, digits = 3, format.args = list(big.mark = ","), caption="AIC summary for LogNormal samples of size n=25, n=50, n=100, n=250 and n=500")

Table 2.15: AIC summary for LogNormal samples of size n=25, n=50, n=100, n=250 and n=500
	(MLE,n=25)	(MLE,n=50)	(MLE,n=100)	(MLE,n=250)	(MLE,n=500)
lnorm	557.858	1,130.847	2,273.682	5,747.808	11,355.90
exp	NA	NA	NA	NA	NA
gamma	567.952	1,148.695	2,289.276	5,779.020	11,432.33
Weibull	567.856	1,146.482	2,292.847	5,780.328	11,439.93
llogis	645.173	1,131.705	2,620.043	5,742.489	11,362.76
invWeibull	601.066	1,204.042	2,433.059	6,394.136	12,438.07
Pareto	564.503	1,140.397	2,295.020	5,768.745	11,424.99
trgamma	NA	NA	NA	NA	NA

#print table
kable(Table.AIC.min, caption="Best (or DISToo) distribution chosen through AIC for LogNormal samples of size n=25, n=50, n=100, n=250 and n=500")

Table 2.16: Best (or DISToo) distribution chosen through AIC for LogNormal samples of size n=25, n=50, n=100, n=250 and n=500
	(MLE,n=25)	(MLE,n=50)	(MLE,n=100)	(MLE,n=250)	(MLE,n=500)
Min (AIC)	lnorm	lnorm	lnorm	llogis	lnorm

It looks like AIC does make the right choice, except of the LogNormal sample of size \(n=250\), though llogis marginally surpasses the `true’ choice, namely lnorm.

Point estimates – such our previous quantile estimates – are sensitive to sampling error, and thus, bootstrapping is often use to reduce the uncertainty of point estimation. That is, we bootstrap \(n_b=25\) bootstrap samples with replacement from the corresponding sample and fit only the parametric family chosen before as the best AIC distribution, which means that we are

Sampling with replacement \(n_b=25\) samples of size \(n=25\) from our initial LogNormal sample of size \(n=25\) and fit all \(n_b=25\) samples via lnorm MLE;
Sampling with replacement \(n_b=25\) samples of size \(n=50\) from our initial LogNormal sample of size \(n=50\) and fit all \(n_b=25\) samples via lnorm MLE;
Sampling with replacement \(n_b=25\) samples of size \(n=100\) from our initial LogNormal sample of size \(n=100\) and fit all \(n_b=25\) samples via lnorm MLE;
Sampling with replacement \(n_b=25\) samples of size \(n=250\) from our initial LogNormal sample of size \(n=250\) and fit all \(n_b=25\) samples via llogis MLE;
Sampling with replacement \(n_b=25\) samples of size \(n=500\) from our initial LogNormal sample of size \(n=500\) and fit all \(n_b=25\) samples via lnorm MLE;

Recall that AIC identified lnorm as the best choice for our initial samples of sizes \(n\in\{25,50,100,500\}\), while llogis was the best choice for our initial sample of size \(n=250\).

The bootstrap sampling for i) to v) – defined above – are now given.

#fix the seed inside the loop -- e.g. set.seed(567*b) -- so the bootstrap sampling generates the same output every time the code is run
#Sample1: the case of n=25, **lnorm** based on min(AIC), samples with replacement=25
MLE.lnorm.boot.n25=matrix(0,25,2,dimnames=list(c(1:25),c("meanlog","sdlog")))
for (b in 1:25) {
set.seed(567*b)#!!!set the seed inside the loop!!!
Sample1.boot=sample(Sample1.list[[1]], replace = TRUE)
MLE.lnorm=mledist(Sample1.boot, "lnorm")
MLE.lnorm.boot.n25[b,]=MLE.lnorm$estimate
}
MLE.lnorm.boot.mean.n25=colMeans(MLE.lnorm.boot.n25)
#Sample1: the case of n=50, **lnorm** based on min(AIC), samples with replacement=25
MLE.lnorm.boot.n50=matrix(0,25,2,dimnames=list(c(1:25),c("meanlog","sdlog")))
for (b in 1:25) {
set.seed(567*b)#!!!set the seed inside the loop!!!
Sample1.boot=sample(Sample1.list[[2]], replace = TRUE)
MLE.lnorm=mledist(Sample1.boot, "lnorm")
MLE.lnorm.boot.n50[b,]=MLE.lnorm$estimate
}
MLE.lnorm.boot.mean.n50=colMeans(MLE.lnorm.boot.n50)
#Sample1: the case of n=100, **lnorm** based on min(AIC), samples with replacement=25
MLE.lnorm.boot.n100=matrix(0,25,2,dimnames=list(c(1:25),c("meanlog","sdlog")))
for (b in 1:25) {
set.seed(567*b)#!!!set the seed inside the loop!!!
Sample1.boot=sample(Sample1.list[[3]], replace = TRUE)
MLE.lnorm=mledist(Sample1.boot, "lnorm")
MLE.lnorm.boot.n100[b,]=MLE.lnorm$estimate
}
MLE.lnorm.boot.mean.n100=colMeans(MLE.lnorm.boot.n100)
#Sample1: the case of n=250, **llogis** based on min(AIC), samples with replacement=25
MLE.llogis.boot.n250=matrix(0,25,2,dimnames=list(c(1:25),c("shape","scale")))
for (b in 1:25) {
set.seed(567*b)#!!!set the seed inside the loop!!!
Sample1.boot=sample(Sample1.list[[4]], replace = TRUE)
MLE.llogis=mledist(Sample1.boot, "llogis")
MLE.llogis.boot.n250[b,]=MLE.llogis$estimate
}
MLE.llogis.boot.mean.n250=colMeans(MLE.llogis.boot.n250)
#Sample1: the case of n=500, **lnorm** based on min(AIC), samples with replacement=25
MLE.lnorm.boot.n500=matrix(0,25,2,dimnames=list(c(1:25),c("meanlog","sdlog")))
for (b in 1:25) {
set.seed(567*b)#!!!set the seed inside the loop!!!
Sample1.boot=sample(Sample1.list[[5]], replace = TRUE)
MLE.lnorm=mledist(Sample1.boot, "lnorm")
MLE.lnorm.boot.n500[b,]=MLE.lnorm$estimate
}
MLE.lnorm.boot.mean.n500=colMeans(MLE.lnorm.boot.n500)

The sets of \(n_b=25\) MLE estimates are now averaged and obtain the so-called Bootstrap MLE estimates, which are tabulated below.

kable(MLE.lnorm.boot.mean.n25, digits = 3, format.args = list(big.mark = ","), caption="Bootstrap MLE estimates assuming lnorm (n=25)")

Table 2.17: Bootstrap MLE estimates assuming lnorm (n=25)
	x
meanlog	9.727
sdlog	0.842

kable(MLE.lnorm.boot.mean.n50, digits = 3, format.args = list(big.mark = ","), caption="Bootstrap MLE estimates assuming lnorm (n=50)")

Table 2.18: Bootstrap MLE estimates assuming lnorm (n=50)
	x
meanlog	9.777
sdlog	1.076

kable(MLE.lnorm.boot.mean.n100, digits = 3, format.args = list(big.mark = ","), caption="Bootstrap MLE estimates assuming lnorm (n=100)")

Table 2.19: Bootstrap MLE estimates assuming lnorm (n=100)
	x
meanlog	10.104
sdlog	0.883

kable(MLE.llogis.boot.mean.n250, digits = 3, format.args = list(big.mark = ","), caption="Bootstrap MLE estimates assuming llogis (n=250)")

Table 2.20: Bootstrap MLE estimates assuming llogis (n=250)
	x
shape	1.569
scale	23,511.935

kable(MLE.lnorm.boot.mean.n500, digits = 3, format.args = list(big.mark = ","), caption="Bootstrap MLE estimates assuming lnorm (n=500)")

Table 2.21: Bootstrap MLE estimates assuming lnorm (n=500)
	x
meanlog	9.959
sdlog	0.974

We now compute the bootstrap quantile estimates as we have agreed upon the parametric model is the ‘best’ option. Recall that Dist00 is lnorm for all \(n=\{25, 50, 100, 500\}\) and llogis if \(n=250\), while the true distribution is lnorm of \((\mu=10,\sigma=1)\).

We report the following:

\[\begin{eqnarray*} &&(1)\;\text{Compute} \;q\big(xx; \hat{\theta}_{MLE}(n,i), Dist00\big) \quad \text{for all}\; i=1,\ldots, 25\quad\text{and}\; n=\{25, 50, 100, 250, 500\},\\ &&(2)\;\text{Compute} \;\bar{\bar{q}}_{boot} (xx;n) =\frac{1}{25} \sum_{i=1}^{25} q\big(xx; \hat{\theta}_{MLE}(n,i), Dist00\big),\quad \text{for all}\;n=\{25, 50, 100, 250, 500\},\\ &&(3)\;\text{Report} \;\big|\bar{\bar{q}}_{boot} (xx;n)-xx\%\mbox{quantiles of the 'true' distribution}\big|\quad \text{for all}\;n=\{25, 50, 100, 250, 500\}, \end{eqnarray*}\] where (1) are the quantile estimates for (individual) bootstrap samples, (2) are quantile bootstrap estimates, and (3) evaluate the estimation errors with respect to (2). All computations are reported for the three quantile levels \(xx=\{50\%, 75\%, 95\%\}\).

The background computations – for (1), (2) and (3) – are now given.

#(best n=25) lnorm
Quantile.best.mle.n25.each=matrix(0,25,3,dimnames=list(NULL,c("q_boot(50%)","q_boot(75%)","q_boot(95%)")))
for (i in 1:25) {
Quantile.best.mle.n25.each[i,]=qlnorm(c(0.5,0.75,0.95), meanlog=MLE.lnorm.boot.n25[i,1], sdlog=MLE.lnorm.boot.n25[i,2])
}
Quantile.best.mle.n25.barbar=colMeans(Quantile.best.mle.n25.each)
#(best n=50) lnorm
Quantile.best.mle.n50.each=matrix(0,25,3,dimnames=list(NULL,c("q_boot(50%)","q_boot(75%)","q_boot(95%)")))
for (i in 1:25) {
Quantile.best.mle.n50.each[i,]=qlnorm(c(0.5,0.75,0.95), meanlog=MLE.lnorm.boot.n50[i,1], sdlog=MLE.lnorm.boot.n50[i,2])
}
Quantile.best.mle.n50.barbar=colMeans(Quantile.best.mle.n50.each)
#(best n=100) lnorm
Quantile.best.mle.n100.each=matrix(0,25,3,dimnames=list(NULL,c("q_boot(50%)","q_boot(75%)","q_boot(95%)")))
for (i in 1:25) {
Quantile.best.mle.n100.each[i,]=qlnorm(c(0.5,0.75,0.95), meanlog=MLE.lnorm.boot.n100[i,1], sdlog=MLE.lnorm.boot.n100[i,2])
}
Quantile.best.mle.n100.barbar=colMeans(Quantile.best.mle.n100.each)
#(best n=250) llogis
Quantile.best.mle.n250.each=matrix(0,25,3,dimnames=list(NULL,c("q_boot(50%)","q_boot(75%)","q_boot(95%)")))
for (i in 1:25) {
Quantile.best.mle.n250.each[i,]=qllogis(c(0.5,0.75,0.95), shape=MLE.llogis.boot.n250[i,1], scale=MLE.llogis.boot.n250[i,2])
}
Quantile.best.mle.n250.barbar=colMeans(Quantile.best.mle.n250.each)
#(best n=500) lnorm
Quantile.best.mle.n500.each=matrix(0,25,3,dimnames=list(NULL,c("q_boot(50%)","q_boot(75%)","q_boot(95%)")))
for (i in 1:25) {
Quantile.best.mle.n500.each[i,]=qlnorm(c(0.5,0.75,0.95), meanlog=MLE.lnorm.boot.n500[i,1], sdlog=MLE.lnorm.boot.n500[i,2])
}
Quantile.best.mle.n500.barbar=colMeans(Quantile.best.mle.n500.each)
# Table2a
Table2a=matrix(0,5,3,dimnames=list(c("n=25 (lnorm)","n=50 (lnorm)","n=100 (lnorm)","n=250 (llogis)","n=500 (lnorm)"),c("boot(50%)","boot(75%)","boot(95%)")))
Table2a[1,]=abs(Quantile.best.mle.n25.barbar-Quantile.true) #lnorm
Table2a[2,]=abs(Quantile.best.mle.n50.barbar-Quantile.true) #lnorm
Table2a[3,]=abs(Quantile.best.mle.n100.barbar-Quantile.true)#lnorm
Table2a[4,]=abs(Quantile.best.mle.n250.barbar-Quantile.true)#llogis
Table2a[5,]=abs(Quantile.best.mle.n500.barbar-Quantile.true)#lnorm

The output (2) is summarised below.

kable(Table2a, digits = 3, format.args = list(big.mark = ","), caption="Quantile bootstrap estimation errors for LogNormal samples of size n=25, n=50, n=100, n=250 and n=500")

Table 2.22: Quantile bootstrap estimation errors for LogNormal samples of size n=25, n=50, n=100, n=250 and n=500
	boot(50%)	boot(75%)	boot(95%)
n=25 (lnorm)	5,118.359	13,267.308	4.505281e+04
n=50 (lnorm)	4,140.955	5,900.422	5.205544e+03
n=100 (lnorm)	2,521.460	1,348.219	8.685972e+03
n=250 (llogis)	1,485.469	529,456.344	4.167706e+09
n=500 (lnorm)	863.649	2,414.141	8.968366e+03

The \(n=250\) sample leads to enormous quantile bootstrap estimates at levels \(xx=\{75\%, 95\%\}\), and we need to understand why we have such an anomaly. Recall that MLE estimations in R rely on some optimisation solvers that may not identify a global maximum (if that would exist), but ignore this possible issue once again, and look into the quantile estimates for (individual) bootstrap samples – see (1) – when \(n=250\).

kable(Quantile.best.mle.n250.each[order(Quantile.best.mle.n250.each[,3]),], digits = 3, format.args = list(big.mark = ","), caption="Quantile estimates for (individual) bootstrap samples coming from the LogNormal sample of size n=250")

Table 2.23: Quantile estimates for (individual) bootstrap samples coming from the LogNormal sample of size n=250
q_boot(50%)	q_boot(75%)	q_boot(95%)
25,210.59	43,962.93	1.119045e+05
22,750.66	41,324.46	1.126478e+05
24,990.67	44,010.39	1.138929e+05
23,310.87	42,398.51	1.158347e+05
23,419.48	42,528.22	1.158784e+05
24,020.06	43,434.30	1.175072e+05
23,370.40	42,751.53	1.179313e+05
22,292.95	41,610.82	1.187381e+05
21,818.53	41,420.38	1.216037e+05
22,837.61	42,674.95	1.220027e+05
22,618.70	42,687.85	1.240936e+05
22,803.86	43,248.68	1.267631e+05
22,314.19	42,897.05	1.286263e+05
21,452.17	41,930.58	1.292864e+05
23,463.28	44,368.43	1.294032e+05
21,035.65	41,706.86	1.317143e+05
24,879.30	46,867.56	1.358218e+05
24,408.74	46,340.64	1.360636e+05
23,751.00	45,733.55	1.375070e+05
23,153.26	45,209.63	1.391594e+05
24,539.01	47,225.02	1.418608e+05
28,568.35	54,309.21	1.598137e+05
23,500.18	2,856,906.46	9.092886e+09
23,597.05	4,567,632.67	3.176145e+10
23,691.79	5,924,193.83	6.333839e+10

Note that the peculiar large quantile bootstrap estimates for \(n=250\) (llogis) is due to some – three in this case – extraordinary large quantile estimates for (individual) bootstrap samples amongst the \(n_b=25\) estimates. If we exclude these three bootstrap samples, then we could re-compute (2) based the 22 individual bootstrap samples, and we get that

#Check this later
Reorder <- Quantile.best.mle.n250.each[order(Quantile.best.mle.n250.each[,3]),]
Quantile.best.mle.n250.barbar.no.outlier <- colMeans(Reorder[-c(23,24,25),])
No.outlier <- matrix(0,2,3)
No.outlier[1,] <- abs(Quantile.best.mle.n250.barbar.no.outlier-Quantile.true)
No.outlier[2,] <- abs(Quantile.best.mle.n250.barbar-Quantile.true)#llogis; as computed before
colnames(No.outlier) <- c("q_boot(50%)","q_boot(75%)","q_boot(95%)")
rownames(No.outlier) <- c("n=250 (llogis) No outliers (22 bootstrap samples)","n=250 (llogis) With outliers (25 bootstrap samples)")
kable(No.outlier, digits = 3, format.args = list(big.mark = ","), caption="Quantile bootstrap estimation errors for LogNormal samples of size n=250 without outliers")

Table 2.24: Quantile bootstrap estimation errors for LogNormal samples of size n=250 without outliers
	q_boot(50%)	q_boot(75%)	q_boot(95%)
n=250 (llogis) No outliers (22 bootstrap samples)	1,473.960	790.525	1.262712e+04
n=250 (llogis) With outliers (25 bootstrap samples)	1,485.469	529,456.344	4.167706e+09

The estimation errors have been reduced by eliminating the outliers induced by bootstrap sampling error, and the bootstrap estimates are more reasonable. The summary of the quantile bootstrap errors is given below:

Table2aF <- matrix(0,6,3,dimnames=list(c("n=25 (lnorm)","n=50 (lnorm)","n=100 (lnorm)", "n=250 (llogis) With outliers (25 bootstrap samples)", "n=250 (llogis) No outliers (22 bootstrap samples)", "n=500 (lnorm)"),c("boot(50%)","boot(75%)","boot(95%)")))
Table2aF[1,] <- abs(Quantile.best.mle.n25.barbar-Quantile.true) #lnorm
Table2aF[2,] <- abs(Quantile.best.mle.n50.barbar-Quantile.true) #lnorm
Table2aF[3,] <- abs(Quantile.best.mle.n100.barbar-Quantile.true)#lnorm
Table2aF[4,] <- abs(Quantile.best.mle.n250.barbar-Quantile.true)#llogis
Table2aF[5,] <- No.outlier[1,]#llogis
Table2aF[6,] <- abs(Quantile.best.mle.n500.barbar-Quantile.true)#lnorm
kable(Table2aF, digits = 3, format.args = list(big.mark = ","), caption="Quantile bootstrap estimation errors based on (3) for all LogNormal samples with and without outliers")

Table 2.25: Quantile bootstrap estimation errors based on (3) for all LogNormal samples with and without outliers
	boot(50%)	boot(75%)	boot(95%)
n=25 (lnorm)	5,118.359	13,267.308	4.505281e+04
n=50 (lnorm)	4,140.955	5,900.422	5.205544e+03
n=100 (lnorm)	2,521.460	1,348.219	8.685972e+03
n=250 (llogis) With outliers (25 bootstrap samples)	1,485.469	529,456.344	4.167706e+09
n=250 (llogis) No outliers (22 bootstrap samples)	1,473.960	790.525	1.262712e+04
n=500 (lnorm)	863.649	2,414.141	8.968366e+03

So far, we reported the bootstrap errors – see (3) – and two additional sets of estimates and their associated estimation errors are provided. Specifically, we

\[\begin{eqnarray*} &&(3)\;\text{Report} \;\big|\bar{\bar{q}}_{boot} (xx;n)-xx\%\mbox{quantiles of the 'true' distribution}\big|\quad \text{for all}\;n=\{25, 50, 100, 250, 500\},\\ &&(4)\;\text{Report} \;\big|\bar{q}^{MLE}_{boot} (xx;n)-xx\%\mbox{quantiles of the 'true' distribution}\big|\quad \text{for all}\;n=\{25, 50, 100, 250, 500\},\\ &&(5)\;\text{Report} \;\big|\bar{q}^{MLE}_{sample} (xx;n)-xx\%\mbox{quantiles of the 'true' distribution}\big|\quad \text{for all}\;n=\{25, 50, 100, 250, 500\},\\ \end{eqnarray*}\] where \(\bar{\bar{q}}_{boot} (xx;n)\) is the quantile bootstrap estimator \[\bar{\bar{q}}_{boot} (xx;n) =\frac{1}{25} \sum_{i=1}^{25} q\big(xx; \hat{\theta}_{MLE}(n,i), Dist00\big),\quad \text{for all}\;n=\{25, 50, 100, 250, 500\},\] \(\bar{q}^{MLE}_{boot} (xx;n)\) is the quantile based on Dist00 with parameters estimated via the parameter bootstrap estimates \[\bar{q}^{MLE}_{boot} (xx;n) = q\bigg(xx; \widehat{\Theta}^{boot}_{MLE}(n), Dist00\bigg),\quad\text{where}\;\widehat{\Theta}^{boot}_{MLE}(n)=\frac{1}{25}\sum^{25}_{i=1} \hat{\Theta}^{boot}_{MLE}(n,i),\quad \text{for all}\;n=\{25, 50, 100, 250, 500\},\] and \(\bar{q}^{MLE}_{sample} (xx;n)\) is the quantile based on Dist00 with parameters estimated via the parameter sample estimates (i.e. without bootstrapping) \[\bar{q}^{MLE}_{sample} (xx;n) = q\bigg(xx; \widehat{\Theta}^{sample}_{MLE}(n), Dist00\bigg),\quad\text{where}\;\widehat{\Theta}^{boot}_{MLE}(n)=\frac{1}{25}\sum^{25}_{i=1} \hat{\Theta}^{boot}_{MLE}(n,i),\quad \text{for all}\;n=\{25, 50, 100, 250, 500\}.\] Note that \(\bar{q}^{MLE}_{sample} (xx;n)\) were computed at the beginning of this section introducing the bootstrap computations. All computations are reported for the three quantile levels \(xx=\{50\%, 75\%, 95\%\}\).

Table2b <- matrix(0,5,6,dimnames=list(c("n=25","n=50","n=100","n=250","n=500"),c("boot(50%)","boot(75%)","boot(95%)","table1(50%)","table1(75%)","table1(95%)")))
#Quantile.true
#################n=25, best model: lnorm
Quantile.best.mle.n25 <- qlnorm(c(0.5,0.75,0.95), meanlog = MLE.lnorm.boot.mean.n25[1], sdlog = MLE.lnorm.boot.mean.n25[2]) 
#Quantile.best.mle.n25
Table2b[1,c(1,2,3)] <- abs(Quantile.best.mle.n25-Quantile.true)
#################n=50, best model: lnorm
Quantile.best.mle.n50 <- qlnorm(c(0.5,0.75,0.95), meanlog = MLE.lnorm.boot.mean.n50[1], sdlog = MLE.lnorm.boot.mean.n50[2]) 
#Quantile.best.mle.n50
Table2b[2,c(1,2,3)] <- abs(Quantile.best.mle.n50-Quantile.true)
#################n=100, best model: lnorm
Quantile.best.mle.n100 <- qlnorm(c(0.5,0.75,0.95), meanlog = MLE.lnorm.boot.mean.n100[1], sdlog = MLE.lnorm.boot.mean.n100[2])
#Quantile.best.mle.n100
Table2b[3,c(1,2,3)] <- abs(Quantile.best.mle.n100-Quantile.true)
#################n=250, best model: llogis 
Quantile.best.mle.n250 <- qllogis(c(0.5,0.75,0.95), shape=MLE.llogis.boot.mean.n250[1], scale =MLE.llogis.boot.mean.n250[2])
#Quantile.best.mle.n250
Table2b[4,c(1,2,3)] <- abs(Quantile.best.mle.n250-Quantile.true)
#################n=500, best model: lnorm
Quantile.best.mle.n500 <- qlnorm(c(0.5,0.75,0.95), meanlog = MLE.lnorm.boot.mean.n500[1], sdlog = MLE.lnorm.boot.mean.n500[2])
#Quantile.best.mle.n500
Table2b[5,c(1,2,3)] <- abs(Quantile.best.mle.n500-Quantile.true)
#From Table 1 before
Table2b[1,c(4,5,6)] <- Table1.1[1,c(2,4,6)] #lnorm, i.e first row
Table2b[2,c(4,5,6)] <- Table1.2[1,c(2,4,6)] #inorm, i.e first row
Table2b[3,c(4,5,6)] <- Table1.3[1,c(2,4,6)] #lnorm, i.e first row
Table2b[4,c(4,5,6)] <- Table1.4[5,c(2,4,6)] #llogis, i.e fifth row
Table2b[5,c(4,5,6)] <- Table1.5[1,c(2,4,6)] #lnorm, i.e first row 
colnames(Table2b) <- c("q_boot(50%)","q_boot(75%)","q_boot(95%)", "q_sample(50%)","q_sample(75%)","q_sample(95%)")
rownames(Table2b) <- c("n=25 (lnorm)", "n=50 (lnorm)", "n=100 (lnorm)", "n=250 (llogis)", "n=500 (lnorm)")

kable(Table2b, digits = 3, format.args = list(big.mark = ","), caption="Quantile bootstrap estimation errors based on (4) and (5) for all LogNormal samples with outliers")

Table 2.26: Quantile bootstrap estimation errors based on (4) and (5) for all LogNormal samples with outliers
	q_boot(50%)	q_boot(75%)	q_boot(95%)	q_sample(50%)	q_sample(75%)	q_sample(95%)
n=25 (lnorm)	5,263.810	13,660.436	47,146.892	4,730.688	11,397.029	37,486.570
n=50 (lnorm)	4,398.817	6,809.925	10,592.784	4,465.168	6,877.322	10,498.602
n=100 (lnorm)	2,422.207	1,108.441	9,650.478	1,176.609	1,099.864	14,678.478
n=250 (llogis)	1,485.469	4,115.184	39,440.623	1,148.783	224.350	10,911.390
n=500 (lnorm)	880.347	2,446.697	9,127.131	989.887	2,519.477	8,799.418

The estimation errors show a reasonable pattern, since the larger the sample is the smaller the error is. Note that outliers are not removed, and in particular sample of size \(n=250\) shows a weird behaviour. This is very likely due to sensitivity of the R optimisation function used in the MLE computations, which escalated the presence of outliers in individual bootstrap samples. Therefore, one should first check the optimality of the MLE estimates computed by R, and then detect possible outliers.

2.5 Weibull Sample with Moderately Heavy Tail

We now redo the computations from Sections 2.2, 2.3 and 2.4 for Sample 2, which is drawn from the Weibull distribution with parameters \(a=0.5\) and \(b=1\). Note that this particular Weibull distribution is a moderately heavy tailed distribution since \(a<1\). We only report the very final R outputs, since the intermediate steps are very similar to those from Sample 1 computations.

Note that set.seed(1234) is the seed set for, but no seed is set for bootstrapping computations.

set.seed(1234) #Generating Sample1 and saved in a list
n=c(25, 50, 100, 250, 500)
weibull.shape=0.5
weibull.scale=1    
Sample2.list=list(n25=rep(0,25),n50=rep(0,50),n100=rep(0,100),n250=rep(0,250),n500=rep(0,500))
for (i in 1:length(n)) {
Sample2=rweibull(n[i], shape=weibull.shape , scale=weibull.scale)  
Sample2.list[[i]]=Sample2
}

kable(Table.AIC, digits = 3, format.args = list(big.mark = ","), caption=("AIC summary for moderately heavy tailed Weibull samples of size n=25, n=50, n=100, n=250 and n=500"))

Table 2.27: AIC summary for moderately heavy tailed Weibull samples of size n=25, n=50, n=100, n=250 and n=500
	(MLE,n=25)	(MLE,n=50)	(MLE,n=100)	(MLE,n=250)	(MLE,n=500)
lnorm	86.957	153.419	254.780	517.470	1,107.682
exp	96.413	185.928	302.191	804.243	1,748.775
gamma	87.293	148.174	209.015	489.321	1,087.917
Weibull	85.648	145.817	216.315	469.088	1,042.992
llogis	87.223	150.473	243.450	499.986	1,098.118
invWeibull	94.412	180.450	314.424	646.758	1,315.670
Pareto	87.122	154.147	269.159	544.106	1,187.851
trgamma	87.257	147.829	210.463	471.080	1,044.309

kable(Table.AIC.min, digits = 3, format.args = list(big.mark = ","), caption=("Best (or DISToo) distribution chosen through AIC for moderately heavy tailed Weibull samples of size n=25, n=50, n=100, n=250 and n=500"))

Table 2.28: Best (or DISToo) distribution chosen through AIC for moderately heavy tailed Weibull samples of size n=25, n=50, n=100, n=250 and n=500
	(MLE,n=25)	(MLE,n=50)	(MLE,n=100)	(MLE,n=250)	(MLE,n=500)
Min (AIC)	Weibull	Weibull	gamma	Weibull	Weibull

It looks like AIC does make the right choice, except of the Weibull sample of size \(n=100\), though gamma marginally surpasses the true choice, i.e. weibull.

We now perform the usual bootstrap sampling.

The sets of \(n_b=25\) MLE estimates are now averaged and obtain the so-called Bootstrap MLE estimates, which are tabulated below.

kable(MLE.weibull.boot.mean.n25, digits = 3, format.args = list(big.mark = ","), caption="Bootstrap MLE estimates assuming weibull (n=25)")

Table 2.29: Bootstrap MLE estimates assuming weibull (n=25)
	x
shape	0.616
scale	1.715

kable(MLE.weibull.boot.mean.n50, digits = 3, format.args = list(big.mark = ","), caption="Bootstrap MLE estimates assuming weibull (n=50)")

Table 2.30: Bootstrap MLE estimates assuming weibull (n=50)
	x
shape	0.539
scale	1.345

kable(MLE.gamma.boot.mean.n100, digits = 3, format.args = list(big.mark = ","), caption="Bootstrap MLE estimates assuming gamma (n=100)")

Table 2.31: Bootstrap MLE estimates assuming gamma (n=100)
	x
shape	0.39
rate	0.24

kable(MLE.weibull.boot.mean.n250, digits = 3, format.args = list(big.mark = ","), caption="Bootstrap MLE estimates assuming weibull (n=250)")

Table 2.32: Bootstrap MLE estimates assuming weibull (n=250)
	x
shape	0.481
scale	0.826

kable(MLE.weibull.boot.mean.n500, digits = 3, format.args = list(big.mark = ","), caption="Bootstrap MLE estimates assuming weibull (n=500)")

Table 2.33: Bootstrap MLE estimates assuming weibull (n=500)
	x
shape	0.479
scale	0.928

As before, we report three sets of estimation errors as follows:

The summary of the quantile bootstrap errors – without any outlier detection – is given below:

kable(Table2a, digits = 3, format.args = list(big.mark = ","), caption="Quantile bootstrap estimation errors based on (3) for all moderately heavy tailed Weibull samples")

Table 2.34: Quantile bootstrap estimation errors based on (3) for all moderately heavy tailed Weibull samples
	boot(50%)	boot(75%)	boot(95%)
n=25 (weibull)	0.450	1.055	2.223
n=50 (weibull)	0.203	0.548	1.552
n=100 (gamma)	0.096	0.140	2.028
n=250 (weibull)	0.094	0.294	0.863
n=500 (weibull)	0.049	0.086	0.230

colnames(Table2b) <- c("q_boot(50%)","q_boot(75%)","q_boot(95%)", "q_sample(50%)","q_sample(75%)","q_sample(95%)")
rownames(Table2b) <- c("n=25 (weibull)", "n=50 (weibull)", "n=100 (gamma)", "n=250 (weibull)", "n=500 (weibull)")
kable(Table2b, digits = 3, format.args = list(big.mark = ","), caption="Quantile bootstrap estimation errors based on (4) and (5) for all moderately heavy tailed Weibull samples")

Table 2.35: Quantile bootstrap estimation errors based on (4) and (5) for all moderately heavy tailed Weibull samples
	q_boot(50%)	q_boot(75%)	q_boot(95%)	q_sample(50%)	q_sample(75%)	q_sample(95%)
n=25 (weibull)	0.465	0.992	1.204	0.430	0.849	0.576
n=50 (weibull)	0.201	0.544	1.327	0.198	0.509	1.058
n=100 (gamma)	0.093	0.118	2.160	0.088	0.136	2.014
n=250 (weibull)	0.095	0.294	0.905	0.090	0.281	0.885
n=500 (weibull)	0.049	0.087	0.203	0.052	0.078	0.358

The estimation errors show a reasonable pattern, since the larger the sample is the smaller the error is. Note that outliers are not removed in this section. You should also note that the true quantile values are \(q(50\%)=0.480453\), \(q(50\%)=1.921812\) and \(q(95\%)=8.974412\), so estimation errors are very small but proportional to the true values.

2.6 Weibull Sample with Very Light Tail

We now redo the computations from Sections 2.2, 2.3 and 2.4 for Sample 3, which is drawn from the Weibull distribution with parameters \(a=2\) and \(b=1\). Note that this particular Weibull distribution is a very light tailed distribution since \(a>1\). We only report the very final R outputs, since the intermediate steps are very similar to those from Sample 1 computations.

Note that set.seed(1234) is the seed set for, but no seed is set for bootstrapping computations.

set.seed(1234) #Generating Sample1 and saved in a list
n=c(25, 50, 100, 250, 500)
weibull.shape=2
weibull.scale=1 
Sample3.list=list(n25=rep(0,25),n50=rep(0,50),n100=rep(0,100),n250=rep(0,250),n500=rep(0,500))
for (i in 1:length(n)) {
Sample3=rweibull(n[i], shape=weibull.shape , scale=weibull.scale)  
Sample3.list[[i]]=Sample3
}

kable(Table.AIC, digits = 3, format.args = list(big.mark = ","), caption=("AIC summary for very light tailed Weibull samples of size n=25, n=50, n=100, n=250 and n=500"))

Table 2.36: AIC summary for very light tailed Weibull samples of size n=25, n=50, n=100, n=250 and n=500
	(MLE,n=25)	(MLE,n=50)	(MLE,n=100)	(MLE,n=250)	(MLE,n=500)
lnorm	32.430	71.909	161.961	339.342	677.589
exp	52.137	96.946	178.449	419.694	862.573
gamma	30.941	65.430	135.475	299.051	620.399
Weibull	31.122	64.306	123.496	290.960	612.898
llogis	32.697	68.962	150.631	321.859	668.024
invWeibull	39.886	98.940	221.605	468.631	885.576
Pareto	87.122	154.147	269.159	544.106	1,187.851
trgamma	32.752	67.400	117.892	292.960	618.699

kable(Table.AIC.min, digits = 3, format.args = list(big.mark = ","), caption=("Best (or DISToo) distribution chosen through AIC for very light tailed Weibull samples of size n=25, n=50, n=100, n=250 and n=500"))

Table 2.37: Best (or DISToo) distribution chosen through AIC for very light tailed Weibull samples of size n=25, n=50, n=100, n=250 and n=500
	(MLE,n=25)	(MLE,n=50)	(MLE,n=100)	(MLE,n=250)	(MLE,n=500)
Min (AIC)	gamma	Weibull	trgamma	Weibull	Weibull

It looks like AIC does make the right choice, except of the Weibull samples of size \(n=25\) and \(n=100\), though gamma and trgamma marginally surpass the true' choice, namely *weibull*. Since MLE estimation for *trgamma* is not very stable -- there are three parameters to estimate -- we will further assume that thebest’ AIC choice for the Weibull sample of size \(n=100\) is weibull, which is the second best' AIC-based choice though is not far from thebest` choice.

We now perform the usual bootstrap sampling.

The sets of \(n_b=25\) MLE estimates are now averaged and obtain the so-called Bootstrap MLE estimates, which are tabulated below.

kable(MLE.gamma.boot.mean.n25, digits = 3, format.args = list(big.mark = ","), caption="Bootstrap MLE estimates assuming gamma (n=25)")

Table 2.38: Bootstrap MLE estimates assuming gamma (n=25)
	x
shape	5.028
rate	4.946

kable(MLE.weibull.boot.mean.n50, digits = 3, format.args = list(big.mark = ","), caption="Bootstrap MLE estimates assuming weibull (n=50)")

Table 2.39: Bootstrap MLE estimates assuming weibull (n=50)
	x
shape	2.245
scale	1.063

kable(MLE.weibull.boot.mean.n100, digits = 3, format.args = list(big.mark = ","), caption="Bootstrap MLE estimates assuming weibull (n=250)")

Table 2.40: Bootstrap MLE estimates assuming weibull (n=250)
	x
shape	2.104
scale	0.996

kable(MLE.weibull.boot.mean.n250, digits = 3, format.args = list(big.mark = ","), caption="Bootstrap MLE estimates assuming weibull (n=250)")

Table 2.41: Bootstrap MLE estimates assuming weibull (n=250)
	x
shape	1.927
scale	0.937

kable(MLE.weibull.boot.mean.n500, digits = 3, format.args = list(big.mark = ","), caption="Bootstrap MLE estimates assuming weibull (n=500)")

Table 2.42: Bootstrap MLE estimates assuming weibull (n=500)
	x
shape	1.898
scale	0.978

As before, we report three sets of estimation errors as follows:

The summary of the quantile bootstrap errors – without any outlier detection – is given below:

kable(Table2a, digits = 3, format.args = list(big.mark = ","), caption="Quantile bootstrap estimation errors based on (3) for all very light tailed tailed Weibull samples")

Table 2.43: Quantile bootstrap estimation errors based on (3) for all very light tailed tailed Weibull samples
	boot(50%)	boot(75%)	boot(95%)
n=25 (gamma)	0.116	0.105	0.157
n=50 (weibull)	0.069	0.054	0.013
n=100 (weibull)	0.004	0.013	0.047
n=250 (weibull)	0.058	0.067	0.073
n=500 (weibull)	0.027	0.016	0.013

colnames(Table2) <- c("q_boot(50%)","q_boot(75%)","q_boot(95%)", "q_sample(50%)","q_sample(75%)","q_sample(95%)")
rownames(Table2) <- c("n=25 (gamma)", "n=50 (weibull)", "n=100 (weibull)", "n=250 (weibull)", "n=500 (weibull)")
kable(Table2, digits = 3, format.args = list(big.mark = ","), caption="Quantile bootstrap estimation errors based on (4) and (5) for all very light tailed tailed Weibull samples")

Table 2.44: Quantile bootstrap estimation errors based on (4) and (5) for all very light tailed tailed Weibull samples
	q_boot(50%)	q_boot(75%)	q_boot(95%)	q_sample(50%)	q_sample(75%)	q_sample(95%)
n=25 (gamma)	0.118	0.098	0.128	0.105	0.079	0.096
n=50 (weibull)	0.070	0.052	0.002	0.075	0.071	0.049
n=100 (weibull)	0.004	0.014	0.052	0.003	0.009	0.034
n=250 (weibull)	0.058	0.067	0.075	0.042	0.046	0.044
n=500 (weibull)	0.027	0.016	0.012	0.023	0.012	0.017

The estimation errors show a reasonable pattern, since the larger the sample is the smaller the error is. Note that outliers are not removed in this section. You should also note that the true quantile values are \(q(50\%)=0.8325546\), \(q(50\%)=1.1774100\) and \(q(95\%)=1.7308184\), so estimation errors are very small but proportional to the true values.

3 Real Data Analysis

In this section, you learn to:

Compare competitive fitting models when working with real data;
Compute point estimates and use bootstrapping to robustify your point estimates.

The real data analysis from this section is the dataset that has been described in Section 1.2, and a detailed exploratory data analysis is provided in Section 1.3. We are particularly focus on Building Insurance Claim Size corresponding to claims recorded in 1990.

We first fit the `best’ MLE distribution from the eight possible choices, and report

\[ \hat{q}^{MLE}_{50\%},\quad\hat{q}^{MLE}_{75\%}\quad\text{and}\quad \hat{q}^{MLE}_{95\%}, \]

i.e. the theoretical quantiles for the distribution (these are three point estimates).

Note that the claims are quite large, and therefore, we scale down all claims by a factor of \(100,000\), and the final outputs will be later scaled up by the same factor.

#View(Bivariate_Fire_Insurance_Clean_Data)
Sample.RealData.raw=Bivariate_Fire_Insurance_Clean_Data[[2]][1336:1502]
#head(Bivariate_Fire_Insurance_Clean_Data)
#head(Sample.RealData.raw)
#tail(Sample.RealData.raw)
#In order to use exp and trgamma fitdist(), we scale the data by 100,000
Sample.RealData=Sample.RealData.raw/100000
summary(Sample.RealData)

    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
  0.8251   8.1848  10.9736  15.4773  16.5017 116.9554

kable(Table.AIC, digits = 3, format.args = list(big.mark = ","), caption=("AIC summary for Building Insurance Claim Size recorded in 1990"))

Table 3.1: AIC summary for Building Insurance Claim Size recorded in 1990
	AIC (MLE,Sample.RealData)
lnorm	1,203.430
exp	1,250.951
gamma	1,222.947
Weibull	1,238.022
llogis	1,186.759
invWeibull	1,260.543
Pareto	1,252.947
trgamma	1,208.347

kable(Table.AIC.min, digits = 3, format.args = list(big.mark = ","), caption=("Best (or DISToo) distribution chosen through AIC for Building Insurance Claim Size recorded in 1990"))

Table 3.2: Best (or DISToo) distribution chosen through AIC for Building Insurance Claim Size recorded in 1990
	AIC(MLE,Best dist)
Min (AIC)	llogis

Thus, llogis is the best AIC-based parametric choice to fit the Building claims occurred in 1990. The point estimates are provided in Table 3.3 and gives the non-parametric and `best’ AIC-based parametric quantile point estimates.

colnames(TableB.1) <- c("q(50%)","q(75%)","q(95%)")
rownames(TableB.1) <- c("Non-parametric", "AIC Parametric - llogis")
kable(TableB.1, format.args = list(big.mark = ","), caption=("Quantile point estimates for Building Insurance Claim Size recorded in 1990"))

Table 3.3: Quantile point estimates for Building Insurance Claim Size recorded in 1990
	q(50%)	q(75%)	q(95%)
Non-parametric	1,097,360	1,650,165	3,770,627
AIC Parametric - llogis	1,143,608	1,767,316	3,672,177

It is interesting to note that the non-parametric high quantile point estimate – at level \(95\%\) – is larger than its parametric estimate. We now use bootstrapping to understand the variability of our point estimates.

Bootstrap 100 times is our first setting, i.e. we generate 100 samples with replacement from the original sample of size 167: \[ \tilde{x}_1,\ldots,\tilde{x}_{100}\quad \text{with}\quad dim(\tilde{x}_i)=167 \quad\text{for all}\;1\le i \le 100. \] We then fit the parameter(s) \(\hat{\Theta}_{MLE,i}\) of Distoo, i.e. llogis in this case, based on the bootstrap sample \(\tilde{x}_{i}\) for all \(1\le i \le 100\). Further, we compute \[ \hat{q}^{MLE,i}_{xx} \quad\text{for all}\quad 1\le i \le 100\quad\text{and}\quad xx\in \{50\%, 75\%, 95\%\},\\ \] where \(\hat{q}^{MLE,i}_{xx}\) is the \(xx\%\) quantile of Distoo by assuming \(\Theta=\hat{\Theta}_{MLE,i}\). Furthermore, we report

\[\begin{eqnarray*} &&(i)\;\text{Bootstrap Estimate} \quad\hat{q}^{MLE,boot}_{xx}=\frac{1}{100}\sum^{100}_{i=1} \hat{q}^{MLE,i}_{xx},\\ &&(ii)\;\text{Bootstrap Variance} \quad\widehat{Var}\left(\hat{q}^{MLE,boot}_{xx}\right)=\frac{1}{100}\sum^{100}_{i=1}\left(\hat{q}^{MLE,i}_{xx}-\hat{q}^{MLE,boot}_{xx}\right)^2,\\ &&(iii)\;\text{MSE Estimate} \quad\widehat{MSE}\left(\hat{q}^{MLE,boot}_{xx}\right)=\frac{1}{100}\sum^{100}_{i=1}\left(\hat{q}^{MLE,i}_{XX}-\hat{q}^{MLE}_{XX}\right)^2, \end{eqnarray*}\] for all \(xx\in \{50\%, 75\%, 95\%\}\).

Here are the bootstrap samples for which the sampling seed is not fixed, as it should be for a randomised experiment.

We now bootstrap 150 times our claim data and perform the same computations. All results are reported in Table 3.4 and 3.5.

colnames(TableB.2a) <- c("q(50%)","q(75%)","q(95%)")
rownames(TableB.2a) <- c("Bootstrap Estimate", "Bootstrap Variance" , "MSE Estimate", "Bootstrap Variance / Bootstrap Estimate", "MSE Estimate / Bootstrap Estimate")
kable(TableB.2a, format.args = list(big.mark = ","), caption=("Summary of bootstrap estimation -- 100 bootstrap samples"))

Table 3.4: Summary of bootstrap estimation – 100 bootstrap samples
	q(50%)	q(75%)	q(95%)
Bootstrap Estimate	1.150928e+06	1.779085e+06	3.709717e+06
Bootstrap Variance	2.744746e+09	1.250897e+10	1.882482e+11
MSE Estimate	2.770871e+09	1.252238e+10	1.877750e+11
Bootstrap Variance / Bootstrap Estimate	2.384812e+03	7.031123e+03	5.074461e+04
MSE Estimate / Bootstrap Estimate	2.407510e+03	7.038661e+03	5.061705e+04

colnames(TableB.2b) <- c("q(50%)","q(75%)","q(95%)")
rownames(TableB.2b) <- c("Bootstrap Estimate", "Bootstrap Variance" , "MSE Estimate", "Bootstrap Variance / Bootstrap Estimate", "MSE Estimate / Bootstrap Estimate")
kable(TableB.2b, format.args = list(big.mark = ","), caption=("Summary of bootstrap estimation -- 150 bootstrap samples"))

Table 3.5: Summary of bootstrap estimation – 150 bootstrap samples
	q(50%)	q(75%)	q(95%)
Bootstrap Estimate	1.147084e+06	1.769867e+06	3.677304e+06
Bootstrap Variance	3.309178e+09	1.125144e+10	1.468273e+11
MSE Estimate	3.299195e+09	1.118293e+10	1.458747e+11
Bootstrap Variance / Bootstrap Estimate	2.392803e+03	7.067744e+03	5.119189e+04
MSE Estimate / Bootstrap Estimate	2.415578e+03	7.075322e+03	5.106321e+04

Insurance Risk Modelling: Fitting Loss Distributions

Prof Vali Asimit and Dr Feng Zhou

January 18, 2022

1 Introduction – R Workspace and Data Exploration

1.1 Install R packages

1.2 Data Description

1.3 Exploratory Data Analysis (EDA)

2 Simulation Study

2.1 Simulate Data

2.2 Maximum Likelihood Estimation (MLE)

2.3 Method of Moments

2.4 LogNormal Sample – Further Analysis

2.5 Weibull Sample with Moderately Heavy Tail

2.6 Weibull Sample with Very Light Tail

3 Real Data Analysis