1 Introduction – R Workspace and Data Exploration


In this section, you learn to:


1.1 Install R packages

Install and activate the R packages for specific computations of this lab such as fitdistrplus and actuar. Some additional generic R packages are needed such as knitr, data.table, etc. Recall that this R lab is written in Bookdown setting and installing R packages should be done directly from RStudio. If you run the R code in a normal fashion, use e.g. install.packages(“knitr”) to install the R package knitr. Activating R packages is possible by individual activation, e.g. library(knitr), or by activating multiple R packages in one go, e.g. 

\[ libraries(c("fitdistrplus", "actuar", "knitr", "data.table")), \] but the latter requires a previous installation of the R package easypackages.

#install.packages("knitr") ## example of individual R package installation
#library(knitr)            ## example of individual R package activation
#suppressWarnings(library(easypackages))
#library(easypackages)
#libraries(c("fitdistrplus", "actuar", "knitr", "data.table", "MASS", "survival")) ## example of multiple R package activation in one go
suppressWarnings(library(actuar, warn.conflicts = FALSE))  ## warning messages from activating "actuar" package are ignored
suppressWarnings(library(MASS))
suppressWarnings(library(survival))
suppressWarnings(library(fitdistrplus))
suppressWarnings(library(grDevices))
suppressWarnings(library(knitr))
suppressWarnings(library(data.table))

1.2 Data Description

The purpose of this section is to describe the data to a general audience. Our insurance data are given as `Building insurance claim size’ and have a sample of size \(1,502\) of \[(x_{i1}, x_{i2}, x_{i3},y_{i1},y_{i2}) \quad \text{with}\; 1\le i \le 1,502,\] where \((x_{i1}, x_{i2}, x_{i3})\) are the covariates sampled from \((X_1,X_2,X_3)\) that help to predict the target variables \((Y_1,Y_2)\) via the observed sample \((y_{i1}, y_{i2})\). Recall that \((Y_1,Y_2)\) represents the building and content claim that are recorded as Building Insurance Claim Size and Content Insurance Claim Size variables. Moreover, \((X_1,X_2,X_3)\) provides redundant information, since X_1 is the Recorded time variable that contains the information given by X_2 and X_3 which records the claim occurrence time (year and month, respectively). That is,

The feature values \(\textbf{x}\)’s include information about the following

  1. Recorded time (\(X_1\)) – date variable;
  2. Building Insurance Claim Size (\(Y_1\)) – continuous variable that records the building insurance claim;
  3. Content Insurance Claim Size (\(Y_2\)) – continuous variable that records the content insurance claim;
  4. Year (\(X_2\)) – categorical variable with 11 possible values: from 1980 to 1990;
  5. Month (\(X_3\)) – categorical variable with 12 possible values: from 1 representing January to 12representing December.

The data file is named Bivariate_Fire_Insurance_Clean_Data.csv and a small snapshoot of the data is given in Table 1.1. A full Exploratory Data Analaysis is provided in Section 1.3, which explains the sample in greater details.

Our insurance data are given as `Building insurance claim size’ and has the following structure:

Bivariate_Fire_Insurance_Clean_Data <- read.table("Bivariate_Fire_Insurance_Clean_Data.csv",header=TRUE,as.is=TRUE,sep=",")
kable(head(Bivariate_Fire_Insurance_Clean_Data[,1:5]), caption="Fire Insurance Data Snapshoot",align = "rrrrr", format.args = list(big.mark = ","))
Table 1.1: Fire Insurance Data Snapshoot
Recorded.time Building.Insurance.Claim.Size Content.Insurance.Claim.Size Year Month
03/01/1980 1,098,096.6 585,651.5 1,980 1
04/01/1980 1,756,954.6 336,749.6 1,980 1
07/01/1980 1,244,509.5 3,367,496.0 1,980 1
10/01/1980 4,452,039.5 4,273,234.0 1,980 1
10/01/1980 2,494,875.5 3,543,192.0 1,980 1
16/01/1980 775,689.6 993,117.1 1,980 1
Bivariate_Fire_Insurance_Clean_Data_toy <- read.table("Bivariate_Fire_Insurance_Clean_Data.csv",header=TRUE,as.is=TRUE,sep=",")
colnames(Bivariate_Fire_Insurance_Clean_Data_toy) <- c("X1", "Y1", "Y2","X2", "X3")
#kable(head(Bivariate_Fire_Insurance_Clean_Data_toy), caption="Fire Insurance Data Snapshoot",align = "rrrrr", format.args = list(big.mark = ","))
summary(Bivariate_Fire_Insurance_Clean_Data_toy[,2:5])
       Y1                 Y2                  X2             X3        
 Min.   :   23191   Min.   :      825   Min.   :1980   Min.   : 1.000  
 1st Qu.:  878753   1st Qu.:   268458   1st Qu.:1983   1st Qu.: 4.000  
 Median : 1224291   Median :   482160   Median :1986   Median : 7.000  
 Mean   : 1871507   Mean   :  1629436   Mean   :1986   Mean   : 6.525  
 3rd Qu.: 1928640   3rd Qu.:  1188155   3rd Qu.:1988   3rd Qu.: 9.000  
 Max.   :95168375   Max.   :132013200   Max.   :1990   Max.   :12.000  

Note that this R lab aims to provide a prediction model only the Building claims (\(Y_1\)), since the Content claims (\(Y_2\)) could be modelled in the same fashion. Keep in mind that \(Y_1\) and \(Y_2\) are dependent random variables.

suppressWarnings(library(corrplot))
par(mfrow=c(1,2))
### yearly correlation plot 
coryear <- c()
for(i in 1980:1990){coryear <- c(coryear, cor(Bivariate_Fire_Insurance_Clean_Data_toy[which(Bivariate_Fire_Insurance_Clean_Data_toy$X2==i),c("Y1","Y2")])[1,2])}
names(coryear) <-  1980:1990
plot(1980:1990,coryear, main="Yearly Correlation Plot", ylab="Corr", xlab="year", type="p", col="red")
### monthly correlation plot 
cormonth <- c()
nmonth <-  c("Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec")
for(i in 1:12){cormonth <- c(cormonth, cor(Bivariate_Fire_Insurance_Clean_Data_toy[which(Bivariate_Fire_Insurance_Clean_Data_toy$X3==i),c("Y1","Y2")])[1,2])}
names(cormonth) <- nmonth
plot(1:12,cormonth, main="Monthly Correlation Plot",ylab="Corr", xlab="month", type="p", col="red",xaxt="n") 
axis(side=1,at=1:12,labels=nmonth)
Correlation for $(Y_1,Y_2)$ for various years (left) and months (right)

Figure 1.1: Correlation for \((Y_1,Y_2)\) for various years (left) and months (right)

It is quite difficult to explain the yearly correlations from Figure 1.1, but the monthly correlations are quite intuitive. The Building and Content claims are highly correlated over the hot weather – as it could be seen in July and August – when concomitant large claims are more likely to occur during that period of time.

1.3 Exploratory Data Analysis (EDA)

The purpose of this section is to explore the trends in the data through descriptive tools, which could help with selecting the most appropriate data analysis through statistical evidence which is done in Section 3.

Recall that the strength of association between the Building and Content claims has been examined in Section 1.2, namely Figure 1.1. Since our real data analysis from Section 3 will be focused on Building claims, less detailed qualitative data explorations are made for Content claims in Section 1.3. We mainly look at how homogeneous the claims are over the time, though statistical evidence would be required to confirm the findings from this section.

First, we look at two timeplots, namely Figures 1.2 and 1.3, which show the evolution over time of the Building and Content claims, receptively.

Bivariate_Fire_Insurance_Clean_Data_toy$X1 <- as.Date(Bivariate_Fire_Insurance_Clean_Data_toy$X1, "%d/%m/%Y")
library(ggplot2)
library(dplyr)
library(plotly)
library(hrbrthemes)
pp1 <- Bivariate_Fire_Insurance_Clean_Data_toy %>%
    ggplot( aes(x=X1, y=Y1)) +
    geom_area(fill="#00ffff", alpha=0.5) +
    geom_line(color="#69b3a2") +
    ylab("Building Insurance Claim Size (Y1)") +
    xlab("Recorded time (X1)")+
    theme_ipsum()
pp1 <- ggplotly(pp1)
pp1

Figure 1.2: Building Insurance Claim Size Timeplot

#Bivariate_Fire_Insurance_Clean_Data_toy$X1 <- as.Date(Bivariate_Fire_Insurance_Clean_Data_toy$X1, "%d/%m/%Y")
library(ggplot2)
library(dplyr)
library(plotly)
library(hrbrthemes)
pp1 <- Bivariate_Fire_Insurance_Clean_Data_toy %>%
    ggplot( aes(x=X1, y=Y2)) +
    geom_area(fill="#84ff00", alpha=0.5) +
    geom_line(color="#69b3a2") +
    ylab("Content Insurance Claim Size (Y2)") +
    xlab("Recorded time (X1)")+
    theme_ipsum()
pp1 <- ggplotly(pp1)
pp1

Figure 1.3: Content Insurance Claim Size Timeplot

Figures 1.2 and 1.3 give an overall understanding of the claims, and we observe some outliers, i.e. very large claims as compared to the common claims. We should be mindful of such claims when fitting the claim amount distribution as such atypical claims may affect the robustness of our estimation.

The next step is to look at some classical histograms to understand whether the claims’ pattern show unusual patterns in some years and/or months. Figures 1.4 and 1.5 show the yearly histograms for the Building and Content claims, respectively.

par(mfrow=c(3,4))
for(i in 1980:1990){
 xxx <- Bivariate_Fire_Insurance_Clean_Data_toy[which(Bivariate_Fire_Insurance_Clean_Data_toy$X2==i),]
 hist(xxx$Y1,main="", xlab=i, border="red", col="green")}
hist(Bivariate_Fire_Insurance_Clean_Data_toy$Y1, main="", xlab="All Years", border="red", col="green")
Building Insurance Claim Size (Y1) -- Yearly Histograms

Figure 1.4: Building Insurance Claim Size (Y1) – Yearly Histograms

par(mfrow=c(3,4))
for(i in 1980:1990){
 xxx <- Bivariate_Fire_Insurance_Clean_Data_toy[which(Bivariate_Fire_Insurance_Clean_Data_toy$X2==i),]
 hist(xxx$Y2,main="", xlab=i, border="red", col="green")}
hist(Bivariate_Fire_Insurance_Clean_Data_toy$Y2, main="", xlab="All Years", border="red", col="green")
Content Insurance Claim Size (Y2) -- Yearly Histograms

Figure 1.5: Content Insurance Claim Size (Y2) – Yearly Histograms

Figures 1.6 and 1.7 show the monthly histograms for the Building and Content claims, respectively.

par(mfrow=c(3,4))
nmonth <-  c("Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec")
for(i in 1:12){
 xxx <- Bivariate_Fire_Insurance_Clean_Data_toy[which(Bivariate_Fire_Insurance_Clean_Data_toy$X3==i),]
 hist(xxx$Y1,main="", xlab=nmonth[i], border="red", col="green")}
Building Insurance Claim Size (Y1) -- Monthly Histograms

Figure 1.6: Building Insurance Claim Size (Y1) – Monthly Histograms

par(mfrow=c(3,4))
nmonth <-  c("Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec")
for(i in 1:12){
 xxx <- Bivariate_Fire_Insurance_Clean_Data_toy[which(Bivariate_Fire_Insurance_Clean_Data_toy$X3==i),]
 hist(xxx$Y2,main="", xlab=nmonth[i], border="red", col="green")}
Content Insurance Claim Size (Y2) -- Monthly Histograms

Figure 1.7: Content Insurance Claim Size (Y2) – Monthly Histograms

It is quite clear that interpreting numerous histograms – displayed as Figures 1.4 to 1.7 – is not convenient, and thus, we now report summary statistics in some radar plots that could help gathering and comparing the yearly and monthly risks with ease.

The following radar plots are provided for the Building Insurance Claim Size (Y1):

  1. Figure 1.8 that displays the yearly mean and standard deviation;
  2. Figure 1.9 that displays the yearly quantiles;
  3. Figure 1.10 that displays the monthly mean and standard deviation;
  4. Figure 1.11 that displays the monthly quantiles.

Note that the claims are divided by 1,000,000, i.e. all measures of risks – mean, standard deviation and quantiles – are scaled down by a factor of 1,000,000.

#Bivariate_Fire_Insurance_Clean_Data_toy <- read.table("Bivariate_Fire_Insurance_Clean_Data.csv",header=TRUE,as.is=TRUE,sep=",")
#colnames(Bivariate_Fire_Insurance_Clean_Data_toy) <- c("X1", "Y1", "Y2","X2", "X3")
meanyear <- numeric(11)
sdyear <- numeric(11)
for(i in 1980:1990){
 xxx <- Bivariate_Fire_Insurance_Clean_Data_toy[which(Bivariate_Fire_Insurance_Clean_Data_toy$X2==i),]
 meanyear[i-1979] <- mean(xxx$Y1)/1000000
 sdyear[i-1979] <- sd(xxx$Y1)/1000000}
library(plotly)
fig1eY1 <- plot_ly(
    type = 'scatterpolar',
    r = c(meanyear, meanyear[1]),
    theta=paste0("year ",as.character(c(1980:1990,1980))), #the labels are forced to be characters, so the setting from below does not work
               # theta = c("1980", "1981", "1982", "1983", "1984", "1985", "1986'", "1987", "1988", "1989", "1990", "1980"),
    #fill = 'toself',
    linetype = I("dot"),
    name = 'Average Claim/1,000,000'
  ) 
fig1eY1 <- fig1eY1 %>% add_trace(
  type = 'scatterpolar',
  r = c(sdyear, sdyear[1]),
  theta=paste0("year ",as.character(c(1980:1990,1980))),
  #fill = 'toself',
  linetype = I("dotdash"),
  name = 'Standard Deviation Claim/1,000,000'
  ) 
fig1eY1

Figure 1.8: Building Insurance Claim Size (Y1) – Yearly Summary Statistics

#Bivariate_Fire_Insurance_Clean_Data_toy <- read.table("Bivariate_Fire_Insurance_Clean_Data.csv",header=TRUE,as.is=TRUE,sep=",")
#colnames(Bivariate_Fire_Insurance_Clean_Data_toy) <- c("X1", "Y1", "Y2","X2", "X3")
q50year <- numeric(11)
q75year <- numeric(11)
q95year <- numeric(11)
for(i in 1980:1990){
 xxx <- Bivariate_Fire_Insurance_Clean_Data_toy[which(Bivariate_Fire_Insurance_Clean_Data_toy$X2==i),]
 q50year[i-1979] <- quantile(xxx$Y1,0.50)/1000000
 q75year[i-1979] <- quantile(xxx$Y1,0.75)/1000000
 q95year[i-1979] <- quantile(xxx$Y1,0.95)/1000000}
library(plotly)
fig1qY1 <- plot_ly(
    type = 'scatterpolar',
    r = c(q50year, q50year[1]),
    theta=paste0("year ",as.character(c(1980:1990,1980))), 
    #fill = 'toself',
    linetype = I("dot"),
    name = 'q(50%)/1,000,000'
  ) 
fig1qY1 <- fig1qY1 %>% add_trace(
  type = 'scatterpolar',
  r = c(q75year, q75year[1]),
  theta=paste0("year ",as.character(c(1980:1990,1980))),
  #fill = 'toself',
  linetype = I("dashdot"),
  name = 'q(75%)/1,000,000'
  ) 
fig1qY1 <- fig1qY1 %>% add_trace(
  type = 'scatterpolar',
  r = c(q95year, q95year[1]),
  theta=paste0("year ",as.character(c(1980:1990,1980))),
  #fill = 'toself',
  linetype = I("dash"),
  name = 'q(95%)/1,000,000'
  ) 
fig1qY1

Figure 1.9: Building Insurance Claim Size (Y1) – Yearly Summary Statistics (quantiles)

#Bivariate_Fire_Insurance_Clean_Data_toy <- read.table("Bivariate_Fire_Insurance_Clean_Data.csv",header=TRUE,as.is=TRUE,sep=",")
#colnames(Bivariate_Fire_Insurance_Clean_Data_toy) <- c("X1", "Y1", "Y2","X2", "X3")
meanmonth <- numeric(11)
sdmonth <- numeric(11)
for(i in 1:12){
 xxx <- Bivariate_Fire_Insurance_Clean_Data_toy[which(Bivariate_Fire_Insurance_Clean_Data_toy$X3==i),]
 meanmonth[i] <- mean(xxx$Y1)/1000000
 sdmonth[i] <- sd(xxx$Y1)/1000000}
library(plotly)
fig1eY1 <- plot_ly(
    type = 'scatterpolar',
    r = c(meanmonth, meanmonth[1]),
    theta = c("Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec", "Jan"),
    #fill = 'toself',
    linetype = I("dot"),
    name = 'Average Claim/1,000,000'
  ) 
fig1eY1 <- fig1eY1 %>% add_trace(
  type = 'scatterpolar',
  r = c(sdmonth, sdmonth[1]),
  theta = c("Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec", "Jan"),
  #fill = 'toself',
  linetype = I("dotdash"),
  name = 'Standard Deviation Claim/1,000,000'
  ) 
fig1eY1

Figure 1.10: Building Insurance Claim Size (Y1) – Monthly Summary Statistics

#Bivariate_Fire_Insurance_Clean_Data_toy <- read.table("Bivariate_Fire_Insurance_Clean_Data.csv",header=TRUE,as.is=TRUE,sep=",")
#colnames(Bivariate_Fire_Insurance_Clean_Data_toy) <- c("X1", "Y1", "Y2","X2", "X3")
q50month <- numeric(11)
q75month <- numeric(11)
q95month <- numeric(11)
for(i in 1:12){
 xxx <- Bivariate_Fire_Insurance_Clean_Data_toy[which(Bivariate_Fire_Insurance_Clean_Data_toy$X3==i),]
 q50month[i] <- quantile(xxx$Y1,0.50)/1000000
 q75month[i] <- quantile(xxx$Y1,0.75)/1000000
 q95month[i] <- quantile(xxx$Y1,0.95)/1000000}
library(plotly)
fig1qY1 <- plot_ly(
    type = 'scatterpolar',
    r = c(q50month, q50month[1]),
    theta = c("Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec", "Jan"),
    #fill = 'toself',
    linetype = I("dot"),
    name = 'q(50%)/1,000,000'
  ) 
fig1qY1 <- fig1qY1 %>% add_trace(
  type = 'scatterpolar',
  r = c(q75month, q75month[1]),
  theta = c("Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec", "Jan"),
  #fill = 'toself',
  linetype = I("dashdot"),
  name = 'q(75%)/1,000,000'
  ) 
fig1qY1 <- fig1qY1 %>% add_trace(
  type = 'scatterpolar',
  r = c(q95month, q95month[1]),
  theta = c("Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec", "Jan"),
  #fill = 'toself',
  linetype = I("dash"),
  name = 'q(95%)/1,000,000'
  ) 
fig1qY1

Figure 1.11: Building Insurance Claim Size (Y1) – Monthly Summary Statistics (quantiles)

Building claims show an interesting trend. While Figure 1.8 shows large variation – coefficient of variation larger than 1 – on the yearly basis, the monthly claims from Figure 1.10 show the same behaviour but only for July, August and October. This peculiar behaviour could be explained by the lack of homogeneity on the annual basis, i.e. the iid (independent and identical distributed) assumption may not hold for the individual claims (sampled annually). This is only our initial intuition, though statistical testing could confirm or not this behaviour. At the same time, the timplot from Figures 1.2 shows three extreme claims; the most extreme claim appears in 1980, and the second(third) most extreme claim is recorded in 1985(1988), which is in accordance with Figure 1.8. Quantiles could better explain the tail behaviour, and thus, the \(50\%\), \(75\%\) and \(95\%\) quantiles are plotted in Figures 1.9 and 1.11. The same outliers – explained before – tend to explain the large difference between the \(75\%\) and \(95\%\) quantiles, especially in year 1985 and month October.

Content claims show an even more extreme behaviour – see Figure 1.12 to 1.15 – which is in accordance with multiple unusually large claims observed in the timeplot from Figures 1.3. The last visualisation tools of this section are now given for Content Insurance Claim Size (Y2), i.e. teh following radar plots:

  1. Figure 1.12 that displays the yearly mean and standard deviation;
  2. Figure 1.13 that displays the yearly quantiles;
  3. Figure 1.14 that displays the monthly mean and standard deviation;
  4. Figure 1.15 that displays the monthly quantiles.

Note that the claims are divided by 1,000,000, i.e. all measures of risks – mean, standard deviation and quantiles – are scaled down by a factor of 1,000,000.

#Bivariate_Fire_Insurance_Clean_Data_toy <- read.table("Bivariate_Fire_Insurance_Clean_Data.csv",header=TRUE,as.is=TRUE,sep=",")
#colnames(Bivariate_Fire_Insurance_Clean_Data_toy) <- c("X1", "Y1", "Y2","X2", "X3")
meanyear <- numeric(11)
sdyear <- numeric(11)
for(i in 1980:1990){
 xxx <- Bivariate_Fire_Insurance_Clean_Data_toy[which(Bivariate_Fire_Insurance_Clean_Data_toy$X2==i),]
 meanyear[i-1979] <- mean(xxx$Y2)/1000000
 sdyear[i-1979] <- sd(xxx$Y2)/1000000}
library(plotly)
fig1eY2 <- plot_ly(
    type = 'scatterpolar',
    r = c(meanyear, meanyear[1]),
    theta=paste0("year ",as.character(c(1980:1990,1980))), #the labels are forced to be characters, so the setting from below does not work
               # theta = c("1980", "1981", "1982", "1983", "1984", "1985", "1986'", "1987", "1988", "1989", "1990", "1980"),
    #fill = 'toself',
    linetype = I("dot"),
    name = 'Average Claim/1,000,000'
  ) 
fig1eY2 <- fig1eY2 %>% add_trace(
  type = 'scatterpolar',
  r = c(sdyear, sdyear[1]),
  theta=paste0("year ",as.character(c(1980:1990,1980))),
  #fill = 'toself',
  linetype = I("dotdash"),
  name = 'Standard Deviation Claim/1,000,000'
  ) 
fig1eY2

Figure 1.12: Content Insurance Claim Size (Y2) – Yearly Summary Statistics

#Bivariate_Fire_Insurance_Clean_Data_toy <- read.table("Bivariate_Fire_Insurance_Clean_Data.csv",header=TRUE,as.is=TRUE,sep=",")
#colnames(Bivariate_Fire_Insurance_Clean_Data_toy) <- c("X1", "Y1", "Y2","X2", "X3")
q50year <- numeric(11)
q75year <- numeric(11)
q95year <- numeric(11)
for(i in 1980:1990){
 xxx <- Bivariate_Fire_Insurance_Clean_Data_toy[which(Bivariate_Fire_Insurance_Clean_Data_toy$X2==i),]
 q50year[i-1979] <- quantile(xxx$Y2,0.50)/1000000
 q75year[i-1979] <- quantile(xxx$Y2,0.75)/1000000
 q95year[i-1979] <- quantile(xxx$Y2,0.95)/1000000}
library(plotly)
fig1qY2 <- plot_ly(
    type = 'scatterpolar',
    r = c(q50year, q50year[1]),
    theta=paste0("year ",as.character(c(1980:1990,1980))), 
    #fill = 'toself',
    linetype = I("dot"),
    name = 'q(50%)/1,000,000'
  ) 
fig1qY2 <- fig1qY2 %>% add_trace(
  type = 'scatterpolar',
  r = c(q75year, q75year[1]),
  theta=paste0("year ",as.character(c(1980:1990,1980))),
  #fill = 'toself',
  linetype = I("dashdot"),
  name = 'q(75%)/1,000,000'
  ) 
fig1qY2 <- fig1qY2 %>% add_trace(
  type = 'scatterpolar',
  r = c(q95year, q95year[1]),
  theta=paste0("year ",as.character(c(1980:1990,1980))),
  #fill = 'toself',
  linetype = I("dash"),
  name = 'q(95%)/1,000,000'
  ) 
fig1qY2

Figure 1.13: Content Insurance Claim Size (Y2) – Yearly Summary Statistics (quantiles)

#Bivariate_Fire_Insurance_Clean_Data_toy <- read.table("Bivariate_Fire_Insurance_Clean_Data.csv",header=TRUE,as.is=TRUE,sep=",")
#colnames(Bivariate_Fire_Insurance_Clean_Data_toy) <- c("X1", "Y1", "Y2","X2", "X3")
meanmonth <- numeric(11)
sdmonth <- numeric(11)
for(i in 1:12){
 xxx <- Bivariate_Fire_Insurance_Clean_Data_toy[which(Bivariate_Fire_Insurance_Clean_Data_toy$X3==i),]
 meanmonth[i] <- mean(xxx$Y2)/1000000
 sdmonth[i] <- sd(xxx$Y2)/1000000}
library(plotly)
fig1eY2 <- plot_ly(
    type = 'scatterpolar',
    r = c(meanmonth, meanmonth[1]),
    theta = c("Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec", "Jan"),
    #fill = 'toself',
    linetype = I("dot"),
    name = 'Average Claim/1,000,000'
  ) 
fig1eY2 <- fig1eY2 %>% add_trace(
  type = 'scatterpolar',
  r = c(sdmonth, sdmonth[1]),
  theta = c("Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec", "Jan"),
  #fill = 'toself',
  linetype = I("dotdash"),
  name = 'Standard Deviation Claim/1,000,000'
  ) 
fig1eY2

Figure 1.14: Content Insurance Claim Size (Y2) – Monthly Summary Statistics

#Bivariate_Fire_Insurance_Clean_Data_toy <- read.table("Bivariate_Fire_Insurance_Clean_Data.csv",header=TRUE,as.is=TRUE,sep=",")
#colnames(Bivariate_Fire_Insurance_Clean_Data_toy) <- c("X1", "Y1", "Y2","X2", "X3")
q50month <- numeric(11)
q75month <- numeric(11)
q95month <- numeric(11)
for(i in 1:12){
 xxx <- Bivariate_Fire_Insurance_Clean_Data_toy[which(Bivariate_Fire_Insurance_Clean_Data_toy$X3==i),]
 q50month[i] <- quantile(xxx$Y2,0.50)/1000000
 q75month[i] <- quantile(xxx$Y2,0.75)/1000000
 q95month[i] <- quantile(xxx$Y2,0.95)/1000000}
library(plotly)
fig1qY2 <- plot_ly(
    type = 'scatterpolar',
    r = c(q50month, q50month[1]),
    theta = c("Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec", "Jan"),
    #fill = 'toself',
    linetype = I("dot"),
    name = 'q(50%)/1,000,000'
  ) 
fig1qY2 <- fig1qY2 %>% add_trace(
  type = 'scatterpolar',
  r = c(q75month, q75month[1]),
  theta = c("Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec", "Jan"),
  #fill = 'toself',
  linetype = I("dashdot"),
  name = 'q(75%)/1,000,000'
  ) 
fig1qY2 <- fig1qY2 %>% add_trace(
  type = 'scatterpolar',
  r = c(q95month, q95month[1]),
  theta = c("Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec", "Jan"),
  #fill = 'toself',
  linetype = I("dash"),
  name = 'q(95%)/1,000,000'
  ) 
fig1qY2

Figure 1.15: Content Insurance Claim Size (Y2) – Monthly Summary Statistics (quantiles)

This ends our EDA that is a preamble of the real data analysis provided in Section @ref{RDA}. We next provide a detailed simulation study in Section that relies on the same loss data estimation procedures from Section @ref{RDA}. The main aim of the simulation study is to provide the possible shortcomings of statistical models used to fit our insurance data.

2 Simulation Study


In this section, you learn to:


2.1 Simulate Data

This section provides elementary simulations from univariate (loss) distributions. We first simulate samples of size \(n=500\) – though samples of size \(n=\{25, 50, 100, 250, 500\}\) will be later generated – from the following three parametric distributions:

  1. LogNormal with parameters \(\mu=10\) and \(\sigma=1\) known as Sample 1 – note that all LogNormal distributions are moderately heavy tailed distributions;
  2. Weibull with parameters \(a=0.5\) and \(b=1\), known as Sample 2 – note that this particular Weibull is a moderately heavy tailed distribution since \(a<1\);
  3. Weibull with parameters \(a=2\) and \(b=1\), known as Sample 2 – note that this particular Weibull is a very light tailed distribution since \(a>1\);

Note that our LogNormal distribution has the usual parametrisation, i.e. its expected value is \(\exp\left(\mu+\frac{\sigma^2}{2}\right)\). Further, our Weibull parametrisation, which is in terms of \(a,b>0\), has the following cumulative distribution function:

\[ F_{Weibull}(x;a,b)=\Pr(X\le x)=\exp\left\{-\left(\frac{x}{b}\right)^a\right\},\quad\text{for all}\;x\ge0,\;\text{i.e.}\;\mathbb{E}[X]=b\cdot\Gamma\left(1+\frac{1}{a}\right), \] where \(\Gamma()\) is the usual Gamma function.

The summary statistics for the three samples are given below:

set.seed(123); #set your seed so you could replicate the same sample every time you run the R script
n=500;  
#Simulate Sample 1: logNormal($\mu=10, \sigma=1$) moderate heavy tail
Sample1 <- rlnorm(n, meanlog=10, sdlog=1)  
summary(Sample1)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1539   12399   22488   36912   43705  563003
#Simulate Sample 2: Weibull($a=0.5, b=1$) moderate heavy tail
Sample2 <- rweibull(n, shape=0.5 , scale=1) 
#Note that the scale parameter b=1 
summary(Sample2)
##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
##  0.00000  0.09372  0.48676  2.01320  1.98672 45.73743
#Simulate Sample 3: Weibull($a=2, b=1$) very light tail
Sample3 <- rweibull(n, shape=2 , scale=1)
summary(Sample3)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.06029 0.53292 0.83407 0.88164 1.16460 2.31943

Let us plot the histograms for three samples, which are given in Figure 2.1 below.

par(mfrow=c(2,2))
hist(Sample1, prob=TRUE, border="red", col="green");
hist(log(Sample1), prob=TRUE, border="red", col="green");
hist(Sample2, prob=TRUE, border="red", col="green");
hist(Sample3, prob=TRUE, border="red", col="green");
Histograms for Sample 1 and log of Sample 1 (top), and Samples 2 and 3 (bottom)

Figure 2.1: Histograms for Sample 1 and log of Sample 1 (top), and Samples 2 and 3 (bottom)

As expected, Samples 1 and 2 histograms are right skewed in Figure 2.1 due to their heavy tailness, while Sample 3 shows a very different pattern. At a first glance, Sample 2 seems to come from a less heavy tailed distribution than Sample 1 since one may say the Sample 1 histogram fades away after the threshold \(3\times10^5\), while Sample 2 histogram fades away after the threshold \(30\); this argument is faulty since the two distributions calibrate distributions with very different moments, e.g. the theoretical expected value of Sample 1 and Sample 2 is \(\exp(10.5)=36,315.50\) and \(1\cdot\Gamma\left(1+\frac{1}{0.5}\right)=2\), respectively. In fact, one could show that the theoretical distribution of Weibull(\(a=0.5\),\(b=1\)) exhibits a heavier tail distribution than LogNormal(\(\mu=10\),\(\sigma=1\)).

Finally, the log-transformation of the Sample 1 data leads to a symmetric histogram due to the fact that \(\log(X)\sim N\big(\mu,\sigma^2\big)\) whenever \(X\sim LogN\big(\mu,\sigma^2\big)\).

2.2 Maximum Likelihood Estimation (MLE)


We are now ready to parametrically fit our sample data from Section 2.1, and two methods are chosen:

  • Maximum Likelihood Estimation (MLE) – see Section 2.2 – where the parameters are chosen so that the likelihood of observing the given sample is maximised;
  • Method of Moments (MOM) – see Section 2.3 – where sample and theoretical higher moments are matched, i.e are equal.

MLE and MOM estimations are possible via function mledist and mmedist functions that are available in the fitdistrplus R package. There are numerous numerical issues with both MLE and MOM implementations in R especially when fitting distributions with more than one parameter. Why?

MLE requires solving non-linear optimisation problems, and R (and Python) rely on basic optimisation tools that are designed for general use, and bespoke software like MATLAB would be more computationally stable; this issue is more acute when the dimension of the optimisation problem, i.e. number of parameters gets larger (two or more parameters).

MOM usually requires solving a non-linear system of equations, which is computationally feasible even in R (and Python) – as an application of the Fixed Point Theorem though other methods are available – when one parameter distribution is fitted, but MOM is computationally more challenging (not only in R and Python) when the number of parameters gets larger (two or more parameters).


Eight parametric families are used to fit our three samples. The standard R notations are used to signify which parametric we refer to. in the two packages for

  1. lnorm refers to LogNormal distribution – see actuar and stats R packages;
  2. exp refers to Exponential distribution – see actuar and stats R packages;
  3. gamma refers to Gamma distribution – see actuar and stats R packages;
  4. weibull refers to Weibull distribution – see actuar and stats R packages;
  5. llogis refers to LogLogistic distribution – see actuar R package;
  6. invweibull refers to Inverse Weibull distribution – see actuar R package;
  7. pareto refers to Pareto distribution – see actuar R package;
  8. trgamma refers to transformed Gamma distribution – see actuar R package.

Recall that pareto is defined on \((0, \infty)\), which is not true for Pareto Type 1, and thus, we prefer using pareto so that all eight parametric families are defined on the same range.

MLE estimation requires a `good’ starting point for their numerical algorithms, and thus, one should be careful when performing MLE estimation for two or more parametric distributions; local minima or even saddle points could be obtained, which are far from the desired global optimum point/points if this/these exists/exist as the likelihood function is not always a convex function. MOM faces numerical shortcomings when performing MLE estimation for two or more parametric distributions. Numerical issues are signified as NA in the R output.

Sample 1 (simulated from LogNormal) is fitted via MLE (assuming lnorm family) as follows:

MLE.lnorm=mledist(Sample1, "lnorm")
MLE.lnorm$estimate
##    meanlog      sdlog 
## 10.0345904  0.9717961

Sample 1 (simulated from LogNormal) is fitted via MLE assuming exp family as follows:

MLE.exp=mledist(Sample1/1000, "exp") 
#Large valued data (not large sized samples) require to scale down the data by a factor 
#we chose here to scale down by a factor of 1,000 since the expected value is around \exp(10.5)=36,315.50
#we should scale back the MLE estimate, and since we actually fit the rate, we divide the MLE estimate by 1,000
#recall that expected value is equal to 1/rate
MLE.exp$estimate/1000
##         rate 
## 2.709161e-05

Sample 1 (simulated from LogNormal) is fitted via MLE (assuming gamma family) as follows:

MLE.gamma=mledist(Sample1, "gamma",lower=c(0, 0), start=list(shape =1, rate=1))
#A `good' starting point is needed here
MLE.gamma$estimate  #rate=1/scale
##        shape         rate 
## 1.177195e+00 3.188515e-05

Sample 1 (simulated from LogNormal) is fitted via MLE (assuming weibull family) as follows:

MLE.weibull=mledist(Sample1, "weibull")
MLE.weibull$estimate
##        shape        scale 
##     1.016727 37225.585178

Sample 1 (simulated from LogNormal) is fitted via MLE (assuming llogis family) as follows:

MLE.llogis=mledist(Sample1, "llogis")
MLE.llogis$estimate
##        shape        scale 
##     1.802915 22563.980389

Sample 1 (simulated from LogNormal) is fitted via MLE (assuming invweibull family) as follows:

MLE.invweibull=mledist(Sample1, "invweibull")
MLE.invweibull$estimate
##        shape        scale 
##    0.4334861 1036.7107921

Sample 1 (simulated from LogNormal) is fitted via MLE (assuming pareto family) as follows:

MLE.pareto=mledist(Sample1, "pareto",lower=c(0, 0), start=list(shape =1, scale=1))
#A `good' starting point is needed here
MLE.pareto$estimate
##        shape        scale 
## 7.484903e+00 2.382607e+05

Sample 1 (simulated from LogNormal) is fitted via MLE (assuming trgamma family) as follows:

MLE.trgamma=mledist(Sample1, "trgamma")
MLE.trgamma$estimate
##      shape1      shape2        rate 
## 6.922385543 0.391949647 0.004997469

MLE fitted distributions have their density plots in Figure 2.2, though there is not much to say about these plots.

par(mfrow=c(2,4))
curve(dlnorm(x,MLE.lnorm$estimate[1],MLE.lnorm$estimate[2]),from=0,to=max(Sample1),ylab="Density",  col=1, main="lnorm", xlab = "")
curve(dexp(x,MLE.exp$estimate/1000),from=0,to=max(Sample1),ylab="Density",  col=2, main="exp", xlab = "")
curve(dgamma(x,MLE.gamma$estimate[1],MLE.gamma$estimate[2]),from=0,to=max(Sample1),ylab="Density",  col=3, main="gamma", xlab = "")
curve(dweibull(x,MLE.weibull$estimate[1],MLE.weibull$estimate[2]),from=0, to = max(Sample1), ylab="Density",col=4,main="weibull", xlab = "")
curve(dllogis(x,MLE.llogis$estimate[1],MLE.llogis$estimate[2]),from=0, to = max(Sample1), ylab="Density",col=5,main="llogis", xlab = "")
curve(dinvweibull(x,MLE.invweibull$estimate[1],MLE.invweibull$estimate[2]),from=0, to = max(Sample1), ylab="Density",col=6,main="invweibull", xlab = "")
curve(dpareto(x,MLE.pareto$estimate[1],MLE.pareto$estimate[2]),from=0, to = max(Sample1), ylab="Density",col=7,main="pareto", xlab = "")
curve(dtrgamma(x,MLE.trgamma$estimate[1],MLE.trgamma$estimate[2],MLE.trgamma$estimate[3]),xlim=c(0,max(Sample1)),ylab="Density",col=8, main="trgamma", xlab = "")
MLE Density Plots

Figure 2.2: MLE Density Plots

2.3 Method of Moments

MOM usually requires solving a non-linear system of equations, which is often computationally infeasible when MOM is applied for parametric families with two or more parameters – also mentioned at the beginning of Section 2.2. Sample 1 data are fitted via MOM through the mmedist function, and you could see that MOM estimation fails in some cases as we anticipated.

Sample 1 (simulated from LogNormal) is fitted via MOM (assuming lnorm family) as follows:

MME.lnorm=mmedist(Sample1, "lnorm")
MME.lnorm$estimate
##   meanlog     sdlog 
## 10.042359  0.973578

Sample 1 (simulated from LogNormal) is fitted via MLE (assuming exp family) as follows:

MME.exp=mmedist(Sample1, "exp")
MME.exp$estimate
##         rate 
## 2.709161e-05

Sample 1 (simulated from LogNormal) is fitted via MLE (assuming gamma family) as follows:

MME.gamma=mmedist(Sample1, "gamma")
MME.gamma$estimate
##        shape         rate 
## 6.328446e-01 1.714478e-05

Sample 1 (simulated from LogNormal) is fitted via MLE (assuming weibull family) as follows:

memp  <-  function(x, order) mean(x^order)
MME.weibull=mmedist(Sample1, "weibull", order=c(1, 2), memp=memp)
## checkstartfix is carried out
MME.weibull$estimate
##        shape        scale 
## 9.504584e-01 3.991438e+04

Sample 1 (simulated from LogNormal) is fitted via MLE (assuming llogis family) as follows:

memp  <-  function(x, order) mean(x^order)
MME.llogis=mmedist(Sample1, "llogis", order=c(1, 2), memp=memp)
## checkstartfix is carried out
## <simpleError in optim(par = vstart, fn = fnobj, fix.arg = fix.arg, obs = data,     gr = gradient, mdistnam = mdistname, memp = memp, hessian = TRUE,     method = opt.meth, lower = lower, upper = upper, ...): function cannot be evaluated at initial parameters>
MME.llogis$estimate
## [1] NA NA

Sample 1 (simulated from LogNormal) is fitted via MLE (assuming invweibull family) as follows:

memp  <-  function(x, order) mean(x^order)
MME.invweibull=mmedist(Sample1, "invweibull", order=c(1, 2), memp=memp)
## checkstartfix is carried out
MME.invweibull$estimate
##     shape     scale 
##  7379.188 59279.946

Sample 1 (simulated from LogNormal) is fitted via MLE (assuming pareto family) as follows:

memp  <-  function(x, order) mean(x^order)
MME.pareto=mmedist(Sample1, "pareto", order=c(1, 2), memp=memp)
## checkstartfix is carried out
MME.pareto$estimate
##        shape        scale 
## 5.447284e+00 1.641572e+05

Sample 1 (simulated from LogNormal) is fitted via MLE (assuming trgamma family) as follows:

memp  <-  function(x, order) mean(x^order)
MME.trgamma=mmedist(Sample1, "trgamma", order=c(1, 2, 3), memp=memp)
## checkstartfix is carried out
## <simpleError in optim(par = vstart, fn = fnobj, fix.arg = fix.arg, obs = data,     gr = gradient, mdistnam = mdistname, memp = memp, hessian = TRUE,     method = opt.meth, lower = lower, upper = upper, ...): non-finite finite-difference value [3]>
MME.trgamma$estimate
## [1] NA NA NA

There are no obvious concerns with MOM estimates when fitting lnorm, exp and gamma families. There is no doubt that MOM fails when fitting llogis and trgamma families, and there is no immediate remedy. MOM estimation when fitting weibull, invweibull and pareto families provide some caution messages, and thus, we check whether the empirical and theoretical two moments are matched as it supposed to be.

xxx <- c(mean(Sample1),mean(Sample1^2))
xxxW<-c(mweibull(1, shape=MME.weibull$estimate[1], scale = MME.weibull$estimate[2]),mweibull(2, shape=MME.weibull$estimate[1], scale = MME.weibull$estimate[2]))
xxxIW<-c(minvweibull(1, shape=MME.invweibull$estimate[1], scale = MME.invweibull$estimate[2]),minvweibull(2, shape=MME.invweibull$estimate[1], scale = MME.invweibull$estimate[2]))
xxxP<-c(mpareto(1, shape=MME.pareto$estimate[1], scale = MME.pareto$estimate[2]),mpareto(2, shape=MME.pareto$estimate[1], scale = MME.pareto$estimate[2]))
cat("First two (raw) moments of Samples 1 -- empirical estimates")
## First two (raw) moments of Samples 1 -- empirical estimates
xxx
## [1] 3.691179e+04 3.515427e+09
cat("First two (raw) moments of Samples 1 -- theoretical MOM weibull")
## First two (raw) moments of Samples 1 -- theoretical MOM weibull
xxxW
## [1] 4.083912e+04 3.515497e+09
cat("First two (raw) moments of Samples 1 -- theoretical MOM invweibull")
## First two (raw) moments of Samples 1 -- theoretical MOM invweibull
xxxIW
## [1] 5.928458e+04 3.514662e+09
cat("First two (raw) moments of Samples 1 -- theoretical MOM pareto")
## First two (raw) moments of Samples 1 -- theoretical MOM pareto
xxxP
## [1] 3.691179e+04 3.515427e+09
cat("The diferences between the empirical and theoretical MOM for weibull, invweibull and pareto")
## The diferences between the empirical and theoretical MOM for weibull, invweibull and pareto
xxx-xxxW
## [1]  -3927.331 -70056.150
xxx-xxxIW
## [1] -22372.79 764842.04
xxx-xxxP
## [1] 3.953755e-07 0.000000e+00

This suggests that the weibull and invweibull MOM estimates are very off, while the pareto MOM estimation is fine as it is.

One could help R to better estimate the weibull MOM estimates from first principles. That is, we have to solve the following system of equations: \[\mathbb{E}[X]=b\cdot\Gamma\left(1+\frac{1}{a}\right)=m_1\quad\text{and}\quad\mathbb{E}\big[X^2\big]=b^2\cdot\Gamma\left(1+\frac{2}{a}\right)=m_2,\] where \(m_1\) and \(m_2\) are the first and second empirical moment estimates. That is, we need to find the MOM estimate of the shape parameter \(\hat{a}_{MOM}>0\) such that \[\frac{m_2}{(m_1)^2}\left(\Gamma\left(1+\frac{1}{a}\right)\right)^2-\Gamma\left(1+\frac{2}{a}\right)=0,\] which is a root problem that could be solved in R with the help of the uniroot function. Clearly, the MOM estimate of the scale parameter is given by \[\hat{b}_{MOM}=\frac{m_1}{\Gamma\left(1+\frac{1}{\hat{a}_{MOM}}\right)}.\]

Now, let us solve the root equation defined earlier, and we get that

xxx <- c(mean(Sample1),mean(Sample1^2))
fW <- function(x) {(xxx[2]/xxx[1]^2)*(gamma(1+(1/x)))^2-gamma(1+(2/x))}
asol<-uniroot(fW, c(0.1,1))$root
bsol<-xxx[1]/gamma(1+(1/asol))
c(asol,bsol)
## [1] 8.020448e-01 3.263813e+04
xxxW<-c(mweibull(1, shape=asol, scale = bsol),mweibull(2, shape=asol, scale = bsol))
cat("The diferences between the empirical and theoretical MOM for weibull")
## The diferences between the empirical and theoretical MOM for weibull
xxx - xxxW
## [1]     0.00 16711.37
cat("Weibull MOM estimates -- (a=shape, b=scale)")
## Weibull MOM estimates -- (a=shape, b=scale)
c(asol,bsol)
## [1] 8.020448e-01 3.263813e+04

As you could see, the default toleration in the uniroot function does not lead to reasonable weibull MOM estimates. Therefore, we reduce the toleration to \(tol=10^{-15}\) and narrow down the search interval to \((0.7,0.9)\) from the previous search interval of \((0.1,1)\), and we get that

#fW <- function(x) {(xxx[2]/xxx[1]^2)*(gamma(1+(1/x)))^2-gamma(1+(2/x))}
asol1<-uniroot(fW, c(0.7,0.9), tol=10^(-15))$root
bsol1<-xxx[1]/gamma(1+(1/asol1))
xxxW1<-c(mweibull(1, shape=asol1, scale = bsol1),mweibull(2, shape=asol1, scale = bsol1))
cat("The diferences between the empirical and theoretical MOM for weibull")
## The diferences between the empirical and theoretical MOM for weibull
xxx -xxxW1
## [1]  0.000000e+00 -4.768372e-07
cat("Weibull MOM estimates -- (a=shape, b=scale)")
## Weibull MOM estimates -- (a=shape, b=scale)
c(asol1,bsol1)
## [1] 8.020419e-01 3.263804e+04

The above estimates are reasonable since the empirical and theoretical moments are different up to an error of \(10^{-7}\), even though there are small differences (default toleration and reduced toleration) between the MOM outputs, \(\big(\hat{a}_{MOM},\hat{b}_{MOM}\big)\).

It is a good exercise for you to try finding a remedy for the invweibull MOM estimation, which is similar to the weibull remedy from above.

As a final note, you should be aware that some ad-hoc MOM estimation are based on dispersion measures as well. For example, if one fits gamma family that has two parameters, one may try matching the first moment and variance, instead of matching the first and second moments. This is not a big issue, though variance could be estimated via the unbiased or the asymptotically unbiased estimator – each having its pros and cons depending on the sample size – and the two choices lead to different set of estimates; recall that even a small difference in parameter estimates could lead to massive model error as we noticed in the previous weibull MOM estimation.

2.4 LogNormal Sample – Further Analysis


In this section, you learn to:

  • Compare competitive fitting models;
  • Use bootstrapping to robustify your estimation;
  • Understand the effect of sample size over the objectives from above.

So far, Sample 1 is drawn from a LogNormal distribution – a moderately heavy tailed distribution – with parameters \(\mu=10\) and \(\sigma=1\). The sample of size of Sample 1 has been \(n=500\), and now take five samples of size \(n=\{25, 50, 100, 250, 500\}\) as follows:

set.seed(1234) #Generate Sample1 and save in a list
n=c(25, 50, 100, 250, 500)
lnorm.meanlog=10
lnorm.sdlog=1
Sample1.list=list(n25=rep(0,25),n50=rep(0,50),n100=rep(0,100),n250=rep(0,250),n500=rep(0,500))
for (i in 1:length(n)) {
Sample1=rlnorm(n[i], meanlog=lnorm.meanlog, sdlog=lnorm.sdlog)
Sample1.list[[i]]=Sample1
}

We then perform a parametric estimation – via MLE and MOM – for all five samples when the underlying distribution is one of the eight parametric families considered before.

Note that our MLE estimates from this section are as given by the R function mledist without checking whether these estimates are points of global maxima, which is beyond the scope of this session. Section 2.2 has detailed what one should do to enhance our confidence in our MLE estimates.

Note also that MOM estimation is as given by the R function mmedist, which we have already seen how off could be for some parametric choices; specifically, when \(n=500\), weibull and invweibull MOM estimations are not acceptable, though the pareto estimates turned out to be fine. We will not repeat the same exercise as in Section 2.3 to rectify these shortcomings, but one should keep in mind these possible drawbacks.

Here is the MLE and MOM estimation for all samples and parametric assumptions.

#MLE and MME for 8 different distributions and saved in each matrix
M1.MLE.lnorm=matrix(0,length(n),2,dimnames=list(c("n25","n50","n100","n250","n500"),c("meanlog","sdlog")))
M2.MLE.exp=matrix(0,length(n),1,dimnames=list(c("n25","n50","n100","n250","n500"),c("rate")))
M3.MLE.gamma=matrix(0,length(n),2,dimnames=list(c("n25","n50","n100","n250","n500"),c("shape","rate")))
M4.MLE.weibull=matrix(0,length(n),2,dimnames=list(c("n25","n50","n100","n250","n500"),c("shape","scale")))
M5.MLE.llogis=matrix(0,length(n),2,dimnames=list(c("n25","n50","n100","n250","n500"),c("shape","scale")))
M6.MLE.invweibull=matrix(0,length(n),2,dimnames=list(c("n25","n50","n100","n250","n500"),c("shape","scale")))
M7.MLE.pareto=matrix(0,length(n),2,dimnames=list(c("n25","n50","n100","n250","n500"),c("shape","scale")))
M8.MLE.trgamma=matrix(0,length(n),3,dimnames=list(c("n25","n50","n100","n250","n500"),c("shape1","shape2","rate")))
M1.MME.lnorm=matrix(0,length(n),2,dimnames=list(c("n25","n50","n100","n250","n500"),c("meanlog","sdlog")))
M2.MME.exp=matrix(0,length(n),1,dimnames=list(c("n25","n50","n100","n250","n500"),c("rate")))
M3.MME.gamma=matrix(0,length(n),2,dimnames=list(c("n25","n50","n100","n250","n500"),c("shape","rate")))
M4.MME.weibull=matrix(0,length(n),2,dimnames=list(c("n25","n50","n100","n250","n500"),c("shape","scale")))
M5.MME.llogis=matrix(0,length(n),2,dimnames=list(c("n25","n50","n100","n250","n500"),c("shape","scale")))
M6.MME.invweibull=matrix(0,length(n),2,dimnames=list(c("n25","n50","n100","n250","n500"),c("shape","scale")))
M7.MME.pareto=matrix(0,length(n),2,dimnames=list(c("n25","n50","n100","n250","n500"),c("shape","scale")))
M8.MME.trgamma=matrix(0,length(n),3,dimnames=list(c("n25","n50","n100","n250","n500"),c("shape1","shape2","rate")))
for (i in 1:length(n)) {
#lnorm
M1.MLE.lnorm[i,]=mledist(Sample1.list[[i]], "lnorm")$estimate
M1.MME.lnorm[i,]=mmedist(Sample1.list[[i]], "lnorm")$estimate
#exp
MLE.exp=mledist(Sample1.list[[i]]/1000, "exp")
#if (is.na(MLE.exp$estimate)==is.na(NA)) {MLE.exp$estimate=1/mean(Sample1.list[[i]])}
M2.MLE.exp[i,]=MLE.exp$estimate/1000
M2.MME.exp[i,]=mmedist(Sample1.list[[i]], "exp")$estimate
#gamma
MLE.gamma=mledist(Sample1.list[[i]], "gamma",lower=c(0, 0), start=list(shape =1, rate=1))  #when the meanlog is too large, become NA
M3.MLE.gamma[i,]=MLE.gamma$estimate
#if (is.na(MLE.gamma$estimate[1])==is.na(NA)) {
#  MLE.gamma$estimate=gammaMLE(Sample1.list[[i]])
#  MLE.gamma=c(MLE.gamma$estimate[1],1/MLE.gamma$estimate[2])
#  M3.MLE.gamma[i,]=MLE.gamma}
M3.MME.gamma[i,]=mmedist(Sample1.list[[i]], "gamma")$estimate
#weibull
M4.MLE.weibull[i,]=mledist(Sample1.list[[i]], "weibull")$estimate
memp  <-  function(x, order) mean(x^order)
MME.weibull=mmedist(Sample1.list[[i]], "weibull", order=c(1, 2), memp=memp)
M4.MME.weibull[i,]=MME.weibull$estimate
#llogis
M5.MLE.llogis[i,]=mledist(Sample1.list[[i]], "llogis")$estimate
memp  <-  function(x, order) mean(x^order)
MME.llogis=mmedist(Sample1.list[[i]], "llogis", order=c(1, 2), memp=memp)
M5.MME.llogis[i,]=MME.llogis$estimate
#invweibull  
M6.MLE.invweibull[i,]=mledist(Sample1.list[[i]], "invweibull")$estimate
memp  <-  function(x, order) mean(x^order)
MME.invweibull=mmedist(Sample1.list[[i]], "invweibull", order=c(1, 2), memp=memp)
M6.MME.invweibull[i,]=MME.invweibull$estimate
#pareto
M7.MLE.pareto[i,]=mledist(Sample1.list[[i]], "pareto")$estimate
memp  <-  function(x, order) mean(x^order)
MME.pareto=mmedist(Sample1.list[[i]], "pareto", order=c(1, 2), memp=memp)
M7.MME.pareto[i,]=MME.pareto$estimate
#trgamma
M8.MLE.trgamma[i,]=mledist(Sample1.list[[i]], "trgamma")$estimate
memp  <-  function(x, order) mean(x^order)
MME.trgamma=mmedist(Sample1.list[[i]], "trgamma", order=c(1, 2, 3), memp=memp)
M8.MME.trgamma[i,]=MME.trgamma$estimate
}
## checkstartfix is carried out
## checkstartfix is carried out
## checkstartfix is carried out
## checkstartfix is carried out
## checkstartfix is carried out
## <simpleError in optim(par = vstart, fn = fnobj, fix.arg = fix.arg, obs = data,     gr = gradient, mdistnam = mdistname, memp = memp, hessian = TRUE,     method = opt.meth, lower = lower, upper = upper, ...): non-finite finite-difference value [3]>
## checkstartfix is carried out
## checkstartfix is carried out
## <simpleError in optim(par = vstart, fn = fnobj, fix.arg = fix.arg, obs = data,     gr = gradient, mdistnam = mdistname, memp = memp, hessian = TRUE,     method = opt.meth, lower = lower, upper = upper, ...): function cannot be evaluated at initial parameters>
## checkstartfix is carried out
## checkstartfix is carried out
## checkstartfix is carried out
## <simpleError in optim(par = vstart, fn = fnobj, fix.arg = fix.arg, obs = data,     gr = gradient, mdistnam = mdistname, memp = memp, hessian = TRUE,     method = opt.meth, lower = lower, upper = upper, ...): non-finite finite-difference value [3]>
## checkstartfix is carried out
## checkstartfix is carried out
## <simpleError in optim(par = vstart, fn = fnobj, fix.arg = fix.arg, obs = data,     gr = gradient, mdistnam = mdistname, memp = memp, hessian = TRUE,     method = opt.meth, lower = lower, upper = upper, ...): function cannot be evaluated at initial parameters>
## checkstartfix is carried out
## checkstartfix is carried out
## checkstartfix is carried out
## <simpleError in optim(par = vstart, fn = fnobj, fix.arg = fix.arg, obs = data,     gr = gradient, mdistnam = mdistname, memp = memp, hessian = TRUE,     method = opt.meth, lower = lower, upper = upper, ...): non-finite finite-difference value [3]>
## checkstartfix is carried out
## checkstartfix is carried out
## <simpleError in optim(par = vstart, fn = fnobj, fix.arg = fix.arg, obs = data,     gr = gradient, mdistnam = mdistname, memp = memp, hessian = TRUE,     method = opt.meth, lower = lower, upper = upper, ...): function cannot be evaluated at initial parameters>
## checkstartfix is carried out
## checkstartfix is carried out
## checkstartfix is carried out
## <simpleError in optim(par = vstart, fn = fnobj, fix.arg = fix.arg, obs = data,     gr = gradient, mdistnam = mdistname, memp = memp, hessian = TRUE,     method = opt.meth, lower = lower, upper = upper, ...): non-finite finite-difference value [3]>
## checkstartfix is carried out
## checkstartfix is carried out
## <simpleError in optim(par = vstart, fn = fnobj, fix.arg = fix.arg, obs = data,     gr = gradient, mdistnam = mdistname, memp = memp, hessian = TRUE,     method = opt.meth, lower = lower, upper = upper, ...): function cannot be evaluated at initial parameters>
## checkstartfix is carried out
## checkstartfix is carried out
## checkstartfix is carried out
## <simpleError in optim(par = vstart, fn = fnobj, fix.arg = fix.arg, obs = data,     gr = gradient, mdistnam = mdistname, memp = memp, hessian = TRUE,     method = opt.meth, lower = lower, upper = upper, ...): non-finite finite-difference value [3]>

Sample 1 (simulated from LogNormal) of sizes \(n=\{25, 50, 100, 250, 500\}\) are fitted via MLE and MOM (assuming lnorm family) as follows:

#print table
kable(cbind(M1.MLE.lnorm, M1.MME.lnorm), digits = 7, format.args = list(big.mark = ","), caption = "MLE estimates (left) and MOM estimates (right)")
Table 2.1: MLE estimates (left) and MOM estimates (right)
meanlog sdlog meanlog sdlog
n25 9.758218 0.9048491 9.643222 1.1298040
n50 9.773453 1.0790497 9.872765 1.0785899
n100 10.052040 0.8846443 10.091675 0.8532116
n250 10.035170 1.0340745 10.039688 1.0131995
n500 9.954018 0.9791639 9.952332 0.9837266

Sample 1 (simulated from LogNormal) of sizes \(n=\{25, 50, 100, 250, 500\}\) are fitted via MLE and MOM (assuming exp family) as follows:

#print table
kable(cbind(M2.MLE.exp, M2.MME.exp), digits = 7, format.args = list(big.mark = ","), caption = "MLE estimates (left) and MOM estimates (right)")
Table 2.2: MLE estimates (left) and MOM estimates (right)
rate rate
n25 3.43e-05 3.43e-05
n50 2.88e-05 2.88e-05
n100 2.88e-05 2.88e-05
n250 2.61e-05 2.61e-05
n500 2.94e-05 2.94e-05

Sample 1 (simulated from LogNormal) of sizes \(n=\{25, 50, 100, 250, 500\}\) are fitted via MLE and MOM (assuming gamma family) as follows:

#print table
kable(cbind(M3.MLE.gamma, M3.MME.gamma), digits = 7, format.args = list(big.mark = ","), caption = "MLE estimates (left) and MOM estimates (right)")
Table 2.3: MLE estimates (left) and MOM estimates (right)
shape rate shape rate
n25 1.0918842 3.74e-05 0.3870090 1.33e-05
n50 0.8632702 2.49e-05 0.4544097 1.31e-05
n100 1.3814064 3.98e-05 0.9338181 2.69e-05
n250 1.1027760 2.88e-05 0.5581965 1.46e-05
n500 1.1761067 3.45e-05 0.6127707 1.80e-05

Sample 1 (simulated from LogNormal) of sizes \(n=\{25, 50, 100, 250, 500\}\) are fitted via MLE and MOM (assuming weibull family) as follows:

#print table
kable(cbind(M4.MLE.weibull, M4.MME.weibull), digits = 7, format.args = list(big.mark = ","), caption = "MLE estimates (left) and MOM estimates (right)")
Table 2.4: MLE estimates (left) and MOM estimates (right)
shape scale shape scale
n25 0.9437471 29,543.90 0.7860005 29,567.97
n50 0.8482235 31,171.88 0.7734185 32,456.36
n100 1.1193522 36,437.53 1.1269249 39,020.15
n250 0.9823712 37,948.02 0.9101338 41,143.93
n500 1.0165764 34,341.51 0.9437451 36,941.55

Sample 1 (simulated from LogNormal) of sizes \(n=\{25, 50, 100, 250, 500\}\) are fitted via MLE and MOM (assuming llogis family) as follows:

#print table
kable(cbind(M5.MLE.llogis, M5.MME.llogis), digits = 7, format.args = list(big.mark = ","), caption = "MLE estimates (left) and MOM estimates (right)")
Table 2.5: MLE estimates (left) and MOM estimates (right)
shape scale shape scale
n25 0.1879159 17,280.25 126,073,923 2,071,502
n50 1.6328594 15,686.76 NA NA
n100 0.1950570 23,373.39 NA NA
n250 1.7470879 23,175.25 NA NA
n500 1.7940411 20,968.28 NA NA

Sample 1 (simulated from LogNormal) of sizes \(n=\{25, 50, 100, 250, 500\}\) are fitted via MLE and MOM (assuming invweibull family) as follows:

#print table
kable(cbind(M6.MLE.invweibull, M6.MME.invweibull), digits = 7, format.args = list(big.mark = ","), caption = "MLE estimates (left) and MOM estimates (right)")
Table 2.6: MLE estimates (left) and MOM estimates (right)
shape scale shape scale
n25 0.5934289 1,714.3955 6,404.055 55,248.86
n50 0.5975857 1,616.6989 7,926.824 62,073.97
n100 0.6110728 2,850.8647 6,417.635 49,987.98
n250 0.3123593 365.3364 8,418.296 63,971.32
n500 0.4255051 902.3304 6,379.283 55,269.19

Sample 1 (simulated from LogNormal) of sizes \(n=\{25, 50, 100, 250, 500\}\) are fitted via MLE and MOM (assuming pareto family) as follows:

#print table
kable(cbind(M7.MLE.pareto, M7.MME.pareto), digits = 7, format.args = list(big.mark = ","), caption = "MLE estimates (left) and MOM estimates (right)")
Table 2.7: MLE estimates (left) and MOM estimates (right)
shape scale shape scale
n25 3.930348 82,776.96 3.262691 66,039.26
n50 2.395978 49,999.88 3.665754 92,496.85
n100 30.655653 1,030,842.25 30.219733 1,015,106.90
n250 6.199372 197,403.65 4.526900 135,049.08
n500 7.722460 227,712.70 5.164899 141,900.73

Sample 1 (simulated from LogNormal) of sizes \(n=\{25, 50, 100, 250, 500\}\) are fitted via MLE and MOM (assuming trgamma family) as follows:

#print table
kable(cbind(M8.MLE.trgamma, M8.MME.trgamma), digits = 7, format.args = list(big.mark = ","), caption = "MLE estimates (left) and MOM estimates (right)")
Table 2.8: MLE estimates (left) and MOM estimates (right)
shape1 shape2 rate shape1 shape2 rate
n25 9.115813 0.3457418 0.0288253 NA NA NA
n50 7.066169 0.3303238 0.0168226 NA NA NA
n100 7.186013 0.4164912 0.0040745 NA NA NA
n250 6.720832 0.3822265 0.0052258 NA NA NA
n500 8.476387 0.3523861 0.0171844 NA NA NA

There is a lot to absorb from these estimates, and we will not analyse every single set of estimates, but we note the following:

  1. MLE and MOM estimates are identical when fitting exp, as expected since both should equal to \(1/m_1\) where \(m_1\) is the sample mean estimator;
  2. There are no obvious concerns with almost all MLE estimates though pareto family when \(n=100\) are very different than all other samples;
  3. There are no obvious concerns with MOM estimates when fitting lnorm, exp and gamma families;
  4. As in Section 2.3, there is no doubt that MOM fails when fitting llogis and trgamma families;
  5. As in Section 2.3, MOM estimates when fitting weibull, invweibull and pareto families should be checked and bespoke remedies might be needed; MOM(like MLE) estimates for pareto family when \(n=100\) are very different than all other samples.

Let us check the MOM estimates for pareto family when \(n=100\), though the MLE counterpart has the same issue.

xxx <- c(mean(Sample1.list[[3]]),mean(Sample1.list[[3]]^2))
xxxP<-c(mpareto(1, shape=M7.MME.pareto[3,1], scale = M7.MME.pareto[3,2]),mpareto(1, shape=M7.MME.pareto[3,1], scale = M7.MME.pareto[3,2]))
cat("First two (raw) moments of Samples 1 with n=100 -- empirical estimates")
## First two (raw) moments of Samples 1 with n=100 -- empirical estimates
xxx
## [1] 3.474046e+04 2.499335e+09
cat("Pareto MOM estimates of Samples 1 with n=100 -- (a=shape, b=scale)")
## Pareto MOM estimates of Samples 1 with n=100 -- (a=shape, b=scale)
xxxP
## [1] 34740.46 34740.46
cat("The diferences between the empirical and theoretical MOM for weibull")
## The diferences between the empirical and theoretical MOM for weibull
xxx-xxxP
## [1] -1.383093e-05  2.499300e+09

Thus, we need to find a bespoke remedy for this issue. One could help R to better estimate the pareto MOM estimates from first principles. That is, we have to solve the following system of equations: \[\mathbb{E}[X]=\frac{\lambda}{\alpha-1}=m_1\quad\text{and}\quad\mathbb{E}\big[X^2\big]=\frac{2\lambda^2}{(\alpha-1)(\alpha-2)}=m_2,\] where \(m_1\) and \(m_2\) are the first and second empirical moment estimates. Recall that the shape parameter should satisfy \(\alpha>2\) so that the first two moments are finite, and \(\lambda>0\) should always need to hold, since our pareto parametrisation has the following cumulative distribution function: \[ F_{Pareto}(x;\alpha,\lambda)=\Pr(X\le x)=1-\left(\frac{\lambda}{\lambda+x}\right)^\alpha\quad\text{for all}\;x\ge0\quad\text{with}\;\alpha,\lambda>0. \] Our system of equations requires that \(\lambda=m_1(\alpha-1)\), which in turn gives that \[ m_2=2m_1^2\cdot\frac{\alpha-1}{\alpha-2}\implies \hat{\alpha}_{MOM}=2+\left(\frac{m_2}{2m_1^2}-1\right)^{-1}\quad\text{and}\quad \hat{\lambda}_{MOM}=m_1(\hat{\alpha}_{MOM}-1). \] Recall that \(\hat{\alpha}_{MOM}>2\) and \(\hat{\lambda}_{MOM}>0\) should hold, which is true as long as \(\frac{m_2}{2m_1^2}-1>0\), but for this particular sample \(\frac{m_2}{2m_1^2}-1=1.035436\) so the MOM are feasible. We could now check whether we got what we need not only what we hoped to:

xxx <- c(mean(Sample1.list[[3]]),mean(Sample1.list[[3]]^2))
alpha1MOM <- 2+(xxx[2]/(2*xxx[1]^2)-1)^(-1)
lambda1MOM <- (alpha1MOM-1)*xxx[1]
cat("Pareto MOM estimates of Samples 1 with n=100 -- (alpha1=shape, lambda1=scale)")
## Pareto MOM estimates of Samples 1 with n=100 -- (alpha1=shape, lambda1=scale)
c(alpha1MOM,lambda1MOM)
## [1] 3.021973e+01 1.015107e+06
xxxP1<-c(mpareto(1, shape=alpha1MOM, scale = lambda1MOM),mpareto(2, shape=alpha1MOM, scale = lambda1MOM))
cat("First two (raw) moments of Samples 1 with n=100 -- empirical estimates")
## First two (raw) moments of Samples 1 with n=100 -- empirical estimates
xxx
## [1] 3.474046e+04 2.499335e+09
cat("First two (raw) moments of Samples 1 with n=100 -- theoretical MOM pareto")
## First two (raw) moments of Samples 1 with n=100 -- theoretical MOM pareto
xxxP1
## [1] 3.474046e+04 2.499335e+09
cat("The diferences between the empirical and theoretical MOM for pareto")
## The diferences between the empirical and theoretical MOM for pareto
xxx-xxxP1
## [1] 1.964509e-10 4.959106e-05

and we notice that the empirical and theoretical moments are extremely close to each other with an error up to \(10^{-5}\). These MOM estimates for pareto will not be further used so that the R code does not get more cumbersome that is now, but keep in mind that some MOM sets of estimates are not very accurate, particularly those MOM estimates when fitting weibull, invweibull and pareto families.

Q-Q plots and P-P plots are intuitive graphical tools for checking two given distributions. Q-Q plots tend to be more useful, but for completeness, all plots are provided as follows:

  1. Compare MLE and MOM fitted distribution (by assuming lnorm) with the empirical distribution – see the Q-Q Plots from Figure 2.3;
  2. Compare MLE and MOM fitted distribution by assuming lnorm) with the empirical distribution – see the P-P Plots from Figure 2.4;
  3. Compare MLE and MOM fitted distribution (by assuming exp) with the empirical distribution – see the Q-Q Plots from Figure 2.5;
  4. Compare MLE and MOM fitted distribution (by assuming exp) with the empirical distribution – see the P-P Plots from Figure 2.6.

Note that the five samples (of sizes \(n=\{25, 50, 100, 250, 500\}\)) are drawn/simulated from LogNormal , i.e. the true distribution is LogNormal, but these samples are fitted via MLE/MOM by assuming lnorm and exp families. Clearly, the other six parametric assumptions could be checked in the same fashion.

par(mfrow = c(2, 5))
mle.fln <- fitdist(Sample1.list[[1]], "lnorm", method="mle")
qqcomp(list(mle.fln), legendtext = "lnorm (mle, n=25)")
mle.fln <- fitdist(Sample1.list[[2]], "lnorm", method="mle")
qqcomp(list(mle.fln), legendtext = "lnorm (mle, n=50)")
mle.fln <- fitdist(Sample1.list[[3]], "lnorm", method="mle")
qqcomp(list(mle.fln), legendtext = "lnorm (mle, n=100)")
mle.fln <- fitdist(Sample1.list[[4]], "lnorm", method="mle")
qqcomp(list(mle.fln), legendtext = "lnorm (mle, n=250)")
mle.fln <- fitdist(Sample1.list[[5]], "lnorm", method="mle")
qqcomp(list(mle.fln), legendtext = "lnorm (mle, n=500)")
mme.fln <- fitdist(Sample1.list[[1]], "lnorm", method="mme")
qqcomp(list(mme.fln), legendtext = "lnorm (mme, n=25)")
mme.fln <- fitdist(Sample1.list[[2]], "lnorm", method="mme")
qqcomp(list(mme.fln), legendtext = "lnorm (mme, n=50)")
mme.fln <- fitdist(Sample1.list[[3]], "lnorm", method="mme")
qqcomp(list(mme.fln), legendtext = "lnorm (mme, n=100)")
mme.fln <- fitdist(Sample1.list[[4]], "lnorm", method="mme")
qqcomp(list(mme.fln), legendtext = "lnorm (mme, n=250)")
mme.fln <- fitdist(Sample1.list[[5]], "lnorm", method="mme")
qqcomp(list(mme.fln), legendtext = "lnorm (mme, n=500)")
Q-Q plots (by assuming lnorm) for MLE (top) and MOM (bottom)

Figure 2.3: Q-Q plots (by assuming lnorm) for MLE (top) and MOM (bottom)

par(mfrow = c(2, 5))
mle.fln <- fitdist(Sample1.list[[1]], "lnorm", method="mle")
ppcomp(list(mle.fln), legendtext = "lnorm (mle, n=25)")
mle.fln <- fitdist(Sample1.list[[2]], "lnorm", method="mle")
ppcomp(list(mle.fln), legendtext = "lnorm (mle, n=50)")
mle.fln <- fitdist(Sample1.list[[3]], "lnorm", method="mle")
ppcomp(list(mle.fln), legendtext = "lnorm (mle, n=100)")
mle.fln <- fitdist(Sample1.list[[4]], "lnorm", method="mle")
ppcomp(list(mle.fln), legendtext = "lnorm (mle, n=250)")
mle.fln <- fitdist(Sample1.list[[5]], "lnorm", method="mle")
ppcomp(list(mle.fln), legendtext = "lnorm (mle, n=500)")
mme.fln <- fitdist(Sample1.list[[1]], "lnorm", method="mme")
ppcomp(list(mme.fln), legendtext = "lnorm (mme, n=25)")
mme.fln <- fitdist(Sample1.list[[2]], "lnorm", method="mme")
ppcomp(list(mme.fln), legendtext = "lnorm (mme, n=50)")
mme.fln <- fitdist(Sample1.list[[3]], "lnorm", method="mme")
ppcomp(list(mme.fln), legendtext = "lnorm (mme, n=100)")
mme.fln <- fitdist(Sample1.list[[4]], "lnorm", method="mme")
ppcomp(list(mme.fln), legendtext = "lnorm (mme, n=250)")
mme.fln <- fitdist(Sample1.list[[5]], "lnorm", method="mme")
ppcomp(list(mme.fln), legendtext = "lnorm (mme, n=500)")
P-P plots (by assuming lnorm) for MLE (top) and MOM (bottom)

Figure 2.4: P-P plots (by assuming lnorm) for MLE (top) and MOM (bottom)

#Here, in order to use fitdist of exp, we re-scale of the data sample (just here only)
set.seed(1234) #Generating Sample1 and saved in a list
n=c(25, 50, 100, 250, 500)
lnorm.meanlog=10
lnorm.sdlog=1
Sample1.list.rescaled=list(n25=rep(0,25),n50=rep(0,50),n100=rep(0,100),n250=rep(0,250),n500=rep(0,500))
for (i in 1:length(n)) {
Sample1=rlnorm(n[i], meanlog=lnorm.meanlog, sdlog=lnorm.sdlog)
Sample1.list.rescaled[[i]]=Sample1/1000
}
par(mfrow = c(2, 5))
mle.fe <- fitdist(Sample1.list.rescaled[[1]], "exp", method="mle")
qqcomp(list(mle.fe), legendtext = "exp (mle, n=25)")
mle.fe <- fitdist(Sample1.list.rescaled[[2]], "exp", method="mle")
qqcomp(list(mle.fe), legendtext = "exp (mle, n=50)")
mle.fe <- fitdist(Sample1.list.rescaled[[3]], "exp", method="mle")
qqcomp(list(mle.fe), legendtext = "exp (mle, n=100)")
mle.fe <- fitdist(Sample1.list.rescaled[[4]], "exp", method="mle")
qqcomp(list(mle.fe), legendtext = "exp (mle, n=250)")
mle.fe <- fitdist(Sample1.list.rescaled[[5]], "exp", method="mle")
qqcomp(list(mle.fe), legendtext = "exp (mle, n=500)")
mme.fe <- fitdist(Sample1.list.rescaled[[1]], "exp", method="mme")
qqcomp(list(mme.fe), legendtext = "exp (mme, n=25)")
mme.fe <- fitdist(Sample1.list.rescaled[[2]], "exp", method="mme")
qqcomp(list(mme.fe), legendtext = "exp (mme, n=50)")
mme.fe <- fitdist(Sample1.list.rescaled[[3]], "exp", method="mme")
qqcomp(list(mme.fe), legendtext = "exp (mme, n=100)")
mme.fe <- fitdist(Sample1.list.rescaled[[4]], "exp", method="mme")
qqcomp(list(mme.fe), legendtext = "exp (mme, n=250)")
mme.fe <- fitdist(Sample1.list.rescaled[[5]], "exp", method="mme")
qqcomp(list(mme.fe), legendtext = "exp (mme, n=500)")
Q-Q plots (by assuming exp) for MLE (top) and MOM (bottom)

Figure 2.5: Q-Q plots (by assuming exp) for MLE (top) and MOM (bottom)

par(mfrow = c(2, 5))
mle.fe <- fitdist(Sample1.list.rescaled[[1]], "exp", method="mle")
ppcomp(list(mle.fe), legendtext = "exp (mle, n=25)")
mle.fe <- fitdist(Sample1.list.rescaled[[2]], "exp", method="mle")
ppcomp(list(mle.fe), legendtext = "exp (mle, n=50)")
mle.fe <- fitdist(Sample1.list.rescaled[[3]], "exp", method="mle")
ppcomp(list(mle.fe), legendtext = "exp (mle, n=100)")
mle.fe <- fitdist(Sample1.list.rescaled[[4]], "exp", method="mle")
ppcomp(list(mle.fe), legendtext = "exp (mle, n=250)")
mle.fe <- fitdist(Sample1.list.rescaled[[5]], "exp", method="mle")
ppcomp(list(mle.fe), legendtext = "exp (mle, n=500)")
mme.fe <- fitdist(Sample1.list.rescaled[[1]], "exp", method="mme")
ppcomp(list(mme.fe), legendtext = "exp (mme, n=25)")
mme.fe <- fitdist(Sample1.list.rescaled[[2]], "exp", method="mme")
ppcomp(list(mme.fe), legendtext = "exp (mme, n=50)")
mme.fe <- fitdist(Sample1.list.rescaled[[3]], "exp", method="mme")
ppcomp(list(mme.fe), legendtext = "exp (mme, n=100)")
mme.fe <- fitdist(Sample1.list.rescaled[[4]], "exp", method="mme")
ppcomp(list(mme.fe), legendtext = "exp (mme, n=250)")
mme.fe <- fitdist(Sample1.list.rescaled[[5]], "exp", method="mme")
ppcomp(list(mme.fe), legendtext = "exp (mme, n=500)")
P-P plots (by assuming exp) for MLE (top) and MOM (bottom)

Figure 2.6: P-P plots (by assuming exp) for MLE (top) and MOM (bottom)

Quantiles play an important role in fitting the best possible distribution. In addition, the real data analysis from Section 3 requires estimating quantiles at level \(\{50\%, 75\%,95\%\}\). Therefore, we next estimate the same quantiles so that we understand the estimation error in a controlled setting where data are simulated.

The following quantile related computations are tabulated: i) `True’ quantile values and ii) MLE and MOM and quantile estimates with various sample sizes.

Quantile.true=qlnorm(c(0.5,0.75,0.95), meanlog=lnorm.meanlog, sdlog=lnorm.sdlog, lower.tail = TRUE, log.p = FALSE)
Quantile.true=matrix(Quantile.true,1,3,dimnames=list(c("Quantile value"),c("(50%)","(75%)","(95%)")))
kable(Quantile.true, digits = 2, format.args = list(big.mark = ","), caption="Quantile values of the true LogNormal distribution at levels 50%, 75% and 95%")
Table 2.9: Quantile values of the true LogNormal distribution at levels 50%, 75% and 95%
(50%) (75%) (95%)
Quantile value 22,026.47 43,238.64 114,102.6
suppressWarnings(library(stats))
Table1.1=matrix(0,8,6,dimnames=list(c("lnorm","exp","gamma","weibull","llogis","invwei","Pareto","trgamma"),c("MOM(50%)","MLE(50%)","MOM(75%)","MLE(75%)","MOM(95%)","MLE(95%)")))
#lnorm
mle.fln.n25 <- fitdist(Sample1.list[[1]], "lnorm", method="mle")
Quantile.lnorm.mle.n25=quantile(mle.fln.n25, probs = c(0.5, 0.75, 0.95))
mme.fln.n25 <- fitdist(Sample1.list[[1]], "lnorm", method="mme")
Quantile.lnorm.mme.n25=quantile(mme.fln.n25, probs = c(0.5, 0.75, 0.95))
#exp
mle.fe.n25 <- fitdist(Sample1.list.rescaled[[1]], "exp", method="mle") #Use re-scaled value
Quantile.exp.mle.n25=quantile(mle.fe.n25, probs = c(0.5, 0.75, 0.95))
mme.fe.n25 <- fitdist(Sample1.list[[1]], "exp", method="mme")
Quantile.exp.mme.n25=quantile(mme.fe.n25, probs = c(0.5, 0.75, 0.95))
#gamma
mle.fg.n25 <- fitdist(Sample1.list[[1]], "gamma", method="mle",lower=c(0, 0), start=list(shape =1, rate=1))
Quantile.gamma.mle.n25=quantile(mle.fg.n25, probs = c(0.5, 0.75, 0.95))
mme.fg.n25 <- fitdist(Sample1.list[[1]], "gamma", method="mme")
Quantile.gamma.mme.n25=quantile(mme.fg.n25, probs = c(0.5, 0.75, 0.95))
#weibull
mle.fw.n25 <- fitdist(Sample1.list[[1]], "weibull", method="mle")
Quantile.weibull.mle.n25=quantile(mle.fw.n25, probs = c(0.5, 0.75, 0.95))
memp  <-  function(x, order) mean(x^order)
mme.fw.n25 <- fitdist(Sample1.list[[1]], "weibull", method="mme", order=c(1, 2), memp=memp)
Quantile.weibull.mme.n25=quantile(mme.fw.n25, probs = c(0.5, 0.75, 0.95))
#llogis
mle.fl.n25 <- fitdist(Sample1.list[[1]], "llogis", method="mle")
Quantile.llogis.mle.n25=quantile(mle.fl.n25, probs = c(0.5, 0.75, 0.95))
#mme.fl.n25 <- fitdist(Sample1.list[[1]], "llogis", method="mme", order=c(1, 2), memp=memp)
#Quantile.llogis.mme.n25=quantile(mme.fl.n25, probs = c(0.5, 0.75, 0.95))
#invweibull
mle.fi.n25 <- fitdist(Sample1.list[[1]], "invweibull", method="mle")
Quantile.invweibull.mle.n25=quantile(mle.fi.n25, probs = c(0.5, 0.75, 0.95))
mme.fi.n25 <- fitdist(Sample1.list[[1]], "invweibull", method="mme", order=c(1, 2), memp=memp)
Quantile.invweibull.mme.n25=quantile(mme.fi.n25, probs = c(0.5, 0.75, 0.95))
#pareto
mle.fp.n25 <- fitdist(Sample1.list[[1]], "pareto", method="mle")
Quantile.pareto.mle.n25=quantile(mle.fp.n25, probs = c(0.5, 0.75, 0.95))
mme.fp.n25 <- fitdist(Sample1.list[[1]], "pareto", method="mme", order=c(1, 2), memp=memp)
Quantile.pareto.mme.n25=quantile(mme.fp.n25, probs = c(0.5, 0.75, 0.95))
#trgamma
mle.ft.n25 <- fitdist(Sample1.list.rescaled[[1]], "trgamma", method="mle",optim.method="SANN") #Use re-scaled
Quantile.trgamma.mle.n25=quantile(mle.ft.n25, probs = c(0.5, 0.75, 0.95))
memp  <-  function(x, order) mean(x^order)
mme.ft.n25 <- fitdist(Sample1.list[[1]], "trgamma", method="mme", order=c(1, 2, 3), memp=memp)
<simpleError in optim(par = vstart, fn = fnobj, fix.arg = fix.arg, obs = data,     gr = gradient, mdistnam = mdistname, memp = memp, hessian = TRUE,     method = opt.meth, lower = lower, upper = upper, ...): non-finite finite-difference value [3]>
Quantile.trgamma.mme.n25=quantile(mme.ft.n25, probs = c(0.5, 0.75, 0.95))
#Table
Table1.1[1,c(2,4,6)]=abs(as.numeric(Quantile.lnorm.mle.n25$quantiles)-Quantile.true)
Table1.1[1,c(1,3,5)]=abs(as.numeric(Quantile.lnorm.mme.n25$quantiles)-Quantile.true)
Table1.1[2,c(2,4,6)]=abs(as.numeric(Quantile.exp.mle.n25$quantiles)*1000-Quantile.true) #Use re-scaled
Table1.1[2,c(1,3,5)]=abs(as.numeric(Quantile.exp.mme.n25$quantiles)-Quantile.true)
Table1.1[3,c(2,4,6)]=abs(as.numeric(Quantile.gamma.mle.n25$quantiles)-Quantile.true)
Table1.1[3,c(1,3,5)]=abs(as.numeric(Quantile.gamma.mme.n25$quantiles)-Quantile.true)
Table1.1[4,c(2,4,6)]=abs(as.numeric(Quantile.weibull.mle.n25$quantiles)-Quantile.true)
Table1.1[4,c(1,3,5)]=abs(as.numeric(Quantile.weibull.mme.n25$quantiles)-Quantile.true)
Table1.1[5,c(2,4,6)]=abs(as.numeric(Quantile.llogis.mle.n25$quantiles)-Quantile.true)
#Table1.1[5,c(1,3,5)]=abs(as.numeric(Quantile.llogis.mme.n25$quantiles)-Quantile.true)
Table1.1[5,c(1,3,5)]=c(NA,NA,NA)
Table1.1[6,c(2,4,6)]=abs(as.numeric(Quantile.invweibull.mle.n25$quantiles)-Quantile.true)
#Table1.1[6,c(1,3,5)]=abs(as.numeric(Quantile.invweibull.mme.n25$quantiles)-Quantile.true)
Table1.1[6,c(1,3,5)]=c(NA,NA,NA)
Table1.1[7,c(2,4,6)]=abs(as.numeric(Quantile.pareto.mle.n25$quantiles)-Quantile.true)
Table1.1[7,c(1,3,5)]=abs(as.numeric(Quantile.pareto.mme.n25$quantiles)-Quantile.true)
Table1.1[8,c(2,4,6)]=abs(as.numeric(Quantile.trgamma.mle.n25$quantiles)*1000-Quantile.true) #Use re-scaled
Table1.1[8,c(1,3,5)]=abs(as.numeric(Quantile.trgamma.mme.n25$quantiles)-Quantile.true)
#print table
kable(Table1.1, digits = 3, format.args = list(big.mark = ","), caption="MLE and MOM quantile estimation errors at levels 50%, 75% and 95% for the LogNormal sample of size n=25")
Table 2.10: MLE and MOM quantile estimation errors at levels 50%, 75% and 95% for the LogNormal sample of size n=25
MOM(50%) MLE(50%) MOM(75%) MLE(75%) MOM(95%) MLE(95%)
lnorm 6,609.535 4,730.688 10,205.622 11,397.029 15,230.399 3.748657e+04
exp 1,796.157 1,796.158 2,778.020 2,778.021 26,668.697 2.666870e+04
gamma 11,845.148 1,114.471 6,734.219 2,821.699 8,550.559 2.933522e+04
weibull 3,477.938 1,990.717 1,563.530 1,476.883 5,312.393 1.961538e+04
llogis NA 4,746.214 NA 5,934,882.525 NA 1.102743e+11
invwei NA 18,847.107 NA 29,245.701 NA 1.416476e+05
Pareto 6,394.976 6,061.714 8,275.686 8,230.097 14,731.784 1.948882e+04
trgamma NA 3,392.109 NA 8,640.248 NA 3.673115e+04
Table1.2=matrix(0,8,6,dimnames=list(c("lnorm","exp","gamma","Weibull","llogis","invWeibull","Pareto","trgamma"),c("MOM(50%)","MLE(50%)","MOM(75%)","MLE(75%)","MOM(95%)","MLE(95%)")))
#lnorm
mle.fln.n50 <- fitdist(Sample1.list[[2]], "lnorm", method="mle")
Quantile.lnorm.mle.n50=quantile(mle.fln.n50, probs = c(0.5, 0.75, 0.95))
mme.fln.n50 <- fitdist(Sample1.list[[2]], "lnorm", method="mme")
Quantile.lnorm.mme.n50=quantile(mme.fln.n50, probs = c(0.5, 0.75, 0.95))
#exp
mle.fe.n50 <- fitdist(Sample1.list.rescaled[[2]], "exp", method="mle") #Use re-scaled
Quantile.exp.mle.n50=quantile(mle.fe.n50, probs = c(0.5, 0.75, 0.95))
mme.fe.n50 <- fitdist(Sample1.list[[2]], "exp", method="mme")
Quantile.exp.mme.n50=quantile(mme.fe.n50, probs = c(0.5, 0.75, 0.95))
#gamma
mle.fg.n50 <- fitdist(Sample1.list[[2]], "gamma", method="mle",lower=c(0, 0), start=list(shape =1, rate=1))
Quantile.gamma.mle.n50=quantile(mle.fg.n50, probs = c(0.5, 0.75, 0.95))
mme.fg.n50 <- fitdist(Sample1.list[[2]], "gamma", method="mme")
Quantile.gamma.mme.n50=quantile(mme.fg.n50, probs = c(0.5, 0.75, 0.95))
#weibull
mle.fw.n50 <- fitdist(Sample1.list[[2]], "weibull", method="mle")
Quantile.weibull.mle.n50=quantile(mle.fw.n50, probs = c(0.5, 0.75, 0.95))
memp  <-  function(x, order) mean(x^order)
mme.fw.n50 <- fitdist(Sample1.list[[2]], "weibull", method="mme", order=c(1, 2), memp=memp)
Quantile.weibull.mme.n50=quantile(mme.fw.n50, probs = c(0.5, 0.75, 0.95))
#llogis
mle.fl.n50 <- fitdist(Sample1.list[[2]], "llogis", method="mle")
Quantile.llogis.mle.n50=quantile(mle.fl.n50, probs = c(0.5, 0.75, 0.95))
mme.fl.n50 <- fitdist(Sample1.list[[2]], "llogis", method="mme", optim.method="Nelder-Mead", order=c(1, 2), memp=memp)
<simpleError in optim(par = vstart, fn = fnobj, fix.arg = fix.arg, obs = data,     gr = gradient, mdistnam = mdistname, memp = memp, hessian = TRUE,     method = opt.meth, lower = lower, upper = upper, ...): function cannot be evaluated at initial parameters>
Quantile.llogis.mme.n50=quantile(mme.fl.n50, probs = c(0.5, 0.75, 0.95))
#invweibull
mle.fi.n50 <- fitdist(Sample1.list[[2]], "invweibull", method="mle")
Quantile.invweibull.mle.n50=quantile(mle.fi.n50, probs = c(0.5, 0.75, 0.95))
mme.fi.n50 <- fitdist(Sample1.list[[2]], "invweibull", method="mme", order=c(1, 2), memp=memp)
Quantile.invweibull.mme.n50=quantile(mme.fi.n50, probs = c(0.5, 0.75, 0.95))
#pareto
mle.fp.n50 <- fitdist(Sample1.list[[2]], "pareto", method="mle")
Quantile.pareto.mle.n50=quantile(mle.fp.n50, probs = c(0.5, 0.75, 0.95))
mme.fp.n50 <- fitdist(Sample1.list[[2]], "pareto", method="mme", order=c(1, 2), memp=memp)
Quantile.pareto.mme.n50=quantile(mme.fp.n50, probs = c(0.5, 0.75, 0.95))
#trgamma
mle.ft.n50 <- fitdist(Sample1.list.rescaled[[2]], "trgamma", method="mle",optim.method="SANN") #Use re-scaled
Quantile.trgamma.mle.n50=quantile(mle.ft.n50, probs = c(0.5, 0.75, 0.95))
memp  <-  function(x, order) mean(x^order)
mme.ft.n50 <- fitdist(Sample1.list[[2]], "trgamma", method="mme",optim.method="SANN", order=c(1, 2, 3), memp=memp)
<simpleError in optim(par = vstart, fn = fnobj, fix.arg = fix.arg, obs = data,     gr = gradient, mdistnam = mdistname, memp = memp, hessian = TRUE,     method = opt.meth, lower = lower, upper = upper, ...): non-finite finite-difference value [3]>
Quantile.trgamma.mme.n50=quantile(mme.ft.n50, probs = c(0.5, 0.75, 0.95))
#Table
Table1.2[1,c(2,4,6)]=abs(as.numeric(Quantile.lnorm.mle.n50$quantiles)-Quantile.true)
Table1.2[1,c(1,3,5)]=abs(as.numeric(Quantile.lnorm.mme.n50$quantiles)-Quantile.true)
Table1.2[2,c(2,4,6)]=abs(as.numeric(Quantile.exp.mle.n50$quantiles)*1000-Quantile.true) #Use re-scaled
Table1.2[2,c(1,3,5)]=abs(as.numeric(Quantile.exp.mme.n50$quantiles)-Quantile.true)
Table1.2[3,c(2,4,6)]=abs(as.numeric(Quantile.gamma.mle.n50$quantiles)-Quantile.true)
Table1.2[3,c(1,3,5)]=abs(as.numeric(Quantile.gamma.mme.n50$quantiles)-Quantile.true)
Table1.2[4,c(2,4,6)]=abs(as.numeric(Quantile.weibull.mle.n50$quantiles)-Quantile.true)
Table1.2[4,c(1,3,5)]=abs(as.numeric(Quantile.weibull.mme.n50$quantiles)-Quantile.true)
Table1.2[5,c(2,4,6)]=abs(as.numeric(Quantile.llogis.mle.n50$quantiles)-Quantile.true)
#Table1.2[5,c(1,3,5)]=abs(as.numeric(Quantile.llogis.mme.n25$quantiles)-Quantile.true)
Table1.2[5,c(1,3,5)]=c(NA,NA,NA)
Table1.2[6,c(2,4,6)]=abs(as.numeric(Quantile.invweibull.mle.n25$quantiles)-Quantile.true)
#Table1.2[6,c(1,3,5)]=abs(as.numeric(Quantile.invweibull.mme.n25$quantiles)-Quantile.true)
Table1.2[6,c(1,3,5)]=c(NA,NA,NA)
Table1.2[7,c(2,4,6)]=abs(as.numeric(Quantile.pareto.mle.n50$quantiles)-Quantile.true)
Table1.2[7,c(1,3,5)]=abs(as.numeric(Quantile.pareto.mme.n50$quantiles)-Quantile.true)
Table1.2[8,c(2,4,6)]=abs(as.numeric(Quantile.trgamma.mle.n50$quantiles)*1000-Quantile.true) #Use re-scaled
Table1.2[8,c(1,3,5)]=abs(as.numeric(Quantile.trgamma.mme.n50$quantiles)-Quantile.true)
#print table
kable(Table1.2, digits = 3, format.args = list(big.mark = ","), caption="MLE and MOM quantile estimation errors at levels 50%, 75% and 95% for the LogNormal sample of size n=50")
Table 2.11: MLE and MOM quantile estimation errors at levels 50%, 75% and 95% for the LogNormal sample of size n=50
MOM(50%) MLE(50%) MOM(75%) MLE(75%) MOM(95%) MLE(95%)
lnorm 2,631.579 4,465.168 3,093.259 6,877.322 232.280 10,498.602
exp 2,024.488 2,024.488 4,863.270 4,863.270 10,156.143 10,156.143
gamma 7,596.424 541.577 1,872.793 4,812.937 23,784.094 4,591.929
Weibull 1,819.864 1,791.306 6,273.690 2,575.692 19,988.347 463.243
llogis NA 6,339.707 NA 12,496.572 NA 18,895.216
invWeibull NA 18,847.107 NA 29,245.701 NA 141,647.586
Pareto 2,773.589 5,252.152 725.469 4,062.445 2,831.833 10,469.517
trgamma NA 3,351.292 NA 3,175.781 NA 7,042.718
Table1.3=matrix(0,8,6,dimnames=list(c("lnorm","exp","gamma","Weibull","llogis","invWeibull","Pareto","trgamma"),c("MOM(50%)","MLE(50%)","MOM(75%)","MLE(75%)","MOM(95%)","MLE(95%)")))
#lnorm
mle.fln.n100 <- fitdist(Sample1.list[[3]], "lnorm", method="mle")
Quantile.lnorm.mle.n100=quantile(mle.fln.n100, probs = c(0.5, 0.75, 0.95))
mme.fln.n100 <- fitdist(Sample1.list[[3]], "lnorm", method="mme")
Quantile.lnorm.mme.n100=quantile(mme.fln.n100, probs = c(0.5, 0.75, 0.95))
#exp
mle.fe.n100 <- fitdist(Sample1.list.rescaled[[3]], "exp", method="mle")
Quantile.exp.mle.n100=quantile(mle.fe.n100, probs = c(0.5, 0.75, 0.95))
mme.fe.n100 <- fitdist(Sample1.list[[3]], "exp", method="mme")
Quantile.exp.mme.n100=quantile(mme.fe.n100, probs = c(0.5, 0.75, 0.95))
#gamma
mle.fg.n100 <- fitdist(Sample1.list[[3]], "gamma", method="mle",lower=c(0, 0), start=list(shape =1, rate=1))
Quantile.gamma.mle.n100=quantile(mle.fg.n100, probs = c(0.5, 0.75, 0.95))
mme.fg.n100 <- fitdist(Sample1.list[[3]], "gamma", method="mme")
Quantile.gamma.mme.n100=quantile(mme.fg.n100, probs = c(0.5, 0.75, 0.95))
#weibull
mle.fw.n100 <- fitdist(Sample1.list[[3]], "weibull", method="mle")
Quantile.weibull.mle.n100=quantile(mle.fw.n100, probs = c(0.5, 0.75, 0.95))
memp  <-  function(x, order) mean(x^order)
mme.fw.n100 <- fitdist(Sample1.list[[3]], "weibull", method="mme", order=c(1, 2), memp=memp)
Quantile.weibull.mme.n100=quantile(mme.fw.n100, probs = c(0.5, 0.75, 0.95))
#llogis
mle.fl.n100 <- fitdist(Sample1.list[[3]], "llogis", method="mle")
Quantile.llogis.mle.n100=quantile(mle.fl.n100, probs = c(0.5, 0.75, 0.95))
mme.fl.n100 <- fitdist(Sample1.list[[3]], "llogis", method="mme", optim.method="Nelder-Mead", order=c(1, 2), memp=memp)
<simpleError in optim(par = vstart, fn = fnobj, fix.arg = fix.arg, obs = data,     gr = gradient, mdistnam = mdistname, memp = memp, hessian = TRUE,     method = opt.meth, lower = lower, upper = upper, ...): function cannot be evaluated at initial parameters>
Quantile.llogis.mme.n100=quantile(mme.fl.n100, probs = c(0.5, 0.75, 0.95))
#invweibull
mle.fi.n100 <- fitdist(Sample1.list[[3]], "invweibull", method="mle")
Quantile.invweibull.mle.n100=quantile(mle.fi.n100, probs = c(0.5, 0.75, 0.95))
mme.fi.n100 <- fitdist(Sample1.list[[3]], "invweibull", method="mme", order=c(1, 2), memp=memp)
Quantile.invweibull.mme.n100=quantile(mme.fi.n100, probs = c(0.5, 0.75, 0.95))
#pareto
mle.fp.n100 <- fitdist(Sample1.list[[3]], "pareto", method="mle")
Quantile.pareto.mle.n100=quantile(mle.fp.n100, probs = c(0.5, 0.75, 0.95))
mme.fp.n100 <- fitdist(Sample1.list[[3]], "pareto", method="mme", order=c(1, 2), memp=memp)
Quantile.pareto.mme.n100=quantile(mme.fp.n100, probs = c(0.5, 0.75, 0.95))
#trgamma
mle.ft.n100 <- fitdist(Sample1.list.rescaled[[3]], "trgamma", method="mle",optim.method="SANN")
Quantile.trgamma.mle.n100=quantile(mle.ft.n100, probs = c(0.5, 0.75, 0.95))
memp  <-  function(x, order) mean(x^order)
mme.ft.n100 <- fitdist(Sample1.list[[3]], "trgamma", method="mme", order=c(1, 2, 3), memp=memp)
<simpleError in optim(par = vstart, fn = fnobj, fix.arg = fix.arg, obs = data,     gr = gradient, mdistnam = mdistname, memp = memp, hessian = TRUE,     method = opt.meth, lower = lower, upper = upper, ...): non-finite finite-difference value [3]>
Quantile.trgamma.mme.n100=quantile(mme.ft.n100, probs = c(0.5, 0.75, 0.95))
#Table
Table1.3[1,c(2,4,6)]=abs(as.numeric(Quantile.lnorm.mle.n100$quantiles)-Quantile.true)
Table1.3[1,c(1,3,5)]=abs(as.numeric(Quantile.lnorm.mme.n100$quantiles)-Quantile.true)
Table1.3[2,c(2,4,6)]=abs(as.numeric(Quantile.exp.mle.n100$quantiles)*1000-Quantile.true)
Table1.3[2,c(1,3,5)]=abs(as.numeric(Quantile.exp.mme.n100$quantiles)-Quantile.true)
Table1.3[3,c(2,4,6)]=abs(as.numeric(Quantile.gamma.mle.n100$quantiles)-Quantile.true)
Table1.3[3,c(1,3,5)]=abs(as.numeric(Quantile.gamma.mme.n100$quantiles)-Quantile.true)
Table1.3[4,c(2,4,6)]=abs(as.numeric(Quantile.weibull.mle.n100$quantiles)-Quantile.true)
Table1.3[4,c(1,3,5)]=abs(as.numeric(Quantile.weibull.mme.n100$quantiles)-Quantile.true)
Table1.3[5,c(2,4,6)]=abs(as.numeric(Quantile.llogis.mle.n100$quantiles)-Quantile.true)
#Table1.3[5,c(1,3,5)]=abs(as.numeric(Quantile.llogis.mme.n25$quantiles)-Quantile.true)
Table1.3[5,c(1,3,5)]=c(NA,NA,NA)
Table1.3[6,c(2,4,6)]=abs(as.numeric(Quantile.invweibull.mle.n25$quantiles)-Quantile.true)
#Table1.3[6,c(1,3,5)]=abs(as.numeric(Quantile.invweibull.mme.n25$quantiles)-Quantile.true)
Table1.3[6,c(1,3,5)]=c(NA,NA,NA)
Table1.3[7,c(2,4,6)]=abs(as.numeric(Quantile.pareto.mle.n100$quantiles)-Quantile.true)
Table1.3[7,c(1,3,5)]=abs(as.numeric(Quantile.pareto.mme.n100$quantiles)-Quantile.true)
Table1.3[8,c(2,4,6)]=abs(as.numeric(Quantile.trgamma.mle.n100$quantiles)*1000-Quantile.true)
Table1.3[8,c(1,3,5)]=abs(as.numeric(Quantile.trgamma.mme.n100$quantiles)-Quantile.true)
#print table
kable(Table1.3, digits = 3, format.args = list(big.mark = ","), caption="MLE and MOM quantile estimation errors at levels 50%, 75% and 95% for the LogNormal sample of size n=100")
Table 2.12: MLE and MOM quantile estimation errors at levels 50%, 75% and 95% for the LogNormal sample of size n=100
MOM(50%) MLE(50%) MOM(75%) MLE(75%) MOM(95%) MLE(95%)
lnorm 2,114.735 1,176.609 315.869 1,099.864 15,871.023 1.467848e+04
exp 2,053.784 2,053.784 4,921.863 4,921.863 10,029.525 1.002952e+04
gamma 1,383.222 4,788.592 4,923.187 4,528.691 7,471.584 2.104357e+04
Weibull 6,160.097 4,236.670 8,900.913 5,545.530 10,797.083 1.699704e+04
llogis NA 1,346.923 NA 6,484,788.997 NA 8.404552e+10
invWeibull NA 18,847.107 NA 29,245.701 NA 1.416476e+05
Pareto 1,526.024 1,547.151 4,412.807 4,447.684 8,316.650 8.280262e+03
trgamma NA 2,812.880 NA 1,555.266 NA 1.756407e+04
Table1.4=matrix(0,8,6,dimnames=list(c("lnorm","exp","gamma","weibull","llogis","invwei","Pareto","trgamma"),c("MOM(50%)","MLE(50%)","MOM(75%)","MLE(75%)","MOM(95%)","MLE(95%)")))
#lnorm
mle.fln.n250 <- fitdist(Sample1.list[[4]], "lnorm", method="mle")
Quantile.lnorm.mle.n250=quantile(mle.fln.n250, probs = c(0.5, 0.75, 0.95))
mme.fln.n250 <- fitdist(Sample1.list[[4]], "lnorm", method="mme")
Quantile.lnorm.mme.n250=quantile(mme.fln.n250, probs = c(0.5, 0.75, 0.95))
#exp
mle.fe.n250 <- fitdist(Sample1.list.rescaled[[4]], "exp", method="mle")
Quantile.exp.mle.n250=quantile(mle.fe.n250, probs = c(0.5, 0.75, 0.95))
mme.fe.n250 <- fitdist(Sample1.list[[4]], "exp", method="mme")
Quantile.exp.mme.n250=quantile(mme.fe.n250, probs = c(0.5, 0.75, 0.95))
#gamma
mle.fg.n250 <- fitdist(Sample1.list[[4]], "gamma", method="mle",lower=c(0, 0), start=list(shape =1, rate=1))
Quantile.gamma.mle.n250=quantile(mle.fg.n250, probs = c(0.5, 0.75, 0.95))
mme.fg.n250 <- fitdist(Sample1.list[[4]], "gamma", method="mme")
Quantile.gamma.mme.n250=quantile(mme.fg.n250, probs = c(0.5, 0.75, 0.95))
#weibull
mle.fw.n250 <- fitdist(Sample1.list[[4]], "weibull", method="mle")
Quantile.weibull.mle.n250=quantile(mle.fw.n250, probs = c(0.5, 0.75, 0.95))
memp  <-  function(x, order) mean(x^order)
mme.fw.n250 <- fitdist(Sample1.list[[4]], "weibull", method="mme", order=c(1, 2), memp=memp)
Quantile.weibull.mme.n250=quantile(mme.fw.n250, probs = c(0.5, 0.75, 0.95))
#llogis
mle.fl.n250 <- fitdist(Sample1.list[[4]], "llogis", method="mle")
Quantile.llogis.mle.n250=quantile(mle.fl.n250, probs = c(0.5, 0.75, 0.95))
mme.fl.n250 <- fitdist(Sample1.list[[4]], "llogis", method="mme", optim.method="Nelder-Mead", order=c(1, 2), memp=memp)
<simpleError in optim(par = vstart, fn = fnobj, fix.arg = fix.arg, obs = data,     gr = gradient, mdistnam = mdistname, memp = memp, hessian = TRUE,     method = opt.meth, lower = lower, upper = upper, ...): function cannot be evaluated at initial parameters>
Quantile.llogis.mme.n250=quantile(mme.fl.n250, probs = c(0.5, 0.75, 0.95))
#invweibull
mle.fi.n250 <- fitdist(Sample1.list[[4]], "invweibull", method="mle")
Quantile.invweibull.mle.n250=quantile(mle.fi.n250, probs = c(0.5, 0.75, 0.95))
mme.fi.n250 <- fitdist(Sample1.list[[4]], "invweibull", method="mme", order=c(1, 2), memp=memp)
Quantile.invweibull.mme.n250=quantile(mme.fi.n250, probs = c(0.5, 0.75, 0.95))
#pareto
mle.fp.n250 <- fitdist(Sample1.list[[4]], "pareto", method="mle")
Quantile.pareto.mle.n250=quantile(mle.fp.n250, probs = c(0.5, 0.75, 0.95))
mme.fp.n250 <- fitdist(Sample1.list[[4]], "pareto", method="mme", order=c(1, 2), memp=memp)
Quantile.pareto.mme.n250=quantile(mme.fp.n250, probs = c(0.5, 0.75, 0.95))
#trgamma
mle.ft.n250 <- fitdist(Sample1.list.rescaled[[4]], "trgamma", method="mle",optim.method="SANN")
Quantile.trgamma.mle.n250=quantile(mle.ft.n250, probs = c(0.5, 0.75, 0.95))
memp  <-  function(x, order) mean(x^order)
mme.ft.n250 <- fitdist(Sample1.list[[4]], "trgamma", method="mme", order=c(1, 2, 3), memp=memp)
<simpleError in optim(par = vstart, fn = fnobj, fix.arg = fix.arg, obs = data,     gr = gradient, mdistnam = mdistname, memp = memp, hessian = TRUE,     method = opt.meth, lower = lower, upper = upper, ...): non-finite finite-difference value [3]>
Quantile.trgamma.mme.n250=quantile(mme.ft.n250, probs = c(0.5, 0.75, 0.95))
#Table
Table1.4[1,c(2,4,6)]=abs(as.numeric(Quantile.lnorm.mle.n250$quantiles)-Quantile.true)
Table1.4[1,c(1,3,5)]=abs(as.numeric(Quantile.lnorm.mme.n250$quantiles)-Quantile.true)
Table1.4[2,c(2,4,6)]=abs(as.numeric(Quantile.exp.mle.n250$quantiles)*1000-Quantile.true)
Table1.4[2,c(1,3,5)]=abs(as.numeric(Quantile.exp.mme.n250$quantiles)-Quantile.true)
Table1.4[3,c(2,4,6)]=abs(as.numeric(Quantile.gamma.mle.n250$quantiles)-Quantile.true)
Table1.4[3,c(1,3,5)]=abs(as.numeric(Quantile.gamma.mme.n250$quantiles)-Quantile.true)
Table1.4[4,c(2,4,6)]=abs(as.numeric(Quantile.weibull.mle.n250$quantiles)-Quantile.true)
Table1.4[4,c(1,3,5)]=abs(as.numeric(Quantile.weibull.mme.n250$quantiles)-Quantile.true)
Table1.4[5,c(2,4,6)]=abs(as.numeric(Quantile.llogis.mle.n250$quantiles)-Quantile.true)
#Table1.4[5,c(1,3,5)]=abs(as.numeric(Quantile.llogis.mme.n25$quantiles)-Quantile.true)
Table1.4[5,c(1,3,5)]=c(NA,NA,NA)
Table1.4[6,c(2,4,6)]=abs(as.numeric(Quantile.invweibull.mle.n25$quantiles)-Quantile.true)
#Table1.4[6,c(1,3,5)]=abs(as.numeric(Quantile.invweibull.mme.n25$quantiles)-Quantile.true)
Table1.4[6,c(1,3,5)]=c(NA,NA,NA)
Table1.4[7,c(2,4,6)]=abs(as.numeric(Quantile.pareto.mle.n250$quantiles)-Quantile.true)
Table1.4[7,c(1,3,5)]=abs(as.numeric(Quantile.pareto.mme.n250$quantiles)-Quantile.true)
Table1.4[8,c(2,4,6)]=abs(as.numeric(Quantile.trgamma.mle.n250$quantiles)*1000-Quantile.true)
Table1.4[8,c(1,3,5)]=abs(as.numeric(Quantile.trgamma.mme.n250$quantiles)-Quantile.true)
#print table
kable(Table1.4, digits = 3, format.args = list(big.mark = ","), caption="MLE and MOM quantile estimation errors at levels 50%, 75% and 95% for the LogNormal sample of size n=250")
Table 2.13: MLE and MOM quantile estimation errors at levels 50%, 75% and 95% for the LogNormal sample of size n=250
MOM(50%) MLE(50%) MOM(75%) MLE(75%) MOM(95%) MLE(95%)
lnorm 891.759 788.462 2,152.874 2,589.015 7,225.328 10,897.672
exp 4,514.943 4,514.943 9,844.182 9,844.181 607.428 607.427
gamma 2,936.292 5,506.932 8,254.973 9,777.674 27,308.961 3,275.622
weibull 5,478.712 4,104.661 15,668.511 9,677.747 23,256.793 1,839.958
llogis NA 1,148.783 NA 224.350 NA 10,911.390
invwei NA 18,847.107 NA 29,245.701 NA 141,647.586
Pareto 319.000 1,326.292 5,149.616 6,229.499 12,601.131 8,545.292
trgamma NA 2,392.174 NA 4,916.701 NA 1,890.435
Table1.5=matrix(0,8,6,dimnames=list(c("lnorm","exp","gamma","Weibull","llogis","invWeibull","Pareto","trgamma"),c("MOM(50%)","MLE(50%)","MOM(75%)","MLE(75%)","MOM(95%)","MLE(95%)")))
#lnorm
mle.fln.n500 <- fitdist(Sample1.list[[5]], "lnorm", method="mle")
Quantile.lnorm.mle.n500=quantile(mle.fln.n500, probs = c(0.5, 0.75, 0.95))
mme.fln.n500 <- fitdist(Sample1.list[[5]], "lnorm", method="mme")
Quantile.lnorm.mme.n500=quantile(mme.fln.n500, probs = c(0.5, 0.75, 0.95))
#exp
mle.fe.n500 <- fitdist(Sample1.list.rescaled[[5]], "exp", method="mle")
Quantile.exp.mle.n500=quantile(mle.fe.n500, probs = c(0.5, 0.75, 0.95))
mme.fe.n500 <- fitdist(Sample1.list[[5]], "exp", method="mme")
Quantile.exp.mme.n500=quantile(mme.fe.n500, probs = c(0.5, 0.75, 0.95))
#gamma
mle.fg.n500 <- fitdist(Sample1.list[[5]], "gamma", method="mle",lower=c(0, 0), start=list(shape =1, rate=1))
Quantile.gamma.mle.n500=quantile(mle.fg.n500, probs = c(0.5, 0.75, 0.95))
mme.fg.n500 <- fitdist(Sample1.list[[5]], "gamma", method="mme")
Quantile.gamma.mme.n500=quantile(mme.fg.n500, probs = c(0.5, 0.75, 0.95))
#weibull
mle.fw.n500 <- fitdist(Sample1.list[[5]], "weibull", method="mle")
Quantile.weibull.mle.n500=quantile(mle.fw.n500, probs = c(0.5, 0.75, 0.95))
memp  <-  function(x, order) mean(x^order)
mme.fw.n500 <- fitdist(Sample1.list[[5]], "weibull", method="mme", order=c(1, 2), memp=memp)
Quantile.weibull.mme.n500=quantile(mme.fw.n500, probs = c(0.5, 0.75, 0.95))
#llogis
mle.fl.n500 <- fitdist(Sample1.list[[5]], "llogis", method="mle")
Quantile.llogis.mle.n500=quantile(mle.fl.n500, probs = c(0.5, 0.75, 0.95))
mme.fl.n500 <- fitdist(Sample1.list[[5]], "llogis", method="mme", optim.method="Nelder-Mead", order=c(1, 2), memp=memp)
<simpleError in optim(par = vstart, fn = fnobj, fix.arg = fix.arg, obs = data,     gr = gradient, mdistnam = mdistname, memp = memp, hessian = TRUE,     method = opt.meth, lower = lower, upper = upper, ...): function cannot be evaluated at initial parameters>
Quantile.llogis.mme.n500=quantile(mme.fl.n500, probs = c(0.5, 0.75, 0.95))
#invweibull
mle.fi.n500 <- fitdist(Sample1.list[[5]], "invweibull", method="mle")
Quantile.invweibull.mle.n500=quantile(mle.fi.n500, probs = c(0.5, 0.75, 0.95))
mme.fi.n500 <- fitdist(Sample1.list[[5]], "invweibull", method="mme", order=c(1, 2), memp=memp)
Quantile.invweibull.mme.n500=quantile(mme.fi.n500, probs = c(0.5, 0.75, 0.95))
#pareto
mle.fp.n500 <- fitdist(Sample1.list[[5]], "pareto", method="mle")
Quantile.pareto.mle.n500=quantile(mle.fp.n500, probs = c(0.5, 0.75, 0.95))
mme.fp.n500 <- fitdist(Sample1.list[[5]], "pareto", method="mme", order=c(1, 2), memp=memp)
Quantile.pareto.mme.n500=quantile(mme.fp.n500, probs = c(0.5, 0.75, 0.95))
#trgamma
mle.ft.n500 <- fitdist(Sample1.list.rescaled[[5]], "trgamma", method="mle",optim.method="SANN")
Quantile.trgamma.mle.n500=quantile(mle.ft.n500, probs = c(0.5, 0.75, 0.95))
memp  <-  function(x, order) mean(x^order)
mme.ft.n500 <- fitdist(Sample1.list[[5]], "trgamma", method="mme", order=c(1, 2, 3), memp=memp)
<simpleError in optim(par = vstart, fn = fnobj, fix.arg = fix.arg, obs = data,     gr = gradient, mdistnam = mdistname, memp = memp, hessian = TRUE,     method = opt.meth, lower = lower, upper = upper, ...): non-finite finite-difference value [3]>
Quantile.trgamma.mme.n500=quantile(mme.ft.n500, probs = c(0.5, 0.75, 0.95))
#Table
Table1.5[1,c(2,4,6)]=abs(as.numeric(Quantile.lnorm.mle.n500$quantiles)-Quantile.true)
Table1.5[1,c(1,3,5)]=abs(as.numeric(Quantile.lnorm.mme.n500$quantiles)-Quantile.true)
Table1.5[2,c(2,4,6)]=abs(as.numeric(Quantile.exp.mle.n500$quantiles)*1000-Quantile.true)
Table1.5[2,c(1,3,5)]=abs(as.numeric(Quantile.exp.mme.n500$quantiles)-Quantile.true)
Table1.5[3,c(2,4,6)]=abs(as.numeric(Quantile.gamma.mle.n500$quantiles)-Quantile.true)
Table1.5[3,c(1,3,5)]=abs(as.numeric(Quantile.gamma.mme.n500$quantiles)-Quantile.true)
Table1.5[4,c(2,4,6)]=abs(as.numeric(Quantile.weibull.mle.n500$quantiles)-Quantile.true)
Table1.5[4,c(1,3,5)]=abs(as.numeric(Quantile.weibull.mme.n500$quantiles)-Quantile.true)
Table1.5[5,c(2,4,6)]=abs(as.numeric(Quantile.llogis.mle.n500$quantiles)-Quantile.true)
#Table1.5[5,c(1,3,5)]=abs(as.numeric(Quantile.llogis.mme.n25$quantiles)-Quantile.true)
Table1.5[5,c(1,3,5)]=c(NA,NA,NA)
Table1.5[6,c(2,4,6)]=abs(as.numeric(Quantile.invweibull.mle.n25$quantiles)-Quantile.true)
#Table1.5[6,c(1,3,5)]=abs(as.numeric(Quantile.invweibull.mme.n25$quantiles)-Quantile.true)
Table1.5[6,c(1,3,5)]=c(NA,NA,NA)
Table1.5[7,c(2,4,6)]=abs(as.numeric(Quantile.pareto.mle.n500$quantiles)-Quantile.true)
Table1.5[7,c(1,3,5)]=abs(as.numeric(Quantile.pareto.mme.n500$quantiles)-Quantile.true)
Table1.5[8,c(2,4,6)]=abs(as.numeric(Quantile.trgamma.mle.n500$quantiles)*1000-Quantile.true)
Table1.5[8,c(1,3,5)]=abs(as.numeric(Quantile.trgamma.mme.n500$quantiles)-Quantile.true)
#print table
kable(Table1.5, digits = 3, format.args = list(big.mark = ","), caption="MLE and MOM quantile estimation errors at levels 50%, 75% and 95% for the LogNormal sample of size n=500")
Table 2.14: MLE and MOM quantile estimation errors at levels 50%, 75% and 95% for the LogNormal sample of size n=500
MOM(50%) MLE(50%) MOM(75%) MLE(75%) MOM(95%) MLE(95%)
lnorm 1,025.327 989.887 2,462.778 2,519.477 8,184.871 8,799.418
exp 1,589.495 1,589.495 3,993.285 3,993.284 12,036.150 12,036.151
gamma 3,830.746 3,012.984 3,063.139 3,854.470 7,568.605 17,698.725
Weibull 3,026.117 1,919.941 8,980.098 4,115.913 4,044.073 13,048.882
llogis NA 1,058.191 NA 4,556.497 NA 5,874.291
invWeibull NA 18,847.107 NA 29,245.701 NA 141,647.586
Pareto 1,645.910 642.249 449.642 1,537.962 2,558.748 6,185.344
trgamma NA 362.050 NA 351.373 NA 14,517.625

Akaike Information Criterion (AIC) is useful tool for model selection that applies only to MLE and not to MOM. In other words, the smallest or the lowest value (depending how AIC is defined) indicates the best MLE-based model choice; the lowest AIC indicates the best model in R. From now on, the best distribution chosen through AIC is denoted as DISToo.

Since MLE for exp and trgamma requires re-scalling the data, these two distributions are ignored from the AIC-based comparison, though something more clever could have been done.

#print table
kable(Table.AIC, digits = 3, format.args = list(big.mark = ","), caption="AIC summary for LogNormal samples of size n=25, n=50, n=100, n=250 and n=500")
Table 2.15: AIC summary for LogNormal samples of size n=25, n=50, n=100, n=250 and n=500
(MLE,n=25) (MLE,n=50) (MLE,n=100) (MLE,n=250) (MLE,n=500)
lnorm 557.858 1,130.847 2,273.682 5,747.808 11,355.90
exp NA NA NA NA NA
gamma 567.952 1,148.695 2,289.276 5,779.020 11,432.33
Weibull 567.856 1,146.482 2,292.847 5,780.328 11,439.93
llogis 645.173 1,131.705 2,620.043 5,742.489 11,362.76
invWeibull 601.066 1,204.042 2,433.059 6,394.136 12,438.07
Pareto 564.503 1,140.397 2,295.020 5,768.745 11,424.99
trgamma NA NA NA NA NA
#print table
kable(Table.AIC.min, caption="Best (or DISToo) distribution chosen through AIC for LogNormal samples of size n=25, n=50, n=100, n=250 and n=500")
Table 2.16: Best (or DISToo) distribution chosen through AIC for LogNormal samples of size n=25, n=50, n=100, n=250 and n=500
(MLE,n=25) (MLE,n=50) (MLE,n=100) (MLE,n=250) (MLE,n=500)
Min (AIC) lnorm lnorm lnorm llogis lnorm

It looks like AIC does make the right choice, except of the LogNormal sample of size \(n=250\), though llogis marginally surpasses the `true’ choice, namely lnorm.

Point estimates – such our previous quantile estimates – are sensitive to sampling error, and thus, bootstrapping is often use to reduce the uncertainty of point estimation. That is, we bootstrap \(n_b=25\) bootstrap samples with replacement from the corresponding sample and fit only the parametric family chosen before as the best AIC distribution, which means that we are

  1. Sampling with replacement \(n_b=25\) samples of size \(n=25\) from our initial LogNormal sample of size \(n=25\) and fit all \(n_b=25\) samples via lnorm MLE;
  2. Sampling with replacement \(n_b=25\) samples of size \(n=50\) from our initial LogNormal sample of size \(n=50\) and fit all \(n_b=25\) samples via lnorm MLE;
  3. Sampling with replacement \(n_b=25\) samples of size \(n=100\) from our initial LogNormal sample of size \(n=100\) and fit all \(n_b=25\) samples via lnorm MLE;
  4. Sampling with replacement \(n_b=25\) samples of size \(n=250\) from our initial LogNormal sample of size \(n=250\) and fit all \(n_b=25\) samples via llogis MLE;
  5. Sampling with replacement \(n_b=25\) samples of size \(n=500\) from our initial LogNormal sample of size \(n=500\) and fit all \(n_b=25\) samples via lnorm MLE;

Recall that AIC identified lnorm as the best choice for our initial samples of sizes \(n\in\{25,50,100,500\}\), while llogis was the best choice for our initial sample of size \(n=250\).

The bootstrap sampling for i) to v) – defined above – are now given.

#fix the seed inside the loop -- e.g. set.seed(567*b) -- so the bootstrap sampling generates the same output every time the code is run
#Sample1: the case of n=25, **lnorm** based on min(AIC), samples with replacement=25
MLE.lnorm.boot.n25=matrix(0,25,2,dimnames=list(c(1:25),c("meanlog","sdlog")))
for (b in 1:25) {
set.seed(567*b)#!!!set the seed inside the loop!!!
Sample1.boot=sample(Sample1.list[[1]], replace = TRUE)
MLE.lnorm=mledist(Sample1.boot, "lnorm")
MLE.lnorm.boot.n25[b,]=MLE.lnorm$estimate
}
MLE.lnorm.boot.mean.n25=colMeans(MLE.lnorm.boot.n25)
#Sample1: the case of n=50, **lnorm** based on min(AIC), samples with replacement=25
MLE.lnorm.boot.n50=matrix(0,25,2,dimnames=list(c(1:25),c("meanlog","sdlog")))
for (b in 1:25) {
set.seed(567*b)#!!!set the seed inside the loop!!!
Sample1.boot=sample(Sample1.list[[2]], replace = TRUE)
MLE.lnorm=mledist(Sample1.boot, "lnorm")
MLE.lnorm.boot.n50[b,]=MLE.lnorm$estimate
}
MLE.lnorm.boot.mean.n50=colMeans(MLE.lnorm.boot.n50)
#Sample1: the case of n=100, **lnorm** based on min(AIC), samples with replacement=25
MLE.lnorm.boot.n100=matrix(0,25,2,dimnames=list(c(1:25),c("meanlog","sdlog")))
for (b in 1:25) {
set.seed(567*b)#!!!set the seed inside the loop!!!
Sample1.boot=sample(Sample1.list[[3]], replace = TRUE)
MLE.lnorm=mledist(Sample1.boot, "lnorm")
MLE.lnorm.boot.n100[b,]=MLE.lnorm$estimate
}
MLE.lnorm.boot.mean.n100=colMeans(MLE.lnorm.boot.n100)
#Sample1: the case of n=250, **llogis** based on min(AIC), samples with replacement=25
MLE.llogis.boot.n250=matrix(0,25,2,dimnames=list(c(1:25),c("shape","scale")))
for (b in 1:25) {
set.seed(567*b)#!!!set the seed inside the loop!!!
Sample1.boot=sample(Sample1.list[[4]], replace = TRUE)
MLE.llogis=mledist(Sample1.boot, "llogis")
MLE.llogis.boot.n250[b,]=MLE.llogis$estimate
}
MLE.llogis.boot.mean.n250=colMeans(MLE.llogis.boot.n250)
#Sample1: the case of n=500, **lnorm** based on min(AIC), samples with replacement=25
MLE.lnorm.boot.n500=matrix(0,25,2,dimnames=list(c(1:25),c("meanlog","sdlog")))
for (b in 1:25) {
set.seed(567*b)#!!!set the seed inside the loop!!!
Sample1.boot=sample(Sample1.list[[5]], replace = TRUE)
MLE.lnorm=mledist(Sample1.boot, "lnorm")
MLE.lnorm.boot.n500[b,]=MLE.lnorm$estimate
}
MLE.lnorm.boot.mean.n500=colMeans(MLE.lnorm.boot.n500)

The sets of \(n_b=25\) MLE estimates are now averaged and obtain the so-called Bootstrap MLE estimates, which are tabulated below.

kable(MLE.lnorm.boot.mean.n25, digits = 3, format.args = list(big.mark = ","), caption="Bootstrap MLE estimates assuming lnorm (n=25)")
Table 2.17: Bootstrap MLE estimates assuming lnorm (n=25)
x
meanlog 9.727
sdlog 0.842
kable(MLE.lnorm.boot.mean.n50, digits = 3, format.args = list(big.mark = ","), caption="Bootstrap MLE estimates assuming lnorm (n=50)")
Table 2.18: Bootstrap MLE estimates assuming lnorm (n=50)
x
meanlog 9.777
sdlog 1.076
kable(MLE.lnorm.boot.mean.n100, digits = 3, format.args = list(big.mark = ","), caption="Bootstrap MLE estimates assuming lnorm (n=100)")
Table 2.19: Bootstrap MLE estimates assuming lnorm (n=100)
x
meanlog 10.104
sdlog 0.883
kable(MLE.llogis.boot.mean.n250, digits = 3, format.args = list(big.mark = ","), caption="Bootstrap MLE estimates assuming llogis (n=250)")
Table 2.20: Bootstrap MLE estimates assuming llogis (n=250)
x
shape 1.569
scale 23,511.935
kable(MLE.lnorm.boot.mean.n500, digits = 3, format.args = list(big.mark = ","), caption="Bootstrap MLE estimates assuming lnorm (n=500)")
Table 2.21: Bootstrap MLE estimates assuming lnorm (n=500)
x
meanlog 9.959
sdlog 0.974

We now compute the bootstrap quantile estimates as we have agreed upon the parametric model is the ‘best’ option. Recall that Dist00 is lnorm for all \(n=\{25, 50, 100, 500\}\) and llogis if \(n=250\), while the true distribution is lnorm of \((\mu=10,\sigma=1)\).

We report the following:

\[\begin{eqnarray*} &&(1)\;\text{Compute} \;q\big(xx; \hat{\theta}_{MLE}(n,i), Dist00\big) \quad \text{for all}\; i=1,\ldots, 25\quad\text{and}\; n=\{25, 50, 100, 250, 500\},\\ &&(2)\;\text{Compute} \;\bar{\bar{q}}_{boot} (xx;n) =\frac{1}{25} \sum_{i=1}^{25} q\big(xx; \hat{\theta}_{MLE}(n,i), Dist00\big),\quad \text{for all}\;n=\{25, 50, 100, 250, 500\},\\ &&(3)\;\text{Report} \;\big|\bar{\bar{q}}_{boot} (xx;n)-xx\%\mbox{quantiles of the 'true' distribution}\big|\quad \text{for all}\;n=\{25, 50, 100, 250, 500\}, \end{eqnarray*}\] where (1) are the quantile estimates for (individual) bootstrap samples, (2) are quantile bootstrap estimates, and (3) evaluate the estimation errors with respect to (2). All computations are reported for the three quantile levels \(xx=\{50\%, 75\%, 95\%\}\).

The background computations – for (1), (2) and (3) – are now given.

#(best n=25) lnorm
Quantile.best.mle.n25.each=matrix(0,25,3,dimnames=list(NULL,c("q_boot(50%)","q_boot(75%)","q_boot(95%)")))
for (i in 1:25) {
Quantile.best.mle.n25.each[i,]=qlnorm(c(0.5,0.75,0.95), meanlog=MLE.lnorm.boot.n25[i,1], sdlog=MLE.lnorm.boot.n25[i,2])
}
Quantile.best.mle.n25.barbar=colMeans(Quantile.best.mle.n25.each)
#(best n=50) lnorm
Quantile.best.mle.n50.each=matrix(0,25,3,dimnames=list(NULL,c("q_boot(50%)","q_boot(75%)","q_boot(95%)")))
for (i in 1:25) {
Quantile.best.mle.n50.each[i,]=qlnorm(c(0.5,0.75,0.95), meanlog=MLE.lnorm.boot.n50[i,1], sdlog=MLE.lnorm.boot.n50[i,2])
}
Quantile.best.mle.n50.barbar=colMeans(Quantile.best.mle.n50.each)
#(best n=100) lnorm
Quantile.best.mle.n100.each=matrix(0,25,3,dimnames=list(NULL,c("q_boot(50%)","q_boot(75%)","q_boot(95%)")))
for (i in 1:25) {
Quantile.best.mle.n100.each[i,]=qlnorm(c(0.5,0.75,0.95), meanlog=MLE.lnorm.boot.n100[i,1], sdlog=MLE.lnorm.boot.n100[i,2])
}
Quantile.best.mle.n100.barbar=colMeans(Quantile.best.mle.n100.each)
#(best n=250) llogis
Quantile.best.mle.n250.each=matrix(0,25,3,dimnames=list(NULL,c("q_boot(50%)","q_boot(75%)","q_boot(95%)")))
for (i in 1:25) {
Quantile.best.mle.n250.each[i,]=qllogis(c(0.5,0.75,0.95), shape=MLE.llogis.boot.n250[i,1], scale=MLE.llogis.boot.n250[i,2])
}
Quantile.best.mle.n250.barbar=colMeans(Quantile.best.mle.n250.each)
#(best n=500) lnorm
Quantile.best.mle.n500.each=matrix(0,25,3,dimnames=list(NULL,c("q_boot(50%)","q_boot(75%)","q_boot(95%)")))
for (i in 1:25) {
Quantile.best.mle.n500.each[i,]=qlnorm(c(0.5,0.75,0.95), meanlog=MLE.lnorm.boot.n500[i,1], sdlog=MLE.lnorm.boot.n500[i,2])
}
Quantile.best.mle.n500.barbar=colMeans(Quantile.best.mle.n500.each)
# Table2a
Table2a=matrix(0,5,3,dimnames=list(c("n=25 (lnorm)","n=50 (lnorm)","n=100 (lnorm)","n=250 (llogis)","n=500 (lnorm)"),c("boot(50%)","boot(75%)","boot(95%)")))
Table2a[1,]=abs(Quantile.best.mle.n25.barbar-Quantile.true) #lnorm
Table2a[2,]=abs(Quantile.best.mle.n50.barbar-Quantile.true) #lnorm
Table2a[3,]=abs(Quantile.best.mle.n100.barbar-Quantile.true)#lnorm
Table2a[4,]=abs(Quantile.best.mle.n250.barbar-Quantile.true)#llogis
Table2a[5,]=abs(Quantile.best.mle.n500.barbar-Quantile.true)#lnorm

The output (2) is summarised below.

kable(Table2a, digits = 3, format.args = list(big.mark = ","), caption="Quantile bootstrap estimation errors for LogNormal samples of size n=25, n=50, n=100, n=250 and n=500")
Table 2.22: Quantile bootstrap estimation errors for LogNormal samples of size n=25, n=50, n=100, n=250 and n=500
boot(50%) boot(75%) boot(95%)
n=25 (lnorm) 5,118.359 13,267.308 4.505281e+04
n=50 (lnorm) 4,140.955 5,900.422 5.205544e+03
n=100 (lnorm) 2,521.460 1,348.219 8.685972e+03
n=250 (llogis) 1,485.469 529,456.344 4.167706e+09
n=500 (lnorm) 863.649 2,414.141 8.968366e+03

The \(n=250\) sample leads to enormous quantile bootstrap estimates at levels \(xx=\{75\%, 95\%\}\), and we need to understand why we have such an anomaly. Recall that MLE estimations in R rely on some optimisation solvers that may not identify a global maximum (if that would exist), but ignore this possible issue once again, and look into the quantile estimates for (individual) bootstrap samples – see (1) – when \(n=250\).

kable(Quantile.best.mle.n250.each[order(Quantile.best.mle.n250.each[,3]),], digits = 3, format.args = list(big.mark = ","), caption="Quantile estimates for (individual) bootstrap samples coming from the LogNormal sample of size n=250")
Table 2.23: Quantile estimates for (individual) bootstrap samples coming from the LogNormal sample of size n=250
q_boot(50%) q_boot(75%) q_boot(95%)
25,210.59 43,962.93 1.119045e+05
22,750.66 41,324.46 1.126478e+05
24,990.67 44,010.39 1.138929e+05
23,310.87 42,398.51 1.158347e+05
23,419.48 42,528.22 1.158784e+05
24,020.06 43,434.30 1.175072e+05
23,370.40 42,751.53 1.179313e+05
22,292.95 41,610.82 1.187381e+05
21,818.53 41,420.38 1.216037e+05
22,837.61 42,674.95 1.220027e+05
22,618.70 42,687.85 1.240936e+05
22,803.86 43,248.68 1.267631e+05
22,314.19 42,897.05 1.286263e+05
21,452.17 41,930.58 1.292864e+05
23,463.28 44,368.43 1.294032e+05
21,035.65 41,706.86 1.317143e+05
24,879.30 46,867.56 1.358218e+05
24,408.74 46,340.64 1.360636e+05
23,751.00 45,733.55 1.375070e+05
23,153.26 45,209.63 1.391594e+05
24,539.01 47,225.02 1.418608e+05
28,568.35 54,309.21 1.598137e+05
23,500.18 2,856,906.46 9.092886e+09
23,597.05 4,567,632.67 3.176145e+10
23,691.79 5,924,193.83 6.333839e+10

Note that the peculiar large quantile bootstrap estimates for \(n=250\) (llogis) is due to some – three in this case – extraordinary large quantile estimates for (individual) bootstrap samples amongst the \(n_b=25\) estimates. If we exclude these three bootstrap samples, then we could re-compute (2) based the 22 individual bootstrap samples, and we get that

#Check this later
Reorder <- Quantile.best.mle.n250.each[order(Quantile.best.mle.n250.each[,3]),]
Quantile.best.mle.n250.barbar.no.outlier <- colMeans(Reorder[-c(23,24,25),])
No.outlier <- matrix(0,2,3)
No.outlier[1,] <- abs(Quantile.best.mle.n250.barbar.no.outlier-Quantile.true)
No.outlier[2,] <- abs(Quantile.best.mle.n250.barbar-Quantile.true)#llogis; as computed before
colnames(No.outlier) <- c("q_boot(50%)","q_boot(75%)","q_boot(95%)")
rownames(No.outlier) <- c("n=250 (llogis) No outliers (22 bootstrap samples)","n=250 (llogis) With outliers (25 bootstrap samples)")
kable(No.outlier, digits = 3, format.args = list(big.mark = ","), caption="Quantile bootstrap estimation errors for LogNormal samples of size n=250 without outliers")
Table 2.24: Quantile bootstrap estimation errors for LogNormal samples of size n=250 without outliers
q_boot(50%) q_boot(75%) q_boot(95%)
n=250 (llogis) No outliers (22 bootstrap samples) 1,473.960 790.525 1.262712e+04
n=250 (llogis) With outliers (25 bootstrap samples) 1,485.469 529,456.344 4.167706e+09

The estimation errors have been reduced by eliminating the outliers induced by bootstrap sampling error, and the bootstrap estimates are more reasonable. The summary of the quantile bootstrap errors is given below:

Table2aF <- matrix(0,6,3,dimnames=list(c("n=25 (lnorm)","n=50 (lnorm)","n=100 (lnorm)", "n=250 (llogis) With outliers (25 bootstrap samples)", "n=250 (llogis) No outliers (22 bootstrap samples)", "n=500 (lnorm)"),c("boot(50%)","boot(75%)","boot(95%)")))
Table2aF[1,] <- abs(Quantile.best.mle.n25.barbar-Quantile.true) #lnorm
Table2aF[2,] <- abs(Quantile.best.mle.n50.barbar-Quantile.true) #lnorm
Table2aF[3,] <- abs(Quantile.best.mle.n100.barbar-Quantile.true)#lnorm
Table2aF[4,] <- abs(Quantile.best.mle.n250.barbar-Quantile.true)#llogis
Table2aF[5,] <- No.outlier[1,]#llogis
Table2aF[6,] <- abs(Quantile.best.mle.n500.barbar-Quantile.true)#lnorm
kable(Table2aF, digits = 3, format.args = list(big.mark = ","), caption="Quantile bootstrap estimation errors based on (3) for all LogNormal samples with and without outliers")
Table 2.25: Quantile bootstrap estimation errors based on (3) for all LogNormal samples with and without outliers
boot(50%) boot(75%) boot(95%)
n=25 (lnorm) 5,118.359 13,267.308 4.505281e+04
n=50 (lnorm) 4,140.955 5,900.422 5.205544e+03
n=100 (lnorm) 2,521.460 1,348.219 8.685972e+03
n=250 (llogis) With outliers (25 bootstrap samples) 1,485.469 529,456.344 4.167706e+09
n=250 (llogis) No outliers (22 bootstrap samples) 1,473.960 790.525 1.262712e+04
n=500 (lnorm) 863.649 2,414.141 8.968366e+03

So far, we reported the bootstrap errors – see (3) – and two additional sets of estimates and their associated estimation errors are provided. Specifically, we

\[\begin{eqnarray*} &&(3)\;\text{Report} \;\big|\bar{\bar{q}}_{boot} (xx;n)-xx\%\mbox{quantiles of the 'true' distribution}\big|\quad \text{for all}\;n=\{25, 50, 100, 250, 500\},\\ &&(4)\;\text{Report} \;\big|\bar{q}^{MLE}_{boot} (xx;n)-xx\%\mbox{quantiles of the 'true' distribution}\big|\quad \text{for all}\;n=\{25, 50, 100, 250, 500\},\\ &&(5)\;\text{Report} \;\big|\bar{q}^{MLE}_{sample} (xx;n)-xx\%\mbox{quantiles of the 'true' distribution}\big|\quad \text{for all}\;n=\{25, 50, 100, 250, 500\},\\ \end{eqnarray*}\] where \(\bar{\bar{q}}_{boot} (xx;n)\) is the quantile bootstrap estimator \[\bar{\bar{q}}_{boot} (xx;n) =\frac{1}{25} \sum_{i=1}^{25} q\big(xx; \hat{\theta}_{MLE}(n,i), Dist00\big),\quad \text{for all}\;n=\{25, 50, 100, 250, 500\},\] \(\bar{q}^{MLE}_{boot} (xx;n)\) is the quantile based on Dist00 with parameters estimated via the parameter bootstrap estimates \[\bar{q}^{MLE}_{boot} (xx;n) = q\bigg(xx; \widehat{\Theta}^{boot}_{MLE}(n), Dist00\bigg),\quad\text{where}\;\widehat{\Theta}^{boot}_{MLE}(n)=\frac{1}{25}\sum^{25}_{i=1} \hat{\Theta}^{boot}_{MLE}(n,i),\quad \text{for all}\;n=\{25, 50, 100, 250, 500\},\] and \(\bar{q}^{MLE}_{sample} (xx;n)\) is the quantile based on Dist00 with parameters estimated via the parameter sample estimates (i.e. without bootstrapping) \[\bar{q}^{MLE}_{sample} (xx;n) = q\bigg(xx; \widehat{\Theta}^{sample}_{MLE}(n), Dist00\bigg),\quad\text{where}\;\widehat{\Theta}^{boot}_{MLE}(n)=\frac{1}{25}\sum^{25}_{i=1} \hat{\Theta}^{boot}_{MLE}(n,i),\quad \text{for all}\;n=\{25, 50, 100, 250, 500\}.\] Note that \(\bar{q}^{MLE}_{sample} (xx;n)\) were computed at the beginning of this section introducing the bootstrap computations. All computations are reported for the three quantile levels \(xx=\{50\%, 75\%, 95\%\}\).

Table2b <- matrix(0,5,6,dimnames=list(c("n=25","n=50","n=100","n=250","n=500"),c("boot(50%)","boot(75%)","boot(95%)","table1(50%)","table1(75%)","table1(95%)")))
#Quantile.true
#################n=25, best model: lnorm
Quantile.best.mle.n25 <- qlnorm(c(0.5,0.75,0.95), meanlog = MLE.lnorm.boot.mean.n25[1], sdlog = MLE.lnorm.boot.mean.n25[2]) 
#Quantile.best.mle.n25
Table2b[1,c(1,2,3)] <- abs(Quantile.best.mle.n25-Quantile.true)
#################n=50, best model: lnorm
Quantile.best.mle.n50 <- qlnorm(c(0.5,0.75,0.95), meanlog = MLE.lnorm.boot.mean.n50[1], sdlog = MLE.lnorm.boot.mean.n50[2]) 
#Quantile.best.mle.n50
Table2b[2,c(1,2,3)] <- abs(Quantile.best.mle.n50-Quantile.true)
#################n=100, best model: lnorm
Quantile.best.mle.n100 <- qlnorm(c(0.5,0.75,0.95), meanlog = MLE.lnorm.boot.mean.n100[1], sdlog = MLE.lnorm.boot.mean.n100[2])
#Quantile.best.mle.n100
Table2b[3,c(1,2,3)] <- abs(Quantile.best.mle.n100-Quantile.true)
#################n=250, best model: llogis 
Quantile.best.mle.n250 <- qllogis(c(0.5,0.75,0.95), shape=MLE.llogis.boot.mean.n250[1], scale =MLE.llogis.boot.mean.n250[2])
#Quantile.best.mle.n250
Table2b[4,c(1,2,3)] <- abs(Quantile.best.mle.n250-Quantile.true)
#################n=500, best model: lnorm
Quantile.best.mle.n500 <- qlnorm(c(0.5,0.75,0.95), meanlog = MLE.lnorm.boot.mean.n500[1], sdlog = MLE.lnorm.boot.mean.n500[2])
#Quantile.best.mle.n500
Table2b[5,c(1,2,3)] <- abs(Quantile.best.mle.n500-Quantile.true)
#From Table 1 before
Table2b[1,c(4,5,6)] <- Table1.1[1,c(2,4,6)] #lnorm, i.e first row
Table2b[2,c(4,5,6)] <- Table1.2[1,c(2,4,6)] #inorm, i.e first row
Table2b[3,c(4,5,6)] <- Table1.3[1,c(2,4,6)] #lnorm, i.e first row
Table2b[4,c(4,5,6)] <- Table1.4[5,c(2,4,6)] #llogis, i.e fifth row
Table2b[5,c(4,5,6)] <- Table1.5[1,c(2,4,6)] #lnorm, i.e first row 
colnames(Table2b) <- c("q_boot(50%)","q_boot(75%)","q_boot(95%)", "q_sample(50%)","q_sample(75%)","q_sample(95%)")
rownames(Table2b) <- c("n=25 (lnorm)", "n=50 (lnorm)", "n=100 (lnorm)", "n=250 (llogis)", "n=500 (lnorm)")
kable(Table2b, digits = 3, format.args = list(big.mark = ","), caption="Quantile bootstrap estimation errors based on (4) and (5) for all LogNormal samples with outliers")
Table 2.26: Quantile bootstrap estimation errors based on (4) and (5) for all LogNormal samples with outliers
q_boot(50%) q_boot(75%) q_boot(95%) q_sample(50%) q_sample(75%) q_sample(95%)
n=25 (lnorm) 5,263.810 13,660.436 47,146.892 4,730.688 11,397.029 37,486.570
n=50 (lnorm) 4,398.817 6,809.925 10,592.784 4,465.168 6,877.322 10,498.602
n=100 (lnorm) 2,422.207 1,108.441 9,650.478 1,176.609 1,099.864 14,678.478
n=250 (llogis) 1,485.469 4,115.184 39,440.623 1,148.783 224.350 10,911.390
n=500 (lnorm) 880.347 2,446.697 9,127.131 989.887 2,519.477 8,799.418

The estimation errors show a reasonable pattern, since the larger the sample is the smaller the error is. Note that outliers are not removed, and in particular sample of size \(n=250\) shows a weird behaviour. This is very likely due to sensitivity of the R optimisation function used in the MLE computations, which escalated the presence of outliers in individual bootstrap samples. Therefore, one should first check the optimality of the MLE estimates computed by R, and then detect possible outliers.

2.5 Weibull Sample with Moderately Heavy Tail

We now redo the computations from Sections 2.2, 2.3 and 2.4 for Sample 2, which is drawn from the Weibull distribution with parameters \(a=0.5\) and \(b=1\). Note that this particular Weibull distribution is a moderately heavy tailed distribution since \(a<1\). We only report the very final R outputs, since the intermediate steps are very similar to those from Sample 1 computations.

Note that set.seed(1234) is the seed set for, but no seed is set for bootstrapping computations.

set.seed(1234) #Generating Sample1 and saved in a list
n=c(25, 50, 100, 250, 500)
weibull.shape=0.5
weibull.scale=1    
Sample2.list=list(n25=rep(0,25),n50=rep(0,50),n100=rep(0,100),n250=rep(0,250),n500=rep(0,500))
for (i in 1:length(n)) {
Sample2=rweibull(n[i], shape=weibull.shape , scale=weibull.scale)  
Sample2.list[[i]]=Sample2
}
kable(Table.AIC, digits = 3, format.args = list(big.mark = ","), caption=("AIC summary for moderately heavy tailed Weibull samples of size n=25, n=50, n=100, n=250 and n=500"))
Table 2.27: AIC summary for moderately heavy tailed Weibull samples of size n=25, n=50, n=100, n=250 and n=500
(MLE,n=25) (MLE,n=50) (MLE,n=100) (MLE,n=250) (MLE,n=500)
lnorm 86.957 153.419 254.780 517.470 1,107.682
exp 96.413 185.928 302.191 804.243 1,748.775
gamma 87.293 148.174 209.015 489.321 1,087.917
Weibull 85.648 145.817 216.315 469.088 1,042.992
llogis 87.223 150.473 243.450 499.986 1,098.118
invWeibull 94.412 180.450 314.424 646.758 1,315.670
Pareto 87.122 154.147 269.159 544.106 1,187.851
trgamma 87.257 147.829 210.463 471.080 1,044.309
kable(Table.AIC.min, digits = 3, format.args = list(big.mark = ","), caption=("Best (or DISToo) distribution chosen through AIC for moderately heavy tailed Weibull samples of size n=25, n=50, n=100, n=250 and n=500"))
Table 2.28: Best (or DISToo) distribution chosen through AIC for moderately heavy tailed Weibull samples of size n=25, n=50, n=100, n=250 and n=500
(MLE,n=25) (MLE,n=50) (MLE,n=100) (MLE,n=250) (MLE,n=500)
Min (AIC) Weibull Weibull gamma Weibull Weibull

It looks like AIC does make the right choice, except of the Weibull sample of size \(n=100\), though gamma marginally surpasses the true choice, i.e. weibull.

We now perform the usual bootstrap sampling.

The sets of \(n_b=25\) MLE estimates are now averaged and obtain the so-called Bootstrap MLE estimates, which are tabulated below.

kable(MLE.weibull.boot.mean.n25, digits = 3, format.args = list(big.mark = ","), caption="Bootstrap MLE estimates assuming weibull (n=25)")
Table 2.29: Bootstrap MLE estimates assuming weibull (n=25)
x
shape 0.616
scale 1.715
kable(MLE.weibull.boot.mean.n50, digits = 3, format.args = list(big.mark = ","), caption="Bootstrap MLE estimates assuming weibull (n=50)")
Table 2.30: Bootstrap MLE estimates assuming weibull (n=50)
x
shape 0.539
scale 1.345
kable(MLE.gamma.boot.mean.n100, digits = 3, format.args = list(big.mark = ","), caption="Bootstrap MLE estimates assuming gamma (n=100)")
Table 2.31: Bootstrap MLE estimates assuming gamma (n=100)
x
shape 0.39
rate 0.24
kable(MLE.weibull.boot.mean.n250, digits = 3, format.args = list(big.mark = ","), caption="Bootstrap MLE estimates assuming weibull (n=250)")
Table 2.32: Bootstrap MLE estimates assuming weibull (n=250)
x
shape 0.481
scale 0.826
kable(MLE.weibull.boot.mean.n500, digits = 3, format.args = list(big.mark = ","), caption="Bootstrap MLE estimates assuming weibull (n=500)")
Table 2.33: Bootstrap MLE estimates assuming weibull (n=500)
x
shape 0.479
scale 0.928

As before, we report three sets of estimation errors as follows:

\[\begin{eqnarray*} &&(3)\;\text{Report} \;\big|\bar{\bar{q}}_{boot} (xx;n)-xx\%\mbox{quantiles of the 'true' distribution}\big|\quad \text{for all}\;n=\{25, 50, 100, 250, 500\},\\ &&(4)\;\text{Report} \;\big|\bar{q}^{MLE}_{boot} (xx;n)-xx\%\mbox{quantiles of the 'true' distribution}\big|\quad \text{for all}\;n=\{25, 50, 100, 250, 500\},\\ &&(5)\;\text{Report} \;\big|\bar{q}^{MLE}_{sample} (xx;n)-xx\%\mbox{quantiles of the 'true' distribution}\big|\quad \text{for all}\;n=\{25, 50, 100, 250, 500\}. \end{eqnarray*}\]

The summary of the quantile bootstrap errors – without any outlier detection – is given below:

kable(Table2a, digits = 3, format.args = list(big.mark = ","), caption="Quantile bootstrap estimation errors based on (3) for all moderately heavy tailed Weibull samples")
Table 2.34: Quantile bootstrap estimation errors based on (3) for all moderately heavy tailed Weibull samples
boot(50%) boot(75%) boot(95%)
n=25 (weibull) 0.450 1.055 2.223
n=50 (weibull) 0.203 0.548 1.552
n=100 (gamma) 0.096 0.140 2.028
n=250 (weibull) 0.094 0.294 0.863
n=500 (weibull) 0.049 0.086 0.230
colnames(Table2b) <- c("q_boot(50%)","q_boot(75%)","q_boot(95%)", "q_sample(50%)","q_sample(75%)","q_sample(95%)")
rownames(Table2b) <- c("n=25 (weibull)", "n=50 (weibull)", "n=100 (gamma)", "n=250 (weibull)", "n=500 (weibull)")
kable(Table2b, digits = 3, format.args = list(big.mark = ","), caption="Quantile bootstrap estimation errors based on (4) and (5) for all moderately heavy tailed Weibull samples")
Table 2.35: Quantile bootstrap estimation errors based on (4) and (5) for all moderately heavy tailed Weibull samples
q_boot(50%) q_boot(75%) q_boot(95%) q_sample(50%) q_sample(75%) q_sample(95%)
n=25 (weibull) 0.465 0.992 1.204 0.430 0.849 0.576
n=50 (weibull) 0.201 0.544 1.327 0.198 0.509 1.058
n=100 (gamma) 0.093 0.118 2.160 0.088 0.136 2.014
n=250 (weibull) 0.095 0.294 0.905 0.090 0.281 0.885
n=500 (weibull) 0.049 0.087 0.203 0.052 0.078 0.358

The estimation errors show a reasonable pattern, since the larger the sample is the smaller the error is. Note that outliers are not removed in this section. You should also note that the true quantile values are \(q(50\%)=0.480453\), \(q(50\%)=1.921812\) and \(q(95\%)=8.974412\), so estimation errors are very small but proportional to the true values.

2.6 Weibull Sample with Very Light Tail

We now redo the computations from Sections 2.2, 2.3 and 2.4 for Sample 3, which is drawn from the Weibull distribution with parameters \(a=2\) and \(b=1\). Note that this particular Weibull distribution is a very light tailed distribution since \(a>1\). We only report the very final R outputs, since the intermediate steps are very similar to those from Sample 1 computations.

Note that set.seed(1234) is the seed set for, but no seed is set for bootstrapping computations.

set.seed(1234) #Generating Sample1 and saved in a list
n=c(25, 50, 100, 250, 500)
weibull.shape=2
weibull.scale=1 
Sample3.list=list(n25=rep(0,25),n50=rep(0,50),n100=rep(0,100),n250=rep(0,250),n500=rep(0,500))
for (i in 1:length(n)) {
Sample3=rweibull(n[i], shape=weibull.shape , scale=weibull.scale)  
Sample3.list[[i]]=Sample3
}
kable(Table.AIC, digits = 3, format.args = list(big.mark = ","), caption=("AIC summary for very light tailed Weibull samples of size n=25, n=50, n=100, n=250 and n=500"))
Table 2.36: AIC summary for very light tailed Weibull samples of size n=25, n=50, n=100, n=250 and n=500
(MLE,n=25) (MLE,n=50) (MLE,n=100) (MLE,n=250) (MLE,n=500)
lnorm 32.430 71.909 161.961 339.342 677.589
exp 52.137 96.946 178.449 419.694 862.573
gamma 30.941 65.430 135.475 299.051 620.399
Weibull 31.122 64.306 123.496 290.960 612.898
llogis 32.697 68.962 150.631 321.859 668.024
invWeibull 39.886 98.940 221.605 468.631 885.576
Pareto 87.122 154.147 269.159 544.106 1,187.851
trgamma 32.752 67.400 117.892 292.960 618.699
kable(Table.AIC.min, digits = 3, format.args = list(big.mark = ","), caption=("Best (or DISToo) distribution chosen through AIC for very light tailed Weibull samples of size n=25, n=50, n=100, n=250 and n=500"))
Table 2.37: Best (or DISToo) distribution chosen through AIC for very light tailed Weibull samples of size n=25, n=50, n=100, n=250 and n=500
(MLE,n=25) (MLE,n=50) (MLE,n=100) (MLE,n=250) (MLE,n=500)
Min (AIC) gamma Weibull trgamma Weibull Weibull

It looks like AIC does make the right choice, except of the Weibull samples of size \(n=25\) and \(n=100\), though gamma and trgamma marginally surpass the true' choice, namely *weibull*. Since MLE estimation for *trgamma* is not very stable -- there are three parameters to estimate -- we will further assume that thebest’ AIC choice for the Weibull sample of size \(n=100\) is weibull, which is the second best' AIC-based choice though is not far from thebest` choice.

We now perform the usual bootstrap sampling.

The sets of \(n_b=25\) MLE estimates are now averaged and obtain the so-called Bootstrap MLE estimates, which are tabulated below.

kable(MLE.gamma.boot.mean.n25, digits = 3, format.args = list(big.mark = ","), caption="Bootstrap MLE estimates assuming gamma (n=25)")
Table 2.38: Bootstrap MLE estimates assuming gamma (n=25)
x
shape 5.028
rate 4.946
kable(MLE.weibull.boot.mean.n50, digits = 3, format.args = list(big.mark = ","), caption="Bootstrap MLE estimates assuming weibull (n=50)")
Table 2.39: Bootstrap MLE estimates assuming weibull (n=50)
x
shape 2.245
scale 1.063
kable(MLE.weibull.boot.mean.n100, digits = 3, format.args = list(big.mark = ","), caption="Bootstrap MLE estimates assuming weibull (n=250)")
Table 2.40: Bootstrap MLE estimates assuming weibull (n=250)
x
shape 2.104
scale 0.996
kable(MLE.weibull.boot.mean.n250, digits = 3, format.args = list(big.mark = ","), caption="Bootstrap MLE estimates assuming weibull (n=250)")
Table 2.41: Bootstrap MLE estimates assuming weibull (n=250)
x
shape 1.927
scale 0.937
kable(MLE.weibull.boot.mean.n500, digits = 3, format.args = list(big.mark = ","), caption="Bootstrap MLE estimates assuming weibull (n=500)")
Table 2.42: Bootstrap MLE estimates assuming weibull (n=500)
x
shape 1.898
scale 0.978

As before, we report three sets of estimation errors as follows:

\[\begin{eqnarray*} &&(3)\;\text{Report} \;\big|\bar{\bar{q}}_{boot} (xx;n)-xx\%\mbox{quantiles of the 'true' distribution}\big|\quad \text{for all}\;n=\{25, 50, 100, 250, 500\},\\ &&(4)\;\text{Report} \;\big|\bar{q}^{MLE}_{boot} (xx;n)-xx\%\mbox{quantiles of the 'true' distribution}\big|\quad \text{for all}\;n=\{25, 50, 100, 250, 500\},\\ &&(5)\;\text{Report} \;\big|\bar{q}^{MLE}_{sample} (xx;n)-xx\%\mbox{quantiles of the 'true' distribution}\big|\quad \text{for all}\;n=\{25, 50, 100, 250, 500\}. \end{eqnarray*}\]

The summary of the quantile bootstrap errors – without any outlier detection – is given below:

kable(Table2a, digits = 3, format.args = list(big.mark = ","), caption="Quantile bootstrap estimation errors based on (3) for all very light tailed tailed Weibull samples")
Table 2.43: Quantile bootstrap estimation errors based on (3) for all very light tailed tailed Weibull samples
boot(50%) boot(75%) boot(95%)
n=25 (gamma) 0.116 0.105 0.157
n=50 (weibull) 0.069 0.054 0.013
n=100 (weibull) 0.004 0.013 0.047
n=250 (weibull) 0.058 0.067 0.073
n=500 (weibull) 0.027 0.016 0.013
colnames(Table2) <- c("q_boot(50%)","q_boot(75%)","q_boot(95%)", "q_sample(50%)","q_sample(75%)","q_sample(95%)")
rownames(Table2) <- c("n=25 (gamma)", "n=50 (weibull)", "n=100 (weibull)", "n=250 (weibull)", "n=500 (weibull)")
kable(Table2, digits = 3, format.args = list(big.mark = ","), caption="Quantile bootstrap estimation errors based on (4) and (5) for all very light tailed tailed Weibull samples")
Table 2.44: Quantile bootstrap estimation errors based on (4) and (5) for all very light tailed tailed Weibull samples
q_boot(50%) q_boot(75%) q_boot(95%) q_sample(50%) q_sample(75%) q_sample(95%)
n=25 (gamma) 0.118 0.098 0.128 0.105 0.079 0.096
n=50 (weibull) 0.070 0.052 0.002 0.075 0.071 0.049
n=100 (weibull) 0.004 0.014 0.052 0.003 0.009 0.034
n=250 (weibull) 0.058 0.067 0.075 0.042 0.046 0.044
n=500 (weibull) 0.027 0.016 0.012 0.023 0.012 0.017

The estimation errors show a reasonable pattern, since the larger the sample is the smaller the error is. Note that outliers are not removed in this section. You should also note that the true quantile values are \(q(50\%)=0.8325546\), \(q(50\%)=1.1774100\) and \(q(95\%)=1.7308184\), so estimation errors are very small but proportional to the true values.

3 Real Data Analysis


In this section, you learn to:


The real data analysis from this section is the dataset that has been described in Section 1.2, and a detailed exploratory data analysis is provided in Section 1.3. We are particularly focus on Building Insurance Claim Size corresponding to claims recorded in 1990.

We first fit the `best’ MLE distribution from the eight possible choices, and report

\[ \hat{q}^{MLE}_{50\%},\quad\hat{q}^{MLE}_{75\%}\quad\text{and}\quad \hat{q}^{MLE}_{95\%}, \]

i.e. the theoretical quantiles for the distribution (these are three point estimates).

Note that the claims are quite large, and therefore, we scale down all claims by a factor of \(100,000\), and the final outputs will be later scaled up by the same factor.

#View(Bivariate_Fire_Insurance_Clean_Data)
Sample.RealData.raw=Bivariate_Fire_Insurance_Clean_Data[[2]][1336:1502]
#head(Bivariate_Fire_Insurance_Clean_Data)
#head(Sample.RealData.raw)
#tail(Sample.RealData.raw)
#In order to use exp and trgamma fitdist(), we scale the data by 100,000
Sample.RealData=Sample.RealData.raw/100000
summary(Sample.RealData)
    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
  0.8251   8.1848  10.9736  15.4773  16.5017 116.9554 
kable(Table.AIC, digits = 3, format.args = list(big.mark = ","), caption=("AIC summary for Building Insurance Claim Size recorded in 1990"))
Table 3.1: AIC summary for Building Insurance Claim Size recorded in 1990
AIC (MLE,Sample.RealData)
lnorm 1,203.430
exp 1,250.951
gamma 1,222.947
Weibull 1,238.022
llogis 1,186.759
invWeibull 1,260.543
Pareto 1,252.947
trgamma 1,208.347
kable(Table.AIC.min, digits = 3, format.args = list(big.mark = ","), caption=("Best (or DISToo) distribution chosen through AIC for Building Insurance Claim Size recorded in 1990"))
Table 3.2: Best (or DISToo) distribution chosen through AIC for Building Insurance Claim Size recorded in 1990
AIC(MLE,Best dist)
Min (AIC) llogis

Thus, llogis is the best AIC-based parametric choice to fit the Building claims occurred in 1990. The point estimates are provided in Table 3.3 and gives the non-parametric and `best’ AIC-based parametric quantile point estimates.

colnames(TableB.1) <- c("q(50%)","q(75%)","q(95%)")
rownames(TableB.1) <- c("Non-parametric", "AIC Parametric - llogis")
kable(TableB.1, format.args = list(big.mark = ","), caption=("Quantile point estimates for Building Insurance Claim Size recorded in 1990"))
Table 3.3: Quantile point estimates for Building Insurance Claim Size recorded in 1990
q(50%) q(75%) q(95%)
Non-parametric 1,097,360 1,650,165 3,770,627
AIC Parametric - llogis 1,143,608 1,767,316 3,672,177

It is interesting to note that the non-parametric high quantile point estimate – at level \(95\%\) – is larger than its parametric estimate. We now use bootstrapping to understand the variability of our point estimates.

Bootstrap 100 times is our first setting, i.e. we generate 100 samples with replacement from the original sample of size 167: \[ \tilde{x}_1,\ldots,\tilde{x}_{100}\quad \text{with}\quad dim(\tilde{x}_i)=167 \quad\text{for all}\;1\le i \le 100. \] We then fit the parameter(s) \(\hat{\Theta}_{MLE,i}\) of Distoo, i.e. llogis in this case, based on the bootstrap sample \(\tilde{x}_{i}\) for all \(1\le i \le 100\). Further, we compute \[ \hat{q}^{MLE,i}_{xx} \quad\text{for all}\quad 1\le i \le 100\quad\text{and}\quad xx\in \{50\%, 75\%, 95\%\},\\ \] where \(\hat{q}^{MLE,i}_{xx}\) is the \(xx\%\) quantile of Distoo by assuming \(\Theta=\hat{\Theta}_{MLE,i}\). Furthermore, we report

\[\begin{eqnarray*} &&(i)\;\text{Bootstrap Estimate} \quad\hat{q}^{MLE,boot}_{xx}=\frac{1}{100}\sum^{100}_{i=1} \hat{q}^{MLE,i}_{xx},\\ &&(ii)\;\text{Bootstrap Variance} \quad\widehat{Var}\left(\hat{q}^{MLE,boot}_{xx}\right)=\frac{1}{100}\sum^{100}_{i=1}\left(\hat{q}^{MLE,i}_{xx}-\hat{q}^{MLE,boot}_{xx}\right)^2,\\ &&(iii)\;\text{MSE Estimate} \quad\widehat{MSE}\left(\hat{q}^{MLE,boot}_{xx}\right)=\frac{1}{100}\sum^{100}_{i=1}\left(\hat{q}^{MLE,i}_{XX}-\hat{q}^{MLE}_{XX}\right)^2, \end{eqnarray*}\] for all \(xx\in \{50\%, 75\%, 95\%\}\).

Here are the bootstrap samples for which the sampling seed is not fixed, as it should be for a randomised experiment.

We now bootstrap 150 times our claim data and perform the same computations. All results are reported in Table 3.4 and 3.5.

colnames(TableB.2a) <- c("q(50%)","q(75%)","q(95%)")
rownames(TableB.2a) <- c("Bootstrap Estimate", "Bootstrap Variance" , "MSE Estimate", "Bootstrap Variance / Bootstrap Estimate", "MSE Estimate / Bootstrap Estimate")
kable(TableB.2a, format.args = list(big.mark = ","), caption=("Summary of bootstrap estimation -- 100 bootstrap samples"))
Table 3.4: Summary of bootstrap estimation – 100 bootstrap samples
q(50%) q(75%) q(95%)
Bootstrap Estimate 1.150928e+06 1.779085e+06 3.709717e+06
Bootstrap Variance 2.744746e+09 1.250897e+10 1.882482e+11
MSE Estimate 2.770871e+09 1.252238e+10 1.877750e+11
Bootstrap Variance / Bootstrap Estimate 2.384812e+03 7.031123e+03 5.074461e+04
MSE Estimate / Bootstrap Estimate 2.407510e+03 7.038661e+03 5.061705e+04
colnames(TableB.2b) <- c("q(50%)","q(75%)","q(95%)")
rownames(TableB.2b) <- c("Bootstrap Estimate", "Bootstrap Variance" , "MSE Estimate", "Bootstrap Variance / Bootstrap Estimate", "MSE Estimate / Bootstrap Estimate")
kable(TableB.2b, format.args = list(big.mark = ","), caption=("Summary of bootstrap estimation -- 150 bootstrap samples"))
Table 3.5: Summary of bootstrap estimation – 150 bootstrap samples
q(50%) q(75%) q(95%)
Bootstrap Estimate 1.147084e+06 1.769867e+06 3.677304e+06
Bootstrap Variance 3.309178e+09 1.125144e+10 1.468273e+11
MSE Estimate 3.299195e+09 1.118293e+10 1.458747e+11
Bootstrap Variance / Bootstrap Estimate 2.392803e+03 7.067744e+03 5.119189e+04
MSE Estimate / Bootstrap Estimate 2.415578e+03 7.075322e+03 5.106321e+04