Skip to contents

Compare some empirical data set against different distributions to help find the distribution that could be the best fit.

Usage

tidy_distribution_comparison(.x, .distribution_type = "continuous")

Arguments

.x

The data set being passed to the function

.distribution_type

What kind of data is it, can be one of continuous or discrete

Value

An invisible list object. A tibble is printed.

Details

The purpose of this function is to take some data set provided and to try to find a distribution that may fit the best. A parameter of .distribution_type must be set to either continuous or discrete in order for this the function to try the appropriate types of distributions.

The following distributions are used:

Continuous:

  • tidy_beta

  • tidy_cauchy

  • tidy_exponential

  • tidy_gamma

  • tidy_logistic

  • tidy_lognormal

  • tidy_normal

  • tidy_pareto

  • tidy_uniform

  • tidy_weibull

Discrete:

  • tidy_binomial

  • tidy_geometric

  • tidy_hypergeometric

  • tidy_poisson

The function itself returns a list output of tibbles. Here are the tibbles that are returned:

  • comparison_tbl

  • deviance_tbl

  • total_deviance_tbl

  • aic_tbl

  • kolmogorov_smirnov_tbl

  • multi_metric_tbl

The comparison_tbl is a long tibble that lists the values of the density function against the given data.

The deviance_tbl and the total_deviance_tbl just give the simple difference from the actual density to the estimated density for the given estimated distribution.

The aic_tbl will provide the AIC for a lm model of the estimated density against the emprical density.

The kolmogorov_smirnov_tbl for now provides a two.sided estimate of the ks.test of the estimated density against the empirical.

The multi_metric_tbl will summarise all of these metrics into a single tibble.

Author

Steven P. Sanderson II, MPH

Examples

xc <- mtcars$mpg
output_c <- tidy_distribution_comparison(xc, "continuous")
#> For the beta distribution, its mean 'mu' should be 0 < mu < 1. The data will
#> therefore be scaled to enforce this.

xd <- trunc(xc)
output_d <- tidy_distribution_comparison(xd, "discrete")

output_c
#> $comparison_tbl
#> # A tibble: 352 × 8
#>    sim_number     x     y    dx       dy     p     q dist_type
#>    <fct>      <int> <dbl> <dbl>    <dbl> <dbl> <dbl> <fct>    
#>  1 1              1  21    2.97 0.000114 0.625  10.4 Empirical
#>  2 1              2  21    4.21 0.000455 0.625  10.4 Empirical
#>  3 1              3  22.8  5.44 0.00142  0.781  13.3 Empirical
#>  4 1              4  21.4  6.68 0.00355  0.688  14.3 Empirical
#>  5 1              5  18.7  7.92 0.00721  0.469  14.7 Empirical
#>  6 1              6  18.1  9.16 0.0124   0.438  15   Empirical
#>  7 1              7  14.3 10.4  0.0192   0.125  15.2 Empirical
#>  8 1              8  24.4 11.6  0.0281   0.812  15.2 Empirical
#>  9 1              9  22.8 12.9  0.0395   0.781  15.5 Empirical
#> 10 1             10  19.2 14.1  0.0516   0.531  15.8 Empirical
#> # … with 342 more rows
#> 
#> $deviance_tbl
#> # A tibble: 352 × 2
#>    name                      value
#>    <chr>                     <dbl>
#>  1 Empirical                0.451 
#>  2 Beta c(1.11, 1.58, 0)   -0.457 
#>  3 Cauchy c(19.2, 7.38)     0.0778
#>  4 Exponential c(0.05)      0.234 
#>  5 Gamma c(11.47, 1.75)     0.381 
#>  6 Logistic c(20.09, 3.27)  0.179 
#>  7 Lognormal c(2.96, 0.29)  0.300 
#>  8 Pareto c(10.4, 1.62)     0.451 
#>  9 Uniform c(8.34, 31.84)  -0.356 
#> 10 Weibull c(3.58, 22.29)  -0.105 
#> # … with 342 more rows
#> 
#> $total_deviance_tbl
#> # A tibble: 10 × 2
#>    dist_with_params        abs_tot_deviance
#>    <chr>                              <dbl>
#>  1 Cauchy c(19.2, 7.38)              0.0785
#>  2 Beta c(1.11, 1.58, 0)             0.444 
#>  3 Logistic c(20.09, 3.27)           1.15  
#>  4 Gamma c(11.47, 1.75)              1.66  
#>  5 Uniform c(8.34, 31.84)            2.66  
#>  6 Weibull c(3.58, 22.29)            3.36  
#>  7 Gaussian c(20.09, 5.93)           3.47  
#>  8 Lognormal c(2.96, 0.29)           5.64  
#>  9 Exponential c(0.05)               6.19  
#> 10 Pareto c(10.4, 1.62)              9.51  
#> 
#> $aic_tbl
#> # A tibble: 10 × 3
#>    dist_type               aic_value abs_aic
#>    <fct>                       <dbl>   <dbl>
#>  1 Beta c(1.11, 1.58, 0)       -48.9    48.9
#>  2 Pareto c(10.4, 1.62)        106.    106. 
#>  3 Gaussian c(20.09, 5.93)    -167.    167. 
#>  4 Lognormal c(2.96, 0.29)    -169.    169. 
#>  5 Gamma c(11.47, 1.75)       -179.    179. 
#>  6 Weibull c(3.58, 22.29)     -197.    197. 
#>  7 Uniform c(8.34, 31.84)     -207.    207. 
#>  8 Logistic c(20.09, 3.27)    -217.    217. 
#>  9 Cauchy c(19.2, 7.38)       -233.    233. 
#> 10 Exponential c(0.05)        -236.    236. 
#> 
#> $kolmogorov_smirnov_tbl
#> # A tibble: 10 × 6
#>    dist_type               ks_statistic ks_pvalue ks_method      alter…¹ dist_…²
#>    <fct>                          <dbl>     <dbl> <chr>          <chr>   <chr>  
#>  1 Beta c(1.11, 1.58, 0)          0.781  0.000500 Monte-Carlo t… two-si… Beta c…
#>  2 Cauchy c(19.2, 7.38)           0.375  0.0210   Monte-Carlo t… two-si… Cauchy…
#>  3 Exponential c(0.05)            0.531  0.000500 Monte-Carlo t… two-si… Expone…
#>  4 Gamma c(11.47, 1.75)           0.125  0.969    Monte-Carlo t… two-si… Gamma …
#>  5 Logistic c(20.09, 3.27)        0.188  0.619    Monte-Carlo t… two-si… Logist…
#>  6 Lognormal c(2.96, 0.29)        0.312  0.0930   Monte-Carlo t… two-si… Lognor…
#>  7 Pareto c(10.4, 1.62)           0.5    0.00150  Monte-Carlo t… two-si… Pareto…
#>  8 Uniform c(8.34, 31.84)         0.281  0.161    Monte-Carlo t… two-si… Unifor…
#>  9 Weibull c(3.58, 22.29)         0.25   0.277    Monte-Carlo t… two-si… Weibul…
#> 10 Gaussian c(20.09, 5.93)        0.125  0.972    Monte-Carlo t… two-si… Gaussi…
#> # … with abbreviated variable names ¹​alternative, ²​dist_char
#> 
#> $multi_metric_tbl
#> # A tibble: 10 × 8
#>    dist_type             abs_t…¹ aic_v…² abs_aic ks_st…³ ks_pv…⁴ ks_me…⁵ alter…⁶
#>    <fct>                   <dbl>   <dbl>   <dbl>   <dbl>   <dbl> <chr>   <chr>  
#>  1 Cauchy c(19.2, 7.38)   0.0785  -233.    233.    0.375 2.10e-2 Monte-… two-si…
#>  2 Beta c(1.11, 1.58, 0)  0.444    -48.9    48.9   0.781 5.00e-4 Monte-… two-si…
#>  3 Logistic c(20.09, 3.…  1.15    -217.    217.    0.188 6.19e-1 Monte-… two-si…
#>  4 Gamma c(11.47, 1.75)   1.66    -179.    179.    0.125 9.69e-1 Monte-… two-si…
#>  5 Uniform c(8.34, 31.8…  2.66    -207.    207.    0.281 1.61e-1 Monte-… two-si…
#>  6 Weibull c(3.58, 22.2…  3.36    -197.    197.    0.25  2.77e-1 Monte-… two-si…
#>  7 Gaussian c(20.09, 5.…  3.47    -167.    167.    0.125 9.72e-1 Monte-… two-si…
#>  8 Lognormal c(2.96, 0.…  5.64    -169.    169.    0.312 9.30e-2 Monte-… two-si…
#>  9 Exponential c(0.05)    6.19    -236.    236.    0.531 5.00e-4 Monte-… two-si…
#> 10 Pareto c(10.4, 1.62)   9.51     106.    106.    0.5   1.50e-3 Monte-… two-si…
#> # … with abbreviated variable names ¹​abs_tot_deviance, ²​aic_value,
#> #   ³​ks_statistic, ⁴​ks_pvalue, ⁵​ks_method, ⁶​alternative
#> 
#> attr(,".x")
#>  [1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 17.8 16.4 17.3 15.2 10.4
#> [16] 10.4 14.7 32.4 30.4 33.9 21.5 15.5 15.2 13.3 19.2 27.3 26.0 30.4 15.8 19.7
#> [31] 15.0 21.4
#> attr(,".n")
#> [1] 32