Compare Empirical Data to Distributions
Source:R/util-distribution-comparison.R
tidy_distribution_comparison.Rd
Compare some empirical data set against different distributions to help find the distribution that could be the best fit.
Arguments
- .x
The data set being passed to the function
- .distribution_type
What kind of data is it, can be one of
continuous
ordiscrete
Details
The purpose of this function is to take some data set provided and
to try to find a distribution that may fit the best. A parameter of
.distribution_type
must be set to either continuous
or discrete
in order
for this the function to try the appropriate types of distributions.
The following distributions are used:
Continuous:
tidy_beta
tidy_cauchy
tidy_exponential
tidy_gamma
tidy_logistic
tidy_lognormal
tidy_normal
tidy_pareto
tidy_uniform
tidy_weibull
Discrete:
tidy_binomial
tidy_geometric
tidy_hypergeometric
tidy_poisson
The function itself returns a list output of tibbles. Here are the tibbles that are returned:
comparison_tbl
deviance_tbl
total_deviance_tbl
aic_tbl
kolmogorov_smirnov_tbl
multi_metric_tbl
The comparison_tbl
is a long tibble
that lists the values of the density
function against the given data.
The deviance_tbl
and the total_deviance_tbl
just give the simple difference
from the actual density to the estimated density for the given estimated distribution.
The aic_tbl
will provide the AIC
for a lm
model of the estimated density
against the emprical density.
The kolmogorov_smirnov_tbl
for now provides a two.sided
estimate of the
ks.test
of the estimated density against the empirical.
The multi_metric_tbl
will summarise all of these metrics into a single tibble.
Examples
xc <- mtcars$mpg
output_c <- tidy_distribution_comparison(xc, "continuous")
#> For the beta distribution, its mean 'mu' should be 0 < mu < 1. The data will
#> therefore be scaled to enforce this.
xd <- trunc(xc)
output_d <- tidy_distribution_comparison(xd, "discrete")
output_c
#> $comparison_tbl
#> # A tibble: 352 × 8
#> sim_number x y dx dy p q dist_type
#> <fct> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <fct>
#> 1 1 1 21 2.97 0.000114 0.625 10.4 Empirical
#> 2 1 2 21 4.21 0.000455 0.625 10.4 Empirical
#> 3 1 3 22.8 5.44 0.00142 0.781 13.3 Empirical
#> 4 1 4 21.4 6.68 0.00355 0.688 14.3 Empirical
#> 5 1 5 18.7 7.92 0.00721 0.469 14.7 Empirical
#> 6 1 6 18.1 9.16 0.0124 0.438 15 Empirical
#> 7 1 7 14.3 10.4 0.0192 0.125 15.2 Empirical
#> 8 1 8 24.4 11.6 0.0281 0.812 15.2 Empirical
#> 9 1 9 22.8 12.9 0.0395 0.781 15.5 Empirical
#> 10 1 10 19.2 14.1 0.0516 0.531 15.8 Empirical
#> # … with 342 more rows
#>
#> $deviance_tbl
#> # A tibble: 352 × 2
#> name value
#> <chr> <dbl>
#> 1 Empirical 0.451
#> 2 Beta c(1.11, 1.58, 0) -0.457
#> 3 Cauchy c(19.2, 7.38) 0.0778
#> 4 Exponential c(0.05) 0.234
#> 5 Gamma c(11.47, 1.75) 0.381
#> 6 Logistic c(20.09, 3.27) 0.179
#> 7 Lognormal c(2.96, 0.29) 0.300
#> 8 Pareto c(10.4, 1.62) 0.451
#> 9 Uniform c(8.34, 31.84) -0.356
#> 10 Weibull c(3.58, 22.29) -0.105
#> # … with 342 more rows
#>
#> $total_deviance_tbl
#> # A tibble: 10 × 2
#> dist_with_params abs_tot_deviance
#> <chr> <dbl>
#> 1 Cauchy c(19.2, 7.38) 0.0785
#> 2 Beta c(1.11, 1.58, 0) 0.444
#> 3 Logistic c(20.09, 3.27) 1.15
#> 4 Gamma c(11.47, 1.75) 1.66
#> 5 Uniform c(8.34, 31.84) 2.66
#> 6 Weibull c(3.58, 22.29) 3.36
#> 7 Gaussian c(20.09, 5.93) 3.47
#> 8 Lognormal c(2.96, 0.29) 5.64
#> 9 Exponential c(0.05) 6.19
#> 10 Pareto c(10.4, 1.62) 9.51
#>
#> $aic_tbl
#> # A tibble: 10 × 3
#> dist_type aic_value abs_aic
#> <fct> <dbl> <dbl>
#> 1 Beta c(1.11, 1.58, 0) -48.9 48.9
#> 2 Pareto c(10.4, 1.62) 106. 106.
#> 3 Gaussian c(20.09, 5.93) -167. 167.
#> 4 Lognormal c(2.96, 0.29) -169. 169.
#> 5 Gamma c(11.47, 1.75) -179. 179.
#> 6 Weibull c(3.58, 22.29) -197. 197.
#> 7 Uniform c(8.34, 31.84) -207. 207.
#> 8 Logistic c(20.09, 3.27) -217. 217.
#> 9 Cauchy c(19.2, 7.38) -233. 233.
#> 10 Exponential c(0.05) -236. 236.
#>
#> $kolmogorov_smirnov_tbl
#> # A tibble: 10 × 6
#> dist_type ks_statistic ks_pvalue ks_method alter…¹ dist_…²
#> <fct> <dbl> <dbl> <chr> <chr> <chr>
#> 1 Beta c(1.11, 1.58, 0) 0.781 0.000500 Monte-Carlo t… two-si… Beta c…
#> 2 Cauchy c(19.2, 7.38) 0.375 0.0210 Monte-Carlo t… two-si… Cauchy…
#> 3 Exponential c(0.05) 0.531 0.000500 Monte-Carlo t… two-si… Expone…
#> 4 Gamma c(11.47, 1.75) 0.125 0.969 Monte-Carlo t… two-si… Gamma …
#> 5 Logistic c(20.09, 3.27) 0.188 0.619 Monte-Carlo t… two-si… Logist…
#> 6 Lognormal c(2.96, 0.29) 0.312 0.0930 Monte-Carlo t… two-si… Lognor…
#> 7 Pareto c(10.4, 1.62) 0.5 0.00150 Monte-Carlo t… two-si… Pareto…
#> 8 Uniform c(8.34, 31.84) 0.281 0.161 Monte-Carlo t… two-si… Unifor…
#> 9 Weibull c(3.58, 22.29) 0.25 0.277 Monte-Carlo t… two-si… Weibul…
#> 10 Gaussian c(20.09, 5.93) 0.125 0.972 Monte-Carlo t… two-si… Gaussi…
#> # … with abbreviated variable names ¹alternative, ²dist_char
#>
#> $multi_metric_tbl
#> # A tibble: 10 × 8
#> dist_type abs_t…¹ aic_v…² abs_aic ks_st…³ ks_pv…⁴ ks_me…⁵ alter…⁶
#> <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr>
#> 1 Cauchy c(19.2, 7.38) 0.0785 -233. 233. 0.375 2.10e-2 Monte-… two-si…
#> 2 Beta c(1.11, 1.58, 0) 0.444 -48.9 48.9 0.781 5.00e-4 Monte-… two-si…
#> 3 Logistic c(20.09, 3.… 1.15 -217. 217. 0.188 6.19e-1 Monte-… two-si…
#> 4 Gamma c(11.47, 1.75) 1.66 -179. 179. 0.125 9.69e-1 Monte-… two-si…
#> 5 Uniform c(8.34, 31.8… 2.66 -207. 207. 0.281 1.61e-1 Monte-… two-si…
#> 6 Weibull c(3.58, 22.2… 3.36 -197. 197. 0.25 2.77e-1 Monte-… two-si…
#> 7 Gaussian c(20.09, 5.… 3.47 -167. 167. 0.125 9.72e-1 Monte-… two-si…
#> 8 Lognormal c(2.96, 0.… 5.64 -169. 169. 0.312 9.30e-2 Monte-… two-si…
#> 9 Exponential c(0.05) 6.19 -236. 236. 0.531 5.00e-4 Monte-… two-si…
#> 10 Pareto c(10.4, 1.62) 9.51 106. 106. 0.5 1.50e-3 Monte-… two-si…
#> # … with abbreviated variable names ¹abs_tot_deviance, ²aic_value,
#> # ³ks_statistic, ⁴ks_pvalue, ⁵ks_method, ⁶alternative
#>
#> attr(,".x")
#> [1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 17.8 16.4 17.3 15.2 10.4
#> [16] 10.4 14.7 32.4 30.4 33.9 21.5 15.5 15.2 13.3 19.2 27.3 26.0 30.4 15.8 19.7
#> [31] 15.0 21.4
#> attr(,".n")
#> [1] 32