A number of distributions provided by gbm
have model
specific parameters associated with them. All distributions have
parameters associated with them such as the mean or variance; however,
certain distributions require additional data to be defined fully. This
additional data is referred to as “model specific parameters”. This
document describes how to correctly specify these parameters on
construction of the associated GBMDist
object as well as
their default values.
There are 5 distributions within gbm
which have
additional parameters associated with them. These distributions are:
CoxPH
, Pairwise
, Quantile
,
TDist
and Tweedie
.
The Cox proportional hazards model has several model specific parameters associated with it. All of them are optional but play important roles in the boosting process.
strata
: a vector of positive integers indicating which
strata each row of data belongs to. If there are multiple rows per
observation then this should be reflected in the strata
vector. If not specified it is assumed all training data are in the same
stratum and all test data are in another stratum.sorted
: a vector specifying how the rows of data are
ordered within their strata
and the order within strata is
the reverse order of the censored times or start times of the survival
data. This vector is completely optional and will be calculated by
gbmt
.ties
: a string specifying the method by which ties are
broken. Currently the “breslow” and “efron” approximations are
implemented, with the latter being the default method taken.prior_node_coeff_var
: a double used to regularize the
model predictions in gbm
. It represents the prior on the
number of events in the model. The predictions of the
GBMFit
are given by \(\log("Number of events"/"Expected
Number of events")\). Both the number of events in a dataset
and the model’s expected number of events could be \(0\) leading to non-finite behaviour. The
inverse of this parameter is added to both the numerator and denominator
appearing in the log ratio so as to ensure the predictions are finite.
The default value is \(1000\),
representing a base event number of \(1/1000\) events irrespective of the value
of the measured or expected number of events.The “Pairwise” distribution implements ranking measures following the
LamdaMART algorithm. Observations belong to groups, with all
pairs of items with different labels but belonging to the same group are
used for training. The distribution requires a character vector with the
column names of the data that jointly indicate the group an observation
belongs to. This character vector is passed to the group
argument on construction. When training with a Pairwise distribution a
number of information retrieval (IR) metrics are available whose utility
is maximised by the tree growing algorithm. The metric
parameter stores the selection and currently the IR metrics available
are:
The default for group
is "query"
while
metric
defaults to "ndcg"
. If map
or mrr
are selected the response must be in \({0, 1}\). A cut-off in the ranking of items
in a groups can be specified via max_rank
, the default for
this is 0 (all ranks taken into account) and is only applicable for
“ndcg” and “mrr”. Finally, the group_index
or label can be
specified directly - note this is optional and will be calculated by
gbmt
.
# Create pairwise grouped data
# create query groups, with an average size of 25 items each
N <- 1000
num.queries <- floor(N/25)
query <- sample(1:num.queries, N, replace=TRUE)
# X1 is a variable determined by query group only
query.level <- runif(num.queries)
X1 <- query.level[query]
# X2 varies with each item
X2 <- runif(N)
# X3 is uncorrelated with target
X3 <- runif(N)
# The target
Y <- X1 + X2
# Add some random noise to X2 that is correlated with
# queries, but uncorrelated with items
X2 <- X2 + scale(runif(num.queries))[query]
# Add some random noise to target
SNR <- 5 # signal-to-noise ratio
sigma <- sqrt(var(Y)/SNR)
Y <- Y + runif(N, 0, sigma)
data <- data.frame(Y, query=query, X1, X2, X3)
# Create appropriate Pairwise object
pair_dist <- gbm_dist(name="Pairwise", group="query", max_rank=1, metric="ndcg")
To perform quantile regression a QuantileGBMDist
object
must be passed to gbmt
. The quantile to estimate is stored
in the parameter alpha
and this defaults to
0.25
.
The t-distribution requires its degrees of freedom (df
)
to be set. The default value for this is four but it can be specified on
contruction of the associated GBMDist
object.
The tweedie distribution relates the variance of the response to its
expectation via: \(Var(Y) = E[Y]^p\),
where p
is the power of the distribution. This parameter is
specified through the power
named argument on calling
gbm_dist
and its default value is 1.5.
# Create a TweedieGBMDist object with a power of 2 - equivalent to a Gamma distribution
tweedie_dist <- gbm_dist(name="Tweedie", power=2)
Tweedie distributions include various more familiar distributions
which can be accessed through setting the power
parameter:
power=0
.power=1
.1 < power < 2
.power=2
.2 < power < 3
and
power > 3
.power=3
.Note no Tweedie models exist for
0 < power < 1
.