parcel
We consider an at-order prediction setting that uses only attributes available at the moment a parcel is accepted into the system. Let
where is a typed feature vector observed at acceptance time . We define two primary targets. The first is a continuous time-to-event label
where is the timestamp of the first successful delivery attempt. The second is a binary first-attempt success label
with if and only if the first attempt succeeds. Because some items never achieve a first success within the business horizon, we attach a right-censoring flag , with when is observed and for returns, cancellations, or unresolved cases. The model maps to two outputs used by operations:
where is a point estimate of and is a calibrated probability for first-attempt success. For ETA we report point-error metrics,
together with tail summaries and defined as the 90th and 95th percentiles of on the evaluation set . For PTC we adopt two complementary classification metrics: i \emph{ROC-AUC} area under the receiver operating characteristic curve, which is threshold-free and measures the probability that a randomly chosen positive example receives a higher score than a randomly chosen negative. Formally, if denotes the score, then
making it insensitive to any monotone calibration of and robust under class imbalance. ii \emph{F1 score}, the harmonic mean of precision and recall after thresholding the probability at :
We select on the validation split by maximizing F1, unless otherwise specified and then report test-set F1 at that fixed . This provides an operational view of the success/failure trade-off complementary to AUC’s ranking perspective. For transparency, we also report the associated precision and recall at the chosen . Features are computed exclusively from the acceptance-time state; aggregates e.g., per–post-office or per–day load use leave-one-out logic so the index parcel does not influence its own features. Categorical vocabularies and numeric normalizers are fit on training data and then frozen. Personally identifiable information such as names, phone numbers, and exact street numbers is dropped or hashed before modeling. Data are split chronologically by into // train/validation/test segments, and we also perform rolling-origin evaluation. Portability and equity are assessed via leave-one-district/post-office-out generalization and parity gaps such as with analogous gaps for classification metrics. Training uses a joint objective that couples robust regression with probabilistic classification:
Weights are fixed to /; PTC class imbalance is handled by inverse-frequency weights. We optimize with AdamW, cosine learning-rate decay with warmup, dropout in attention and MLP blocks, and early stopping on a composite validation score . After training we calibrate using temperature scaling or isotonic regression on the validation split note that AUC is invariant to monotone calibration, while F1 may change through threshold re-optimization.