Assessment

Auto-generated from the signals above. Thresholds are transparent and tunable in recommend.py.

Strengths

Precision is high (0.97): when the model flags a complaint as reportable it is almost always correct, so reviewers aren't swamped by false alarms.

Risks and weaknesses

Recall on reportable complaints is 0.76, below the 0.85 floor. 454 reportable complaints were missed (false negatives) — the most costly error for compliance.
Recall dropped 0.20 vs the benchmark window (0.96 -> 0.76) — active degradation.
F1 fell 0.11 vs benchmark (0.96 -> 0.85).
Significant drift detected (overall PSI 0.33 > 0.25). Inputs/scores have moved materially from the benchmark.
Unseen categories appeared in the current window: Buy now pay later, Cryptocurrency wallet, Peer-to-peer payment app — the model never trained on these.
Model is poorly calibrated (ECE 0.14 > 0.1); confidence scores do not reflect true correctness and should not gate automation.
Prediction bias: the model is under-flagging reportable cases (predicted positive rate 0.49 vs actual 0.63).
Subgroup skew: 'Cryptocurrency wallet' has recall 0.50, 0.26 below the overall 0.76 — the model is blind to how impact is expressed in that segment.

Recommended controls

TUNING: lower the decision threshold (favour recall) and/or apply class weighting; re-validate precision impact before rollout.
RETRAINING: schedule a refresh on recent labeled data; the input distribution has moved away from the training benchmark.
RETRAINING / DATA: add examples of the new themes above to the training set; until then route them to human review.
TUNING: apply post-hoc calibration (Platt/temperature scaling) so confidence can safely drive auto-routing thresholds.
DATA / FAIRNESS: oversample the weak subgroup and add segment-aware evaluation gates so regressions there are caught early.
MONITORING: schedule this report on a rolling window (e.g. weekly), alert on recall < floor or PSI > 0.25, and keep a human-reviewed gold set for ground-truth validation.

Recommendation tags — TUNING: threshold/calibration/class-weight changes · RETRAINING: refresh on recent labeled data · DATA/FAIRNESS: fix coverage gaps · MONITORING: scheduling, alerting, gold-set validation.