Assessment
Auto-generated from the signals above. Thresholds are transparent and tunable in recommend.py.
Strengths
- Precision is high (0.97): when the model flags a complaint as reportable it is almost always correct, so reviewers aren't swamped by false alarms.
Risks and weaknesses
- Recall on reportable complaints is 0.76, below the 0.85 floor. 454 reportable complaints were missed (false negatives) — the most costly error for compliance.
- Recall dropped 0.20 vs the benchmark window (0.96 -> 0.76) — active degradation.
- F1 fell 0.11 vs benchmark (0.96 -> 0.85).
- Significant drift detected (overall PSI 0.33 > 0.25). Inputs/scores have moved materially from the benchmark.
- Unseen categories appeared in the current window: Buy now pay later, Cryptocurrency wallet, Peer-to-peer payment app — the model never trained on these.
- Model is poorly calibrated (ECE 0.14 > 0.1); confidence scores do not reflect true correctness and should not gate automation.
- Prediction bias: the model is under-flagging reportable cases (predicted positive rate 0.49 vs actual 0.63).
- Subgroup skew: 'Cryptocurrency wallet' has recall 0.50, 0.26 below the overall 0.76 — the model is blind to how impact is expressed in that segment.
Recommended controls
- TUNING: lower the decision threshold (favour recall) and/or apply class weighting; re-validate precision impact before rollout.
- RETRAINING: schedule a refresh on recent labeled data; the input distribution has moved away from the training benchmark.
- RETRAINING / DATA: add examples of the new themes above to the training set; until then route them to human review.
- TUNING: apply post-hoc calibration (Platt/temperature scaling) so confidence can safely drive auto-routing thresholds.
- DATA / FAIRNESS: oversample the weak subgroup and add segment-aware evaluation gates so regressions there are caught early.
- MONITORING: schedule this report on a rolling window (e.g. weekly), alert on recall < floor or PSI > 0.25, and keep a human-reviewed gold set for ground-truth validation.
Recommendation tags — TUNING: threshold/calibration/class-weight changes · RETRAINING: refresh on recent labeled data ·
DATA/FAIRNESS: fix coverage gaps · MONITORING: scheduling, alerting, gold-set validation.