What to do when the labels are skewed: tackling the Accuracy Paradox

Since the beginning of 2018 I have been trying to solve various day-to-day problems in my job using simple deep learning models (see Deep Learning on Structured Data Part 1 and Part 2).
The problem I cut my teeth on — predicting time to relief (TTR) for Db2 support tickets — had the advantage of reasonably balanced labels. The input data broke down as follows:
- 0 label — TTR ≤ 1 day: 39%
- 1 label — TTR > 1 day: 61%
On a smaller data set (~ 180 k records) the accuracy topped out around 73%, with a confusion matrix that looked like this:

So, while the results were not impressive for a small dataset, at least the model was not guessing the same outcome all the time.
I ran into problems when I tried to apply the same approach to a problem with very unbalanced label values: predicting whether a Db2 support ticket would generate a Duty Manager (DM) call.
Clients call the Duty Manager when they have a ticket that needs additional attention. We have live Duty Managers on call around the clock every day of the year. When a DM call comes in, the Duty Manager quickly checks the state of the ticket with the team working on it and then calls the client back to review the plan for the ticket. DM calls are disruptive for clients and they generate additional work for my team. It would be great for everybody if we could predict that a DM call was coming for a ticket and take proactive steps (such as getting additional help for the analyst working on the ticket) so that there is no need for the DM call in the first place.
Only a small proportion of tickets generate Duty Manager calls, so the label values in the input data for the DM prediction project are skewed: over 98% have label = 0 (no DM call).
When we try to run the model on this input data, accuracy looks great:

However, the confusion matrix for the validation set tells a different story — the model is never correctly predicting a DM call on the validation set:

This is a showstopper problem — the value of the model is in accurately predicting DM calls. If the model never predicts DM calls correctly it is worthless.
Without a clear idea of what to call the problem I was facing, I did some naive searches and stumbled across 8 Tactics to Combat Imbalanced Classes in Your Machine Learning Dataset. As I read through this excellent article I felt a bit deflated — the remedies looked like a lot of work. At least my nemesis had a name: the Accuracy Paradox. Incidentally, I think this would make a cool byline for a movie from the Mission Impossible franchise.

Luckily, a bit more searching uncovered what I was looking for — a straightforward way to address the Accuracy Paradox in Keras.
First, define variables to express the relative skew in the label values:
zero_weight = 1.0
one_weight = 72.8
Next, update the compile statement to include the weighted_metrics option:
model.compile(loss="binary_crossentropy", optimizer=optimizer, metrics=["accuracy"], weighted_metrics=["accuracy"])
Finally, update the fit statement to include class_weight option with values for each label value:
modelfit = model.fit(X_train, dtrain.target, epochs=epochs, batch_size=BATCH_SIZE
, validation_data=(X_valid, dvalid.target),class_weight = {0 : zero_weight, 1: one_weight}, verbose=1)
Rerunning the model with these changes, we get a different accuracy picture:

And the confusion matrix for the validation set shows that the model is actually predicting DM calls correctly some of the time:

Still lots of work to do to get to adequate DM call prediction accuracy, but at least now I have a decent starting point and a fighting chance to get a model that will be useful.