Deep Learning on Structured Data: part 2

Mark Ryan
4 min readMar 12, 2018
Tabular input data post encoding

A couple of weeks ago I posted an overview of a project that I have been working on for a few months to apply a simple deep learning model to structured data to solve some bread and butter operational problems for my team that provides support for Db2. The initial problem I decided to tackle was predicting whether the time to relief (TTR) for a given problem ticket would be before or after a given threshold.

This article is part of a series describing my experience applying deep learning to problems involving tabular, structured data. If you are interested in an end-to-end examination of this topic involving an open data set, I am writing a book for Manning Publications, Deep Learning with Structured Data, an Early Access version of which is available online now. You can use the discount code fccryan to save on the cost of access to this book.

In this post I’ll go into some details of the code for the TTR problem as well as future directions that I would like to take for this project.

You can see a shareable version of the notebook containing the code here.

The goal of the code is to tackle the TTR problem while providing a flexible structure that makes it easy to:

  • add or remove features for the current problem
  • adapt to a new problem incorporating a set of structured data with a completely different set of columns.

The kaggle submission that inspired this project was a great end-to-end example but it did suffer from hardcoding of features. Adding or removing features to this code involved touching many different parts of the code (filling in missing data, categorization, definition of Keras variables, and several parts of the model definition itself) which made changing the feature set for the TTR problem error-prone and time-consuming.

Cells with credentials are blocked in the shared notebook. Here is what one of the blocked cells (the one that ingests a Db2 Warehouse on Cloud table into a pandas dataframe) looks like so you can see how the connection is made. DSX can generate this kind of ingest code for you automatically once you have defined a connection in your DSX project for the database.

To make the code more flexible I attempted to remove (with exceptions noted below) all references to input column names much as possible and instead deal with each input feature according to which of the following classes it belongs to:

  • Continuous values like elapsed time or temperature)
  • Categorical values like country names or days of the week)
  • Text

This snippet shows the definition of lists for each class of feature:

By default all columns of the input dataset are considered, and the columns of interest are narrowed down:

  • nonfeaturelist: columns that are not going to be inputs to the model. In the case of predicting TTR, this includes the columns related to the label (Time_to_relief and TTRgttthresh) along with any columns that are populated after the initial creation of the ticket.
  • textcols: columns that need text processing. For TTR prediction there is just one column in this category, Prob_Abstract_Text which is the brief description of the problem.
  • contcols: continuous columns that contain number values. For TTR there aren’t any such columns.
  • collist: the remaining columns — all of which are treated as categorical features.

These lists are used to iterate through the following processing steps:

  • Fill missing data:
  • Replace category values with IDs:
  • Process text features:
  • Define Keras variables:
  • Model definition using the feature lists to define model inputs and build up the layers associated with each input

There are still a couple of places that need additional work:

  • Incorporating continuous features in the model definition
  • Eliminating the following hardcoded references to column names in model definition

With these updates the code covers the whole process from ingestion of data from a Db2 on Cloud database to output of results from a simple Keras model in a way that makes it easy to add and drop features.

I am currently working on another application of this model: predicting Db2 Duty Manager calls. Despite this problem having a significantly different set of features from the TTR prediction problem (including continuous features), the overall model from the TTR problem is working reasonably well.

--

--

Mark Ryan

Technical writing manager at Google. Opinions expressed are my own.