Data Analytics Applied

The data analytic techniques used here can be applied in any business model. It is always a question of which technique best fits the model and the purpose, e.g., order input, pricing. In this client sample time series, random forest feature extraction, neural network long short term memory ('LSTM') and neural network embedded categorical variables models are applied.

Sparse Data Correction

Sparse or missing data is usually corrected by inserting a mean, median, a prior entry or some other simple method. This is referred to as 'imputing'. Imputing is usually effective if there are few empty entries randomly scattered. In business data it is usually a large portion of a sample or a feature during a time period is missing. A typical example is a data feature that was not available early in the time frame or a respondent only answered a portion of the query. This data is often discarded. We apply a matrix factorization scheme to correct sparse data. It has proven to be twice as accurate as imputing for the data used in this example and is far superior to simply discarding valuable data.

Time Series Analysis

Time Series Analysis as used here is the historical standard in which forecasts are made by finding statistical correlations between different time periods rather than the convolutional neural networks time series computations. This classical analysis only uses a single variable, e.g, sales, recorded over the second variable, time. Time series should always be applied because it is easily explained and exposes trend lines and seasonality. Further, early in a proof of concept order input and time are often the only two readily available variables.

Random Forests Feature Importance

A random forests regressor is applied not to forecast, but to select feature importances. Random forests is a traditional and robust classifier with limitations as a continuous target predictor such as order input. There are both strong adheremnts and dismissive critics of random forests particularly when gradient boosted is applied. Used in the limited purpose here it is easily understood and is supported by business sense interpretation.

Long Short Term Memory Neural Network

Long short term memory (LSTM) is a powerful technique developed in vison research. It finds and predicts recurring patterns in the image. When LSTM is converted to structured data as done here the adaptation begins by parsing the feature variables into time dependent matrices so the time variable is included in the feature variables rather than explicitly as an independent time variable. This process combines time dependence with feature importance.

LSTM can predict changing trend lines and that is not possible with traditional time series. This technique is used for the accurate trend analysis in the client example.

Embedded Neural Network

Embedding is an extemely efficient technique for dealing with categorical data in a neural network solution. Categorical data includes region, class, partner id, store id, etc. The usual method for dealing with this data is to use business knowledge to parse the available data into separate databases. For example, it is known that longer term partners will have a different performance metric than shorter term partners. The result is to separate those partners into separate smaller databases which maintains consistent comparison but loses the information available comparing different partner terms. Embedding enables keeping those partners in the same analysis but still recognize their differences. This technique is used for the accurate forecast by partner in the client example.

Embedding is absolutely essential when making predictions for dealers, stores and independent offices or by part number quantity. Embedding very easily deals with individual identities without losing information in the pooled data. It can also be used where time intervals are critical. Time intervals are time since training, time since a promotion was played, etc. Embedding has been the winning method in many important data science competitions recently.

Deep Feature Synthesis

The distributed computer processing networks generate enormous data 'lakes' updated continuously. The question is whether or not there is real actionable data in those lakes. Just fishing through the lake for potential valid information is not reliably beneficial but is expensive. Automated feature analytics and feature synthesis is a newly emerging field that identifies the 'best' features without knowledge of the business process being modeled. A big lake of data is thereby transformed into an even bigger lake with features created by aggregation or stacking that are even more incomprehensible. We suggest building a model and then finding the right data to power that model rather than creating new features that are indecipherable or just fishing in the data lake.

Citations

The Conda implementation from Continium Analytics is used as the primary python library along with their Bokeh applications. Special recognition goes to Jeremy Howard's fast.ai for the embedded neural network solution, to Facebook for the Prophet time series and PyTorch programs and to Stephen Rendle for the factorization machine.

Steffen Rendle (2012): Factorization Machines with libFM, in ACM Trans. Intell. Syst. Technol., 3(3), May. [PDF]

BSD License For Prophet software. Copyright (c) 2017-present, Facebook, Inc. All rights reserved.


Production Analytics

The proof of concept code is provided in Python 2.7 and Theano. Both are too slow for a production model but are simple to implement in easily sourced software libraries. Production systems are written in Python 3.5, PyTorch, and on a gpu using fast.ai.

Fast.ai

PyTorch