Tuesday, 22 January 2019

Receivables Segmentation using Decision Trees : Part 1

Who said we cannot build Decision Trees in Excel? probably you didn't, but I bet you thought it was. 

This post will be the first in a series of two posts where we will explore how a Receivables Analyst can use Microsoft Excel, to build simple Decision Tree model helping to segment Accounts Receivable.

What is a Decision Tree :
Decision Tree-based prediction models are one of the most used families of classifiers in Machine Learning. They are used widely by the Machine Learning community. You can have a look at Victory Lab (by Sasha Issenberg) to understand how decision trees were used in a very concrete case: Obama’s successful 2012 reelection campaign.
One of the important advantages of Decision Trees is their simplicity, their decision model can be understood by people with little statistical training while helping to predict the value of a binary outcome from several independent variables.

A bit of theory about Decision Trees

Let's start by some theory, a decision tree starts with a root node that includes all combinations of attribute values, then we decide what are the "pertinent" variable to "split" the root node to create the best improvement in class separation. This process is then repeated to build all the brushes of the tree.


The split of a node should only take place on impure data. And a node is considered pure if all the data associated have the same value of the dependent variable. 

Impurity can be measured as the weighted average of the impurities for the child nodes created by the split, where the weight for the child node is proportional to a number of data points in the child node. 


The impurity of a child node might be calculated in different ways: Entropy, Gini Index, Classification error. In this article, we will use entropy to measure the impurity of a node.



To define measure the entropy of a node, suppose there are p possible values (0, 1, .., p-1) for the dependent variable. Assume the child node is defined by independent variable X is equal to a. Then the entropy of the child node is computed as the following equation:


{\displaystyle \mathrm {H} (X)=-\sum _{i=1}^{n}{\mathrm {P} (x_{i})\log _{b}\mathrm {P} (x_{i})}}
                                 Where P(xi)  is the fraction of observations in class i

The impurity associated with the split is then calculated as 

Impurity = (Σ  ni * Entropy(i))/ N

Where ni is number of observations having they split variable equal i 
N is the total number of observations in the parent node.

No comments:

Post a Comment