CX1115 Lab 5: Classification Tree


(How do you draw the plot for decision trees?)

Lab 4 Review!

  1. Split a dataset into train/test split via `train_test_split()`

  2. Build a linear regression model with sklearn, along with the Attributes and Functions for the model object:
    Building model: .fit(),
    Checking model: .intercept_, .coef_,
    Using model: .predict()
    Analysing model: .score(), mean_squared_error()

Overview πŸ—ΊοΈ

(Business) Problem ↔ Data Acquisition ↔ Data Cleaning ↔ Data Exploration / Visualisation ↔ Modelling ↔ Reporting ↔ Deployment

  • After building some linear regression models, we're ready to build classification models!

  • Regression problems can be converted into Classification problems (e.g. predicting up/down instead of the precise closing stock price)

  • Generally, regression is a tougher problem.
    Classification And Regression Tree (CART) can do both tasks.

2 key learning points in Lab 5


  1. SwarmPlot: like boxplot but can see actual data points
    (but can't see quartiles)

  2. Build a decision tree with sklearn, along with the Attributes and Functions for the model object:
    Building model: .fit(),
    Checking model: .plot_tree()
    Using model: .predict(), .predict_proba()
    Analysing model: .score(), .confusion_matrix()

1. How to use decision trees

"Decision Tree is a 'partition strategy' on the data to get probabilities."

  • Refer to plot in class notes - we literally slice the axes into 2 parts with a straight line - what are the implications?
  • 1 node in tree = 1 split = 1 line in the plot; 'AND' when go down tree;
    left split == True with respect to condition in the node
  • Most implementations offer only binary splits (not binary classification!) on numeric independent variables. Why?
  • Another way to plot your trees + arrows are actually there!

Evaluation Metrics

  • Confused about confusion matrix? Check this out
  • Accuracy = (TP + TN) / N, what happens if the dataset is imbalanced? Use TPR, TNR, F1, MCC... (precision, recall is confusing!)

2. How do decision trees work?

Goal is to get pure child nodes, i.e. only 1 class. Remove impurities.

Grow tree: on train data

  • Which variable to choose & how to split the variable?
    Categorial, natural splits. If binary split, 1 category can be isolated.
    Numeric, sort all values in dataset. Either use the value, or the middle of each pair of values. For both numeric and categorical, try all combinations (or random choice for speed) and compute impurity.
    Use Gini or Entropy to compute Information Gain. Not much diff.
  • Stopping conditions: tree depth, number of samples left in nodes before/after split, min change in impurity

Prune tree: complex trees might have overfitted, tune on val

  • Apply penalty on tree complexity (e.g. no of leaves), a form of regularisation such that the final tree is smaller

Lab 5 Deliverables

Check the 'Assignments' tab in the lab's course site on NTULearn.


Remember to submit it within 48 hours after the end of the Lab
(i.e. 18 Feb, 10.30am)

References


Survey for Lab 6! Please let me know what you want to be covered. :)


This set of slides is made using reveal.js. It's really easy to make a basic set of slides (just HTML) and you can consider using it for simple (tech) presentations! For more advanced customization, you do need CSS and JS but scripts can be easily googled for and it has good documentation.

There are also more alternatives here.

Slide 1 image