CX1115 Lab 6
A Summary!

Lab 5 Review

SwarmPlot: like boxplot but can see actual data points
(but can't see quartiles)

Build a decision tree with sklearn, along with the Attributes and Functions for the model object:
Building model: .fit(),
Checking model: .plot_tree()
Using model: .predict(), .predict_proba()
Analysing model: .score(), .confusion_matrix()

Overview 🗺️

(Business) Problem ↔ Data Acquisition ↔ Data Cleaning ↔ Data Exploration / Visualisation ↔ Modelling ↔ Reporting ↔ Deployment

We've seen regression and classification models

Many other models exist, but first understand the basics well

We haven't gone into the details (e.g. regularisation (see Slide 31) ; lasso, ridge regression, elastic net)

4 key learning points in Lab 6

One-hot encoding for categorical variables

Dealing with imbalanced datasets (imblearn is good)

Random Forest (ensemble of trees)

Hyperparameter tuning: Cross validation, grid search

If things have been quite hard

It's (kinda) possible to do data science without too much of Math and coding!

Higher level APIs: PyCaret instead of sklearn
GUI: Tableau / Power BI for viz, Weka for automated preproc

You should still be aware about the strengths and limitations of each modelling techniques + assumptions they make

DT: very fast inference, linear boundaries, interpretable
Linear Regression: error assumptions, check using diagnostic plots

Beyond the course

What we haven't covered (but still important)

Model interpretability: e.g. identifying salient features via SHAP
Dataset shift: what happens if the (test) distribution changes?

Production-level data science

Dealing with bigger datasets with Vaex and Dask
Adopt a test-driven approach even in DS via Great Expectations
The data engineering side of things: Using Apache Spark, Kubernetes

Lab 5 Deliverables

No submission! :)

All the best for the quiz!

The human side of DS

For insights derived from data to be useful, they have to be actionable (i.e. predictive and prescriptive).

Convincing decision makers with insights from data is hard, especially if they are used to relying on their instincts.

Thick data - information humans can obtain but we don't know how to encode it well for computers to understand ; 'semantic gap'

One more thing!

Aim high but have legitimate backup plans (that you'll enjoy)

References

This set of slides is made using reveal.js. It's really easy to make a basic set of slides (just HTML) and you can consider using it for simple (tech) presentations! For more advanced customization, you do need CSS and JS but scripts can be easily googled for and it has good documentation.

There are also more alternatives here.

Slide 1 image