CX1115 Lab 6
A Summary!


Lab 5 Review

  1. SwarmPlot: like boxplot but can see actual data points
    (but can't see quartiles)

  2. Build a decision tree with sklearn, along with the Attributes and Functions for the model object:
    Building model: .fit(),
    Checking model: .plot_tree()
    Using model: .predict(), .predict_proba()
    Analysing model: .score(), .confusion_matrix()

Overview πŸ—ΊοΈ

(Business) Problem ↔ Data Acquisition ↔ Data Cleaning ↔ Data Exploration / Visualisation ↔ Modelling ↔ Reporting ↔ Deployment

  • We've seen regression and classification models

  • Many other models exist, but first understand the basics well

  • We haven't gone into the details (e.g. regularisation (see Slide 31) ; lasso, ridge regression, elastic net)

4 key learning points in Lab 6


  1. One-hot encoding for categorical variables

  2. Dealing with imbalanced datasets (imblearn is good)

  3. Random Forest (ensemble of trees)

  4. Hyperparameter tuning: Cross validation, grid search

If things have been quite hard

It's (kinda) possible to do data science without too much of Math and coding!

  • Higher level APIs: PyCaret instead of sklearn
  • GUI: Tableau / Power BI for viz, Weka for automated preproc

You should still be aware about the strengths and limitations of each modelling techniques + assumptions they make

  • DT: very fast inference, linear boundaries, interpretable
  • Linear Regression: error assumptions, check using diagnostic plots

Beyond the course

What we haven't covered (but still important)

  • Model interpretability: e.g. identifying salient features via SHAP
  • Dataset shift: what happens if the (test) distribution changes?

Production-level data science

  • Dealing with bigger datasets with Vaex and Dask
  • Adopt a test-driven approach even in DS via Great Expectations
  • The data engineering side of things: Using Apache Spark, Kubernetes

Lab 5 Deliverables

No submission! :)


All the best for the quiz!

The human side of DS

For insights derived from data to be useful, they have to be actionable (i.e. predictive and prescriptive).

Convincing decision makers with insights from data is hard, especially if they are used to relying on their instincts.

Thick data - information humans can obtain but we don't know how to encode it well for computers to understand ; 'semantic gap'

One more thing!

Aim high but have legitimate backup plans (that you'll enjoy)

References

This set of slides is made using reveal.js. It's really easy to make a basic set of slides (just HTML) and you can consider using it for simple (tech) presentations! For more advanced customization, you do need CSS and JS but scripts can be easily googled for and it has good documentation.

There are also more alternatives here.

Slide 1 image