CX1115 Lab 4: Linear Regression


Lab 3 Review!

  • Plots for categorical variables and how to interpret them: catplot
    (strip/swarm; (enhanced) boxplot, violin; point, bar, count)

  • Data exploration & cleaning functions:
    Views: `.sample()`, `.sort_values()`, `.size()`, `.value_counts()`
    More views: `.groupby()`, `.unstack()`,`.set_index()`, `.reset_index()`
    Dup: `.unique()`, `.duplicated()`
    NaN: `.dropna()`, `.isnull()`, `.fill_na()`
    Clean: `.rename()`, `.apply(lambda)`, `re.sub()`,
    `.copy()`

  • Dealing with categorical variables via `.astype('category')` on Series objects: This results in a different output when calling `.describe()`

Overview πŸ—ΊοΈ

(Business) Problem ↔ Data Acquisition ↔ Data Cleaning ↔ Data Exploration / Visualisation ↔ Modelling ↔ Reporting ↔ Deployment

  • After doing some data exploration to familiarise ourselves with the data, we're ready to build a simple model!

  • Linear Regression models are used based on the assumption that the independent variables have a linear relationship with the dependent variable. This is not always true.

2 key learning points in Lab 4


  1. Split a dataset into train/test split via `train_test_split()`

  2. Build a linear regression model with sklearn, along with the Attributes and Functions for the model object:
    Building model: .fit(),
    Checking model: `.intercept_`, `.coef_`,
    Using model: .predict(),
    Analysing model: .score(), mean_squared_error()

1. More about linear regression

Assumptions

  • Errors are independent of each other
  • Errors follow a normal distribution
  • Errors have constant variance (homoscedasticity)

Diagnostic Plots

2. More about data splitting

Imbalanced classes

  • If one class is much larger, you need to ensure that the test set follows the same ratio
  • Looking for `stratify` option in `train_test_split()`

k-fold cross validation

  • If your dataset size is small and splitting to train/validation/test sets results in a very small dataset, you can consider perform cross validation instead
  • Check out sklearn.model_selection.KFold
  • Here's a visual guide

Lab 4 Deliverables

Check the 'Assignments' tab in the lab's course site on NTULearn.


Remember that Lab 5 is graded too!
(and due 48 hours after the end of the Lab)

References

This set of slides is made using reveal.js. It's really easy to make a basic set of slides (just HTML) and you can consider using it for simple (tech) presentations! For more advanced customization, you do need CSS and JS but scripts can be easily googled for and it has good documentation.

There are also more alternatives here.

Slide 1 image

CX1115 Telegram Channel!

Some of the TAs have created this initiative to share with y'all interesting stuff (🀩) about DS/AI, join us here! :)

  • NOT the official announcement outlet for the course.
    Those will still be sent through NTUlearn.
  • Non-lecture related stuff (e.g. competition, events, interesting articles, datasets, SOTA results linked to what you've learnt)
  • Fill up the poll to indicate your preferences on what should be shared
    (when it's released, end of the week)