CX1115 Lab 4: Linear Regression

Lab 3 Review!

Plots for categorical variables and how to interpret them: catplot
(strip/swarm; (enhanced) boxplot, violin; point, bar, count)

Data exploration & cleaning functions:
Views: `.sample()`, `.sort_values()`, `.size()`, `.value_counts()`
More views: `.groupby()`, `.unstack()`,`.set_index()`, `.reset_index()`
Dup: `.unique()`, `.duplicated()`
NaN: `.dropna()`, `.isnull()`, `.fill_na()`
Clean: `.rename()`, `.apply(lambda)`, `re.sub()`,
`.copy()`

Dealing with categorical variables via `.astype('category')` on Series objects: This results in a different output when calling `.describe()`

Overview 🗺️

(Business) Problem ↔ Data Acquisition ↔ Data Cleaning ↔ Data Exploration / Visualisation ↔ Modelling ↔ Reporting ↔ Deployment

After doing some data exploration to familiarise ourselves with the data, we're ready to build a simple model!

Linear Regression models are used based on the assumption that the independent variables have a linear relationship with the dependent variable. This is not always true.

2 key learning points in Lab 4

Split a dataset into train/test split via `train_test_split()`

Build a linear regression model with sklearn, along with the Attributes and Functions for the model object:
Building model: .fit(),
Checking model: `.intercept_`, `.coef_`,
Using model: .predict(),
Analysing model: .score(), mean_squared_error()

1. More about linear regression

Assumptions

Errors are independent of each other
Errors follow a normal distribution
Errors have constant variance (homoscedasticity)

Diagnostic Plots

The performance of regression-based model is contingent on the data’s adherence to the assumptions made
There are plots in R (Python here) that can help you check if the assumptions are true. Here's what happens if one of them isn't true

2. More about data splitting

Imbalanced classes

If one class is much larger, you need to ensure that the test set follows the same ratio
Looking for `stratify` option in `train_test_split()`

k-fold cross validation

If your dataset size is small and splitting to train/validation/test sets results in a very small dataset, you can consider perform cross validation instead
Check out sklearn.model_selection.KFold
Here's a visual guide

Lab 4 Deliverables

Check the 'Assignments' tab in the lab's course site on NTULearn.

Remember that Lab 5 is graded too!
(and due 48 hours after the end of the Lab)

References

This set of slides is made using reveal.js. It's really easy to make a basic set of slides (just HTML) and you can consider using it for simple (tech) presentations! For more advanced customization, you do need CSS and JS but scripts can be easily googled for and it has good documentation.

There are also more alternatives here.

Slide 1 image

CX1115 Telegram Channel!

Some of the TAs have created this initiative to share with y'all interesting stuff (🤩) about DS/AI, join us here! :)

NOT the official announcement outlet for the course.
Those will still be sent through NTUlearn.
Non-lecture related stuff (e.g. competition, events, interesting articles, datasets, SOTA results linked to what you've learnt)
Fill up the poll to indicate your preferences on what should be shared
(when it's released, end of the week)