Dealing with categorical variables via `.astype('category')` on Series objects: This results in a different output when calling `.describe()`
Overview πΊοΈ
(Business) Problem β Data Acquisition β Data Cleaning β Data Exploration / Visualisation β Modelling β Reporting β Deployment
After doing some data exploration to familiarise ourselves with the data, we're ready to build a simple model!
Linear Regression models are used based on the assumption that the independent variables have a linear relationship with the dependent variable. This is not always true.
2 key learning points in Lab 4
Split a dataset into train/test split via `train_test_split()`
Build a linear regression model with sklearn, along with the Attributes and Functions for the model object: Building model: .fit(), Checking model: `.intercept_`, `.coef_`, Using model: .predict(), Analysing model: .score(), mean_squared_error()
1. More about linear regression
Assumptions
Errors are independent of each other
Errors follow a normal distribution
Errors have constant variance (homoscedasticity)
Diagnostic Plots
The performance of regression-based model is contingent on the dataβs adherence to the assumptions made
If one class is much larger, you need to ensure that the test set follows the same ratio
Looking for `stratify` option in `train_test_split()`
k-fold cross validation
If your dataset size is small and splitting to train/validation/test sets results in a very small dataset, you can consider perform cross validation instead
Check the 'Assignments' tab in the lab's course site on NTULearn.
Remember that Lab 5 is graded too!
(and due 48 hours after the end of the Lab)
References
This set of slides is made using reveal.js.
It's really easy to make a basic set of slides (just HTML) and you can consider using it for simple (tech) presentations!
For more advanced customization, you do need CSS and JS but scripts can be easily googled for and it has good documentation.