CX1115 Lab 2: Basic Statistics & dataviz

Lab 1 Review

Data Science pipeline
Using Jupyter Notebooks, shortcuts, JupyterLab

Many data types: .csv, .txt., .xls, .json, .html (and .data, etc)
DataFrame attributes and functions ; .info() vs .describe()
Dangers of pd.read_csv(): automatic inference of data types
Overwrite default params: df.head(n=20);
pd.read_csv(data, index_col=0,header=None,sep='\t')
df.columns to overwrite the column headers for the Olympic tables

Web scraping: pd.read_html() and beyond
Learn (how and what) to Google, try to remember basic syntax

Overview 🗺️

(Business) Problem ↔ Data Acquisition ↔ Data Cleaning ↔ Data Exploration / Visualisation ↔ Modelling ↔ Reporting ↔ Deployment

(Not uni-directional, rather iterative in real-life settings)

Purpose: familiarise yourself with the data, know which features might be relevant, prepare data for modeling
Keep the end goal of data analysis in mind (e.g. predict house prices)
If your mind is set on DS, your (personal) goal should be to have most of these attributes and functions at your fingertips

Let's move to the Preparatory Notebooks and Discussion questions!

2 key learning points in Lab 2

Plots for numeric variables and how to interpret them:
boxplot, hist, violin, joint, pair/scatter, heatmap
Attributes and Functions for data exploration

1. Some common viz mistakes

A guide on the limits of our visual system and how to not fall for traps.
(slide numbers in brackets)

Do not put graphs with different scales side by side or together (34)
Lengths are easier to compare than areas (48)
Avoid 3D if the data isn't actually 3D (62-63)
Avoid chartjunk (64-69)
Small multiples are better than (uncontrollable) animations (97,98)
Do not use too many colors - follow convention ; different sets of colour for different data

2. Attributes and Functions
for data exploration

`.corr()`,
`.concat()`, `.index`, `.reindex()`,
`.select_dtypes()`
`.drop()`

Series vs DataFrame

Pandas has some 'duplicated' functions: see this link and this link.

Lab 2 Deliverables

None! Just make sure you go through the materials. :)

Data Visualisation, in greater detail

Much more than just default plots from Seaborn.

Relevant at 2 stages of the pipeline: exploration and reporting

Know the alternatives available for different level of expertise: PowerBI (free), Tableau ; matplotlib / Seaborn / Plotly / Dash / ggplot (R) ; d3.js gallery and tutorial
Choose colours wisely: basics, use palettes or even create your own
Which type of visualisation to use? Here's a guide and here's another and here's one with code in Python.
A portal to another world of data science: full time dataviz!
Here's a uni course on data visualisation

One more thing!

Keep a concise set of notes - something that you can quickly read through days before an interview to recall what you’ve learnt

Uni teaches you a lot, but knowledge has a short half-life
Use spaced repetition: Read before class ; lecture = time to fill knowledge gaps + do tutorial during class (see direct link and application) + clear doubts immediately ; condense notes at the end of each week & sem ; test yourself before the next class (on the previous class' content) via simple recall questions
Having a side project helps you to practice applying the concepts
Weeks before internship application, practice interview questions

References

This set of slides is made using reveal.js. It's really easy to make a basic set of slides (just HTML) and you can consider using it for simple (tech) presentations! For more advanced customization, you do need CSS and JS but scripts can be easily googled for and it has good documentation.

There are also more alternatives here.

Slide 1 image