CX1115 Lab 2: Basic Statistics & dataviz

Lab 1 Review

  • Data Science pipeline
  • Using Jupyter Notebooks, shortcuts, JupyterLab

  • Many data types: .csv, .txt., .xls, .json, .html (and .data, etc)
  • DataFrame attributes and functions ; .info() vs .describe()
  • Dangers of pd.read_csv(): automatic inference of data types
  • Overwrite default params: df.head(n=20);
    pd.read_csv(data, index_col=0,header=None,sep='\t')
  • df.columns to overwrite the column headers for the Olympic tables

  • Web scraping: pd.read_html() and beyond
  • Learn (how and what) to Google, try to remember basic syntax

Overview 🗺️

(Business) Problem ↔ Data Acquisition ↔ Data CleaningData Exploration / Visualisation ↔ Modelling ↔ Reporting ↔ Deployment

(Not uni-directional, rather iterative in real-life settings)


  • Purpose: familiarise yourself with the data, know which features might be relevant, prepare data for modeling
  • Keep the end goal of data analysis in mind (e.g. predict house prices)
  • If your mind is set on DS, your (personal) goal should be to have most of these attributes and functions at your fingertips

Let's move to the Preparatory Notebooks and Discussion questions!

2 key learning points in Lab 2

  1. Plots for numeric variables and how to interpret them:
    boxplot, hist, violin, joint, pair/scatter, heatmap
  2. Attributes and Functions for data exploration

1. Some common viz mistakes

A guide on the limits of our visual system and how to not fall for traps.
(slide numbers in brackets)

  • Do not put graphs with different scales side by side or together (34)
  • Lengths are easier to compare than areas (48)
  • Avoid 3D if the data isn't actually 3D (62-63)
  • Avoid chartjunk (64-69)
  • Small multiples are better than (uncontrollable) animations (97,98)
  • Do not use too many colors - follow convention ; different sets of colour for different data

2. Attributes and Functions
for data exploration

`.corr()`,
`.concat()`, `.index`, `.reindex()`,
`.select_dtypes()`
`.drop()`

  • Series vs DataFrame
  • Pandas has some 'duplicated' functions: see this link and this link.

Lab 2 Deliverables

None! Just make sure you go through the materials. :)

Data Visualisation, in greater detail

Much more than just default plots from Seaborn.

Relevant at 2 stages of the pipeline: exploration and reporting

One more thing!

Keep a concise set of notes - something that you can quickly read through days before an interview to recall what you’ve learnt

  • Uni teaches you a lot, but knowledge has a short half-life
  • Use spaced repetition: Read before class ; lecture = time to fill knowledge gaps + do tutorial during class (see direct link and application) + clear doubts immediately ; condense notes at the end of each week & sem ; test yourself before the next class (on the previous class' content) via simple recall questions
  • Having a side project helps you to practice applying the concepts
  • Weeks before internship application, practice interview questions

References

This set of slides is made using reveal.js. It's really easy to make a basic set of slides (just HTML) and you can consider using it for simple (tech) presentations! For more advanced customization, you do need CSS and JS but scripts can be easily googled for and it has good documentation.

There are also more alternatives here.

Slide 1 image