CX1115 Lab 1: Data Acquisition

(models are cool, but many other things are also important)

Some admin matters

We will begin around 8.35am - please try to come on time
Briefing: scaffolding for labs + extra stuff + experience sharing
Main aim is to give you time to code and gain practical experience
Expectations: It's an introductory course, so the main content covers the basics. If it's too simple, feel free to work on the extra stuff (optional, only if you're done with the basics).
Let me know what you want to learn from the lab sessions: https://forms.gle/VJpwE6R53AFKiv7GA


Survey				Slides

Overview 🗺️

(Business) Problem → Data Acquisition → Data Cleaning → Visualisation → Modelling → Reporting → Deployment → Scaling up

Two main approaches:

Hypothesis-driven data exploration: Collect the relevant data after determining an hypothesis (from domain knowledge / experience)
Quite often, companies just have datasets lying around and your boss wants to see what insights we can get

The former is generally better (when you collect the data, there's more scope for controlling data quality); the latter might be useful if the data indeed has some gems to unmine but it's harder to clean

3 key learning points in Lab 1

Jupyter (check out JupyterLab too)
Basic data exploration: `.describe()`, `.info()`
Data acquisition (web scraping)

1. Jupyter

Pros

Interactive, instant feedback -> good for individual exploratory work
Presents text and plots together -> good for presenting, if content flows linearly

Cons

Easy to mess up if you accidentally reuse variables, especially for long notebooks (or go up and down when editing)
Hard to check diff with Git (can convert, but that's extra work!)

Takeaways

Disciplined usage - individual exploration or reporting (or teaching!)
For anything else, create / convert to .py or .txt (e.g. nbconvert, jupytext)

2. Basic data exploration


							
							import pandas as pd
							df = pd.DataFrame(data)
							df.info()
							df.describe()
							df.head() # .tail()

Watch out for the default parameters used by the functions, e.g. what's the default delimiter of `pd.read_csv()`?

Choose delimiters wisely, if possible. Many datasets contain characters used as delimiters (e.g. `,`)
One way to deal with it is to set `quote_char`, another way is to use obscure characters like `|`

Also watch out for subtle behaviors: e.g. Pandas infers the columns' data type if you don't specify them. It's not always correct - especially dangerous if it's a Categorical variable but inferred as a Numeric.

3. Data acquisition



							import pandas as pd
							df = pd.read_html(html_code) # returns a list

You will need to set some parameters in `read_html()`

This seem very simple, but it's hiding an enormous amount of complexity + it only covers a small part of web scraping (Wiki is a static site). Check out the slides after this!

If you scrape data at a larger scale, remember to adopt an Agile approach.

Lab 1 Deliverables

None! Just make sure you go through the materials. :)

Web scraping, in greater detail

Much more than just `pd.read_table()`.

Actual web scraping is slightly more complicated
(need to know some HTML, regex is useful)
Static sites vs Dynamic sites
Retrieving (text from) images and PDF files
Web scrapers break when website changes, pretty much no simple workaround for major changes besides modifying the code
Refer to this set of slides for more

One more thing!

If you want to be a data scientist, make sure you have a specific domain you're interested in (e.g. finance, healthcare). Data science is a generic skill - to create useful insights, you need domain knowledge

One way to get that is via internships! If you want to do a DS intern, make sure you have a piece of work to talk about (coursework, or personal project)
Try to do at least 2 internships (be it same or different industries)

References

This set of slides is made using reveal.js. It's really easy to make a basic set of slides (just HTML) and you can consider using it for simple (tech) presentations! For more advanced customization, you do need CSS and JS but scripts can be easily googled for and it has good documentation.

There are also more alternatives here.

An alternative view on Jupyter Notebooks.

Slide 1 image