CX1115 Lab 1: Data Acquisition

(models are cool, but many other things are also important)

Some admin matters

  • We will begin around 8.35am - please try to come on time
  • Briefing: scaffolding for labs + extra stuff + experience sharing
  • Main aim is to give you time to code and gain practical experience
  • Expectations: It's an introductory course, so the main content covers the basics. If it's too simple, feel free to work on the extra stuff (optional, only if you're done with the basics).
  • Let me know what you want to learn from the lab sessions: https://forms.gle/VJpwE6R53AFKiv7GA
Survey Slides

Overview 🗺️

(Business) Problem → Data Acquisition → Data Cleaning → Visualisation → Modelling → Reporting → DeploymentScaling up
Two main approaches:
  1. Hypothesis-driven data exploration: Collect the relevant data after determining an hypothesis (from domain knowledge / experience)
  2. Quite often, companies just have datasets lying around and your boss wants to see what insights we can get

The former is generally better (when you collect the data, there's more scope for controlling data quality); the latter might be useful if the data indeed has some gems to unmine but it's harder to clean

3 key learning points in Lab 1

  1. Jupyter (check out JupyterLab too)
  2. Basic data exploration: `.describe()`, `.info()`
  3. Data acquisition (web scraping)

1. Jupyter

Pros

  • Interactive, instant feedback -> good for individual exploratory work
  • Presents text and plots together -> good for presenting, if content flows linearly

Cons

  • Easy to mess up if you accidentally reuse variables, especially for long notebooks (or go up and down when editing)
  • Hard to check diff with Git (can convert, but that's extra work!)

Takeaways

  • Disciplined usage - individual exploration or reporting (or teaching!)
  • For anything else, create / convert to .py or .txt (e.g. nbconvert, jupytext)

2. Basic data exploration


							
							import pandas as pd
							df = pd.DataFrame(data)
							df.info()
							df.describe()
							df.head() # .tail()

						

Watch out for the default parameters used by the functions, e.g. what's the default delimiter of `pd.read_csv()`?

  • Choose delimiters wisely, if possible. Many datasets contain characters used as delimiters (e.g. `,`)
  • One way to deal with it is to set `quote_char`, another way is to use obscure characters like `|`

Also watch out for subtle behaviors: e.g. Pandas infers the columns' data type if you don't specify them. It's not always correct - especially dangerous if it's a Categorical variable but inferred as a Numeric.

3. Data acquisition



							import pandas as pd
							df = pd.read_html(html_code) # returns a list
							
						

You will need to set some parameters in `read_html()`

This seem very simple, but it's hiding an enormous amount of complexity + it only covers a small part of web scraping (Wiki is a static site). Check out the slides after this!

If you scrape data at a larger scale, remember to adopt an Agile approach.

Lab 1 Deliverables

None! Just make sure you go through the materials. :)

Web scraping, in greater detail

Much more than just `pd.read_table()`.
  • Actual web scraping is slightly more complicated
    (need to know some HTML, regex is useful)
  • Static sites vs Dynamic sites
  • Retrieving (text from) images and PDF files
  • Web scrapers break when website changes, pretty much no simple workaround for major changes besides modifying the code
  • Refer to this set of slides for more

One more thing!

  • If you want to be a data scientist, make sure you have a specific domain you're interested in (e.g. finance, healthcare). Data science is a generic skill - to create useful insights, you need domain knowledge
    • One way to get that is via internships! If you want to do a DS intern, make sure you have a piece of work to talk about (coursework, or personal project)
    • Try to do at least 2 internships (be it same or different industries)

References

This set of slides is made using reveal.js. It's really easy to make a basic set of slides (just HTML) and you can consider using it for simple (tech) presentations! For more advanced customization, you do need CSS and JS but scripts can be easily googled for and it has good documentation.

There are also more alternatives here.

An alternative view on Jupyter Notebooks.

Slide 1 image