Survey | Slides |
The former is generally better (when you collect the data, there's more scope for controlling data quality); the latter might be useful if the data indeed has some gems to unmine but it's harder to clean
Pros
Cons
Takeaways
import pandas as pd
df = pd.DataFrame(data)
df.info()
df.describe()
df.head() # .tail()
Watch out for the default parameters used by the functions, e.g. what's the default delimiter of `pd.read_csv()`?
Also watch out for subtle behaviors: e.g. Pandas infers the columns' data type if you don't specify them. It's not always correct - especially dangerous if it's a Categorical variable but inferred as a Numeric.
import pandas as pd
df = pd.read_html(html_code) # returns a list
You will need to set some parameters in `read_html()`
This seem very simple, but it's hiding an enormous amount of complexity + it only covers a small part of web scraping (Wiki is a static site). Check out the slides after this!
If you scrape data at a larger scale, remember to adopt an Agile approach.
There are also more alternatives here.
An alternative view on Jupyter Notebooks.