CX1115 Introduction to DS and AI

AY20/21, Semester 2

TA for labs on introduction to DS and AI, taught via Python


Quicklinks for slides

Here are some additional resources that builds on top of the lab materials. It goes beyond what’s covered in the lab manual, but these are additional questions you should think about. Answers are hidden ;)

(Hint: this is written using Markdown - you can either check out the GitHub repo or view source)

Lab 1

In the text below, df refers to any arbitrary DataFrame.

M1 DataAcqusition.ipynb

  1. Look at the documentation for pd.DataFrame(data) to better understand what it can / cannot do.

    Series, arrays, constants, dataclass or list-like objects. If data is a dict, column order follows insertion-order.

    It only accepts lists, dictionaries, pd.Series, np.array. List-like objects include tuples and sets. Do check out dataclass, but it’s relatively new and not that widely used yet (you might need to know OOP to fully appreciate it too).

  2. All the data imports in the notebook works well and loads into the df nicely, but real life data is more problematic than this! We’ll see more examples along the way.

  3. When importing HTML tables into df, why are there NaNs? Open the link and right click on any tables (e.g. Cast) and click Inspect to look at the code.

  4. Stick with the way information is accessed, i.e. canteens_df["Name"] and canteens_df.iloc[0].

    You might chance upon other ways of accessing rows (e.g. canteens_df.name) or columns (canteens_df.iat[0]). The former doesn’t work with column name with spaces. For the latter, .iat is lightly faster than .iloc, but if performance is really an issue you should stick to Numpy - df is generally much slower. See this link for more details.

  5. What exactly does class 'pandas.core.frame.DataFrame' mean when you type type(df)?

    DataFrame is a class defined in a file called frame.py in a folder named core which is in a folder named pandas. That’s just how the Pandas developers organised their code.

    You can see the file in Panda’s GitHub repository

Exercise1.ipynb

You should be able to complete the two problems by yourself.

For the bonus problem, here are the column names of the UCI dataset to spare you from some manual entry:

['age', 'workclass', 'fnlwgt', 'education', 'education-num', 'marital-status', 'occupation', 'relationship', 'race', 'sex', 'capital-gain', 'capital-loss', 'hours-per-week', 'native-country', 'income']

You’ll still need to take care of some parameters in read_csv().

Definition of fnlwgt isn’t very clear - my interpretation is that it’s an approximation of the number of people in the US that have similar demographic info as described by that row. If you try to sum the column up, you’ll get 6 billion - that’s probably because there are quite a lot of overlaps.

  1. Notice that there are several columns with NaN. We will learn how do deal this in later classes (generally we drop the column if there are too many NaNs, if there’s only a few missing cells we can fill it in - there are many ways to do so and what’s appropriate depends on the dataset).

  2. As usual, be very careful about the data type inferred by Pandas - make sure you double check its correctness.

  3. For the UCI Adult dataset, some columns are rather redundant (e.g. education-num) - these can be dropped.

Lab 2

M2 BasicStatistics.ipynb

  1. The notebook shows many types of plots (boxplot, histogram, kernel density, violin plot, joint plot, heatmap, pair plot). When choosing which kinds of plots to you, bear in mind the level of expertise your target audience has (not just you!). For example, violin plots seem more efficient (boxplot + kernel density plot) but a non-DS trained person will not understand it immediately without guidance.

  2. In the first set of plots for the bi-variate statistics, we can see various plots for both HP and Attack. Be careful not to draw insights just by comparing the plots in this form - the axes have different values! Make sure to set the axes to be the same range if you want to make a visual comparison.

     for ax in axes:
         for x in ax:
             x.set_xlim(0,255)
    
  3. Watch out for the interpretation of correlation values. Here, we are looking at Pearson’s correlation (default option, since we didn’t specify which type). So a correlation score of 0.02 between Defense and Speed means that there is no linear correlation - i.e. we can’t explain the relationship between the features with a straight line (\(y=ax\) for some constant \(a\)) Looking at the pair plots gives some intuition: for the plot between Defense and Speed, there are no Pokemon with both very high Defense and Speed. A majority of them are actually close to the mean (~70) and have rather high standard deviation (~30). Looking at the formula for correlation will also give you an idea why having either X or Y close to the mean (and extremely few points on the top right) leads to a small correlation score. For any Pokemon to have significant impact on the correlation score, the values generally have to vary more than 1 standard deviation from the mean for both features.

    \[\rho = \frac{\sum_{i=1}^{n} (x_i - \bar{x})(y_i - \bar{y})}{\sigma_x \sigma_y}\]

    It’s also useful to note that there are many other ways to compute correlation ; even more ways to measure relationships between distributions.

  4. More details about Kernel Density Estimator.

  5. What exactly does seaborn.set() do? Look at the differences here.

M2 ExploratoryAnalysis.ipynb

  1. Why is the dtype of some columns (Name, Type) object instead of String (hint: pointers)? Here’s a hint and see the other link in that post too.

  2. In this dataset, the NaNs in column Type 2 are true NaNs that can’t be filled with any other values - some Pokemon do not have a secondary type.

  3. Watch out for the whiskers for boxplot! By default, the left whisker is Q1 - 1.5IQR and right whisker is Q3 + 1.5IQR. But you will realise that in some plots, the whiskers do not have equal length. That’s because it stops at the smallest and largest points within the range.

  4. Do not use pairplots directly on all the numeric columns - choose the columns you want to visualise.

  5. To re-emphasise, it’s a good practice to create a deepcopy of your df via df.copy() when you’re cleaning data.

Lab 3

Exercise3.ipynb

  1. In this exercise, you need to deal with categorical variables. Although pd.read_csv() automatically infers the column type, it doesn’t do so for categorical variables. You’ll have to do it manually (e.g. df["NumBedrooms"].astype("category")), but there are some pseudo-automatic methods to do so. Basically, they take advantage of the fact that categorical variables are likely to have a smaller ratio of unique value to number of rows.

  2. As you start picking up data cleaning skills, you might wonder - can’t we just do these on Excel?

  • Why Excel?
    • Okay for small datasets, non-CS can use and work together
    • Plots are easy to generate in Excel, but Power BI gives you a much wider range of options
    • Useful Excel formula / functions: XLOOKUP (to draw information from one sheet to another), Power Pivot, What-If Analysis tools (Scenario, Goal Seek, Data Table)
  • Why not Excel?
    • Easier to understand the logic behind the cleaning process via code (vs breaking down formula)
    • Excel can’t show more than \(2^{20}\) ≈ 1 million rows
    • Can’t do complicated analysis (else it takes eons to open your files)
    • Excel sheets are a pain to understand (you need to look at the formulas cell by cell), but Power Query gives it a strong boost
    • Although Excel has VBA and add-ins like XLMiner and Analysis ToolPak but Python libraries and R packages are much more user-friendly
  • Python vs R
    • Both can do pretty much the same thing, but Python has a bigger community and more libraries (and more stackoverflow qns/ans)
    • RStudio’s layout is very good for data analysis - in Python there’s Spyder
    • R has a learning curve and can be quite a pain at times while Python has its own fair share of quirks but still feels cleaner
    • Some libraries (especially Science-related e.g. bioinformatics) are still only on R

Lab 4

Exercise4.ipynb

  1. Why do we need to divide by $n-2$ to calculate the standard error? (Hint: degree of freedom. Here are some links to understand what it means).

  2. Why vertical distance (from point to the best fit line) is used to compute error? There are actually many variants of linear regression models. What we’re learning is Ordinary Least Squares (OLS), where we assume the independent variables (x) are known and thus the dependent variable (y) is the source of errors. If you look at horizontal distance, that’s basically regressing x on y - but not that it’s not actually the same as y on x! Besides vertical and horizontal distances, what about the ‘orthogonal’ distance from the point to the best fit line? That’s called Deming regression for the case of 2 variables, or total least squares for the more general case.


Last updated: 12 April 2021