Homework 04
PCA on a kidney disease dataset
Your goal here is, basically, to “do PCA” on a kidney disease dataset1, provided here:
The main question of interest here is how the other variables relate to the classification of the disease, which is here “CKD” for “chronic kidney disease” or “not CKD”.
Please turn in a jupyter notebook that includes interspersed code and text (that describes what the code is doing, and interprets results) that covers the following:
- (very) briefly describes the data and removes missing data (describing what has been removed)
- performs PCA on the dataset, using all variables except
classification - reports the proportion of variance in the dataset explained by each of the 24 PCs
- visualizes the results, in a way that shows how disease classification is arranged on the PCs and how the variables load on the PCs
- carefully explains the result in the context of chronic kidney disease diagnosis.
Footnotes
from the UCI ML archive and that is definitely a real dataset of real people↩︎