Homework 04

Author

Peter Ralph

Published

January 26, 2026

PCA on a kidney disease dataset

Your goal here is, basically, to “do PCA” on a kidney disease dataset¹, provided here:

a README describing the variables, and
a CSV of the data.

The main question of interest here is how the other variables relate to the classification of the disease, which is here “CKD” for “chronic kidney disease” or “not CKD”.

Please turn in a jupyter notebook that includes interspersed code and text (that describes what the code is doing, and interprets results) that covers the following:

(very) briefly describes the data and removes missing data (describing what has been removed)
performs PCA on the dataset, using all variables except classification
reports the proportion of variance in the dataset explained by each of the 24 PCs
visualizes the results, in a way that shows how disease classification is arranged on the PCs and how the variables load on the PCs
carefully explains the result in the context of chronic kidney disease diagnosis.

Footnotes

from the UCI ML archive and that is definitely a real dataset of real people ↩︎