Missing Data and the Sound of One Hand Clapping
Anyway, when I was first introduced to the problem of missing data with respect to building a machine learning model, my response was similar to how I felt about the clapping koan. As it turns out, that was the wrong response. There is a solution to the problem of missing data. Its called imputation.
One method of imputation is to replace all missing values in a column with the mean of that column. Another method is to replace all missing values with zero. Both of these methods lead to problems. Mean imputation reduces the variance. This can cause issues with hypothesis testing. The size of a confidence interval is central to hypothesis testing. If the variance is artificially reduced, confidence intervals will shrink. This can lead to rejecting a null hypothesis when it is unwarranted. Setting the missing values to zero will change the mean. This too can cause problems with hypothesis tests.
There are imputation methods that do not have these issues. And important one is called MICE. This method is the subject of my capstone. The programming language R has had an implementation of MICE since 2000. Python, another language, has its version of MICE. This version is more recent and may not be as stable as the R version. In my capstone, I compare the performance of R MICE vs Python MICE. The link to my capstone is here.
Comments
Post a Comment