Spreadsheets are the most ubiquitous form of data storage. Microsoft Excel has become a standard tool in data analysis, despite its many known pitfalls. An exemplary case was in the UK when Public Health England (PHE) missed reporting nearly
16,000 coronavirus cases because the API converted CSV files to XLS files mindlessly. An XLS file can only handle 65,000 rows of data, and the UK government agency neglected all reports by third parties once the spreadsheet was filled. There are many more
horror stories.
The paper starts with a wonderful line: “Spreadsheets, for all of their mundane rectangularness, have been the subject of angst and controversy for decades.” Karl and Kara give many useful tips on how data scientists should handle spreadsheets.
First, be consistent. Use consistent codes for categorical variables like sex, missing variables, etc. Second, use useful names. Develop your rules for variable names and identifiers, and follow them. Check the
Tidyverse style guide if you need it. Third, write
dates as YYYY-MM-DD. Dates are stored as numbers and differently in Mac or Windows, in India or the US. Fourth, avoid empty cells or comments (instead put them in a text file with more explanation). And many more. Check
Karl’s tweet about it too!