Spreadsheets are the most ubiquitous form of data storage. Microsoft Excel has become a standard tool in data analysis, despite its many known pitfalls. An exemplary case was in the UK when Public Health England (PHE) missed reporting nearly 16,000 coronavirus cases
because the API converted CSV files to XLS files mindlessly. An XLS file can only handle 65,000 rows of data, and the UK government agency neglected all reports by third parties once the spreadsheet was filled. There are many more horror stories
The paper starts with a wonderful line: “Spreadsheets, for all of their mundane rectangularness, have been the subject of angst and controversy for decades.” Karl and Kara give many useful tips on how data scientists should handle spreadsheets.
First, be consistent. Use consistent codes for categorical variables like sex, missing variables, etc. Second, use useful names. Develop your rules for variable names and identifiers, and follow them. Check the Tidyverse style guide
if you need it. Third, write dates
as YYYY-MM-DD. Dates are stored as numbers and differently in Mac or Windows, in India or the US. Fourth, avoid empty cells or comments (instead put them in a text file with more explanation). And many more. Check Karl’s tweet
about it too!