View profile

Data Organisation in Spreadsheets and Pop Lyrics | Next Issue #17

Hi there!
Two weeks ago, I got my copy of Atlas of The Invisible. It is an astonishingly beautiful book of visualisations and maps from all over the world documenting thousands of activities. One that caught my eye was on tracking ancestry through DNA. Over 30 million people have brought DNA test kits so far. However, the science behind it is incredibly inaccurate. Why? Because DNA has more meaningful information to store than our parental records. In fact, in just over ten generations, our family tree will have sixteen times more data than our DNA. It is easier to store data along the way than to try and retrieve it later.
Today I will share about storing and organising data in spreadsheets and how it gets troublesome soon enough. You may have noticed pop songs lyrics getting more repetitive these days, and Colin will test that for us. We will also learn some data analysis principles, making barplots and hypothesis testing. Do check out Simpson’s paradox in the Jargon section as well.

Five Stories
Spreadsheets are the most ubiquitous form of data storage. Microsoft Excel has become a standard tool in data analysis, despite its many known pitfalls. An exemplary case was in the UK when Public Health England (PHE) missed reporting nearly 16,000 coronavirus cases because the API converted CSV files to XLS files mindlessly. An XLS file can only handle 65,000 rows of data, and the UK government agency neglected all reports by third parties once the spreadsheet was filled. There are many more horror stories.
The paper starts with a wonderful line: “Spreadsheets, for all of their mundane rectangularness, have been the subject of angst and controversy for decades.” Karl and Kara give many useful tips on how data scientists should handle spreadsheets.
First, be consistent. Use consistent codes for categorical variables like sex, missing variables, etc. Second, use useful names. Develop your rules for variable names and identifiers, and follow them. Check the Tidyverse style guide if you need it. Third, write dates as YYYY-MM-DD. Dates are stored as numbers and differently in Mac or Windows, in India or the US. Fourth, avoid empty cells or comments (instead put them in a text file with more explanation). And many more. Check Karl’s tweet about it too!
Short answer, yes. Colin set out to test this with 15,000 songs that charted on the Billboard Hot 100 between 1958 and 2017. You can easily know a song is repetitive but translating that heuristic to an algorithm is not easy. Using Lempel-Ziv algorithm used for lossless compression, he measures how much can a song’s lyrics be reduced. For Sia’s Cheap Thrills, the song reduces by 76% in size.
In the 1960s, the average song is 45% compressible. In 1980, it increased to 85%. 2014 is the most repetitive year. In fact, the top 10 songs were more repetitive on average for every single year between 1960 and 2015. The compressibility of course varies by genre and artist. Eminem is consistently non-repetitive; Rihanna is most repetitive. Check the blog for an app to find out about your favourite artists.
ggplot is not trivial, period. But it is powerful. This longish post on bar plots presents various ideas on how to present data through a barplot. The tutorial covers enumerated eight tips on how to make better bar plots in R.
First, it discusses what’s the difference between geom_col() and geom_bar() for making bar plots. Then, it introduces coord_flip() to flip axes. Finally, it shows how to make grouped bar plots (something that I keep forgetting even to this day).
I like two short texts: Zen of Python by Tim Peters and data analysis principles by Karl Broman. Zen of Python describes simple axioms around which Python is designed. My favourite lines are the following two.
Now is better than never.
Although never is often better than *right* now.
“Data analysis principles” is also a short list of axioms and tips on data analysis. Again, here are two of my favourite ones.
follow up on all aberrations
remember Simpson’s paradox
Statistical hypothesis testing is one of the first tools you learn as a statistics student. They are critical in determining change from existing beliefs. Can you say X happened?
Finnstats wrote an introductory guide on hypothesis testing in R. First, it explains the basics of hypothesis testing: what do the terms hypothesis and testing mean? Then, it introduces null and alternate hypothesis, type I and II errors, one and two tailed tests, among other jargons. All of these are supported with external links to tutorials on testing.
Four Packages
ggtech provides several themes for ggplot2 based on how big technology companies present their data. I especially liked how Airbnb does that and have incorporated some in my theme. Check the vignette here.
Robyn is an experimental, semi-automated, open-sourced, open-sourced Marketing Mix Modeling (MMM) package in R from Facebook Marketing Science. It uses various machine learning techniques to define media channel efficiency. Website. Github.
In the last edition, I talked about fuzzyjoin, which gives capabilities for non-exact joins. powerjoin is an improvement over it by adding additional checks, among other things. Check the vignette here.
Do you want to build a personal stock trading tool? If yes, TTR is here to help. It provides a collection of 50+ stock trading indicators to design trading rules. Check the documentation here.
Three Jargons
Simpson’s Paradox: Patterns do not exist when looked at the complete data but emerge when looked at partitioned data. One of the best-known examples is UC Berkeley’s admission statistics that test for bias towards men in admission. When the same data were segregated by the department it appeared 6 out of 85 departments were significantly biased against men and 4 were significantly biased against women.
qplot() is a generic ggplot2 function to create histograms, boxplots, scatterplots, and many more. For simple exploration, this is the fastest way after base R’s plot(), boxplot() and hist(). Learn more here.
Sometimes it is useful to categorise the values of a continuous variable into a factor variable. cut() in R does it efficiently. You provide the breaks and labels and it does its job. Learn more here.
Two Tweets
JD Long
watching @allison_horst explain how she gets grad students to pretend to be a whale shark and think about how they have to tilt their whale face in order to get the most krill... to introduce PCA!

I'll now always call PCA "whale faced dimensionality reduction"
R-Ladies Nairobi
We welcome you to our first meetup of 2022 by @OscarBaruffa where will be taken through "Job hunting for Data Analysts: Why it's hard and what one can do about it"


All genders are welcome!

#rladies #rstats #rladiesnairobi
One Meme
That's a wrap!
I hope you learnt something new today. As always, my inbox is always open for feedback. See you next week!
Yours truly,
Did you enjoy this issue? Yes No
Harshvardhan @harshbutjust

A short and sweet curated collection of R-related works. Five stories. Four packages. Three jargons. Two tweets. One Meme.

List of all packages covered in past issues:

In order to unsubscribe, click here.
If you were forwarded this newsletter and you like it, you can subscribe here.
Created with Revue by Twitter.
Knoxville, Tennessee.