View profile

What happened in last two months? | Next Issue #18

Harshvardhan
Harshvardhan
Hi there!
When I first arrived in the US, I was appalled by its lacking public transportation. My previous experience in European countries heightened my expectations from a “developed” nation. I was utterly disappointed. Over the last six months, I thought deeply about it. My conclusion: American individualism promotes individual infrastructure like cars more than public goods like railways.
Some technologies do not make logical sense initially. Public transport system. Open-source software. But over time, you realise what enablement of the masses can achieve. R is a prominent example. Such a wonderful community!
Today’s letter is a recap edition. All stories are taken from past eight editions — except memes; jokes shouldn’t be repeated. Dive in!

Five Stories
When you hit shuffle on Spotify (or Apple Music, or YouTube), do you get a random list of songs to play one after another? If you have ten artists, all with around 100 pieces each, you will likely have 13 back-to-back songs by the same artists by the time you finish the first 100 songs (little simulation).
Spotify was bombarded with complaints from users, and they worked intensely on this. Initially, Spotify used the Fisher-Yates shuffle algorithm for creating a perfectly random queue. It isn’t complicated. 
  1. You begin by writing down numbers from 1 to N, where N is the total number of songs.
  2. Choose any k between 1 and N. Identify the kth song from the end and note it separately. Strike kth harmony from the list of songs.
  3. Repeat till there are no songs left in the first list.
Simple, right? Well, listeners didn’t like the results. The engineering team had to define a “more-random-sounding” algorithm. Ultimately, they used Floyd-Steinberg dithering (an algorithm primarily used in image compression) to space out songs by the same artists. The algorithm can then be used recursively to space out pieces from the same album.
Short answer, yes. Colin set out to test this with 15,000 songs that charted on the Billboard Hot 100 between 1958 and 2017. You can easily know a song is repetitive but translating that heuristic to an algorithm is not easy. Using Lempel-Ziv algorithm used for lossless compression, he measures how much can a song’s lyrics be reduced. For Sia’s Cheap Thrills, the song reduces by 76% in size.
In the 1960s, the average song is 45% compressible. In 1980, it increased to 85%. 2014 is the most repetitive year. In fact, the top 10 songs were more repetitive on average for every single year between 1960 and 2015. The compressibility of course varies by genre and artist. Eminem is consistently non-repetitive; Rihanna is most repetitive. Check the blog for an app to find out about your favourite artists.
Hans Rosling was famous for making seemingly uninteresting statistics relevant for everyone. His book Factfulness was an eye-opener to many, including me, and prompted Bill Gates to stop talking about the “developing world”. MI2 Data Lab did the same thing to ML and AI models: explaining its parts through comics.
The comic book, combined with explanatory notes and R codes, is super helpful in understanding how predictive models work, and more importantly, how people build predictive models. Beta and Bit are two coworkers tasked to build a model for COVID-19 mortality using a tonne of data. They both are smart, but one wants to “get things done”, and the other wants to “be sure enough"—two different archetypes of engineers.
Every comic strip playfully exposits a concept before theoretically explaining and doing it in R. At 50 pages, it won’t take more than an afternoon. Whether you are an experienced analyst or a budding one, bookmark it for your Sunday reading.
You can’t ignore spreadsheets. “Spreadsheets, for all of their mundane rectangularness, have been the subject of angst and controversy for decades.” I’ve found visualising this wrangling in Excel to be tremendously helpful. Bruno Rodrigues’s show_in_excel() function is handy.
DataEditR is a Shiny based tool that gives Excel-like data editing capability to data frames in R.
The package also provides an add-in that makes editing data frames a point-and-click job. Its capabilities in selecting, filtering and editing are also packaged as functions that you can use in your Shiny apps. It also offers functionality for branding by changing the title and logo of the app.
If you have created a ggplot2 plot and want to add interactivity, here’s a simple way to do it. Just pass the plot p to ggplotly().
Making interactive plots with ggplotly()
Making interactive plots with ggplotly()
And by interactive, I mean you can hover on a point and get all the details, select points using a box or lasso tool, zoom in and zoom out, and even download the resulting graphics. With no additional code, they show up in R Markdown as well.
Example plot produced by ggplotly().
Example plot produced by ggplotly().
Four Packages
ggtech provides several themes for ggplot2 based on how big technology companies present their data. I especially liked how Airbnb does that and have incorporated some in my theme. Check the vignette here.
flexdashboard provides methods to make easy interactive dashboards using R Markdown. It supports related visualisations in a single pane, htmlwidgets, flexible and straightforward layouts for organisation, Shiny apps, and a creative “storyboard” layout for presenting a sequence of visualisations with commentary. Check the vignette here.
insight provides methods to recover intermediate information when developing a model — beyond coefficient estimates and estimates of fit. It supports 200+ models currently. Check the vignette here.
spiralize is a visualisation plot to create Archimedean spiral plots in R. See the fifth story in this letter to know more. Vignette here.
Here’s the link to the spreadsheet with all packages mentioned in past editions.
Three Jargons
Simpson’s Paradox: Patterns do not exist when looked at the complete data but emerge when looked at partitioned data. One of the best-known examples is UC Berkeley’s admission statistics that test for bias towards men in admission. When the same data was segregated by the department, it appeared 6 out of 85 departments were significantly biased against men, and four were significantly biased against women.
Collaborative Filtering: Imagine Netflix wanting to know what to recommend to a user, u. It finds ten users who are similar to u. Then, it averages their ratings of several movies and shows and picks one that matches u‘s preferences. That’s collaborative filtering.
curly curly operator: {{ var }} When working with tidyverse, supplying variable name as an argument to a function can lead to errors. Placeholder pipes cannot handle back operation and this is where curly curly comes into the picture. Check the example here.
Two Tweets
Amelia McNamara
"Things that are still on your computer are approximately useless." -@drob #eUSR #eUSR2017 https://t.co/nS3IBiRHBn
David Scholz
Navigate to one of your github repositories and press "." (period). Well, I didn't know yet.
One Meme
That's a wrap!
Hope you enjoyed today’s letter. Share it with your friend or a collegue. My inbox is open for feedback. See you next week!
Yours truly,
Harsh
Did you enjoy this issue? Yes No
Harshvardhan
Harshvardhan @harshbutjust

A short and sweet curated collection of R-related works. Five stories. Four packages. Three jargons. Two tweets. One Meme.

Personal website: https://harsh17.in.

List of all packages covered in past issues: https://www.harsh17.in/nextpackages/.

In order to unsubscribe, click here.
If you were forwarded this newsletter and you like it, you can subscribe here.
Created with Revue by Twitter.
Knoxville, Tennessee.