View profile

Comics, COVID-19 Models and Brewing Your Own Beer| Next Issue #14

Hi there!
How are you doing? I hope you and your loved ones are safe amidst this COVID-19 wave. I’m no expert but definitely recommend masks and vaccinations.
Today’s edition will cover a part-comic-part tutorial on ethical machine learning, an attempt to brew beers based on county statistics, a Shiny app to solve Wordle, methods to make Archimedean spiral plots and more stories. There are also methods on how to embed files in R Markdown using xfun package.
Let’s dive in.

Five Stories
The Hitchhiker’s Guide to Responsible Machine Learning
Hans Rosling was famous for making seemingly uninteresting statistics relevant for everyone. His book Factfulness was an eye-opener to many, including me, and prompted Bill Gates to stop talking about the “developing world”. MI2 Data Lab did the same thing to ML and AI models: explaining its parts through comics.
The Hitchhiker's Guide to Responsible Machine Learning
The Hitchhiker's Guide to Responsible Machine Learning
The comic book, combined with explanatory notes and R codes, is super helpful in understanding how predictive models work, and more importantly, how people build predictive models. Beta and Bit are two coworkers tasked to build a model for COVID-19 mortality using a tonne of data. They both are smart, but one wants to “get things done”, and the other wants to “be sure enough"—two different archetypes of engineers.
Every comic strip playfully exposits a concept before theoretically explaining and doing it in R. At 50 pages, it won’t take more than an afternoon. Whether you are an experienced analyst or a budding one, bookmark it for your Sunday reading.
Brewing Multivariate Beer
Imagine you chose five metrics for every US county: population density, education rate, household income, etc. Now, you attach those metrics to an ingredient of beer like quantity and type of grains, hops, etc. What would the beer taste like for New York? Boston?
Nathan Yau (Flowing Data) made an R function for the recipe, brewed beers for Maine (low population density), Arlington, Virginia (high median household income), Bronx, New York (high population density), and Marin, California (high education rate).
Beers for Aroostook, Arlington, Bronx and Marin.
Beers for Aroostook, Arlington, Bronx and Marin.
In the end, the beers looked the same but tasted different. Aroostook was mild overall; Arlington had a “thick head, some aroma, and an obvious rye taste” because of its high education rate, employment and healthcare coverage; the Bronx was unbalanced with higher hops as the county had low income and high population density.
Excel-like Data Editing in R
Once we start analysing real-world data in R, more and more analysis starts happening in the background. We tasked dplyr to group by order number and find the total sales, but how do we ensure it did that accurately? I’ve found visualising this wrangling in Excel to be tremendously helpful. Bruno Rodrigues’s show_in_excel() function is handy.
DataEditR is a Shiny based tool that gives Excel-like data editing capability to data frames in R.
DataEditR Capabilities. Source: Github Repository.
DataEditR Capabilities. Source: Github Repository.
The package also provides an add-in that makes editing data frames a point-and-click job. Its capabilities in selecting, filtering and editing are also packaged as functions that you can use in your Shiny apps. It also offers functionality for branding by changing the title and logo of the app.
Using Shiny and regex to solve Wordle
On Sunday, Neal Freyman from Sunday Brew wrote:
Software engineer Josh Wardle knew that his partner, Palak Shah, loved word games, and, bored during the pandemic, decided to create one for her as a kind gesture. You might have heard of it: Wordle. After months of playing it as a family, Wardle released the game to the public in October. It had more than 2.7 million players last Monday.
The game is more popular than ever right now. Pachá added something clever to it. (Whether he created it to help her partner is unknown to me.) He made a Shiny app that uses the power of regular expression to match the words from the English dictionary with colour clues.
It takes the correct letters from user inputs and subsets an English dictionary based on words with green letters. It further subsets those which have yellow letters. Finally, it eliminates the words which have grey letters. Voila!
A Quick and Easy Way to Make Spiral Charts in R
Two weeks ago, New York Times published a guest essay with a spiral chart (technically called Archimedean spiral plot) to represent New COVID-19 cases in the United States.
New Covid-19 cases, United States. New York Times. Jeffrey Shaman.
New Covid-19 cases, United States. New York Times. Jeffrey Shaman.
We could argue for hours if it best represents the data, but two things are sure. It caught our attention and showed the wavey nature of new cases. It would be insightful for most time series plots. Nathan Yau (from R brewery) wrote a short instructional article on creating such plots. All we need is spiralize package by Zuguang Gu. You will have your result in about five lines of code.
Four Packages
ggstatsplot: Statistical tests are commonly used in research to showcase results from analyses. ggstatsplot joins modelling with visualisation to create graphics with statistical details. Check the vignette here.
plotly is an R package for interactive web-based plots in R. You can click on a point to learn more, pinch to zoom, and all other familiar gestures work as well. Check the guidebook for different plots you can make.
DataEditR is an interactive editor for viewing, filtering, selecting and editing data frames in R. Check the third story in this letter to know more. Vignette here.
spiralize is a visualisation plot to create Archimedean spiral plots in R. See the fifth story in this letter to know more. Vignette here.
Three Jargons
Golden Ratio: If you are reading this, you have likely heard about the golden ratio at some point. It magically appears in how flowers are arranged, financial trades happen, and many other places. Watch this video by Numberphile to know how it’s so irrational.
Association Rules: Association rules are methods in data mining to find statistical connections (“associations”) between observations in a dataset.
Asymptotically Unbiased Estimator is an unbiased estimator as the sample size increases to infinity. Generally, a biased estimator can be asymptotically unbiased, but the opposite is not necessarily true.
Tokenisation: While processing unstructured textual datasets, tokenisation is converting texts to words, numbers or other characters (called units).
Two Tweets
R Function A Day
Sometimes the rmarkdown source file alone isn't enough to reproduce the report, and additional files (e.g. data) need to be embedded.

The {embed_*} function family from {xfun} 📦 does so by encoding the files to base64 format! 🎁

#rstats #DataScience
Lucy D’Agostino McGowan
Last night, @nbcsnl said that seeing Spider-Man: No Way Home was causing the surge in COVID-19 cases, so I decided to take a look!

😱Looks like the *opposite* is true? Could more people seeing Spiderman lead to fewer cases? Let's take a causal look👇
One Meme
Machine learning instead of intellectual learning.
Machine learning instead of intellectual learning.
That's a wrap!
If there is a blog, meme, or anything that you’d like to be shared in this newsletter, send them in. You can also forward this letter to your friend who’d like to learn more about R and data science. See you next week.
Did you enjoy this issue? Yes No
Harshvardhan @harshbutjust

A short and sweet curated collection of R-related works. Five stories. Four packages. Three jargons. Two tweets. One Meme.

List of all packages covered in past issues:

In order to unsubscribe, click here.
If you were forwarded this newsletter and you like it, you can subscribe here.
Created with Revue by Twitter.
Knoxville, Tennessee.