View profile

What happened in the last two months? | Next - Issue #29

Hi there!
There are some days when your morning coffee isn’t enough. Your jaws are dropping, and your eyelids are closing shut. The best way to gain alertness is by looking up to the ceiling and holding that for 10-15 seconds. It lights up some of the areas of the brain which are associated with wakefulness.
I almost forgot to write the recap edition for the last two months. In today’s letter, I will (re)present the most exciting stories, packages, jargon, tweets, and memes from the last two and a half months. Dive in!

Five Stories
It’s conference season once again. This blog describes upcoming R-related conferences. Some interesting ones:
Statistics is all about quantifying uncertainty. The problem is we don’t know what we don’t know. This blog post by data scientists at Google gives a structure to explore uncertainties.
There are three types of uncertainty. First is statistical uncertainty, the gap between true population and sample measures. The second is representational uncertainty which captures the lack of “exact” meaning. We want to measure ability, but all we have is SAT scores. The third is interventional uncertainty which illustrates that our intervention might result in more changes than expected.
Josh writes the NYC Data Jobs & Events newsletter, a weekly listing of new data positions, events, and conferences in New York City. As part of that and beyond, he read many articles and wrote a listicle (list of articles) that profoundly influenced his thinking. Jump on to read the best ten articles. Some good ones:
To be honest, all of the ten articles could be part of this newsletter.
Montana, Hawaii and Tennessee may not be among the more populous states in the nation — but when it comes to providing material to America’s singers, they more than pull their weight.
This blog by Julia Silge started with a simple question — which states are popular in song lyrics? The initial answer is simple — California. But the results change once you “control” the effect size and look at the per-capita mentions!
Below is the distorted cartogram to show which places are famous in songs. If you want a technical introduction, jump to Julia’s blog!
In the last edition of this letter, I presented a bonus puzzle.
Last week our village had a fête. One of the competitions on offer was to guess the number of balls in a bag. There were N balls in the bag, and they were numbered 1, 2, 3, …, N. To help competitors make a sensible guess, they were allowed to take out four balls and note the numbers on them.
When I took part, I pulled out balls numbered 24, 87, 14 and 35.
There is a prize for the person who guesses the correct number exactly. How many balls should I estimate are in the bag?
I didn’t know what was Significance magazine‘s solution is, but here’s an approach. It’s called the german tank problem.
During World War II, the Allies needed a method to know how many tanks the Germans had. The spies weren’t reliable; some said thousands, some said hundreds. The problem came to statisticians. All the “data” they had was the numbers painted on tanks. Like 24, 87, 14 and 35. Statisticians assumed these were serial numbers and designed a simple formula to estimate the maximum.
N = m + (m-k)/k - 1
Initially, our best estimate was 87 — the sample maximum m, which is also the maximum likelihood estimate. But since we know it is not the best estimate, statisticians made adjustments. If we know k numbers and we assume the numbers we saw so far are equally spaced, we can estimate what could be the following number, right?
That’s what the second term in the formula did. And they were surprisingly accurate. My estimate for the village fete? 107. It was wrong though.
Four Packages
MLDataR provides real-world datasets for machine learning applications. See the vignette here (
simplevis makes plotting with ggplot easier. You don’t need to provide layer by layer instructions for creating a visualisation. A single function will do! See website.
udpipe is a fantastic text mining package suite. It provides methods for tokenisation, POS tagging, lemmatisation, multi-word expressions, keyword detection, sentiment analysis and semantic similarity analysis. Best of all? It supports over 50 languages. See the website here.
summarytools: One of the first things when starting with a new dataset is creating it’s a summary table. This package significantly simplifies the process. You’ll love it. Check the vignette here.
Three Jargons
.Renviron contains environment variables to be set in R sessions. These are most useful to set up API keys like Github or Twitter IDs. The easiest way to edit it using usethis::edit_r_environ().
.Rprofile contains R codes to run whenever R starts up. You can set up your personal welcome messages. And no, putting library(tidyverse) is not a good idea.
The breakdown point is the proportion of arbitrary large observations that an estimator can handle before giving incorrect arbitrary large results.
Two Tweets
Alison Presmanes Hill
Over the weekend, I wrote up my notes about using and teaching Quarto, based on my experiences working with the development team for over a year.

I think (hope?) it is safe to talk about it now 😆
Instructors using #rstats, if you're teaching machine learning, please please please teach tidymodels and not individual packages like glmnet, caret, etc.
One Meme
Letters in Last Two Months
That's a wrap!
I hope you enjoyed today’s letter. Could you share it with a friend or a colleague? See you next week!
Did you enjoy this issue? Yes No
Harshvardhan @harshbutjust

A short and sweet curated collection of R-related works. Five stories. Four packages. Three jargons. Two tweets. One Meme.

List of all packages covered in past issues:

In order to unsubscribe, click here.
If you were forwarded this newsletter and you like it, you can subscribe here.
Created with Revue by Twitter.
Knoxville, Tennessee.