View profile

German Tank Problem, Unix Philosophy and Accidental aRt | Next - Issue #24

Harshvardhan
Harshvardhan
Hi there!
Did you know Twitter was started so that “Groups of friends could keep tabs on what each other were doing based on their status updates. Like texting, but not.”? That’s why there was a 140 character limit; SMS couldn’t handle more than that. Wow.
Today’s letter will start with one solution to the bonus puzzle in the last edition. I will also talk about the Unix philosophy, Stata and R, and accidental aRts. It’s going to be a ride; tighten your seatbelts.
By the way, INFORMS University of Tennessee Student Chapter is organising its first workshop, and I am speaking on personal websites for academics. It’s on Thursday at 1 pm Eastern time. You can tune in via Zoom too!

Five Stories
In the last edition of this letter, I presented a bonus puzzle.
Last week our village had a fête. One of the competitions on offer was to guess the number of balls in a bag. There were N balls in the bag, and they were numbered 1, 2, 3, …, N. To help competitors make a sensible guess, they were allowed to take out four balls and note the numbers on them.
When I took part, I pulled out balls numbered 24, 87, 14 and 35.
There is a prize for the person who guesses the correct number exactly. How many balls should I estimate are in the bag?
I don’t know what’s Significance magazine‘s solution is, but here’s an approach. It’s called the german tank problem.
During World War II, the Allies needed a method to know how many tanks the Germans had. The spies weren’t reliable; some said thousands, some said hundreds. The problem came to statisticians. All the “data” they had was the numbers painted on tanks. Like 24, 87, 14 and 35. Statisticians assumed these were serial numbers and designed a simple formula to estimate the maximum.
N = m + (m-k)/k - 1
Initially, our best estimate was 87 — the sample maximum m, which is also the maximum likelihood estimate. But since we know it is not the best estimate, statisticians made adjustments. If we know k numbers and we assume the numbers we saw so far are equally spaced, we can estimate what could be the following number, right?
That’s what the second term in the formula did. And they were surprisingly accurate. My estimate for the village fete? 107.
As we mature towards becoming better progRammers, we need more non-technical resources to drive our thinking. What evaluations deserve to be in a function? What should be in a loop, an lapply or purrr::map()? These are important philosophical questions. Unix philosophy emphasises creating simple, short, modular and clear programs. Some of my favourite tidbits:
  1. Make each program do one thing well. To do a new job, build afresh rather than complicate old programs by adding new features. [Don’t dump all you know in a single R file.]
  2. Fancy algorithms are slow when n is small, and n is usually small. Fancy algorithms have big constants. Until you know that n is frequently going to be big, don’t get fancy. [Important for optimisation codes.]
  3. Rule of Diversity: Distrust all claims for “one true way”. [This is in sharp contrast with the Zen of Python, which says “There should be one—and preferably only one—obvious way to do it.”]
  4. When in doubt, use brute force. [I only thought of adding three, but this is too good.]
In my experience of using Stata, I found that Stata makes certain things very easy and other things all but impossible. R has a much steeper learning curve but flatter plateau. This book is for people trying to learn either of the languages. It provides line by line comparison of codes written in R and Stata. Econometrics people, bookmark it now!
Plots can go wrong and we all know it. But what if you accidentally create art? This Twitter account posts visualisations gone wrong beautifully. This is my recent favourite. It also has a Tumblr blog which hasn’t been updated in a year. Enjoy!
When you’re publishing a plot in your research, small modifications such as fonts matter a lot. cowplot provides many themes, plot annotations, mixing plots with images and so on. I particularly like their ability to make ggplots share a common legend. Check their vignettes to know more.
Four Packages
rscript: rscript is a STATA package for running R codes from STATA. Check the vignette here.
patchwork: This is how you patch ggplots together. p1 | p2 / p3 – simple as that. Check the vignette here.
MetBrewer: Are you looking for ways to make your plots look different from everyone else? This provides a fantastic collection of colour pallets to use. Check the vignette here.
cowplot: Looking for new themes? Looking for how to satisfy Review 3’s requirements for Figure 3? This package is here to help. Check the vignette here.
Three Jargons
Rule of Three: If a certain event didn’t happen in a sample with n subjects, its 95% confidence interval of the rate of happening is 0 to 3/n. This is particularly useful in the initial stages of clinical trials. Read more.
In classification problems, the statistical theory says there are two general ways data could’ve been generated.
Generative models assume a joint probability distribution between observations’ features and the class they belong to, i.e. P(X, Y).
Discriminative models assume a model of conditional probability to predict the class of observation, i.e. P(Y | X = x).
Two Tweets
Blake Robert Mills
✨✨ PACKAGE UPDATE ✨✨
MetBrewer 0.2.0 is officially out 🥳 Available both on Cran and through my GitHub here: https://t.co/wYHMEKoJLH

New Palettes are out and some amazing new features. Full description of what has changed below :)

#MetBrewer #rstats #r4ds #dataviz https://t.co/XU3q7h7ZJT
Lauren Yee 🦇👩🏻‍💻🌈
Going to start an R package called: loneR for when you're the only R data scientist at your company. 😟#rstats
One Meme
Today, I have a comics on p-value hacking. Check it out!
That's a wrap!
I hope you enjoyed today’s letter. Could you share it with a friend or a colleague? See you next week!
Harsh
Did you enjoy this issue? Yes No
Harshvardhan
Harshvardhan @harshbutjust

A short and sweet curated collection of R-related works. Five stories. Four packages. Three jargons. Two tweets. One Meme.

Personal website: https://harsh17.in.

List of all packages covered in past issues: https://www.harsh17.in/nextpackages/.

In order to unsubscribe, click here.
If you were forwarded this newsletter and you like it, you can subscribe here.
Created with Revue by Twitter.
Knoxville, Tennessee.