View profile

Uncertainties, Non-academic Careers and Text Mining | Next - Issue #26

Harshvardhan
Harshvardhan
Hi there!
I’ve been listening to Dune’s soundtrack over and over again for the last week. Today’s newsletter is also a byproduct of that extensive listening. Enjoy!

Five Stories
This is yet another freely available book on statistics. It gently introduces the concept of statistics and its limitations, ideally suited for beginners. The first chapter on Why do we learn statistics? is a treat to read.
The book has six units: background, introduction to R, working with data, statistical theory including hypothesis testing, statistical tools like ANOVA and finally, a brief taste of Bayesian statistics. The book has been translated into French and Python — which speaks to its goodness.
Statistics is all about quantifying uncertainty. The problem is we don’t know what we don’t know. This blog post by data scientists at Google gives a structure to explore uncertainties.
There are three types of uncertainty. First is statistical uncertainty, the gap between true population and sample measures. The second is representational uncertainty which captures the lack of “exact” meaning. We want to measure ability, but all we have is SAT scores. The third is interventional uncertainty which illustrates that our intervention might result in more changes than expected.
Julia Silge, currently working at RStudio, gave a talk on Non-academic Careers for Astronomers and Physicists. This slide deck is her note to fellow academics on how to look for non-academic positions.
The deck talks about how only a fraction of PhDs are employed by academia, while most could easily become data scientists. My Twitter feed is full of academicians looking for resources around it; how to pivot their career from hand-science academia to anything else. I think it suits them.
Her suggestions: have a GitHub, write a professional blog, and contribute to open-source.
Montana, Hawaii and Tennessee may not be among the more populous states in the nation — but when it comes to providing material to America’s singers, they more than pull their weight.
This blog by Julia Silge started with a simple question — which states are popular in song lyrics? The initial answer is simple — California. But the results change once you “control” the effect size and look at the per-capita mentions!
Below is the distorted cartogram to show which places are famous in songs. If you want a technical introduction, jump to Julia’s blog!
Topic modelling is a type of statistical learning where the goal is to learn about some “hidden” topics that occur in a collection of documents. A standard package for it in R is stm, which stands for Structural Topic Modelling.
In this blog, Julia covers the topic modelling of songs by the Spice Girls. It also has a screencast if you like to follow the analysis along the way!
Four Packages
ruimtehol provides methods for text classification, word, sentence and document embeddings, sentence and document similarity, ranking documents, and recommendations, among others. See website.
simplevis makes plotting with ggplot easier. You don’t need to provide layer by layer instructions for creating a visualisation. A single function will do! See website.
wikipediapreview provides popups for Wikipedia pages in R Markdown documents. This might be the only popup you ever wanted. See the intro tweet!
mischelper provides many helper functions for completing simple tasks in RStudio. It is an add-in, so no code is needed! I couldn’t find documentation on what functions are included. See Github.
Three Jargons
The breakdown point is the proportion of arbitrary large observations that an estimator can handle before giving incorrect arbitrary large results.
Confidence intervals give us a probabilistic range (after accounting for uncertainty) of an unknown parameter. It does not measure uncertainty in data collection. If I say my 95% confidence interval is (L, M) and I collect 100 samples of my data, 95 out of 100 times my parameter of interest would be between (L, M).
The goal of a comparative experiment is to compare several populations. Typically, we design the experiment to have each factorial combination be treated somehow while controlling a set to compare against.
Two Tweets
Alison Presmanes Hill
Over the weekend, I wrote up my notes about using and teaching Quarto, based on my experiences working with the development team for over a year.

I think (hope?) it is safe to talk about it now 😆

https://t.co/qg98fEsW7g
Ryan Cavanaugh
"Do senior devs still have to look up stuff on StackOverflow?"

Gentle reader, I have gone to StackOverflow to read answers that I wrote about behavior that I myself designed and implemented.
One Meme
abhishek
Post your favorite machine learning memes ⬇️ Here's one of my favorites 🤣 https://t.co/1pJTavTDMR
Bonus: Pomodoro
Pomodoro makes time management simple and effective. You work for 25 minutes and then you take a short break. Recently, I found this website called Kaizen Flow (free) for Pomodoro timer that adds background “focus” music as well. Forest app (paid) works great too. Finally, there’s the fantastic Mac tool called Pandan (free) which shows you how long have you been working without break — leaving the decision to take breaks to you.
Fun fact: Cyrillic, who came up with the idea of Pomodoro, used a tomato shaped timer to track his 25 minutes. Pomodoro actually means tomato in Italian! 🍅
That's a wrap!
I hope you enjoyed today’s letter. Could you share it with a friend or a colleague? See you next week!
Harsh
Did you enjoy this issue? Yes No
Harshvardhan
Harshvardhan @harshbutjust

A short and sweet curated collection of R-related works. Five stories. Four packages. Three jargons. Two tweets. One Meme.

Personal website: https://harsh17.in.

List of all packages covered in past issues: https://www.harsh17.in/nextpackages/.

In order to unsubscribe, click here.
If you were forwarded this newsletter and you like it, you can subscribe here.
Created with Revue by Twitter.
Knoxville, Tennessee.