View profile

Notebooks for Data Analysis, etc. | Next - Issue #25

Harshvardhan
Harshvardhan
Hi there!
Today’s letter starts with two critiques: the first one about using notebooks (R Markdown and Jupyter) for data analysis and the second one about Tidyverse, mainly tidyr. Then, there is a biographical blog post of Rcpp’s developer; a comparison between save() and saveRDS(); and a tutorial on creating custom URLs with blogdown.
Let’s dive in!

Five Stories
This four-year-old blog by Yihui Xie, creator of knitr, blogdown, bookdown and many other packages, is a response to Joel Grus’s talk “I Don’t Like Notebooks” presented at JupyterCon — a data science conference by Jupyter. Yihui takes his arguments and checks if they would apply to Markdown.
My main point is that if you use notebooks for software engineering, you are probably using the wrong tool, no matter how popular it is.
When we are doing data analysis, we need an adjoining notebook to note which decisions we took and why along the way. Comments are not enough; markdown provides rich text formatting to plain text writing. In software programming, we work with tools to create new tools.
Yihui covers eleven problems mentioned by Joel and provides point by point rebuttal for both — Jupyter Notebooks and R Markdown. It is a long post and I encourage you to read it. I also made some highlights and notes along the way; you can check them here.
Prof Norman Matloff from UC Davis wrote this well-researched critique of Tidyverse, especially dplyr. His book “The Art of R Programming” is considered one of the leading works of its kind. He presents several arguments on how “Tidy” complicates concepts for new learners by forcing them to learn pipes and functional programming.
He presents case studies of the pipes, tapply() and base R’s plot functions. He also criticizes some people’s insistence to not use base-R’s plot(), $ or the right assignment operator -> at all.
On a personal note, I do agree with him that plot() and $ make the job simple; I use them every day. However, when the analysis develops, $s add up and is prone to mistakes. I do enjoy functional programming completely — it is my biggest reason for favouring R over Python for data analysis. Here are my highlights.
Romain’s biographical blog on his career in R is a fantastic read. During his college in Montpellier, he found R attractive and quickly made it his job to code in R. I cannot do justice to his ideas by summarising his tumultuous journey in a few lines. However, I will present one quote from when he attempted independent consulting as a career:
On top of not being a consultant, dealing with paperwork, invoices, spending time on the phone, estimating the time something will take me, multiplying that by two, but still being way off, coming up with how much I think my time is worth, … I was not ready for any of that.
A long must-read biographical essay.
save() and saveRDS() are both important functions in R. They allow you to save R objects into memory that you can refer to later. The reason why many prefer saveRDS() over save() is because it saves only the structure and not the name of the object. If you happen to forget the name of the object later, you’ll lose it if saved using save() and loading it using load().
Bottomline: use saveRDS() to save objects and readRDS() to read saved objects.
I recently learned that I can create short URLs if I own the domain (and hosting with blogdown). This tutorial gives you the three step process to create any short URL with a redirect file. I end with some tips on which short URLs to use and how to troubleshoot.
Four Packages
rolldown is an R markdown extension that builds scroll driven HTML documents for storytelling. See examples one and two. Here’s the project’s Github.
animation provides several methods to create statistical animations in HTML, PDF, GIF or video formats. See the website to know more.
udpipe is a fantastic text mining package suite. It provides methods for tokenisation, POS tagging, lemmatisation, multi-word expressions, keyword detection, sentiment analysis and semantic similarity analysis. Best of all? It supports over 50 languages. See the website here.
vitae is a package for creating a Resume / CV using R Markdown. Check the Github here.
Three Jargons
Tokenisation (in Text Mining) breaks a text into smaller fragments called tokens that could be words, characters or subwords.
BLAS is a low-level library for everyday linear algebra operations such as vector addition, scalar multiplication, dot products, linear combinations, and matrix multiplication.
Extreme Value Theory is the study of extreme deviations from the median of the data. It aims to estimate the probability of a sample being more extreme than any previously observed.
Two Tweets
Lisa Lendway, she/her
Anyone have a pretty simple webscraping example using rvest? My old example no longer works and I can't figure out the right selectors to scrape it anymore. I really hate webscraping.
Alex Cookson
#rstats friends! GitHub repos are commonly recommended for managing data science work.

I'm setting one up at work and want to know all your recommendations and best practices for a smooooooth workflow! 😁
One Meme
That's a wrap!
I hope you enjoyed today’s letter. Could you share it with a friend or a colleague? See you next week!
Harsh
Did you enjoy this issue? Yes No
Harshvardhan
Harshvardhan @harshbutjust

A short and sweet curated collection of R-related works. Five stories. Four packages. Three jargons. Two tweets. One Meme.

List of all packages covered in past issues: https://www.harsh17.in/nextpackages/.

In order to unsubscribe, click here.
If you were forwarded this newsletter and you like it, you can subscribe here.
Created with Revue by Twitter.
Knoxville, Tennessee.