View profile

Answers to your data science career questions | Next — Issue #21

Harshvardhan
Harshvardhan
Hi there!
After being cancelled once due to the pandemic, rstudio::rconf(2022) is here! The conference was announced this Monday and will take place between July 25 to 28 in National Harbor, DC. The conference will have two days of workshops followed by two days of talks. I am particularly excited about the keynote on Quarto, R Markdown’s successor. You can attend the talks online for free (registration required); however, workshops will require paid registration. Check it out!
Today’s letter will cover some practical tips on data science as a career, some listicles (list + articles) on data, and my bookmarked article on how to take your ggplot from 1 to 10. And lots more. Let’s dive in!

Five Stories
Playing a sport for the first time is daunting. You are not sure of norms and rules; you don’t even know your role deeply. Choosing data science as your career option is nothing different. This blog post by Emily Robinson is a culmination of several answers to frequently asked questions about data science career questions. She is also the author of “Build a Career in Data Science” that I wrote about in my last letter. Here are some great mentions:
Josh writes the NYC Data Jobs & Events newsletter, a weekly listing of new data positions, events, and conferences in New York City. As part of that and beyond, he read many articles and wrote a listicle (list of articles) that profoundly influenced his thinking. Jump on to read the best ten articles. Some good ones:
To be honest, all of the ten articles could be part of this newsletter.
If you have worked with real data and real companies, you would dread the response “the guy who did that left”. Data, unfortunately, does not come with its birth certificate or a medical record. The question is what keeps us from creating one?
Caitlin makes a persuasive case of why data dictionaries are needed. They establish a framework to store institutional knowledge and build on it when necessary. A good data dictionary should contain details about table, its descriptions, its field descriptions, definitions, example values and field notes. Finally, she adds four other tips for new data dictionary projects.
I started learning R before tidyverse was built. Although, I love ggplot2 but creating a histogram with hist(x) is way easier than ggplot(tibble(x), aes(x)) + geom_histogram(). But this is because it is ultimately a simple histogram. In modern data science, we aren’t making good old histograms anymore.
Our plots should show the richness of our data. And as I quickly realised, adding legends in base R was a pain. In fact, it was a major reason why I started using ggplot2. David Robinson eloquently argues that ggplot2 is actually better for beginners.
I guess it is like LaTeX (vs Word). It will take you some time to learn, but once you learn, you’ll fly (not walk).
If David Robinson’s article didn’t convince you, here is Cedric Scherer to show how ggplot2 changes the game. Just watch the following GIF.
Source: Blog
Source: Blog
I have this article bookmarked; I keep going through it every now and then.
Four Packages
tidylog: tidylog provides live feedback on dplyr operations. dplyr operations work in the background and we don’t exactly know what manipulations happened. Just by attaching the tidylog package, you can get elaborated details on your wrangling. See the vignette here.
geomtextpath provides methods to use curved paths with text instead of lines. geom_text plots text as points. This library provides functions to use sentences as lines. See the vignette here.
anomalize: Detecting anomalies is not trivial. A simple (yet popular) approach is to identify observations two standard deviations away from the mean as anomalies. This package provides methods to detect anomalies in general data as well as time-series data. Finance enthusiasts: run to detect hints of economic turbulence. See the two-minute introduction here.
ggCyberPunk: So you want to impress geeks at your company with cyberpunk-styled plots. This package is here to help. See the vignette here.
Three Jargons
GIS: Geographic information system (GIS) is a database containing geographic data combined with software tools for managing, analysing and visualising spatial and locational data.
cut(): How do you convert a numeric variable to categorical by grouping them? cut(). It is a damn useful function; commit it to your memory. See this for more details.
Julia is a general-purpose and statistics-based programming langauge. It also supports R and Python packages. Here’s how the writers defined their vision ten years ago:
We want a language that’s open source, with a liberal license. We want the speed of C with the dynamism of Ruby. We want a language that’s homoiconic, with true macros like Lisp, but with obvious, familiar mathematical notation like Matlab. We want something as usable for general programming as Python, as easy for statistics as R, as natural for string processing as Perl, as powerful for linear algebra as Matlab, as good at gluing programs together as the shell. Something that is dirt simple to learn, yet keeps the most serious hackers happy. We want it interactive and we want it compiled.
Two Tweets
RStudio
rstudio::conf is back and registration is open!🎉

Since we're always better when we're together, join us on July 25-28th in National Harbor, DC!

✍️ Register before April 15th to get an early bird discount on registration!

Link for more info: https://t.co/s61qFKxwHU https://t.co/wIrgqGuTWe
David Robinson
Something often missed in discussions of programming language performance is that Python/R written by an expert can often be faster than Java/C/C++ written by a beginner

Performance lies along a boundary, not a spectrum, and you don’t get speed “for free” w/language choice
One Meme
That's a wrap!
Hope you enjoyed today’s letter. Share it with a friend or a collegue. See you next week!
Harsh
Did you enjoy this issue? Yes No
Harshvardhan
Harshvardhan @harshbutjust

A short and sweet curated collection of R-related works. Five stories. Four packages. Three jargons. Two tweets. One Meme.

Personal website: https://harsh17.in.

List of all packages covered in past issues: https://www.harsh17.in/nextpackages/.

In order to unsubscribe, click here.
If you were forwarded this newsletter and you like it, you can subscribe here.
Created with Revue by Twitter.
Knoxville, Tennessee.