View profile

Data Science in Industry | Next Issue #12

Hi there,
Happy New Year 2022!
It felt great to get back to writing after a good break. During this short break, I read many blogs by researchers at Meta, Spotify, Facebook, and Alphabet. If you aren’t aware of them yet, I’d suggest checking them out. I will present some of those articles on how their data science team operates in today’s edition.
Let’s dive in.

Five Stories
Making our displacement maps more representative, Meta Research
Natural disasters force millions of people out of their homes regularly. In 2020 alone, an estimated 30.7 million people were displaced due to natural causes. Facebook’s Data for Good provides displacement maps of people who moved from one location to another. However, such an estimate is biased because not all people use Facebook (FB is less common with lower-income groups), and not all share their location history.
To reduce bias, researchers wanted Facebook users to represent the population in terms of age, gender, relative wealth and rural-urban characteristics. They solved this challenge by weighting. First, they partition the map into 2.4 km wide areas and calculate the average relative wealth index and road density within each site. Using users’ characteristics such as age and gender, they re-weighted to account for sparse locations and “zoom-out” to include the whole area affected by a disaster.
In both the steps, they use inverse probability weighting (IPW) and Lasso regression. Facebook location users have different probabilities of being selected into the sample compared to the rest of the population. IPW weights each person by the inverse of its probability of selection.
Making our displacement maps more representative
Rachel Bittner: Senior Research Scientist at Spotify
My Beat is a blog series by Spotify Research that spotlight technical employees from different areas and roles and showcases a day of their lives. I love the series because it gives me a peek into experts’ lives, like Tribe of Mentors but way shorter. Rachel Bittner is a senior research scientist at Spotify working on audio classification models.
Although working from her home in Paris, she always gets coffee and bagel — not croissants — for breakfast. Classic New Yorker. Until late afternoon, she catches up on news, emails and Slack messages. Mid-afternoon is when New York is live and she starts meetings with product teams and other stakeholders. She might also squeeze in a few research papers if time permits. The day ends with a long cooking session.
My Weekly Breakdown, Rachel Bittner, Spotify.
My Weekly Breakdown, Rachel Bittner, Spotify.
Representation of music creators on Wikipedia, Spotify Research
Wikipedia is arguably one of the best things that ever happened to the internet. Presence on the most extensive repository of knowledge can impact one’s reach. Spotify analysed how the 50,000 most streamed Spotify artists are represented on Wikipedia.
They found that streaming popularity is correlated with Wikipedia representation. The association isn’t linear: 90% of top-1000 artists have a presence, and the percentage drops to below 50% after 10,000th artist. Top all-female solo/group artists are slightly but significantly more represented on Wikipedia than all-male solo/group artists, although female artists are underrepresented in the top 50,000 Spotify artists.
Rock artists have 2.5 times the representation of hip hop artists and three times the representation of Latin artists. 85% of rock artists were on Wikipedia, but only 33% of dance/electronic, 28% of hip hop and 21% of Latin artists are represented. The blog goes on to explore the content of Wikipedia pages.
Creating New Content at Netflix
Watching through thousands of titles on, I have always wondered what goes on in content creation decisions. Serving more than 195 million users in 190 countries is not easy. Could there be a more streamlined model in deciding which titles to fund?
Netflix’s data science team dissects this management decision problem into two important questions about historical data.
  1. What existing titles are comparable, and in what ways?
  2. What audience size can we expect, and in what regions?
In this blog post, researchers explain how they use machine learning and statistical modelling for these tasks, which are challenging with conventional methods that rely on box office sales and Nielsen ratings. Their approach is rooted in Transfer Learning (creatively marketed as “Machine Learning’s Next Frontier”).
First, they use a large dataset of historical titles and text summaries curated by an expert team to create numerical representations or embeddings for each title. With embeddings as inputs, the researchers conclude a smaller set of titles directly relevant for content decisions. All computations are performed using Netflix’s open-source framework called metaflow. It also has an official R engine.
Supporting content decision makers with machine learning
2021 at RStudio: A Year in Review
2021 was a great year for R — a big thanks to the community of R developers. On average, 140 packages were published on CRAN every week, peaking in the latter half of the year!
Number of R packages published on CRAN in 2021. Visualisation by Harshvardhan.
Number of R packages published on CRAN in 2021. Visualisation by Harshvardhan.
RStudio wrote a blog post earlier today highlighting some of their work this year. Interoperability and support for Python were the headlines. Tidyverse and Tidymodels published their own year in reviews to highlight changes. (I particularly liked informative error messages coming up in rlang).
The most important development I think was the updates to RStudio IDE. Rainbow parentheses enhance code readability by miles. Visual editor‘s table editor and citation manager are extremely useful. If there is one link you should click in this letter, let it be this one and jump to the quick tour of RStudio 1.4.
Four Packages
ggx: This package is a helper for plotting with ggplot2. It can convert your request in English to “rotate x-axis labels by 90 degrees” to ggplot2-speak. Check the package’s Github repository for more details.
IndiaPIN: Two days ago, I was bored and decided to write a dataset package for Indian PIN codes (equivalent to ZIP codes). IndiaPIN has PIN codes for all postal offices in India with their geographic details, including longitude and latitude. See Github repository.
pingr: Sometimes, you need to check if a remote computer or web server is up and running. ping() from pingr allows you to do that. See Github repository.
archive: When you need to compress your output files straight from R, do so with archive() from the archive package. See Github repository.
Three Jargons
curly curly operator: {{ var }} When working with tidyverse, supplying variable name as an argument to a function can lead to errors. Placeholder pipes cannot handle back operation and this is where curly curly comes into the picture. Check the example here.
bang bang operator: !! Older brother of curly curly, it does almost the same thing but less elegantly. Check the example here.
Lasso Regression: It is a regression analysis method that performs both variable selection and regularization in order to enhance the prediction accuracy and interpretability of the model. Wikipedia.
Two Tweets
Ilya Kashnitsky
Which countries have more researchers registered in @Publons and where did they contribute more into fuelling the peer-review engine of academia in the last year 🔎🗺️

🔗#rstats code:
One Meme
That's a wrap!
As always, feedback is welcome. Hope you have a wonderful new year!
Did you enjoy this issue? Yes No
Harshvardhan @harshbutjust

A short and sweet curated collection of R-related works. Five stories. Four packages. Three jargons. Two tweets. One Meme.

List of all packages covered in past issues:

In order to unsubscribe, click here.
If you were forwarded this newsletter and you like it, you can subscribe here.
Created with Revue by Twitter.
Knoxville, Tennessee.