Silver Blog How Reading Papers Helps You Be a More Effective Data Scientist

By reading papers, we were able to learn what others (e.g., LinkedIn) have found to work (and not work). We can then adapt their approach and not have to reinvent the rocket. This helps us deliver a working solution with lesser time and effort.



By Eugene Yan, Applied Scientist at Amazon

“Instead of manually checking our data, why not try what LinkedIn did? It helped them achieve 95% precision and 80% recall.”

My teammate then shared how LinkedIn used k-nearest neighbours to identify inconsistent labels (in job titles). Then, LinkedIn trained a support vector machine (SVM) on the consistent labels; the SVM was then used to update the inconsistent labels. This helped them achieve 95% precision on their job title classifier.

This suggestion was the most useful in our discussion. Following up on it led to our product classifier’s eventual accuracy of 95%. How was she able to contribute that critical insight, I asked. “Oh, I just read papers every now and then”, she replied. To be specific, she tries to read 1–2 papers weekly, usually around topics that the team was working on.

By reading papers, we were able to learn what others (e.g., LinkedIn) have found to work (and not work). We can then adapt their approach and not have to reinvent the rocket. This helps us deliver a working solution with lesser time and effort.

 

If I have seen further than others, it is by standing upon the shoulders of giants.
— Isaac Newton

 

Reading papers also widens our perspective. Though we may work in narrow slices of data science, developments in tangential research are often helpful. For example, the idea of word embeddings and graphs have been useful in recommender systems. Similarly, ideas from computer vision — such as transfer learning and data augmentation — have been helpful for natural language processing (NLP).

Reading papers also keeps us up to date. The field of NLP has made large advances in the past decade. Nonetheless, by reading the most crucial 10 or so papers, we can quickly get up to speed. By being up-to-date, we become more effective at work, thus requiring less time and effort. We then have more time to read and learn, leading to a virtuous cycle.

 

How to choose what papers to read?

 
If we’re starting on this habit, we can just read whatever interests us — most papers will have something to teach us. Reading about topics we’re interested in also make it easier to build the habit.
We could also select papers based on practicality. For example, we might need to quickly understand a domain for a project. Before starting on a project, I almost always set aside time for a literature review. Spending a couple of days diving into papers can save weeks, if not months, of dead-ends and unnecessarily reinventing the wheel.

Recommendations are also a handy way to identify useful papers to read. One hack is to follow people we admire on social media, or subscribe to curated newsletters — I’ve found these sources to have a high information-to-noise ratio.

What papers do I read? Out of practicality, I mostly read papers related to work. This allows me to immediately apply what I’ve read and thus reinforce my learning. Outside of work, I have an interest in sequences and tend to read about NLP and reinforcement learning. I’m especially fond of papers that share what worked and what didn’t, such as through ablation studies. This includes the papers on Word2vecBERT, and T5.

 

How to read papers?

 
A google search for “how to read papers” returns innumerable useful results. But if you find it overwhelming, here’s a couple that I found helpful:

My method is similar to the three-pass approach. In the example below, I’ll share how I read several recsys papers to learn about the metrics of novelty, diversity, serendipity. etc.

In the first pass, I scan the abstract to understand if the paper has what I need. If it does, I skim through the headings to identify the problem statement, methods, and results. In this example, I’m specifically looking for formula on how to calculate the various metrics. I give all papers on my list a first pass (and resist starting on a second pass until I’ve completed the list). In this example, about half of the papers made it to the second pass.

Figure

After the first pass, 30+ papers were reduced to 14 — that’s good effort saved.

 

In the second pass, I go over each paper again and highlight the relevant sections. This helps me quickly spot important portions when I refer to the paper later. Then, I take notes for each paper. In this example, the notes were mostly around metrics (i.e., methods, formula). If it was a literature review for an application (e.g., recsys, product classification, fraud detection), the notes would focus on the methods, system design, and results.

Figure

Sample notes from three papers; notes specific to metrics boxed in red.

 

For most papers, the second pass suffices. I’ve captured the key information and can refer to it in future if needed. Nonetheless, I sometimes do a third pass if I’m reading papers as part of a literature review, or if I want to cement my knowledge.

 

Reading furnishes the mind only with materials of knowledge; it is thinking that makes what we read ours.
— John Locke

 

In the third pass, I synthesize the common concepts across papers into their own notes. Various papers have their own methods to measure novelty, diversity, serendipity, etc. I consolidate them into a single note and compare their pros and cons. While doing this, I often find gaps in my notes and knowledge and have to revisit the original paper.

Figure

Sample notes on the serendipity and unexpectedness metric.

 

Lastly, if I think it’ll be useful for others, I write about what I’ve learnt and publish it online. Relative to starting from scratch, having my notes as a reference makes writing much easier. This has led to pieces such as:

 

Try it for yourself

 
Before jumping deep into your next project, spend a day or two scanning through a couple of relevant papers. I’m confident it’ll save you time and effort in the medium to long term. Not sure where to start? Here are some useful resources to start with:

 
Bio: Eugene Yan works at the intersection of machine learning & product to build pragmatic, customer-facing ML systems. He's currently an Applied Scientist at Amazon. Previously, he led the data science teams at Lazada and uCare.ai. He writes & speaks about data science, data/ML systems, and career growth at eugeneyan.com and tweets at @eugeneyan.

Original. Reposted with permission.

Related: