How We Use Data to Suggest Tags for Your Story

阿新 • • 發佈：2018-12-29

How We Use Data to Suggest Tags for Your Story

Here on Medium, we envision tags to be central in organizing and connecting ideas. Follow the tags you’re interested in and Medium will help deliver the right content to you. To do that we’d like as many writers as possible to tag their posts. For writers, we’d love to help you find your audience.

So how can we use data to improve how tags are used?

Our solution: suggest tags.

What Are Tag Suggestions?

Just as you’re about to publish your draft, we’ll suggest a couple tags for you to use based on what you’ve written. Our goal is not only to increase the number of tagged posts on Medium, but also to help users discover the right tags to use.

Let’s look at the suggested tags for a few excerpts from NY Times articles.

Kanye West declared himself “the greatest living rock star on the planet” at Britain’s Glastonbury festival. But that didn’t prevent a prankster from invading the rapper’s performance and upstaging him.

That is because Mr. Trump’s popularity — his support in some polls is now double that of his closest competitors — is built on his unfettered style, rather than on his positions, which have proved highly fungible.

Reddit, the popular community news site, formalized a new set of guidelines that aim to restrict some of the risqué and potentially offensive content posted to the site.

How It Works

Overview

In our algorithm, we use what is called a nearest neighbors approach. This means that for your post we consider the tags of the posts that are most similar to what you’ve drafted. We then aggregate these tags and rank them by a tag score calculated based on similarity.

Here is a simplified visualization:

In order to use this nearest neighbors method, we need to find a way to compare posts. We can do this by representing posts as vectors in a high-dimensional space and then quantifying how similar two posts are by calculating a distance metric.

Vectorizing Posts

In vectorizing posts, we use something called tf-idf. tf is short for term frequency and it measures how frequently a certain word occurs in a document. idf, or inverse document frequency, measures how special a certain word is in a collection of documents.

Thus, tf-idf is a product of these two statistics.

It reflects how important a word is to a given post in a collection of posts. For example, let’s consider a post written about basketball. The word “layup” will have a high tf-idf value because not only does it occur frequently, but it also is a term that is very specific to basketball. On the other hand, “pass” has multiple meanings and can be used in different contexts, so it may not have a high tf-idf value.

In our tag suggestions algorithm, we use tf-idf vectors to represent the content of a post.

Defining A Distance Metric

With our tf-idf post vectors, we can measure the distance between two posts by using cosine similarity. Cosine similarity is the cosine of the angle between two vectors. A pair of post vectors pointing in the same direction will have a cosine similarity of 1, while vectors pointing in opposite directions will have a cosine similiarty of 0. Now we have a way of calculating post similarity!

Finding The Nearest Neighbors

Once a writer drafts a new post, we are given a new post vector. A cool thing about using tf-idf vectors is that they are unit vectors. This allows us to easily calculate cosine similiarities by doing a simple dot product. By representing the entire collection of posts as a matrix with n rows (each row representing a post vector), we can do a dot product with the new post vector and we’ll get a n-dimensional vector of cosine similarities.

Now we can look for the largest values in this cosine similarity vector to find our nearest neighbors! Finally, we’ll aggregate the tags of these nearest neighbors to determine which ones to suggest.

How We Use Data to Suggest Tags for Your Story

How We Use Data to Suggest Tags for Your StoryHere on Medium, we envision tags to be central in organizing and connecting ideas. Follow the tags you’re int

Why to Use CSS Instead of JavaScript for Your Menus

Why to Use CSS Instead of JavaScript for Your MenusBefore I start on this post, if you haven’t read my post, Good Tech Things to Know, I recommend you read

How I use Python to blog from my iPhone

1. I write each article in MarkdownMarkdown is a simple syntax that can be easily translated to HTML (and a bunch of other formats), but only requires a si

How to Enable LDAPS for Your AWS Microsoft AD Directory

Starting today, you can encrypt the Lightweight Directory Access Protocol (LDAP) communications between your applications and AWS Directory Servic

Use ACM to Upload Certificates for Multiple Domains on ELB

Amazon Web Services is Hiring. Amazon Web Services (AWS) is a dynamic, growing business unit within Amazon.com. We are currently hiring So

Ask HN: How do you manage a security policy for your small team?

My company is a small team operating in the digital health space that's recently grown to 12 in the past year. Although we don't deal with a ton of PHI, s

The quick, practical guide to picking colors for your designs

Having a beautiful color palette doesn’t matter if people can’t figure out how to navigate your design.Anyone can pick great colors, but using colors to gu

The 4 Best Tools To Find Advisors For Your ICO

The 4 Best Tools To Find Advisors For Your ICOWith dozens of new ICOs popping up every week, a successful ICO requires much more than just a great project

7 Ways To Increase Trust For Your Chatbot

Trust is the main reason that bot technology hasn’t fully taken off just yet.Although more and more business are picking up on the technology and utilizing

Ask HN: How to implement caching for dynamic user data in sites like HN, Reddit?

Why would you start by caching it?What are you storing the data in currently? If relational, I'd advise starting with simple relational tables (post_commen

How to Use the Keras Functional API for Deep Learning

Tweet Share Share Google Plus The Keras Python library makes creating deep learning models fast

how to use OpenSSL to decrypt Java AES-encrypted data?

Question: I'm interfacing to a legacy Java application (the app cannot be changed) which is encrypting data using AES. Here is how the or

How To Scan QRCode For UWP (3)

reat 通過 color 照片 property create null 同時 exce 這一節主要介紹如何去設置MediaCapture拍照的分辨率。 MediaCapture 包含一個 VideoDeviceController對象，憑借它可以控制攝像頭的很多設置，

How to use GITHUB to do source control

GITHUB sourcecontrolhow to create repository how to create branch how to add the comment for every change what is the commit how to rollback how to sync th

xcode上編譯c語言程序報錯：ld: x duplicate symbol for architecture x86_64 clang: error: linker command failed with exit code 1 (use -v to see invocation)

text internal self. value gen scrip info 內容讀取在網上查了一下： duplicate symbol的大概意思是，編譯器認為你重復定義了一些東西。 linker command failed with exit cod

How We Use Data to Suggest Tags for Your Story