Designing a URL Shortening service like TinyURL

Why do we need URL shortening?

URL shortening is used to create shorter aliases for long URLs. We call these shortened aliases “short links.” Users are redirected to the original URL when they hit these short links. Short links save a lot of space when displayed, printed, messaged, or tweeted. Additionally, users are less likely to mistype shorter URLs.

For example, if we shorten this page through TinyURL:

https://www.tuturself.com/posts/view?menuId=3&postId=1329

We would get:

https://tinyurl.com/y4h3knv3

The shortened URL is nearly one-third the size of the actual URL. URL shortening is used for optimizing links across devices, tracking individual links to analyze audience and campaign performance, and hiding affiliated original URLs. If you haven’t used tinyurl.com before, please try creating a new shortened URL and spend some time going through the various options their service offers. This will help you a lot in understanding this chapter.

Following are the high-level requirements of the tiny URL system.

Functional Requirements:

  1. Given a URL, our service should generate a shorter and unique alias of it. This is called a short link. This link should be short enough to be easily copied and pasted into applications.
  2. When users access a short link, our service should redirect them to the original link.
  3. Links will expire after a standard default timespan. Users should be able to specify the expiration time.

Non-Functional Requirements:

  1. The system should be highly available. This is required because, if our service is down, all the URL redirections will start failing.
  2. URL redirection should happen in real-time with minimal latency.
  3. Shortened links should not be guessable (not predictable).

Extended Requirements:

  1. Analytics; e.g., how many times a redirection happened?
  2. Our service should also be accessible through REST APIs by other services.

 

System APIs

We can have REST APIs to expose the functionality of our service. The following could be the definitions of the APIs for creating and deleting URLs:

createURL(api_key, original_url, custom_alias=Optional, user_name=Optional, expire_date=Optional)

Parameters:

Parameter Type Description
api_key String   The API developer key of a registered account. This will be used to, among other things, throttle users based on their allocated quota
original_url String Original URL to be shortened
custom_alias String The optional custom key for the URL.
user_name String Optional user name to be used in encoding
expire_date String The optional expiration date for the shortened URL


Returns: (String)
A successful insertion returns the shortened URL; otherwise, it returns an error code.

deleteURL(api_dev_key, url_key)

Where “url_key” is a string representing the shortened URL to be retrieved. A successful deletion returns ‘URL Removed’.

How do we detect and prevent abuse? A malicious user can put us out of business by consuming all URL keys in the current design. To prevent abuse, we can limit users via their api_key. Each api_key can be limited to a certain number of URL creations and redirections per some time period (which may be set to a different duration per developer key). This can be done using Gateway applications like Netflix-Zuul or spring-cloud-gateway

 

Database Design

?      Defining the DB schema in the early stages of the interview would help to understand the data flow among various components and later would guide towards data partitioning.

A few observations about the nature of the data we will store:

  1. We need to store billions of records.
  2. Each object we store is small (less than 1K).
  3. There are no relationships between records—other than storing which user created a URL.
  4. Our service is read-heavy.

Database Schema:

We would need two tables: one for storing information about the URL mappings, and one for the user’s data who created the short link.

What kind of database should we use? Since we anticipate storing billions of rows, and we don’t need to use relationships between objects – a NoSQL key-value store like DynamoDBCassandra or Riak is a better choice. A NoSQL choice would also be easier to scale. Please see SQL vs NoSQL for more details.

 

Basic System Design and Algorithm

The problem we are solving here is, how to generate a short and unique key for a given URL. In the TinyURL example in Section 1, the shortened URL is “https://tinyurl.com/y4h3knv3. The last six characters of this URL is the short key we want to generate. We’ll explore two solutions here:

a. Encoding actual URL

We can compute a unique hash (e.g., MD5 or SHA256, etc.) of the given URL. The hash can then be encoded for display. This encoding could be base36 ([a-z ,0-9]) or base62 ([A-Z, a-z, 0-9]) and if we add ‘-’ and ‘.’ we can use base64 encoding. A reasonable question would be, what should be the length of the short key? 6, 8 or 10 characters.

Using base64 encoding, a 6 letter long key would result in 64^6 = ~68.7 billion possible strings
Using base64 encoding, an 8 letter long key would result in 64^8 = ~281 trillion possible strings

With 68.7B unique Strings, let’s assume six letter keys would suffice for our system.

If we use the MD5 algorithm as our hash function, it’ll produce a 128-bit hash value. After base64 encoding, we’ll get a string having more than 21 characters (since each base64 character encodes 6 bits of the hash value). Since we only have space for 8 characters per short key, how will we choose our key then? We can take the first 6 (or 8) letters for the key. This could result in key duplication though, upon which we can choose some other characters out of the encoding string or swap some characters.

What are the different issues with our solution? We have the following couple of problems with our encoding scheme:

  1. If multiple users enter the same URL, they can get the same shortened URL, which is not acceptable.
  2. What if parts of the URL are URL-encoded? e.g., https://www.tuturself.com/posts/view?menuId=3&postId=1329, and https%3A%2F%2Fwww.tuturself.com%2Fposts%2Fview%3FmenuId%3D3%26postId%3D1329 are identical except for the URL encoding.

Workaround for the issues: We can append an increasing sequence number to each input URL to make it unique, and then generate a hash of it. We don’t need to store this sequence number in the databases, though. Possible problems with this approach could be an ever-increasing sequence number. Can it overflow? Appending an increasing sequence number will also impact the performance of the service.

Another solution could be to append user id (which should be unique) to the input URL. However, if the user has not signed in, we would have to ask the user to choose a uniqueness key. Even after this, if we have a conflict, we have to keep generating a key until we get a unique one.

Request flow for shortening of a URL:

Generating keys offline

We can have a standalone Key Generation Service (KGS) that generates random six letter strings beforehand and stores them in a database (let’s call it key-DB). Whenever we want to shorten a URL, we will just take one of the already-generated keys and use it. This approach will make things quite simple and fast. Not only are we not encoding the URL, but we won’t have to worry about duplications or collisions. KGS will make sure all the keys inserted into key-DB are unique

Can concurrency cause problems? As soon as a key is used, it should be marked in the database to ensure it doesn’t get used again. If there are multiple servers reading keys concurrently, we might get a scenario where two or more servers try to read the same key from the database. How can we solve this concurrency problem?

Servers can use KGS to read/mark keys in the database. KGS can use two tables to store keys: one for keys that are not used yet, and one for all the used keys. As soon as KGS gives keys to one of the servers, it can move them to the used keys table. KGS can always keep some keys in memory so that it can quickly provide them whenever a server needs them. For simplicity, as soon as KGS loads some keys in memory, it can move them to the used keys table. This ensures each server gets unique keys. If KGS dies before assigning all the loaded keys to some server, we will be wasting those keys–which is acceptable, given the huge number of keys we have. KGS also has to make sure not to give the same key to multiple servers. For that, it must synchronize (or get a lock on) the data structure holding the keys before removing keys from it and giving them to a server.

The Original Article can be found at https://www.educative.io/collection/page/5668639101419520/5649050225344512/5668600916475904

Core Java