This is Part 5 of the Web Development Overview series. In previous parts, we looked at content, design/UX, and backend (Part 1), building apps from scratch and frontend development (Part 2), testing and hosting (Part 3), and performance optimization (Part 4).
In this installment, we will cover how to handle more traffic without significant investments into infrastructure – caching.
Website Caching Basics
Why do we need caching for our websites?
As mentioned in the part 4, our overall goal is to make the page render as fast as possible for all our users, regardless of the amount of traffic. This can be achieved by adding more servers, or by upgrading current ones to more powerful versions. This will work and sometimes it’s the only option, but it also has a serious downside: cost. Adding more servers means paying for more resources, and not every aspect of infrastructure is easy to run in parallel.
A good example is MySQL, where you can run a cluster of master/slave nodes, but in most cases all writes will be performed on a master and reads on slaves. This means that there’s a delay between the time when data is written on a master and then is available for reading on slave as it needs to be synced. In some cases this doesn’t matter, but it is something that needs to be thought through, analyzed, and monitored constantly. In this case, one would need to pay for more servers, and also for later maintenance of more complex setup.
So, what advantage does caching gives us over beefing up the infrastructure? This is hard to answer as it depends on the project. In most cases:
- it’s easier to set up
- may require additional servers (not always)
- gives you a bigger bang for your buck.
For example, Varnish Cache itself can speed up the site by between 300x and 1000x.
When do we need caching?
Now let’s look at our “when” question. Sometimes I hear that website caching is required only for high traffic sites and is too much effort to worry about on small installations. I disagree. I think that any site that has hope for success should have some level of caching introduced. After all, we aim for the quickest load times.
It doesn’t need to be sophisticated, but some of the things one can do with the default settings of the available solutions are just too good not to use. I’ll guide you through the most common solutions we use on our projects. We will start with the things that reside deep in the bowels of the servers and finish covering ones that catch the request before it leaves the client’s browser.
Website Caching on Different Locations
What are the locations where your site (or it’s parts) can be cached? There are quite a few, but I’ll classify them also by what is being cached. As I said, starting from the most low level:
This can happen on 2 separate levels and is usually fairly easy to setup:
- Internal query cache. Database engines, like MySQL, have query cache and it’s usually enabled by default with safe settings. The defaults may require tweaking, but usually they are a good starting point.
- External query cache. It’s a very good idea to enhance the DB cache by installing key/value store server and use it to cache the queries further. Popular solutions are Memcached and Redis. This can reduce database load even by 80%.
The idea behind this caching option is that the website constantly requests data from the database and it often requests the same things over and over again. For example, you requesting this post from the server will trigger almost identical DB traffic as your colleague requesting it. If the query is the same, why would we want to go back to the DB to get the result?
In most cases, it’s safe to load it from cache. Even if one considers loading 2 different posts one after another there are some common things that are identical and are possible to cache. It’s true because creating the page content for the browser is rarely done with a single DB query. Usually, there are multiple requests made to the DB, and some are common on most requests – and that’s a caching opportunity.
Programming Language Cache
We know that there are some repeatable things done on the database level that are good candidates for our new caching strategy. The same is true to the framework or programming language. Here we have even more options when we think about website caching:
- Website source code cache. Basically, there is usually a significant optimization opportunity in the application code itself. By using mechanisms like Memoization one can significantly speedup the load times of the application.
- Code precompilation. The problem with most common web programming languages like PHP or Ruby is that they are scripts that need to be read, processed, compiled, and then ran when the request comes to the server. All this takes quite a lot of time, and is usually just the same operation repeated millions of times. To some extent, this can be avoided. For example, when using PHP one can enable the OPCache module that stores precompiled PHP code in a shared memory for easy and fast access. Another option is HHVM, a Facebook born PHP to C++ compiler. These options offer significant speed improvements.
- Framework cache. These days, performance is on everyone’s mind, so we framework communities that are offering options to do efficient website caching as well. For example, ‘Ruby on Rails’ offers built in mechanisms, C# .Net MVC gives access to System.Web.Caching.Cache class that serves this purpose.
Reverse Proxy Cache
Having covered the most common website caching options that can happen in the application stack (database and application code) it’s time to step back a bit and add another layer to our web application. Usually, there are multiple options that we can choose from:
- Webserver cache. Since it’s almost certain that a given website is already using a web server, this option may be easiest to set up. A good example of such technique is setting fcgi cache on Nginx webserver. Basically, the server stores the result HTML of web pages and serves them in the same way as static files, since the underlying framework and database are not involved in the process of returning the web page quickly. This option has a downside, in that even if the content is being served from cache, the server needs to handle the request. That may block access to the web server for requests that need to bypass cache.
- Varnish Cache. It’s a stand-alone server that is a primary handler of all website traffic going into the given website. It can run on the same server machine as the Apache or Nginx (or other web server software), but it can be very easily moved to a separate machine to further optimise the load on the infrastructure and allow unobstructed access to the uncached pages.
- Third party services. It’s a long way to go if one does not want to maintain another server or worry about proper configuration of software like Varnish. It’s also possible that the service of choice will offer additional valuable services. A good example is CloudFlare.com where one gets pretty configurable cache service plus DDoS (Distributed Denial of Service) protection, SSL Certificate, Always Online, and more. It’s important to review the service.
It’s possible that the user is re-visiting the same page, or the same file is used on multiple pages. If it was previously obtained, then it checks if it’s still valid. Every resource downloaded from the server can be flagged with Cache-Control headers that inform the browser how often the given file should be refreshed. It’s a good practice to set long expiration times for static resources (JS, CSS, images) and force their refresh by adding cache busting parameters to them when a new version is used on the site.
Common Issues With Website Caching
Web caching is a very delicate thing. It’s essential to get it right, as issues are hard to notice and can lead to very unexpected and unwanted results. Here are a few things to keep in mind:
- It’s super important not to cache pages that are user specific. For example, caching the cart page on the e-commerce page will result in multiple users seeing the same cart contents. Even worse, caching the order confirmation page will show personal information to other users.
- If we use website caching techniques that involve external tools (web server caching, varnish, third party tools) we need to make sure that the cache is cleared after releasing a new version.
- On sites with high traffic, it’s unadvised to clear all caches at once, because when it’s done, there’s a sudden spike it requests that is actually going back to the website to be generated and stored. It may result in the site being unavailable for some/all users. It’s better to plan cache clearing in advance. Maybe do it server by server, or only clear the paths that changed in given release.
- It’s important to perform rudimentary caching/performance audits after every caching change to make sure nothing got cached that shouldn’t, and that everything that should be cached is still being cached.
So that’s it on this complex and broad subject! I’m sure I haven’t covered all of it, but this is a good overview of the subject.
Is there something I forgot to mention? Leave a comment below.