We often receive site optimization requests after SEO audits. In most cases, we share one and the same list of recommendations that are aimed at making a Drupal-powered website more crawler-friendly. The exceptions are sites that are distinctly different (unique) under the hood.
However, there is one point in that list of recommendations that is typically ignored by SEO companies doing the optimization. It is the one that tells how to handle server responses (HTTP status codes) to search engine crawler (indexing bot) requests. For most SEO specialists, this type of server-side stuff is kind of blurry, so they simply discard the recommendation as not worth the time to act on.
Also, some servers/hosting environments are set up in a way that it is the web server itself that controls status codes sent to browsers/crawlers. This means you would not have the problem discussed in this article as long as you enjoy such an environment, but it may well emerge once you move your site to another server or your hosting provider changes configs around your account.
All things considered, correct server responses can boost site indexing significantly and save some bandwidth by keeping indexing bots out of pages that were not changed since their last visit and telling where exactly these bots can find the new content.
304 Not Modified HTTP status code and caching
First, let us see what this HTTP status code means and what is its function.
When a user navigates to a site, the browser of that user caches the pages visited offline, on the user’s computer, as long as the site allows that and the user is not browsing in a private (incognito) mode. Directed to the same site the next time, the browser asks the site if the pages it has cached during the previous visit are still up-to-date, and if so, skips on downloading content from the web and takes it out from the local cache instead, thus boosting the delivery and saving bandwidth.
There are different caching mechanisms browsers and sites rely on today, but we are interested in those that contain a timestamp. Among the lines the server (site) sends to the browser (crawler) in response to the status request is the Last-Modified line, which, obviously, tells when the page in question has been modified the last time. Visiting the page for the first time, browser records this information, and when it has to open the same page the next time, it extends its request with the If-Modified-Since line that cites the timestamp gathered during the previous visit.
Нас интересует механизм кеширования с временной меткой. При посещении страницы браузер и сервер обмениваются заголовками. Заголовки, которые отправляет браузер на сервер, называют “Заголовки запроса”. А полученные в ответ заголовки от сервера называют “Заголовки ответа”.
If there were no changes to the page since the time of previous visit, server simply tells the browser so and it takes the content out of its cache. Or skips on the page entirely, if it’s a search engine crawler. If the page was updated, server informs browser thereof and provides the new Last-Modified timestamp.
The server response, HTTP status code sent to the browser in the first “no changes” case, is 304 Not Modified. This is the one that can help you speed up indexing with search engine bots.
E-tag caching and timestamp validity
E-tag caching mechanism was introduced in 1999, when HTTP/1.1 rolled out. However, timestamp-based cache relevance verification is still important.
The thing is, search engines do not store E-tag and Last-Modified data, but they do record when their crawlers visited this or that page the last time. So, going to a site, search engine crawler fills the If-Modified-Since line of the request with the timestamp of its last visit. And if the server responds that there were no changes made, the crawler skips the page and continues searching for updates and new content.
Boosting indexing with correct server HTTP status codes sent as responses to search engine crawlers: an example
Let’s say we have an online store with 10000 products. In a single go, a search engine bot indexes 1000 products tops (this is not a real-life figure, same as all the other figures in this example). The bot visits the site every other day, and all pages have already been indexed.
Since the last go of the crawler we updated 2000 products (descriptions and prices) and added 500 products to the store.
If the server responds with if-modified-since filled incorrectly, the bot will have to go through all pages of the site before the updates and the new products are indexed, which will take it 11 visits and 22 days. In other words, your new products and new descriptions/prices will be shown on the SERPs correctly only 22 days after you have made them.
However, if the server returns a correct 304 Not Modified to the crawler, the latter will index all the new stuff in as little as 6 days, which is a major improvement and, in the real-life online store case, means the difference between money lost and money earned.
Server responses generated by Drupal
Drupal is notorious for status codes it sends in response to requests from browsers and crawlers. 304 Not-Modified is one of them. There are three criteria request headers need to meet for Drupal to return a valid 304:
- the request includes If-None-Match and If-Modified-Since;
- the If-Modified-Since value is similar to the Last-modified value;
- the If-None-Match value is same as the E-tag value.
The above has some discrepancies with the official specification:
- В RFC 7232 (Section 3-3) states that the recipient (server) must ignore If-Modified-Since if the header contains If-None-Match;
- В RFC 2616 (section 13.3.4) reads that If-Modified-Since should not return 304 Not-Modified if sone of the other headers are invalid.
Thus, considering point 2, If-Modified-Since validation requires either null or a valid If-None-Match value, with validation of the latter occurring before validation of the If-Modified-Since header.
But the biggest problem with Drupal and headers originates with the third criterium, “the If-None-Match value is same as E-tag value”. As was mentioned above, browsers store the tag received from the Last-Modified header and send it to the server for verification at any next visit. Search engine crawlers also submit the last visit timestamp, and in 99.99% of cases it does not match the date the page was modified last time. This means that a Drupal-powered site will not respond with a 304 to a crawler’s request.
Both Drupal 7 and Drupal 8 have this problem.
Drupal browser/bot response problem solved: indexing boosting patches from Initlab
Long story short, we have developed patches for version 7 and 8 addressing the issue.
The one for Drupal 7 is here: https://www.drupal.org/project/drupal/issues/3055984#comment-13213469
Drupal 8 patch is here: https://www.drupal.org/project/drupal/issues/2259489#comment-13134143
Drupal 8 has the Dependency Injection implemented, which is of great help here, since it allows modifying pretty much anything through service class redefinition. So our Last Modified since header fix module relies on that function to boost you site indexing before the patch is accepted by the Drupal 8 core developers.
For Drupal 7 users the news are not as good: we only have a core patch for them, no other way to fix the problem in that version.
So, there you have it. Another opportunity to boost your site’s performance, get noticed and indexed faster, all without much pain. Feel free to drop us a line should you need further clarifications or want us to set everything up for you.