Bugs and website breakdowns: how to protect your project from the most frequent errors

/sites/default/files/2023-09/drupal_0.jpg

The "fix one thing and another thing breaks" scenario is very unpleasant for the site owner. Adding a new feature or changing the design on one section can affect the work of the entire project, which is especially true for projects with problems in the code base. This can happen for a variety of reasons, which we will talk about today.

Initlab team uses Enterprise development tools to improve the quality of project creation and support. This allows us to track down errors at the earliest stages, fix them in time and prevent site visitors from finding them instead of us. Putting out hotspots is easier than full-fledged fires.

This approach consists of two conventional parts: testing and monitoring. Testing is used to make changes in the development process, and monitoring controls the working state of the site. This combination allows you to squeeze out the maximum in the matter of maintaining the project in perfect condition.

In this article, we will discuss the most frequent causes of bugs on the site and tell you how testing and monitoring can prevent such cases. We will also share our current scheme of work, which practically eliminates critical problems on projects where it is especially important to eliminate bugs.

The most frequent errors in working with the site and their causes

The most common errors that occur on the site when making edits:

Errors of stylisation

Example.We increased the indentation of one block, and this formatting was applied to the entire site or a random page.

Or we changed the font in the product header, and this was reflected in all headers in principle. In other words, styling errors are any unplanned visual changes on the site that spoil its appearance.

The probability of such an error is low if modern component development is used and the code base of the site is in order. However, since support is not always provided by sites that are new enough or built to development standards, you should be prepared for this kind of trouble.

Errors of logic

Example. We implemented a new functionality on the site, which unexpectedly affected the processes critical for the site: ordering, searching for goods, logging into the personal cabinet.

And if there is a chance to notice a broken layout at least by accident, you can't see a logical one just like that - you have to go the same way as a client, press different buttons, try non-standard actions, etc. Besides, such serious errors in code logic will cost the client profit and reputation.

When these errors can occur

When making changes/migration. In an ideal situation, any edits to the site are only made via test copy.

Firstly, it can be painlessly returned to the way it was if the client is suddenly not satisfied with the way the change looks or functions.

Secondly, here we can put the edits in a controlled environment and even if they result in errors in logic or stylisation, it won't hurt our site and we can safely fix everything and bring it to the battle site in the best possible way.

However, even such precautions may not save you from bugs, but at least minimise the damage from direct handling of the site.

When working together with another team. We have projects where we work with the client's development team. This is a normal practice, but it can be problematic if the work of both teams is not tracked.

If you make a change, introduce a new feature, the site goes down. And this is not a question of professionalism - these are trivial working moments that can happen to anyone. But, for example, in the case of joint work on functionality, you can incorrectly combine branches and get an error in logic or stylisation.

How to protect ourselves from mistakes: the tools we use

To break the chain of "one thing fixed - another thing falls off", we use test coverage and rules of teamwork organisation, thanks to which the probability of errors on the site is minimised.

Tests

Tests can help prevent the problems we wrote about above:

visual regression tests help to prevent stylisation errors;
functional tests help to prevent logic errors.

The difference is very simple: for example, we have a button. It can move out - a visual test will find this, or it can not be pressed - for this we need a functional test.

How visual regression tests work. They notify us of changes in the appearance of the site. The mechanism is simple: we take "reference" screenshots of pages so that the test can compare them with the appearance of the site after any changes have been made. When differences are found, the system notifies the developer that the "before" and "after" results do not match. This happens at the moment of uploading changes to the test copy of the site.

How functional tests work. This type of tests runs a specific script on the site to see if it is executed. The test follows the same path as the client: clicks buttons, sends requests and checks the result of this interaction. If in some place what should happen didn't happen, the test notifies the developer.

Tools. We use Playwright for functional and visual regression tests.

During visual regression tests Playwright compares reference screenshots of an "ideal" page made beforehand in its normal state. If any differences are found, the programme will send us a report on the found difference.

In functional tests, Playwright checks that when a user follows specified links and clicks on the right buttons, the user is taken to specific pages and certain text is displayed on them. The virtual browser runs a full script of the desired process, such as shopping. Playwright puts the product in the basket, checks the clickability of the order buttons, checks by name whether the product is in the basket, proceeds to checkout - enters the client's test data, chooses the method of delivery/payment and executes the order. If the process has not been completed in the time specified in the test or there is a failure at any of the stages, the system will notify the developer.

Whereas in the first case the check is based on screenshots, here it checks for the presence of the required URLs and texts on the page.

Previously, we used BackstopJS for testing. However, Playwright won the unspoken competition. It is characterised by faster speed and more features. It is essentially a virtual browser that runs on a server, can open website pages, take screenshots and execute scripts instead of a live person.

Tests are written separately for each project

The scheme of how tests work sounds quite simple. Because of this, it may seem that there is some software that you can let into the site and it will start checking errors.

For example, Drupal core is covered with tests that can be run to check the basic functioning and security of the site. But they are intended only for analysing the core architecture, where we in principle do not interfere according to the development standards. And in order not to look where there are probably no errors, we do not actively use Drupal core tests. Other CMS and frameworks, on which we also develop, and at all do not have coverage of tests from developers.

We write visual regression and functional tests separately for each project.

It's easy to explain - as many sites as there are, there are so many different interfaces and interaction mechanics. Tests simply cannot be universal. They have a general schematic similarity, but they cannot be written once and run on every project.

We always run all changes on the site through a test copy. This is necessary in order not to introduce critical errors on the live site. We work on the local copy and transfer the result to the test copy to check the operability of the changes made. We compare the state of the site "before" and "after" the changes. This helps to catch critical and not so critical errors that can be simply overlooked. This way we increase stability and don't waste unnecessary time on fixing problems that can be introduced when making changes "live".

Liners and static analysers

Linters help to track syntax errors, logic errors and violations of coding standards in the code base when changes are migrated. It is very difficult to do without them when two teams are working on a project at the same time.

Static analysers perform deeper checks of code logic. They analyse the code base, automatically track errors and point to them. This helps you spend less time on code checking and monitor the quality of each team's work. Collaboration conflicts are significantly reduced - errors are pointed out by an impartial robot rather than by colleagues. In general, these tools are necessary to comply with the development standards of any CMS (Drupal, Bitrix, etc.), even if only one team is working on the project.

How linters work to check code styling. They indicate errors in design. That is, the code does not look the way it should - it is an appearance check.

Tools. PHP_CodeSniffer checks files for violations of certain coding standards, helps make code clean and consistent.

How static analysers work to check code logic. They detect errors that may cause incorrect script operation.

Tools. PHPStan helps you find bugs in your code.

Linters and static analysers are launched when you migrate changes to the test copy and immediately give you an idea of what needs to be fixed. They can also be run locally before pushing changes to the test copy.

Linters and static analysers can also be used not only for current work, but also for checking the code base of a finished site. This will help to collect information about the state of the code base of the site, if it is to be So you can get an approximate understanding of the work front in case of forthcoming correction of static analysis errors or violations of coding standards.

We're working on improving the tools

Our team likes to get involved in improving the tools we use. It is ironic that in an article about bugs we mention bugs in programmes intended to fix them. But the point is that we are interested in code quality at all levels.

Recently our employee Andriy Tymchuk noticed a problem with PHPStan validation on one of the support projects - CI always had the current version installed and when using it, an error unrelated to the project code was displayed. Andrey created a detailed report about the problem. The problem was fixed in the new release.

This is how we try to contribute to the development of tools for code verification and improving the quality of development.

Monitoring

If before we talked about testing on a test copy before making changes, now let's talk about how you can track problems on the main site that users see.

Testing cannot guarantee complete absence of errors for two reasons:

1) tests can't cover 100% of site transactions - it's hard to set up and expensive to maintain;

2) the site can be taken down by third-party reasons: breakdown of an external service, server problems, hacking, virus, etc.

So we need monitoring to protect the combat version of the site (prod).

Monitoring is the process of automatically, periodically triggering a combat site to check for current and bugs, and potential problems that threaten its performance.

How monitoring works. We have collected our own toolkit of universal for most sites checks that are important for the technical base of the site. This is a set of standard from our point of view scripts, which are engaged in checking the following items:

Monitoring of SSL certificate and domain expiry date. We often encounter that website owners forget to renew domain names and SSL certificates. Our monitoring informs about their expiry date in advance, so that the site does not fall out of life.
Monitoring security updates. Every good CMS regularly releases updates that close threats and fix bugs from previous versions, which improves site security. Our monitoring tracks the release of such updates and reports on them.
Monitoring the availability of the site. If the site is down for any reason, monitoring will inform you about it. Our task is to find out the cause and restore site availability as soon as possible.
Monitoring changes to the .htaccess and robots.txt files. These files are in the basic CMS installation and sometimes it happens that they can be rolled back to the standard state. And in robots.txt is stored a lot of work on SEO: everything that is hidden from indexation, open for it, directives for search engines, etc. Because of zeroing, unwanted pages can get indexed. Monitoring keeps track of changes in these files.
Monitoring of server resources (CPU, Memory, HDD). When there is constant work on the site, the space on the disc gradually decreases, but few people monitor it. At some point the disc overflows and the site goes down. To prevent this from happening, our monitoring alerts you when 80% of the disc space is reached. This tells us that we need to either clean or raise the tariff plan. Also, this monitoring alerts you to increased load. In case of an influx of visitors, the server may not be able to cope and the site will start to hang. This situation has several options: if there is a DDoS attack - you need to protect the site, if it is an expected increase in audience or developers rolled out a resource-intensive update - you need to optimise the site and so on. In any case, if the site starts to stall, we should be the first to know about it, and monitoring will help us with that.
Email queue monitoring and DNS blacklisting. If a server is blacklisted, all mail servers start blocking e-mails leaving from it, taking them for spam. As a result, conditional order notifications or password recovery emails do not reach anyone. Monitoring tracks the stuck mail, after which we fix the problem.
Antivirus scanning of files. An item that does not require any special explanations.
S.M.A.R.T. monitoring of hard disc drives on a server. Almost all disc drives in recent years have been operating with S.M.A.R.T. technology, which stands for self-monitoring, analysis and reporting technology. Each disc has state parameters necessary for stable operation. If some of these parameters fall below the critical zone, it is time to replace the drive. Monitoring will notify us.
Google Page Speed monitoring. As a result of work on your site, increased visits or Google's algorithm changes, your site's speed may decrease. The changes in important parameters are monitored by our monitoring, which we wrote about here in the second part of the article.

Application monitoring: let's set up checking of important operations on the site

Our experience has taught us how to prevent disasters rather than running around in a panic when everything goes down, looking for the cause. We have collected a list of classic problems for most sites and based on it we have prepared functionality that protects the site from the most common threats.

The monitoring system is our gold standard for continuous support of project performance. The items described above are included in all server support tariffs and site support services. That is, these are tools that monitor parameters universal for all projects.

If we need to monitor specific business logic on a separate site, we need application monitoring. It can be configured for any situation on the project.

For example, we have a website with a key function - ordering. For the owner it is important that nothing breaks during the purchase process. And this can happen not only due to developer error: the server ran out of space, a failure in the payment system, which breaks the most important task of the site - selling.

You can make orders yourself three times a day, you can hire a person who will be busy with the same, but the best way is to connect monitoring - a robot that will check if the order works the right number of times a day.

By monitoring the application, we will get an instant signal if a layout goes down, a button breaks, or any other critical event happens.

Application monitoring can check:

website;
web service;
integration with an external service.

Developing project-specific monitoring is not an easy task and will require time and investment. If it is critical for the project, it is worth considering. If there are no such serious operations on the project, our standard list is at your service.

Ideal chain of work with the site

We have prepared a chain of work with the site, in which it is very likely that the site will not be threatened by any errors and bugs. Based on our experience, this scheme is close to ideal, but most projects use it in a somewhat simplified form, depending on the specifics.

Sandbox → Test copy → Battle site → Monitoring

Sandbox - a draft copy of the site, where programmers implement the first versions of edits. At this stage, you can see if the code base satisfies the linters.

Test copy - a variant of the site for testing changes and demonstrating edits to the client. When we move to this stage, the linters are launched automatically, unless we haven't reviewed everything in the sandbox ourselves. Here you can also check the performance of changes with the help of tests.

The battle site is the parade version, visible to the whole internet. This is where everything that is agreed upon with the client, developers, tests and linters goes.

Monitoring is something that will help the site not go down. Monitoring will automatically track if something goes wrong with the site. It will prevent downtime and threats to your project.

This same chain increases the efficiency of the two teams many times over. If you take all precautions on both sides and trust the machines to look for things that the human eye is not always capable of, you can secure the site to the maximum.

Extra protection for the biggest changes

When some super complex functionality is planned to be implemented on a project and there is a suspicion that no test or monitoring will save from imminent problems, there is an additional safety buffer - canary releases.

With their help you can roll out new functionality that only a part of users will see, evaluate the reaction of monitoring, performance indicators and test response. If the result is poor, you can easily roll everything back and send it for revision.

This is part of an ideal scheme that can be applied when the cost of error is high and the risk of error is too great.

Even a site with a bad codebase can be kept in order

Sometimes clients come to us with websites in a deplorable state. Wallpapering an emergency house is not a bad idea, but sometimes the customer simply does not have the opportunity or time to put the site in order. And to update the catalogue or introduce new features - a vital necessity for business.

Even on a site with bad code that crawled in on crutches from several previous teams, it's possible to make sure we don't break it with new edits, or at least see it happen right away and start fixing it.

With the help of linters and static analysers you can find a conditional thousand errors in the whole code base of a site. It will take hundreds of man-hours to fix them at once. But we can run linter checks only on the part of the code we change with a specific task. Thanks to this new code from us is always of high quality and we can fix its "neighbourhood" - code fragments in the neighbourhood, which relate to the specific task being solved. This way we will not only implement bug-free edits, but also improve the quality of the site on the support.

Seven nannies has a bug-free site

This amount of "strawman" is critical for sites with high traffic, reputation, and a large number of orders.

We at Initlab believe that error tracking should be left to automation as much as possible, and that we should only go out to fix it. Let the questions "Does the shopping cart work?" or "Did the catalogue move?" be asked only by well-established tools, not by the customer.

If you need quality support for your project, contact us.

Source: https://medium.com/@Initlab/bugs-and-website-breakdowns-how-to-protect-your-project-from-the-most-frequent-errors-f8a9e717ecb3

Support