A Google SRE once said, "if a human operator needs to touch your system during normal operations, you have a bug. The definition of normal changes as your systems grow." As discussed earlier in the course, a key pillar of DevOps philosophy for Google is leveraging tooling and automation. Focusing on this allows your engineering teams to focus on development work instead of operational work. SREs do this by eliminating that operational work, which we call toil. So what exactly do we mean by toil? Toil is work directly tied to a service that is manual, repetitive, automatable, tactical, without enduring value, or that scales linearly as the service grows. Toil isn't just administrative work or work that you don't want to do, because that kind of work can still be very important. Different types of people like different types of work and administrative work can be necessary overhead such as team meetings or HR paperwork. This type of work also isn't tied to running a production service. By eliminating toil, SREs can focus the majority of their time on work that will either reduce future toil or add service features which generally focus on improving reliability, performance, or utilization. Until now, you've learned a lot about the reliability part of Site Reliability Engineering. Reducing toil and scaling up services is now the engineering part of Site Reliability Engineering. Engineering work is what enables an SRE team to scale up and to manage services more efficiently than either a peer dev team or a peer ops team. By keeping your SREs working on toil less than 50% of the time, you're also distinguishing the SRE role as clearly different from a typical operations role. So why is toil really a problem? Toil can create multiple issues in your organization. Toil can lead to career stagnation. Individual team member's career progress will slow down or stop if they spend too little time on projects. While it's true that Google rewards undesirable work when it's inevitable and has a large positive impact, you can't make a career out of it. It promotes low morale. People have different levels of tolerance for how much toil they can do, but everyone has a limit. Too much toil leads to burnout, boredom, and discontent. It creates confusion. At Google, we work hard to ensure that everyone who works in or with the SRE organization understands that we are an engineering organization. Individuals or teams within SRE that engage too much in toil undermine the clarity of that communication and confuse people about the SRE role. Toil slows progress. Excessive toil makes a team less productive. A product's feature velocity will slow if the SRE team is too busy with manual and reactionary work to roll out new features promptly. It sets precedence. If you're too willing to take on toil, your developer counterparts will have incentives to load you down with even more toil, sometimes shifting operational tasks that should rightfully be performed by developers to SRE. Other teams may also start expecting SREs to take on such work, further perpetuating the issue. It promotes attrition. Even if you're not personally unhappy with toil, your current or future teammates might like it much less. If you build too much toil into your team's procedures, you motivate the team's best engineers to start looking elsewhere for a more rewarding job. Lastly, toil causes a breach of faith. New hires or transfers who joined SRE with the promise of project work will feel cheated, which is bad for morale. Even though a lot of toil is unhealthy when running a service, there are some positives for having a little bit of toil. Toil doesn't make everyone unhappy all the time, especially in small amounts. Predictable and repetitive tasks can be quite calming. They produce a sense of accomplishment and quick wins. There can be low risk and low stress activities. Some people gravitate towards tasks involving toil and may even enjoy that type of work, but it should never be the primary work for an SRE. Toil isn't always and invariably bad and everyone needs to be absolutely clear that some amount of toil is unavoidable in the SRE role and in almost any engineering role. Toil Becomes toxic when experienced in large quantities. You should be concerned if your teams complain about being burdened with too much toil. Now, you may be wondering how you can balance toil with project work. Toil must be a bounded part of the SRE role. If SREs don't have time for anything else, they're doing traditional sysadmin tasks that DevOps advocates against. If you put a threshold for toil at 50% for SREs, they are free to do project work that supports your engineering and reliability goals the rest of the time. Priority project work for SREs is work that impacts or might impact the team's SLOs. After that, their focus should be work that causes SREs toil. A key aspect of eliminating toil is Automation. SREs strive to automate this year's job away, that is, determining what to automate, under what conditions and how to automate it. Automation in in a production service provide several values. First, it can provide consistency. Any action performed by a human is prone to error, especially the same action performed hundreds of times. A person isn't likely to be as consistent as a machine. Lack of consistency leads to mistakes, oversights, issues with data quality, and even reliability problems. Automation remedies this by creating consistency. Next, automated systems provide a platform that can be extended and applied to more systems. A platform also provides a way to centralize mistakes so that a bug is fixed once in one place. With humans, you'd have to communicate that fix across multiple people and there's more room for error and for the bug to be re-introduced. Additionally, a platform can execute additional tasks faster and with more accuracy than humans, and can also export performance metrics more easily than a manual system. If Automation runs regularly and successfully enough, any common faults can be resolved more quickly. You can then spend your time on other tasks instead, which promotes increased developer velocity since you don't have to spend time either preventing a problem or more commonly cleaning up after it. A problem discovered later in the product life cycle is more expensive to fix. Generally, problems that occur in actual production are the most expensive to fix, both in terms of time and money. This means that an automated system looking for problems as soon as they arise, has a good chance of lowering the total cost of the system. Another value of automation is faster action. Machines react faster than humans. For large production services, automating is necessary for survival, since the amount of work required is usually beyond a manageable manual threshold. Finally, automation saves time. Even though it may be a significant time investment to code a particular automated process, once done, there is no need for continual training of humans in maintenance of the process. Once a task is automated, anyone can execute it. If automation is not common in your organization, you're likely to see some resistance to change from your teams as you start to introduce it or any SRE practices. In the next video, you'll learn about the psychology of and resistance to change and how you can help support your teams through SRE adoption.