A Zamboni Fire Drill
What is a Fire Drill?
At NoRedInk we recently conducted a fire drill with Team Zamboni, a team responsible for “smoothing over the ice” for internal departments by providing engineering tooling and automation solutions.
A fire drill is a deliberately introduced problem in production caused by a couple members of the team. The idea is a couple team members temporarily break something that will cause an alert to trigger and prompt the rest of the team to swarm on it, resolve the issue, and result in an overall learning experience.
Why conduct a fire drill?
There are several reasons I wanted to lead the team through a fire drill.
- Test our fire response readiness. When executing on a project it’s a good idea to add error handling, logging, and alerting, but how do you know if those measures are enough? A fire drill effectively serves as the retroactive acceptance criteria for those non-functional requirements.
- Identify areas of improvement. A fire drill puts our documentation and notifications to the test. Ultimately, it is difficult to anticipate what is sufficient for troubleshooting issues and a fire drill will highlight where we can make improvements to make fighting a real fire go even better.
- Team bonding 🙂. With the entire team remotely located most of our communication is asynchronous. Bringing everyone together to help solve a problem is an opportunity to have synchronous time to learn, to socialize, and to grow together as a team.
Pre-work: Assign Roles
We identified four roles for our first Zamboni fire drill: Arsonist, Fire Captain, Fire Fighter(s), and Supervisor
The Arsonist 🔥
The Arsonist is the person who causes an alert to trigger and ultimately commences the fire drill. There are three types of “arson” we identified.
- Smoke - A benign action that forces an alert to trigger even though there is nothing wrong. Temporarily lowering a CPU usage alert threshold is an example of smoke.
- Ember - A real problem that, if left unchecked for an extended period of time, could escalate to the level of a fire and impact users.
- Fire - Something that immediately impacts users and would otherwise require on-call individuals to get involved immediately. We want to avoid this option during a fire drill, even accidentally, as much as we are able to!
Overall, the Arsonist initiates the fire drill, observes the fire fighters, and captures suggestions. They are also encouraged to “play nice” in a sense that the fire drill should be easily resolvable within 30 minutes.
The Fire Captain and Fire Fighters 🚒
The Fire Captain and Fire Fighters are the people who put out the fake fire during the fire drill. The Fire Captain is a fire fighter with a few additional responsibilities related to logistics. Those responsibilities includes:
- Identify the triggered alert as a fire drill fire
- Create a Slack thread and a Zoom room for everyone to join
- Create a collaborative document for capturing notes
The Supervisor 📋
The Supervisor has one foot in the fire drill and one foot in normal everyday affairs. Overall they help make the fire drill experience go smoothly. This person is primarily responsible for communication and shielding fire drill participants from distractions during the drill or aborting the fire drill if something urgent comes up.
The Plan
For our first fire drill we decided to go with the ember option. We recently did some work to take site usage data for individual users and push that data to a third-party CRM. This data is re-synced with the CRM periodically through several scheduled batch jobs. Somehow forcing these CRM sync jobs to break is the perfect opportunity for a fire drill for a few reasons.
- Students and teachers, our end-users, are not impacted. In other words, it’s not a real fire.
- Our internal departments are not impacted. We were well within the non-functional requirement which specified how often the usage data should be kept up to date. Temporarily disabling one or more of these CRM sync jobs for a couple hours would not put that SLA at risk.
- It’s top of mind because we just worked in this area. We have what we think is sufficient documentation and monitoring and identifying gaps through a fire drill serves as a short feedback loop and therefore it serves as a good learning opportunity.
The Ember
The ember involved forcing the credentials used by our scheduled batch jobs to be incorrect causing our recently created batch jobs to fail. That would ultimately cause an alert to be sent to our notifications Slack channel. This sounds easy enough, but there are several precautions to keep in mind.
- There’s a difference between setting an incorrect username and setting an incorrect password. The former is safer because the latter can lead to that specific service account becoming locked and therefore lead to a real fire.
- Even changing the username can be risky because some third-party services will automatically drop repeated failed authentication requests. Again, this would cause a larger problem for us.
- The ember needs to be easily disabled. Killswitches and feature flags are a great option here. If needed, we should be able to abort the fire drill with the push of a button.
We decided to push some code to production in advance which would allow us to remotely override the username used by these jobs. About an hour before the fire drill was scheduled we changed the username in anticipation of an hourly monitor to trigger near the exact start time of the fire drill.
Right on cue our Slack notification channel received an alert that something was clearly wrong with the sync jobs. The fire captain performed their checklist, everyone jumped in on the zoom, and most of the team fought the fire while the non fire fighters observed and kept on eye on things.
After about 30 minutes the team honed in on the root cause, fixed it, and we immediately did a debrief on what the fire drill taught us and improvements we should try to make.
What a Zamboni fire drill taught us
We have a few valuable takeaways from conducting this fire drill.
Prefer terse notifications that precisely identify the issue. When our notification Slack channel received the fire drill alert the message had what we initially thought was sufficient detail, and it was sufficient, but the amount of detail also obscured the part that precisely identified the root cause. A more concise notification with a link to more verbose details would have helped steer us toward the root cause sooner.
Monitoring dashboards can be really helpful. When fighting a fire having relevant information in an at-a-glance format can really help with orienting oneself with the situation and build context. However, adding a monitoring dashboard is a task that can be easily missed since it is not a user-facing feature or even the fulfillment of a non-functional requirement in most cases. This fire drill taught me that we should consider making it a non-functional requirement in future projects.
The more context sharing and cross-pollination the better. As projects are executed, team members naturally specialize in certain aspects of the project. This person is more familiar with our data reconciliation while this person is more familiar with nuances of our CRM. In building these specialties we build up helpful context and little tips and tricks which are easily siloed. We acknowledged that by continuing to have all team members pair with each other we can spread context outside of individual silos which should directly help share fire fighting knowledge across the team.
Double check the “fire” didn’t spread too far. When we disabled authentication for the scheduled jobs we forgot to account for a separate workflow that uses the same credentials! Fortunately, this additional breakage was not impactful enough to be considered a real fire and we were able to address the broken workflow by the next day.
Consider a fire drill on your team
Everything considered, Team Zamboni invested a little more than an hour of team time towards the fire drill. As a result, we gathered very tangible takeaways to be incorporated into future projects. Even though it took the team away from the project for part of a day, I expect to see this investment very quickly pay for itself many times over.
Should you incorporate fire drills? I think it’s at least worth a try! Please feel free to reach out to me if you have specific questions. @ckoster22 on Twitter