Incident Response Plan
There are several ways that DevOps engineers can be notified of an incident:
- Customer signup notification (notifies us when there is an unexpected behavior takes place while signing up for TruConversion)
- Customer complaint notification (notifies us when there is an unexpected behavior while using any of TruConversion features)
- Service monitoring notification (notifies us when there is an unexpected behavior in TruConversion service infrastructure)
Infrastructure Failure Incident
If our cloud hosting provider suffers an isolated incident in one of our instances we should be able to remain online due to the High Availability setups we’ve configured. In case of an outage in a single Availability Zone, we should remain online. In case of a widespread outage, we will go offline until the upstream service is restored. In any case, you should:
Unfortunatly if there is any isolated incident appears in our cloud hosting provider, in such situation we will be online because we have configured High Availability setups. In case of an outage in a single Availability Zone, we should remain online. In case of mass incident, we will go offline untill the upstream service is restored. In such situation, you should:
- Find out the issue really is the upstream provider.
- Check the status updates of the upstream provider and inform internally so we can inform users.
- If service issue takes longer than a couple of hours, start assessing the possibility of migrating the affected service to another Region.
If you notice a security breach of any kind, you should:
- Inform the issue internaly so it can be escalated internally and communicated to users. It might require us to directly contact customers to inform them.
- Collect and analyse evidence that made you classify this as a security breach.
In case of affected instances:
- Turn them off and create snapshots for future investigation.
- Rotate any credential that might have been present in the instances.
In case of affected credentials, like email phishing or other:
- Rotate any credential that might have been compromised.
- Assume more things have been compromised and investigate other possible affected targets.
These include but are not limited to:
- Loss or theft of personal computing devices used to store or access TruConversion systems.
- Breaches of any TruConversion systems.
- Unintended disclosure of TruConversion sensitive information.
Reacting to Incidents
- Ensure the whole team knows by announcing it on the TruConversion Team channel. Use @channel to attract everyone’s attention.
- Try to identify which services are being affected. If this takes more than a couple of minutes coordinate with other online engineers and ask for help. This might mean initiating a Slack/whatsapp chat where you can discuss your findings through the incident without stopping the actual remediation efforts.
- When you’ve identified the affected services, decide on the severity of the incident:
- Was there a security breach?
- Is customer data affected?
- Is the incident part of a larger vendor, AWS, outage?
- Will a reliable fix be easy to produce?
- Can you do it on your own?
- How long will it take you to deploy it?
- Do you need someone to review your fix before and after you deploy it?
- Do we need to go into maintenance mode in the meantime?
- Are you sure what you are fixing is the actual root cause of the problem?
- Make sure the DevOps team are aware of the issue. If none of them are online, contact them immediately by phone. Most certainly they know about the issue before anyone else, but it’s better to verify if you’re unsure.
- Create an activity log to track what changes are being made and what is known about the outage. This could be writing small updates in a channel like #TruConversion. This is very useful for hand-overs and post-mortem creation.
- Discuss in the TruConversion Team channel if we should enter maintenance mode. Maintenance mode should be used if the outage is expected to take more than a few minutes. If it’s decided that we should enter maintenance mode, a developer should immediately do so.
- If users contact us, use the Incident Reply – Maintenance Mode saved reply if TruConversion is in maintenance mode and Incident Reply – Not in Maintenance Mode saved reply if TruConversion is NOT in maintenance mode.
After the Incident is Solved
- Verify that the incident has been indeed resolved.
- Update the team on TruConversion Team channel.
- Make sure we’ve left maintenance mode if it was enabled.
- If the maintenance needed was much longer than planned, we will prepare an email to explain ourselves.
- Verify that monitoring is in place to detect this issue in the future.
- If the incident was long in duration or broad in affected services, create a post-mortem analysis with a detailed timeline so we can better understand root cause and improve the process in the future.