Outages
Processes and SLAs for planned and unplanned outages on Party Bus
Planned Outages
Process
For planned Party Bus updates that could result in an outage or downtime for customers, Party Bus will communicate all updates at least 24 hours in advance.
- Communication method
- Planned outages are primarily infrastructure upgrades or patches and are usually managed by Party Bus Operations (PB Ops), but they can also be scheduled by Mission DevOps (MDO) and CollabTools Team.
- Primary communication - Mattermost (MM) Bot notifications for comms
- What is the process for customers to request the outage to be rescheduled due to mission impact?
- Customers may complete one of the following:
- Create a help ticket
- Comment in the Value Stream - Party Bus - Support channel on IL2 MM
- For Party Bus planning purposes, the total Levels of Effort (LOEs) are categorized into the following bins. Note these LOEs don't equate to outage durations. Those times are stated in the SLAs.
- Low: 30 min
- Medium: 60 min
- High: 2 biz hours
- Extra High (very rare): 4 biz hours
Service Level Agreements (SLAs)
Party Bus makes every effort to adhere to the following SLAs:
- What is the maximum downtime expected during planned outages?
- Low: < 15 min
- Med: 16-45 min
- High: 46 - 1.5 biz hours
- Extra High (very rare): 1.5 + biz hours
- It is the goal of the Party Bus team to minimize downtime of planned outages, ideally to 0 minutes of downtime per update.
Unplanned Outages
Process
- Upon learning of an unplanned outage, the Party Bus team will immediately triage the event. Note that working hours are normally limited to those listed in Party Bus's Terms and Conditions (T&Cs), unless otherwise coordinated directly with a customer.
- Determines outage severity
- Party Bus's Disaster Recovery Plan (DRP) determines the severity of the unplanned outage. Note: The DRP is a work in progress.
- Levels of Severity:
- Low: < 2 biz hours Disaster Recovery Planning document in Confluence
- Medium: 2 - 4 biz hours
- High: 4 - 8 biz hours (requires After Action Report, AAR)
- Critical: 8+ biz hours (24 calendar hours, requires AAR)
- For unplanned outages, what is the process to resolve it, and how are those outages communicated?
- The DRP provides additional information about this process.
- Party Bus uses the following communication plan for unplanned outages.
| Primary | Alternate | Contingency | |
|---|---|---|---|
| Pipeline Outages | Gitlab | Mattermost | Email via Odoo |
| All Other Outages | Mattermost | Email via Odoo | TBD |
For unplanned outages, Party Bus will publish an AAR following the event (pending sensitivity and classification) that details the following:
An overview of what happened
How it was solved, including technical details
What steps Party Bus is taking to prevent a similar outage from happening again
Once available, AARs may be requested via a link posted in a MM notification bot following the resolution of the outage.
Archived AARs are posted on Party Bus's internal IL4 Confluence server
Service Level Agreements (SLAs)
- Party Bus categorizes unplanned outages into the following severities:
- Low: < 2 biz hours
- Medium: 2 - 4 biz hours
- High: 4 - 8 biz hours (requires AAR)
- Critical: 8+ biz hours (24 calendar hours, requires AAR)