Business Continuity & Disaster Recovery Frameworks
A comprehensive reference for BC/DR planning, recovery objectives, business impact analysis, and testing methodologies.
Core Concepts
Business Continuity (BC)
The capability of an organisation to continue delivery of products or services at acceptable predefined levels following a disruptive incident.
Disaster Recovery (DR)
The process, policies, and procedures related to preparing for recovery or continuation of technology infrastructure critical to an organisation after a natural or human-induced disaster.
Key Difference
- BC = Keeping the business running (people, processes, facilities)
- DR = Recovering IT systems and data
Recovery Objectives
Recovery Time Objective (RTO)
Definition: The maximum acceptable time that a system, application, or function can be down after a disruption.
Examples:
- Critical payment system: RTO = 1 hour
- Email system: RTO = 4 hours
- Internal file storage: RTO = 24 hours
Business Question: "How long can we survive without this?"
Recovery Point Objective (RPO)
Definition: The maximum acceptable amount of data loss measured in time. Determines backup frequency.
Examples:
- Financial transactions: RPO = 0 (zero data loss acceptable)
- CRM system: RPO = 1 hour (up to 1 hour of data can be lost)
- Document management: RPO = 24 hours
Business Question: "How much data can we afford to lose?"
Maximum Tolerable Downtime (MTD)
Definition: The time after which an organisation's viability will be threatened if normal operations cannot be resumed.
Business Question: "At what point does this outage become an existential threat?"
Recovery Consistency Objective (RCO)
Definition: Ensures data consistency across interdependent systems during recovery.
Example: Customer orders, inventory, and payment systems must all recover to the same point in time to maintain data integrity.
BC/DR Planning Frameworks
ISO 22301:2019 - Business Continuity Management
Region: International
Purpose: International standard for Business Continuity Management Systems (BCMS).
Key Components:
- Context of the organisation
- Leadership and planning
- Business impact analysis and risk assessment
- BC strategy and solutions
- Exercising and testing
- Performance evaluation and continual improvement
Best For: Organisations seeking certification or formal BCMS
Link: ISO 22301:2019
NIST SP 800-34 Rev. 1 - Contingency Planning Guide
Region: United States
Purpose: Federal guidance for IT system contingency planning.
Key Components: - Contingency planning policy - Business impact analysis - Preventive controls - Contingency strategies - Plan development, testing, and maintenance
Best For: US federal agencies, contractors, and organisations following NIST guidance
Link: NIST SP 800-34
BS 25999-2 (Superseded by ISO 22301)
Region: United Kingdom
Note: Withdrawn in 2012 and replaced by ISO 22301. Still referenced in some legacy documentation.
BCI Good Practice Guidelines (GPG)
Region: International
Purpose: Professional practice guidance from the Business Continuity Institute.
Key Components:
- Policy and programme management
- Embedding BC in the organisation's culture
- Analysis (BIA and risk assessment)
- Design (strategies and solutions)
- Implementation
- Validation (exercising, testing, maintenance)
Best For: BC practitioners seeking professional development and implementation guidance
Link: BCI Good Practice Guidelines
DRII Professional Practices
Region: International
Purpose: Framework from DRI International (now merged with BCI).
Key Components: 10 Professional Practice areas covering BC lifecycle
Link: DRI International
Business Impact Analysis (BIA) Process
Purpose
Identify and quantify the impacts of disruptions to critical business functions and the resources required to support them.
BIA Steps
flowchart TD
A[1. Identify Business Functions] --> B[2. Assess Impact Over Time]
B --> C[3. Determine RTO/RPO Requirements]
C --> D[4. Identify Critical Resources]
D --> E[5. Document Dependencies]
E --> F[6. Prioritise Recovery]
F --> G[7. Present Findings to Leadership]
Impact Categories to Assess
| Impact Type | Examples |
|---|---|
| Financial | Lost revenue, fines, compensation costs |
| Operational | Inability to deliver services, supply chain disruption |
| Reputational | Customer confidence, media coverage, brand damage |
| Regulatory/Legal | Compliance breaches, contractual penalties |
| Health & Safety | Risk to staff or public safety |
BIA Output Examples
| Business Function | MTD | RTO | RPO | Impact (4hr) | Impact (24hr) |
|---|---|---|---|---|---|
| Customer payments | 2hr | 1hr | 0 | £50k loss, regulatory breach | Business-critical |
| Customer support portal | 8hr | 4hr | 1hr | Reputation damage | £20k loss, SLA breach |
| Internal email | 24hr | 8hr | 4hr | Productivity impact | Minor impact |
BC/DR Strategy Development
Strategy Options by Recovery Speed
| Strategy | RTO Range | Cost | Description |
|---|---|---|---|
| Hot Site | Minutes-1hr | High | Fully equipped, continuously synchronised alternate site |
| Warm Site | 4-24hrs | Medium | Partially equipped site with some infrastructure ready |
| Cold Site | Days-weeks | Low | Empty facility with power and connectivity only |
| Cloud DR | Minutes-hours | Medium | Cloud-based recovery using IaaS/PaaS |
| Mobile Recovery | 24-72hrs | Medium | Transportable recovery facilities |
Backup Strategies by RPO
| RPO Target | Backup Strategy | Technology Examples |
|---|---|---|
| 0 (Zero data loss) | Synchronous replication or Journalling | Database mirroring, synchronous SAN replication, synchronous redundant database writes |
| Minutes | Asynchronous replication | Continuous data protection, near-real-time replication |
| Hours | Frequent backups | Hourly incremental backups, log shipping |
| 24 hours | Daily backups | Nightly full or incremental backups |
Testing Methodologies
Test Types (Progressive Complexity)
1. Tabletop Exercise
Description: Discussion-based session where team members walk through scenarios verbally.
Duration: 2-4 hours
Frequency: Quarterly
Advantages:
- Low cost and disruption
- Good for training and identifying gaps
- Tests understanding and decision-making
Disadvantages:
- Doesn't test actual systems
- May not reveal technical issues
Example Scenario: "The primary data centre has lost power and cooling. Walk through your response steps."
2. Simulation Test
Description: Teams respond to scenario in near-real-time, but without affecting production systems.
Duration: 4-8 hours
Frequency: Semi-annually
Advantages: - Tests coordination and communication - Identifies process gaps - Minimal business disruption
Disadvantages: - Doesn't validate technical recovery - Requires significant planning
3. Parallel Test
Description: Recovery systems are activated alongside production systems without failover.
Duration: 1-2 days
Frequency: Annually
Advantages: - Tests actual recovery capability - No business disruption - Validates backup data integrity
Disadvantages: - Costly - Doesn't test full failover process
4. Full Interruption Test
Description: Production systems are shut down and full failover to recovery environment occurs.
Duration: Varies (planned outage window)
Frequency: Every 1-3 years (rarely performed)
Advantages: - Complete validation of DR capability - Tests all aspects including staff response
Disadvantages: - High risk and cost - Significant business disruption - Requires executive approval
Note: Typically only performed for critical systems with mature DR programmes.
Test Documentation Requirements
Pre-Test:
- Test objectives and scope
- Success criteria
- Participants and roles
- Test scenario details
- Rollback procedures
During Test:
- Actions taken (timestamped)
- Issues encountered
- Decisions made
Post-Test:
- Results vs. success criteria
- Lessons learned
- Action items for plan improvement
- Updated RTO/RPO actuals
BC/DR Plan Components
Essential Plan Elements
-
Plan Activation Criteria
- Who can invoke the plan
- Triggering events
- Decision tree
-
Emergency Contact Information
- Crisis management team
- Key vendors/suppliers
- Emergency services
- Notification cascades
-
Roles and Responsibilities
- Crisis management team structure
- Recovery team leaders
- Communication coordinators
-
Recovery Procedures
- Step-by-step technical recovery tasks
- System dependencies and sequence
- Estimated timeframes
-
Communication Plan
- Internal communications (staff)
- External communications (customers, suppliers, media)
- Regulatory notifications
- Templates for common scenarios
-
Alternative Working Arrangements
- Remote working capabilities
- Alternative facilities
- Equipment and supplies
-
Vendor and Third-Party Contact Details
- Support contracts and escalation paths
- SLA reference information
Industry-Specific Requirements
Financial Services
- PRA/FCA (UK): Operational resilience requirements
- FFIEC (US): Business continuity planning handbook
- Basel Committee: Principles for operational resilience
Healthcare
- NHS England: Business continuity guidance for NHS organisations
- HIPAA (US): Contingency plan requirements (164.308(a)(7))
Critical Infrastructure
- NIS Regulations (UK): BC requirements for operators of essential services
- NIS2 Directive (EU): Enhanced resilience measures
Quick Selection Guide
| Organisation Profile | Recommended Framework | Testing Frequency |
|---|---|---|
| Small business (<50 staff) | Simplified BCI GPG approach | Annual tabletop |
| Medium enterprise | ISO 22301 or BCI GPG | Quarterly tabletop, Annual simulation |
| Large enterprise | ISO 22301 + industry-specific | Monthly tabletop, Quarterly simulation, Annual parallel test |
| US Federal/Contractor | NIST SP 800-34 | Per agency requirements |
| Financial services (UK) | ISO 22301 + PRA/FCA guidance | Quarterly minimum |
| Healthcare (UK) | ISO 22301 + NHS guidance | Semi-annual minimum |
Key Metrics and KPIs
| Metric | Description | Target |
|---|---|---|
| Plan Currency | % of plans reviewed within last 12 months | 100% |
| Staff Awareness | % of staff who know how to access BC plans | >80% |
| Test Coverage | % of critical systems tested annually | 100% |
| RTO Achievement | % of recovery tests meeting RTO targets | >95% |
| RPO Achievement | % of recoveries meeting RPO targets | >95% |
Common Pitfalls
- Plans Not Maintained: Plans become outdated as technology and staff change
- Insufficient Testing: Tabletop exercises only, no validation of actual recovery
- Single Points of Failure: Key person dependencies or single-vendor reliance
- Inadequate Documentation: Plans are too high-level or too technical
- No Alternative Communications: Primary communication method fails and no backup exists
- Backup Data Not Tested: Backups exist but restoration has never been validated
- Scope Creep: Trying to protect everything instead of focusing on critical functions