Electric Power Utilities’ Cybersecurity for Contingency operations
WG D2.50 built this Technical Brochure leveraging the extensive research by others – 86 references are included in the bibliography and cited in the main body text and annexes of the Technical Brochure. Working Group subject matter experts then tailored the findings of this research for restoration of cybersecurity protection after a wide-spread disaster. The approach used is well-aligned with the concepts described for the “Grid Architecture” in the September/October 2019 issue of IEEE Power and Energy magazine. Using model-based system engineering processes, multiple solutions were analyzed to improve cooperation between all participating electric utilities and supporting organizations (e.g., government agencies, law enforcement, contractors) involved in a wide-spread disaster recovery and reconstitution activity. Most important is the need for well-defined agreements between utilities to establish chain of command needed to recognize legitimate new players with access and use control privileges as the join and leave the disaster response team.
Convenor
(US)
D.K. HOLSTEIN
Secretary
(US)
T.W. CEASE
I. PATRIOTA DE SIQUEIRA (BR), M. HAMALAINEN (FI), H. FALK (US), U. CARMO (BR), J. STEWART (US), V. VEYSI (IR), H. LI (UK), G. TAYLOR (UK), P.K. AGARWAL (IN), M. TALJAARD (ZA), J. WACK (US), M. ALEMAM (US), G. PULMAN (UK), C. WORRAD (AU), I. BOLOTOV (UK), L. GARRIGIES (FR), D. ARRIBAS (ES), D. CAMPARA (BA), J. HONG (US), L. WILLIAMS (AU), E. MORALES (CL), G. RASCHE (US), A. JOHNSON (US)
Introduction
WG D2.50 is a continuation of earlier study committee B5 and D2 Working Groups, specifically JWG B5/D2.46, WG D2.31, WG D2.38, WG D2.40, and WG D2.46. The scope of the TB includes assessments and supporting data in existing technical reports (TRs), Technical Brochures (TBs), and open source documentation to characterize the evolving threat and imposition of emerging laws and regulations to focus attention on the cyber-physical security (CPS) challenges during electric power utility (EPU) contingency operations. From a regulation point of view, a good example of the cybersecurity cost benefit analysis is described in [1]. The objective of this review is to:
1. Identify the situations that require consideration of CPS over-rides to gain access to and use of protection, automation, and control system (PACS) assets.
2. Provide supporting analysis and use cases for enabling over-ride mechanisms to ensure access to and use of PACS assets.
3. Identify CPS over-ride requirements that should be included in policies, procedures, and organizational directives (PP&ODs) related to PACS operation.
- Describe the processes and interaction of responsible organizational units (ROUs) to ensure CPS over-ride requirements are implemented in accordance with local norms, laws and regulations.
- Identify the skills and training required to effectively manage over-ride operations.
- Describe the processes needed to reactivate security controls to return to normal secure PACS operations.
4. Develop classes of metrics that can be used by other CIGRE Study Committees to quantify CPS security solutions in terms of deployment rate, response rate, and degree of complexity. An approach like the Defense Innovation Board metrics for software development among others, is considered.
The challenge
WG D2.50 deliverables (Technical Brochures, tutorials, and other presentations) include summaries of key findings or key points. This Technical Brochure uses a standardized template for presenting the findings and supporting analysis. Following is an example of WG D2.50’s template. The objective is to clearly state the question being addressed, the findings or conclusions reached by the WG’s analysis, and the meaning in terms of interpretation of the findings.
Question: Statement of scope, e.g., What are the challenges for field crews to access substation operational communication networks during contingency operations?
Findings: Statement of conclusions, e.g., In the analysis of field crews using personal computers, the primary concern is the high possibility of spreading malware to critical protection, automation, and control devices attached to the operational communication networks.
Meaning: Statement of interpretation and impact, e.g., Only or approved company computers that are routinely checked for malware that cannot be used for personal work should be used for access to substation communication networks.
To support the summaries of key findings, the Working Group includes the supporting analysis in terms of describing the importance of the analysis, objectives and methods used, including data, and criteria, results in terms of relevance, outcomes, and measures.
For This Technical Brochure, WG D2.50 used a generic description of an EPU distribution system architecture. Figure 1 illustrates a typical interface between the IT and OT networks with an emphasis on securely separating various zones of operation. Access control management under contingency conditions is one of the issues addressed in this Technical Brochure.
Multiple regulatory guidelines focus attention on contingency operations. For example, the National Association of Regulatory Utility Commissioners (NARUC) published a cybersecurity table exercise guide to help public utility commissions (PUCs) gather and evaluate information from utilities about their cybersecurity risk management and preparedness [2].
In summary, the principal goal in this Technical Brochure is to determine the right activities to focus on and clarifying who is responsible for the implementation under stress during contingency operations. The objective of this focus is to minimize the impact on system reliability measured by the loss of load probability (LOLP) and expected energy not supplied (EENS). One example, using these metrics is discussed in [3].
Our technical approach
Restoring the grid after a major disruption
IEEE Power Engineering Society’s Power System Relaying Committee commissioned Working Group H22 to develop a guide for categorizing security needs for protection, automation, and control related data files [4]. This reference will be helpful to better understand the relative importance of specific data files and prioritize the need to reconstitute the CPS mechanisms.
Enhancing the resilience of the nation’s electricity system [5] focuses attention on restoring the grid function after a major disruption is discussed in detail in Chapter 6 of the reference. This reference provides a foundational document describing the four restoration steps utilities use to bring their system back online as quickly as possible. Following are extracted from [5].
- Assess the extent, locations, and severity of damage to the electricity system.
- Provide the physical and human resources required for repairs.
- Prioritize sites/components for repair based on factors including the criticality of the load and the availability of resources to complete the needed repairs.
- Implement the needed repairs and reassess system state.
As noted in [5] “these general processes are carried out simultaneously by different organizations operating across all element of the power system.” Figure 2 is a high-level overview of the many organizations have their own restoration plans which requires collaboration of multiple organizations with different skills and maturity. In addition to local laws and regulations, mutual assistant agreements provide additional resources on an as-required basis.
One, of many, examples of the interactions between the field crews dispatched by each utility and external agencies is extracted from Chapter 6 [5]. This example led to an important finding.
When physical disruption of the power system occurs, it is important that utility crews be able to gain rapid access to damaged substation and other facilities so they can safely isolate and de-energize hazardous components, retain and gain access to emergency communications equipment and supplies, promptly assess damage, and start the process of restoration. In that context, the issue of working with law enforcement to gain access become critical, both for reasons of safety and because supplying power can be a key component of disaster recovery and avoiding further risks and damages.
Due to a lack of standing arrangements with law enforcement and other first responders, this is not always possible; informed high-level agreement about access do not always result in smooth operations among key personnel on the ground.
Lastly, it should be noted that Chapter 6 [5] does address disruptions that involve damage to the cyber monitoring and control systems. This Technical Brochure does recognize that CPS systems are damaged and must be restored in a timely manner. However, we do not assume that the widespread disaster resulted from a cyber-attack; that is a subject for others.
Next, we built on the intrinsic nature of incremental restoration of the grid with emphasis cyber-physical security issues during each phase of restoration, recovery, and reconstitution. In this respect the models used in this technical brochure should be applied to the incremental system or segment of interest.
Align management actions with a customized disaster recovery plan
Effective management requires the development of a disaster recovery plan that includes the steps described in Figure 3. We assume the existence of an approved disaster recovery plan is in place and is well-understood by all members of each utility’s disaster team. Thus, WG D2.50 assumes emergency contacts have the requisite authority to authorize specified actions, all roles and responsibilities are assigned, and most importantly the plan has been tested and maintained. Data and backup locations supported by emergency communications are available on demand.
Protection of backup and restoration hardware, firmware, and software components includes both physical and technical safeguards. Backup and restoration software include, for example, router tables, compilers, and other security-relevant system software. It is important to note the following:
An organization should provide the capability to restore information system components within organization-defined restoration time-periods from configuration-controlled and integrity-protected information representing a known, operational state for the components.
Restoring systems often includes the use of emergency communications systems (ECS) that is the use of communications channels other than the norm. As such, the process of data protection should reside at the data object level in the form of self-protecting data objects. The protection of the data itself allows an indifference to the communications channel and network topography. The same principal affords the differential access to content necessary in any complex system.
Focus on the use case to initiate disaster recovery operations
Given an approved disaster recovery (DR) plan, Figure 4 shows a general use case to initiate disaster recovery operations. In accordance with the approved plan, a crisis management operator has the assigned authority to execute the plan and direct the operations. As directed by the crisis management operator, each DR team derives its role and responsibility from the DR plan. Using the emergency communication system (ECS) positive control between the DR team and the crisis management operator can be effectively managed. Included in their field instructions, each DR team has instructions to use the ECS to retrieve backup data from one or more specified locations. The cardinality associated with the DR team is important.
Each DR team knows at least one or more (1..*) locations for the backup data, and each location may service no DR team or multiple DR teams (0..*).
Each crisis management operator is responsible for 1 or more (1..*) DR teams, but each DR team reports to only one crisis management operator (1).
The rake icon in the use case initiate DR operations identifies the need to describe a typical timeline of the activity to initiate the recovery operations. System engineers focus attention on activities because they provide the context in which actions execute. Activities are used, and more importantly reused, through call actions. Call actions allow the composition of DR activities into arbitrarily deep hierarchies that allow a DR activity model to scale from descriptions of simple functions to extraordinarily complex algorithms and processes inherent in a robust security posture.
Stressed timeline to initiate wide-area disaster recovery operations
A wide-area disaster is of key importance to the economy because all other domains rely on the availability of electricity, hence a power outage can have direct impact on the availability of other services (e.g., transport, finance, communications, water supply, etc.) where backup power is not available, or the power restoration time goes beyond the backup autonomy. In this situation, the utility cannot set their own schedules.
A stressed timeline to respond to initiate wide-area disaster recovery operations is shown in Figure 5. This scenario assumes that all utility resources (field crews) are stressed to their limit. To respond to the crisis, the utility in charge (crisis management operator) implements preplanned cooperative agreements to use field crews from other utilities. The use of partitions to indicate which behaviors are the responsibility of the allocated objective specifies the functional requirements of a system or component. Within the context of a use case the partitions are used to indicate the behaviors that are the responsibility of a specified role.
As noted in Figure 3, each instance of the DR team is constrained by their assigned roles and responsibilities. Given this situation, from incident detection to when the DR teams are on-site for disaster recovery and fully supported with ECS and backup data capability is 3 days. A good example is the DENSK project [6] that provides a European Energy Information Sharing and Analysis Center (EE-ISAC), an information sharing platform (ISP), and a situational awareness network (SAN) [7]. Implementing the DENSK project provides the needed support infrastructure for the processes described in Figure 4. Specifically,
- the actions to activate, establish, notify, and declare a disaster should be defined in the implementing instructions as an executable checklist, and
- the processes to mobilize, activate, and validate will require automation to efficiently execute the plan.
Each activity shown in Figure 5 includes a rake symbol designating a process activity. How these activities are to be executed is a local matter, as defined by each utility in their disaster recovery policies, procedures, organizational directives, and implementing instructions. Executing the over-ride of cybersecurity controls must be performed under positive control to minimize interruption of operations and be completed in a timely manner to return the system to a secured state. Effective management requires some metrics such as:
- Time from program launch to deployment of simplest functionality.
- Time to field high-priority functions.
- Time required for full regression test (automated) and cybersecurity audit/penetration testing.
- Time required to restore service after an interruption or outage.
- Time required to complete CPS breach postmortem forensic analysis.
Conclusions
This Technical Brochure offers an in-depth view of the issues, benefits, and concerns of proposed solutions that should be considered by EPU disaster recovery teams. These solutions focus on the need for improved people skills, dramatic changes to policies, procedures and organizational directives to assign responsibility and accountability for maintaining a mature security posture, and the use of advanced technologies and tools to implement a proactive or anticipatory security strategy.
Some of the key take-away points from this work are:
- Advances in digitization and ubiquitous connectivity enabled by such standards as IEC 61850 have dramatically increased operational system’s attack surface. During all stages of disaster recovery and reconstitution, access to critical information is needed by all players including multiple utilities participating with their field crews, federal, state, and local government support agencies. This Technical Brochure addressed two situations:
- During the initial stages of disaster recovery selected players need access to and use of substation and related communication networks and power system automation and control devices. Access control mechnisms enabled in these devices need to be disabled to allow access until the system can be restored.
- Prior to restoring service to the customers, access control and use privileges for network devices and power system automation and control devices need to be reinstated and where necessary improved.
- Safety and security, including all aspects of cyber-physical security, are implemented using different standards and regulation, are managed by different stakeholders, and so they cannot easily be merged into one program. As discussed in IEC 62443, such independent coupling becomes more complex with the overlay or embedding of security mechanism.
- Multiple NERC CIP requirements are qualified in terms of “exceptional circumstances.” These circumstances are described in the NERC glossary of terms as a situation that involves or threatens to involve one or more of the following, or similar conditions that impact safety or bulk electric system (BES) reliability: a risk of injury or death; a natural disaster; civil unrest; an imminent or existing hardware, software, or equipment failure; a cybersecurity incident requiring emergency assistance; a response by emergency services; the enactment of a mutual assistance agreement; or an impediment of large scale workforce availability. These situations establish the framework for contingency operations addressed in this technical brochure.
- This CIGRE study further suggests that all CPS handling mechanisms during contingency operations and restoration be assessed using a risk-based approach based on type and criticality of the applications under consideration.
- An agile management scheme, including manual reset, is needed to manage digital certificates during over-ride and restoration activities. If the disaster is not widespread it is handled by utility’s recovery team who are employees and contractors. If, however, the disaster is widespread the recovery team(s) may be comprised of members from cooperating utilities and support organizations. The latter situation is comprised of multiple certificate authorities, registration authorities, and repositories that must be coordinated by the primary crisis management operator. For example, access control and use control privileges must be managed in a timely and seamless manner as organizations join and leave the disaster response team.
- Considering the dynamics and rapid evolution of the threat landscape, CPS commercial management systems should have an auditable technology readiness level (TRL=7) that necessitates system prototypes in operational environments.
- If cybersecurity restoration is significantly delayed, adversaries, such as nation states and crime organizations, have ample opportunity to probe the system to expose critical asset and network vulnerabilities in the early stages of the kill chain. Thus, it is imperative that CPS systems be reconstituted as soon as possible.
This Technical Brochure identified the need for future work as follows:
- The National Association of Regulatory Utility Commissioner (NARUC) described a cybersecurity tabletop exercise guide that provides a framework that can be tailored to include the constraints introduced in this technical brochure. Model-based systems engineering (MBSE) models of the problem domain expose these constraints. A new working group is recommended to describe the situations that offer an adversary the opportunity to perform the reconnaissance needed to identify exposed vulnerabilities in operational networks and intelligent electronic devices.
- The situation involving personnel from multiple utilities and other agencies comprising the field crew. A well-defined agreement between utilities is needed to establish chain of command. Another difficulty is how to recognize a legitimate new player with access and use control privileges. One suggestion is that all personnel deployed to recovery sites must have in their possession a verifiable type two mechanism of authentication. This can be accomplished by implementing agreements establishing a centralized management authority and decentralizing execution responsibility. Based on how utilities manage digital certification in general, cross-certificate signing between utilities and supporting agencies exposes trust issues that also need to be addressed. Developing such a strategy is left to a future working group.
- [1] E. Ragazzi, A. Stefanini, D. Benintendi, U. Finardi, and D. Holstein, "Evaluating the Prudency of Cybersecurity Investments : Guidelines for Energy Regulators," CNR-IRCrES, NARUC, Guideline May 2020. [Online]. Available: pubs.naruc.org/pub.cfm
- [2] L. P. Costantini and A. Raffety, "Cybersecurity Tabletop Exercise Guide," National Association of Regulatory Utility Commissioners, Washington DC, Guide September 2020. [Online]. Available: www.naruc.org
- [3] S. Tang, A. Liu, and L. Wang, "Power System Reliability Analysis Considering External and Insider Attacks on the SCADA System," presented at the IEEE PES, 2019, Technical summary of thesis.
- [4] Draft Guide for Categorizing Security Needs for Protection, Automation and Control Related Data Files, Technical Guide WG-H22, Unpublished.
- [5] "Enhancing the Resilience of the Nation’s Electricity System," Washington, DC, 2017. [Online]. Available: http://nap.edu/24836
- [6] www.ee-ISAC.eu
- [7] R. Leszczyna, T. Wallis, and M. R. Wróbel, "Developing novel solutions to realise the European Energy–Information Sharing & Analysis Centre," Decision Support Systems, vol. 122, p. 113067, 2019.