On a typical day, careful orchestration keeps the many components of the Advanced Light Source (ALS) running. When atypical situations arise, that orchestration is even more important. This fall, the ALS faced a new challenge when Pacific Gas and Electric (PG&E) announced their strategy for fire prevention. On days predicted to be hot, dry, and windy, PG&E turned off the electricity in high fire-threat zones. They would inspect the power lines and restore power when they deemed the conditions to be safe, hours or days later. In this piece, we’ll walk through the timeline of safely shutting down a synchrotron light source and revisit how the ALS team prepared for and recovered from the PG&E Public Safety Power Shutoff (PSPS) the second week of October 2019.
Information cascade
The flow of information is important during time-critical situations. The protocol is for the Lab directorate to communicate decisions to the areas, and for the areas to communicate with their divisions. Once notified of an impending power shutoff, ALS management put out the order to safely power down the ALS in advance of the PSPS.
Not a typical shutdown
When the ALS is shut down for routine maintenance, many machines are still connected to power, and all the normal staff are on hand. Because we had to prepare for a power outage of unknown start time and duration, it was important that we safely disconnect and power down all equipment in the correct order. We also had to hope that certain pieces, like the superconducting bend magnets, would smoothly switch from normal power to emergency power and back to normal power.
We interviewed the following people to find out how the ALS responded to the PSPS earlier in October. Through their efforts and those of many more around the ALS, we were able to safely restore beam.
- Marc Allaire, Berkeley Center for Structural Biology beamline scientist
- Carol Baumann, electrical maintenance technician
- Cobber Lam, computer systems engineer
- Mike Martin, group leader for photon science operations
- Alpha N’Diaye, research scientist
- Dula Parkinson, deputy for photon science operations
- Steve Rossi, deputy for business operations
- Fernando Sannibale, deputy for accelerator operations
- Tom Scarvie, operations supervisor for the accelerator and floor operators
- Scott Taylor, environment, safety, and health lead
Tuesday, October 8
2:30 p.m.
ALS management notified that Berkeley Lab will be closed at midnight. Only personnel performing emergency functions allowed after midnight.
ALS management decides that the ALS will be shut down in advance of the Berkeley Lab closure.
Tom Scarvie and the operators, EMs (electrical maintenance technicians), and Controls group began implementing the Extended ALS Shutdown Checklist.
Scarvie: We started from the normal shutdown list and extended it to turning off all the power breakers to ensure that any power surge when the power returned would not get to the equipment. A lot of that is computer equipment in the server room in Building 15 for the accelerator and the beamlines. The collaboratively developed list was a testament to all the systems experts—they know their systems very well.
The ALS was in an accelerator physics period, which is time that is scheduled for accelerator staff to test and improve accelerator operations. This meant that no users were scheduled for beamtime.
Scarvie: It was a lucky coincidence that there was very little demand on running the accelerator at that point. We were able to start the shutdown significantly earlier than we would have otherwise.
3:00 p.m.
Beam operation was interrupted.
3:00–10:00 p.m.
Each subsystem and area leader had their own list to prepare for the shutdown. Sue Bailey (User Services group leader) began notifying users who had beamtime scheduled for the rest of the week. Stacey Ortega (senior administrator) was in contact with the Berkeley Center for Structural Biology users.
Sannibale: The operators and EMs are the people who do a regular machine shutdown. This time, a lot of additional people were involved because it was a total shutdown, including the control systems people, Unix group, and accelerator group.
Rossi: At the ALS, we’ve got 400 procedures that guide our work, and a number of them deal with shutdowns and securing the facility, but none of them addressed this unknown, multiday, hard power outage.
Sannibale: We knew that if the ac power went out, we’d lose LCW (low conductivity water), so we needed to switch the water cooling for the superbend compressors to city water. The main thing with the accelerator components is that they don’t like spikes in voltage, which can happen when the power is turned back on.
Taylor: You try to identify those things that can go bad. For example, we have toxic gases on the floor in hazardous gas cabinets that are vented and alarmed, but that’s not necessarily going to work when the power’s out. So, Doug Taube [chemical safety specialist] and I went through and turned off or removed all the gases.
Lam: We have a lot of virtual machines—about 60. A dozen of them are control related, so I couldn’t shut those down until the control room was ready to go. I also needed to notify users that machines in our Building 15 server room would be turned off.
We have two clusters. That’s seven physical machines hosting about 60 virtual machines, so it took about three hours to shut down all those virtual machines and hosting computers. These were spread out over two locations—Buildings 80 and 15. After that was done, I walked around the ring to make sure everything was turned off to prevent them from getting damaged. Then I inspected the server room in Building 15 to make sure all computers were shut down.
Each beamline scientist also directly communicated with their users and then worked to safely power down and secure the equipment at their beamlines. Vacuum valves were shut to isolate ultra-high-vacuum sections, and control systems were turned off and unplugged.
10:00 p.m.
Extended ALS Shutdown Checklist completed
ALS Director Steve Kevan, Tom Scarvie, and Scott Taylor did a sweep of the buildings
Scarvie: This is important because there are a lot of users here, and they have varying levels of understanding of what’s going on institutionally. We wanted to make sure nobody was sitting in their office waiting for the beam to come back.
Wednesday, October 9
Early hours
City water used to cool superbends; server room no longer chilled. Low conductivity water ran until the power went off.
Starting on Wednesday, the ALS Management team had calls two to three times a day to touch base and respond to information provided from the Lab.
Thursday, October 10
12:18 a.m.
Power to the Lab is disconnected.
The emergency power systems switched on correctly
Rossi: We have a couple generators that supply power to the buildings. There’s a transfer switch, and when the switch detects a loss of power, it transfers the load from the mainline to the emergency generator. At the ALS, we don’t have that much connected to backup power; it’s only things that will get damaged from the power loss and are safety critical. So, it’s our blowers on the roof, fume hood controls, superbend compressors.
Scarvie: The superconducting bend magnets need to be kept cold; otherwise, they can go into a heating chain reaction that can require multiple days to recover from. We fully expected that we would not be able to remotely monitor them.
Morning
Taylor: Doug and I walked around the floor to make sure there were no hazards. We have a huge advantage at the ALS—it’s really well lit from emergency lighting and all the windows on the dome. The nice thing about the high roof is that if something did leak, there’s a large volume of air to dissipate it. So, there aren’t too many hazards here.
Allaire: Our users send in samples in dry shippers, which don’t have liquid nitrogen but can keep the samples fresh for about three days. The ones at the beamline have liquid nitrogen. A critical thing for us was to pick up the dewars that were being shipped to the FedEx facility in Emeryville. Stacey had been in constant communication with users and tracking their shipments. There were five in transit, and each dewar contained probably $100,000 of samples. I arranged with colleagues at SLAC [National Accelerator Laboratory] and JBEI [Joint BioEnergy Institute] to take the dewars there to add liquid nitrogen if necessary.
I heard that it might be possible for one of us to come take care of our samples, so then I could bring the dewars directly to the beamline and top them off with liquid nitrogen. I was able to do this for samples at the BCSB beamlines, 4.2.2, and 8.3.1. There were probably 15 dewars on the floor, which would be $1.5 million in samples.
I was with Scott Taylor and Doug Taube. It was really quiet. We’re used to the life of the ALS when it’s running, and then—there was nothing.
Afternoon
Scarvie: Warren Byrne and I came back to check the systems—the temperature of the server room, whether there are giant ice balls on the house liquid nitrogen system, and the superbends. Everything was fine. I have to say, the ALS is a very different and quite peaceful place when everything is quiet and shut off.
6:00 p.m.
Shortly after suggesting that the outage could last for days, PG&E restored power to the Lab around 6:00 p.m. The Lab planned for Facilities to begin re-energizing the buildings on Friday morning.
Scarvie: When we were told that we’d be able to come back in, I contacted the operators and the vacuum group. The main thing we worry about when everything is shut off is the state of the vacuum inside the accelerator and the beamlines.
Friday, October 11
1:00 p.m.
The ALS operators, EMs, vacuum technicians, safety personnel, and managers showed up, and power was restored to the buildings in the following order:
- Building 80
- Building 6 (several different transformers)
- Building 37 (low conductivity water)
- Building 34 (chilled water plant) Rossi: Chillers have this oil reservoir that needs to be heated up before they can operate again. So, we had to wait 12 hours for that oil basin to heat up before the chillers could be restarted.
- Building 15
Taylor: HVAC specialists came in to make sure the ventilation was up to speed. Doug and I did the EH&S walkthrough because the ALS is more complicated than individual labs. We made sure it was safe for the mechanical technicians and electrical maintenance technicians to come in.
Sannibale: As soon as we got the okay to come back, we reestablished 24/7 shifts of EMs and operators. On the machine side, we had a few thousand switches that needed to be turned back on. This is more extensive than the list for a standard shutdown, so that took extra time. Not too many things broke, but a few things here and there had to be replaced.
Allaire: The vacuum group did a fantastic job in going across all the beamlines and restarting. When I came in, the vacuum on our beamlines was almost all correct. It was amazing.
Taylor: Mike DeCool, Don MacGill [vacuum systems technical associates], and the mechanical technicians—the whole crew came in. They worked really hard on restoring everything.
Baumann: First, they had to make sure the fire alarm, the HVAC, and all of that was okay. Around 2:00 p.m., we started turning on the smaller power supplies.
Scarvie: It took another day to get the full cooling capacity for the server room, so we made the decision not to turn on the beamline computers. We were closely watching the temperatures, keeping the door propped open, and running fans and air conditioners.
Lam: They were starting up the control instruments, and the facility also needed a server running on our virtual infrastructure and a dozen virtual machines for the control room. There are nine physical machines that support the whole virtualization infrastructure. That’s the minimum I need to get started to support the controls. Jackie Scoggins and Susan James from the Scientific Computing Services group also played a big part in the ALS shutdown and power up process. Some of the critical infrastructures that run the ALS are on Linux servers.
Saturday, October 12
Martin: Beamline scientists were allowed to enter the otherwise closed Lab to begin turning on and testing their equipment. Most vacuum systems were in good shape and started back up quickly. A handful of motor controllers had problems where they lost key information. Some only needed to be re-homed, but others needed recalibration too. A few would not turn back on and needed to be replaced. A couple computer data systems had issues turning back on. A couple backing pumps had issues turning back on. A superconducting magnet at a beamline lost all its liquid helium, which required a significant amount of LHe to recover operations to. But on the whole, the vast majority of systems came back on well, and beamline operations started up about as well as we could expect.
Baumann: We lost a couple of ion gauge controllers for the storage ring as well as on the beamlines. Also, the power supply for the superbend gateway died. Eric Williams [software developer] was able to implement an alternative solution for that. I was leaving for the day when we found a failed power supply in the booster pit. Unfortunately, this was a 150-pound supply, and we needed to ask Monroe Thomas [mechanical technician] to come in that night. He ended up staying to crane a couple of supplies over for us.
Sunday, October 13
Morning
Booster power supplies turned on.
2:00 p.m.
Beam could be stored, but lifetime was really poor, and there were limitations on how high the current could go.
Monday, October 14
Morning
Lam: When I started turning everything back on, there was a problem with Hiroshi Nishimura’s machine, and he’s one of the biggest control room application developers. So, I spent three days just trying to fix that one machine.
4:00 p.m.
Shutters open for beamline scientists to test. Many beamlines did realignment and testing of all motors on the beamlines.
Tuesday, October 15
5:00 a.m.
Shutters closed for maintenance and testing.
Wednesday, October 16
8:00 a.m.
Beam available. Beamlines were ready for user operations at different times, depending on the condition of their equipment upon power restoration.
Allaire: The BCSB team put an outstanding effort into the recovery and preparation of our five beamlines back for operations. Anthony Rozales, whose shift is 5:00 a.m. to 1:00 p.m., was ready to start four hours earlier than expected on one beamline. We were able to reschedule someone who had lost beamtime the week before. We were explicitly telling users, “As soon as you’re done, let us know so that we can piggyback someone else onto the beamline.” We also had a couple of cancellations that helped with the rescheduling process.
Parkinson: I had a user who had been growing plants for specifically for this scheduled beamtime. Another user had prepared rat brains for this time, and in that prepared state, they won’t last for another four months for beamtime in the next cycle.
N’Diaye: My collaborator was going to measure magnetism in ancient rocks. The power outage delayed the experiment, but for 3 billon-year-old samples, waiting three more weeks won’t make a big difference.
In the midst of the PSPS, the team effort at the ALS was a light in the darkness. Sannibale applauded the group for their contributions, saying, “One thing that really impressed me was that all of this work was on a voluntary basis. No one was forced to come in. The average response was very positive.”
Similarly, Martin saw great cooperation among the beamline staff. He noted, “Often our PS Ops Programs staff worked together across their Program to help and be a second pair of eyes for both safe shutdown and startup activities. It was a great use of our relatively new Photon Science structure, and we continue to learn from each other about what went well and what we took lessons from for the next event at the ALS.”