Oracle Node Shared State Postmortem (Incident #3)

Date: 2022-05-14

Authors: dcale

Status: Complete, action items in progress

Summary:

Impact: Denial of Service for smart contracts relying on get_price entrypoint. No funds were directly at risk at any time during the incident.

Root Causes: Data transmitters were using multiple nodes to retrieve oracle jobs and fulfill them. One node (letzbake) got stuck but the RPC was still available, yielding inconsistent behaviour effectively blocking the data transmitter.

Trigger: Adaption of the oracle job due to a certificate change on one of the CEX server.

Resolution: Data transmitters have been updated to ignore nodes that are not in sync.

Detection: Data transmitters not providing data consistently.

Action Items:

Action Item	Type	Owner	State
Identify reason for failure	mitigate	dcale	DONE
Add more data transmitters for resilience	mitigate	florin	DONE
Adapt client software to use only nodes that are in sync	mitigate	dcale	DONE
Trigger upgrade request of data transmitters	mitigate	pascal	DONE

Lessons Learned#

What went well#

The client software was fixed and upgrade on a weekend in a short amount of time
Data transmitters updated their client software with the new fix

What went wrong#

Wrong assumptions on the behaviour of the RPC if the node is out of sync.

Where we got lucky#

The market was stable at the time, so the impact was minimal.

Conclusion#

The data transmitter software is crucial to the oracle health.
Updates to the transmitter software only should happen during week days.
Edgecases and uncommon scenarios should be tested more.
No assumptions on the health of infrastructure can be made, we need to check.
Upgrading data transmitters requires a long time.

Timeline#

2022-05-14 (all times UTC)

07:47 A new oracle job is published (oot4G5gCabsvEYgQxXtT478uJkCGyANbhXcUo81H6fmBLrmbNqg)
08:00 Data transmitters fail to fulfill job
09:00 Oracle blocks get_price as expected, monitoring goes off
11:00 Reason is debugged and fix is being worked on
14:00 Signed binaries for upgrade are ready
19:44 Taurus updates data transmitter with fix
20:00 The oracle is back up

Note: After the hotfix the proper adaption of the data transmitters was done, distributing and upgrading the data transmitters took the various parties 2 weeks