Oracle Node Shared State Postmortem (Incident #3)
Date: 2022-05-14
Authors: dcale
Status: Complete, action items in progress
Summary:
Impact: Denial of Service for smart contracts relying on get_price
entrypoint. No funds were directly at risk at any time during the incident.
Root Causes: Data transmitters were using multiple nodes to retrieve oracle jobs and fulfill them. One node (letzbake) got stuck but the RPC was still available, yielding inconsistent behaviour effectively blocking the data transmitter.
Trigger: Adaption of the oracle job due to a certificate change on one of the CEX server.
Resolution: Data transmitters have been updated to ignore nodes that are not in sync.
Detection: Data transmitters not providing data consistently.
Action Items:
Action Item | Type | Owner | State |
---|---|---|---|
Identify reason for failure | mitigate | dcale | DONE |
Add more data transmitters for resilience | mitigate | florin | DONE |
Adapt client software to use only nodes that are in sync | mitigate | dcale | DONE |
Trigger upgrade request of data transmitters | mitigate | pascal | DONE |
#
Lessons Learned#
What went well- The client software was fixed and upgrade on a weekend in a short amount of time
- Data transmitters updated their client software with the new fix
#
What went wrong- Wrong assumptions on the behaviour of the RPC if the node is out of sync.
#
Where we got lucky- The market was stable at the time, so the impact was minimal.
#
Conclusion- The data transmitter software is crucial to the oracle health.
- Updates to the transmitter software only should happen during week days.
- Edgecases and uncommon scenarios should be tested more.
- No assumptions on the health of infrastructure can be made, we need to check.
- Upgrading data transmitters requires a long time.
#
Timeline2022-05-14 (all times UTC)
- 07:47 A new oracle job is published (oot4G5gCabsvEYgQxXtT478uJkCGyANbhXcUo81H6fmBLrmbNqg)
- 08:00 Data transmitters fail to fulfill job
- 09:00 Oracle blocks
get_price
as expected, monitoring goes off - 11:00 Reason is debugged and fix is being worked on
- 14:00 Signed binaries for upgrade are ready
- 19:44 Taurus updates data transmitter with fix
- 20:00 The oracle is back up
Note: After the hotfix the proper adaption of the data transmitters was done, distributing and upgrading the data transmitters took the various parties 2 weeks