Status: Complete, action items in progress
Impact: Denial of Service for smart contracts relying on
get_price entrypoint. No funds were directly at risk at any time during the incident.
Root Causes: Data transmitters were using multiple nodes to retrieve oracle jobs and fulfill them. One node (letzbake) got stuck but the RPC was still available, yielding inconsistent behaviour effectively blocking the data transmitter.
Trigger: Adaption of the oracle job due to a certificate change on one of the CEX server.
Resolution: Data transmitters have been updated to ignore nodes that are not in sync.
Detection: Data transmitters not providing data consistently.
|Identify reason for failure
|Add more data transmitters for resilience
|Adapt client software to use only nodes that are in sync
|Trigger upgrade request of data transmitters
- The client software was fixed and upgrade on a weekend in a short amount of time
- Data transmitters updated their client software with the new fix
- Wrong assumptions on the behaviour of the RPC if the node is out of sync.
- The market was stable at the time, so the impact was minimal.
- The data transmitter software is crucial to the oracle health.
- Updates to the transmitter software only should happen during week days.
- Edgecases and uncommon scenarios should be tested more.
- No assumptions on the health of infrastructure can be made, we need to check.
- Upgrading data transmitters requires a long time.
2022-05-14 (all times UTC)
- 07:47 A new oracle job is published (oot4G5gCabsvEYgQxXtT478uJkCGyANbhXcUo81H6fmBLrmbNqg)
- 08:00 Data transmitters fail to fulfill job
- 09:00 Oracle blocks
get_priceas expected, monitoring goes off
- 11:00 Reason is debugged and fix is being worked on
- 14:00 Signed binaries for upgrade are ready
- 19:44 Taurus updates data transmitter with fix
- 20:00 The oracle is back up
Note: After the hotfix the proper adaption of the data transmitters was done, distributing and upgrading the data transmitters took the various parties 2 weeks