Skip to main content

Oracle Node Shared State Postmortem (Incident #3)

Date: 2022-05-14

Authors: dcale

Status: Complete, action items in progress

Summary:

Impact: Denial of Service for smart contracts relying on get_price entrypoint. No funds were directly at risk at any time during the incident.

Root Causes: Data transmitters were using multiple nodes to retrieve oracle jobs and fulfill them. One node (letzbake) got stuck but the RPC was still available, yielding inconsistent behaviour effectively blocking the data transmitter.

Trigger: Adaption of the oracle job due to a certificate change on one of the CEX server.

Resolution: Data transmitters have been updated to ignore nodes that are not in sync.

Detection: Data transmitters not providing data consistently.

Action Items:

Action ItemTypeOwnerState
Identify reason for failuremitigatedcaleDONE
Add more data transmitters for resiliencemitigateflorinDONE
Adapt client software to use only nodes that are in syncmitigatedcaleDONE
Trigger upgrade request of data transmittersmitigatepascalDONE

Lessons Learned#

What went well#

  • The client software was fixed and upgrade on a weekend in a short amount of time
  • Data transmitters updated their client software with the new fix

What went wrong#

  • Wrong assumptions on the behaviour of the RPC if the node is out of sync.

Where we got lucky#

  • The market was stable at the time, so the impact was minimal.

Conclusion#

  • The data transmitter software is crucial to the oracle health.
  • Updates to the transmitter software only should happen during week days.
  • Edgecases and uncommon scenarios should be tested more.
  • No assumptions on the health of infrastructure can be made, we need to check.
  • Upgrading data transmitters requires a long time.

Timeline#

2022-05-14 (all times UTC)

  • 07:47 A new oracle job is published (oot4G5gCabsvEYgQxXtT478uJkCGyANbhXcUo81H6fmBLrmbNqg)
  • 08:00 Data transmitters fail to fulfill job
  • 09:00 Oracle blocks get_price as expected, monitoring goes off
  • 11:00 Reason is debugged and fix is being worked on
  • 14:00 Signed binaries for upgrade are ready
  • 19:44 Taurus updates data transmitter with fix
  • 20:00 The oracle is back up

Note: After the hotfix the proper adaption of the data transmitters was done, distributing and upgrading the data transmitters took the various parties 2 weeks