Oracle Ithaca Upgrade Postmortem (Incident #2)
Date: 2022-04-01
Authors: dcale
Status: Complete, action items in progress
Summary:
Impact: Denial of Service for smart contracts relying on get_price
entrypoint. No funds were directly at risk at any time during the incident.
Root Causes: The upgrade to Ithaca introduced a 7x increase in gas consumption for contract in a non-cached state. Since data transmitters were using a fixed value for the gas consumption the transactions failed.
Trigger: Ithaca upgrade.
Resolution: Data transmitters have been update with new client software that allows configuration of gas usage.
Detection: Failed transactions after ithaca.
Action Items:
Action Item | Type | Owner | State |
---|---|---|---|
Identify reason for failure | mitigate | dcale | DONE |
Have no-op transaction to hoftix and bring oracle in storage | mitigate | florin | DONE |
Adapt client software to allow for gas usage to be specified in job | mitigate | dcale | DONE |
Deploy new scheduler to allow specification of gas and storage usage | mitigate | florin | DONE |
Trigger upgrade request of data transmitters | mitigate | pascal | DONE |
#
Lessons Learned#
What went well- The Tezos community helped to identify the issue. (thank you!!!)
#
What went wrong- Wrong assumptions on the protocol were made.
#
Where we got lucky- The market was stable at the time, so the impact was minimal.
#
Conclusion- Tezos is an evolving protocol, this comes with many advantages but also has things to consider.
- Upgrading data transmitters requires a long time.
#
Timeline2022-04-01 (all times UTC)
- 19:06 First operations start to fail
- 19:34 Ask Tezos developer chat for ithaca changes that could cause this
- 21:20 Gas limit is discovered as sympton for fail
- 22:12 Protocol caching is discovered as cause
- 09:23+1 Understanding what influences cache and identifying "no-op" as potential fix
- 10:41+1 Signatures requested from multisig participants
- 13:22+1 Hotfix to bring oracle back to cach (onxPpj6W9E1PKxVd4y3jBeXC7fbvYJ42YDLJx2t3rfY6Z3JYayG)
Note: After the hotfix the proper adaption of the data transmitters was done, distributing and upgrading the data transmitters took the various parties 2 weeks