-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENG-771: Add upgrade-info json file to determine when upgrades are applied + rpc + cli #793
base: main
Are you sure you want to change the base?
Conversation
Haven't seen it yet, but I expected that it will be looking for heights that it does not have hardcoded, heights looking from the upgrade JSON file that it learned about, but doesn't know how to execute. The tests in the description seem to be in line with this, though 👍 |
/// whether this is a required upgrade or not. A required upgrade | ||
/// will cause the node to freeze and not process any more blocks | ||
/// if it does not have the corresponding Upgrade defined to | ||
/// migrate to that version | ||
#[arg(long)] | ||
pub required: bool, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just thinking when an upgrade wouldn't be required. I know we talked about upgrades where you can download the new software and replace the existing one with it now, because the binary is backwards compatible, and it will apply the upgrade when the time comes. Why the upgrade list would know this though..
OTOH if some nodes update the app_version
then everyone must do it, because it's part of the ledger. So an optional upgrade must not update the app version.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm also not sure how an optional upgrade would be handled if the node with the upgrade is started later than the height. Nothing will trigger that upgrade, so we don't know if it's been applied or not?
/// Query the current upgrade schedule | ||
UpgradeSchedule, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not exactly sure what an upgrade schedule is; maybe the docstring could explain a bit more.
For example I understand it can be the contents of the JSON file, but I was wondering if it could also include the versions that the node is capable of executing.
@@ -235,7 +237,7 @@ async fn test_applying_upgrades() { | |||
tester.commit().await.unwrap(); | |||
|
|||
// check that the app_version was upgraded to 1 | |||
assert_eq!(tester.state_params().app_version, 1); | |||
assert_eq!(tester.state_params().app_version, block_height as u64 + 10); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think block_height + 10
as app_version
is puzzling. I kept wondering why that should be true, and it's true in this particular set of hardcoded values, but it looks like some kind of rule that one has to make sense of. Perhaps it would be more straight forward to just iterate over the known list of (height, version) pairs and assert them, rather than have a formula for it that works now, but it's not clear where the upgrades are defined that this offset must be adhered to.
@@ -19,6 +23,9 @@ use super::{ | |||
FvmMessage, FvmMessageInterpreter, | |||
}; | |||
|
|||
/// Indicates whether the node has been frozen due to not having a required upgrade. | |||
pub static IS_FROZEN: AtomicBool = AtomicBool::new(false); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this live inside the FvmMessageInterpreter
instead of being a public static global field? The way it is, it can be set from anywhere, by anyone. There has been no rule so far that there can be only one interpreter instance. The application only creates one, so there is no problem with multiple conflicting values.
}); | ||
|
||
tracing::info!(app_version = state.app_version(), "upgraded app version"); | ||
if let Some(upgrade_info) = self.upgrade_schedule.get(state.block_height()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking whether we should also add this check to the ProposalInterpreter
implementations (which this interpreter does not implement), but I think you are right, this should be sufficient. If a validator proposes a block without knowing whether it will be able to execute it, it can still be finalized and cause all nodes to freeze there during delivery.
If we live with this model, then it's also probably right to keep the execution of the upgrade in end
, because the decision about what transactions to put into the block has been made the the current version of the software... but no, that doesn't make any difference, if the new version will have to execute it, it will have to be able to handle all the transactions that the previous one already put into the block.
So that could be prevented by adding the freeze already to the proposal handling: if for example we wanted to get rid of a certain transaction type, we could just say "we're no longer eligible to create the next block", rather than put it in, then consider it invalid.
If we want to handle such cases by marking these transactions as invalid, then I reckon the upgrade should be done in begin
after all, which signifies: once this block begins, it's the new rules, even if the transactions are old. Versus suggesting that "until the end, it's the old rules", which could leave the new version in a conundrum whether to be backwards compatible just for the sake of a single block.
} | ||
FvmQuery::NodeState => { | ||
let ret = NodeState { | ||
frozen: IS_FROZEN.load(Ordering::Relaxed), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this one is executed by the same interpreter, it can access it as a member variable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe it could tell on which upgrade it has been frozen, instead of just a bool.
pub struct UpgradeSchedule { | ||
// schedule of upgrades to be executed at given height | ||
schedule: BTreeMap<ChainEpoch, UpgradeInfo>, | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I see missing here is a file system watcher. The JSON is only read when the node starts. We want it to stop when a new upgrade is announced, and the height arrives with the old version still running. If it doesn't refresh the file, then it won't know.
I reckon it could keep a reference to the file and start a background loop that checks it every now and then, then loads new items into the schedule
.
This is also important because without a governance module, the upgrade schedule somehow has to make it to all full nodes, which is why I think the visor should be able to download it and overwrite the file on the disk. Not sure how to make sure that doesn't happen for the guy who is adding upgrades by hand.
pub fn add(&mut self, upgrade_info: UpgradeInfo) -> anyhow::Result<()> { | ||
match self.schedule.entry(upgrade_info.height) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe add
should check not only that the height is unoccupied, but that the version fits between the previous and next upgrades. I'm not sure what kind of semantics we put on the app version, but since it identifies the upgrade, I reckon making it strictly monotonic is a reasonable default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great 👏
I added a lot of exploratory comments about design. Curious to hear others' opinions on the semantics of app version for example, as any upgrade that touches it is consensus breaking, and I find hard to reason about required
because of this.
What I'd be curious about as an operator is whether I an upgrade my node to any particular future version and expect it to work before the upgrades are applied. For example maybe a migration changes the data schema of an IPC actor. Until it's executed, we cannot expect the new bindings to work, so we couldn't start a fendermint that has new bindings ahead of time. Maybe fendermint could have a minimum application version that it checks at the start, and quits if it's not time for this particular binary to run yet.
Closing while working on addressing remaining comments |
As a general practice, I find it best to revert to draft rather than close. It keeps things observable while clearly signalling that it's not ready for review. |
I agree and this is the first time I have closed a PR instead of putting to draft. Reason was that I have on several occations had my PRs reviewed while they were in draft and I thought this time that might happen again as I had pushed several commits and didn't want someone to review it before I push the remaining commits. But sure, I'll move this to draft, and lets see what happens :) |
Context
This PR updates fendermint to support future ipcvisor tool we are building to make upgrades easier to orchistrate for our partners (see post).
In order to facilitate the orchistration, this PR updates fendermint to:
Upgrade
migrations and separate upgrade schedule (a list ofUpgradeInfo
structs) which maps block heights to the hardcoded upgrade migrations.Upgrade
which isrequired
it does not have the source code of.This PR introduces all that functionality which should be enough for the
ipcvisor
to do its thing.Note, the following are not implemented in this PR:
Test plan
Test new functionality, including that fendermint should freeze and that rpc/cli work as expected
By default when fendermint starts it will automatically create an empty upgrade-migration.json file:
Lets try adding a new upgrade info which we don't have the hardcoded Upgrade for, lets also make it
required
to make things interistingNow observe that fenderming freezes at the specifiec upgrade height since it does not have the hardcoded upgrade:
Lets query the rpc and confirm its frozen:
Verify that updating fendermint with new version can successfully proceed
In the previous test, we confirmed that fendermint stopped once it progressed to a block height which had a
required
upgrade associated that it didn't have.Here we want to test that when this happens, that when we stop the running (frozen) fendermint with a newer version which has the upgrade, that it will successfully proceed from where the previous version stopped.
To test this, I created a test-required-upgrade branch which employs a custom kernel with new syscall as well with adding support for upgrading to app version 1.