Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENG-771: Add upgrade-info json file to determine when upgrades are applied + rpc + cli #793

Draft
wants to merge 13 commits into
base: main
Choose a base branch
from

Conversation

fridrik01
Copy link
Contributor

@fridrik01 fridrik01 commented Mar 13, 2024

Context

This PR updates fendermint to support future ipcvisor tool we are building to make upgrades easier to orchistrate for our partners (see post).

In order to facilitate the orchistration, this PR updates fendermint to:

  • Split the upgrade_scheduler into hardcoded Upgrade migrations and separate upgrade schedule (a list of UpgradeInfo structs) which maps block heights to the hardcoded upgrade migrations.
  • Stop processing when it reaches a block height which has an associated hardcoded Upgrade which is required it does not have the source code of.
  • Have RPC methods to query the node state (i.e check if its frozen)
  • CLI commands to manage the upgrade-info.json file and check node state

This PR introduces all that functionality which should be enough for the ipcvisor to do its thing.

Note, the following are not implemented in this PR:

  • Add verification of how the app version progresses progresses (ie. should it be strictly incrementing, can it decrease, etc)
  • Verify CometBFT version at startup

Test plan

Test new functionality, including that fendermint should freeze and that rpc/cli work as expected

By default when fendermint starts it will automatically create an empty upgrade-migration.json file:

# start fenderming
cargo run -p fendermint_app --release -- run
...
2024-03-13T13:54:55.515089Z  INFO fendermint/app/src/cmd/run.rs:118: loaded upgrade_schedule path="/Users/fridrik/.fendermint/upgrade_info.json" schedule=UpgradeSchedule { schedule: {} }
...
CTRL-C

# confirm the upgrade_info.json file was created
cat ~/.fendermint/upgrade_info.json
[]

Lets try adding a new upgrade info which we don't have the hardcoded Upgrade for, lets also make it required to make things interisting

# add a new required upgrade
cargo run -p fendermint_app --release -- upgrade --upgrade-file=/Users/fridrik/.fendermint/upgrade_info.json  add-upgrade --height=1000 --new-app-version=1 --cometbft-version=0.37 --required

# confirm that the upgrade info was added
cat /Users/fridrik/.fendermint/upgrade_info.json
[
  {
    "height": 1000,
    "new_app_version": 1,
    "cometbft_version": "0.37",
    "required": true
  }
]

Now observe that fenderming freezes at the specifiec upgrade height since it does not have the hardcoded upgrade:

cargo run -p fendermint_app --release -- run
...
2024-03-13T14:01:42.959472Z  INFO fendermint/app/src/app.rs:771: event=NewBlock height=998
2024-03-13T14:01:43.994470Z  INFO fendermint/app/src/app.rs:771: event=NewBlock height=999
2024-03-13T14:01:45.034063Z  INFO fendermint/vm/interpreter/src/fvm/exec.rs:223: upgrade scheduled app_version=0 new_app_version=1 height=1000 upgrade_info=UpgradeInfo { height: 1000, new_app_version: 1, cometbft_version: "0.37", required: true }
2024-03-13T14:01:45.034080Z  WARN fendermint/vm/interpreter/src/fvm/exec.rs:242: upgrade not found height=1000 upgrade_info=UpgradeInfo { height: 1000, new_app_version: 1, cometbft_version: "0.37", required: true }
2024-03-13T14:01:45.034084Z ERROR fendermint/vm/interpreter/src/fvm/exec.rs:254: node frozen, please restart with a newer version which has the required upgrade height=1000 upgrade_info=UpgradeInfo { height: 1000, new_app_version: 1, cometbft_version: "0.37", required: true }

Lets query the rpc and confirm its frozen:

cargo run -p fendermint_app --release -- rpc query node-state
{
  "response": {
    "height": "999",
    "value": {
      "frozen": true
    }
  }
}

Verify that updating fendermint with new version can successfully proceed

In the previous test, we confirmed that fendermint stopped once it progressed to a block height which had a required upgrade associated that it didn't have.

Here we want to test that when this happens, that when we stop the running (frozen) fendermint with a newer version which has the upgrade, that it will successfully proceed from where the previous version stopped.

To test this, I created a test-required-upgrade branch which employs a custom kernel with new syscall as well with adding support for upgrading to app version 1.

# stop the previous frozen fendermint
CTRL+C

# start the fendermint from the test-required-upgrade branch:
cargo run -p fendermint_app --release -- run
...
2024-03-13T15:42:09.886900Z  INFO fendermint/vm/interpreter/src/fvm/exec.rs:223: upgrade scheduled app_version=0 new_app_version=1 height=1010 upgrade_info=UpgradeInfo { height: 1010, new_app_version: 1, cometbft_version: "0.37", required: true }
2024-03-13T15:42:09.886931Z  INFO fendermint/vm/interpreter/src/fvm/exec.rs:239: upgrade successful app_version=1
2024-03-13T15:42:09.886940Z  INFO fendermint/app/src/app.rs:771: event=NewBlock height=1010
2024-03-13T15:42:10.249421Z  INFO fendermint/app/src/app.rs:670: event=ProposalProcessed is_accepted=true height=1011 size=0 hash="7EDD19B7B473CA1B006B4D694BF3B5D7771702C72D8F7F3352AC9245A5D6FD4C" proposer="9CDE756FDEFC4FAC219F01844D80309D8D800C32"
2024-03-13T15:42:12.614231Z  INFO fendermint/app/src/app.rs:771: event=NewBlock height=1011
2024-03-13T15:42:12.642061Z  INFO fendermint/app/src/app.rs:670: event=ProposalProcessed is_accepted=true height=1012 size=0 hash="EF9CC72C12A5DC33430CCBF2F4D6CD37BE0A80B71C5B722EF94C6B6BBEA0873F" proposer="9CDE756FDEFC4FAC219F01844D80309D8D800C32"
2024-03-13T15:42:12.675509Z  INFO fendermint/app/src/app.rs:771: event=NewBlock height=1012
2024-03-13T15:42:13.683986Z  INFO fendermint/app/src/app.rs:670: event=ProposalProcessed is_accepted=true height=1013 size=0 hash="A1D6DB8BF88F616426783667CEA3415D12F42875340F310F99E41391FE3920C7" proposer="9CDE756FDEFC4FAC219F01844D80309D8D800C32"
2024-03-13T15:42:13.710567Z  INFO fendermint/app/src/app.rs:771: event=NewBlock height=1013
2024-03-13T15:42:14.712736Z  INFO fendermint/app/src/app.rs:670: event=ProposalProcessed is_accepted=true height=1014 size=0 hash="CEC377F99B00787E909939D3C845DBE68463FE491834F4DB215D0DF0FF817AB5" proposer="9CDE756FDEFC4FAC219F01844D80309D8D800C32"
2024-03-13T15:42:14.736308Z  INFO fendermint/app/src/app.rs:771: event=NewBlock height=1014
...

@fridrik01 fridrik01 requested a review from a team March 13, 2024 15:49
@fridrik01 fridrik01 marked this pull request as ready for review March 13, 2024 15:54
@raulk raulk requested a review from aakoshh March 14, 2024 11:32
@fridrik01 fridrik01 changed the title Add upgrade-info json file to determine when upgrades are applied + rpc + cli ENG-771: Add upgrade-info json file to determine when upgrades are applied + rpc + cli Mar 15, 2024
Copy link

linear bot commented Mar 15, 2024

@aakoshh
Copy link
Contributor

aakoshh commented Mar 18, 2024

Stop processing when it reaches a block height which has an associated hardcoded Upgrade

Haven't seen it yet, but I expected that it will be looking for heights that it does not have hardcoded, heights looking from the upgrade JSON file that it learned about, but doesn't know how to execute.

The tests in the description seem to be in line with this, though 👍

Comment on lines +37 to +42
/// whether this is a required upgrade or not. A required upgrade
/// will cause the node to freeze and not process any more blocks
/// if it does not have the corresponding Upgrade defined to
/// migrate to that version
#[arg(long)]
pub required: bool,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just thinking when an upgrade wouldn't be required. I know we talked about upgrades where you can download the new software and replace the existing one with it now, because the binary is backwards compatible, and it will apply the upgrade when the time comes. Why the upgrade list would know this though..

OTOH if some nodes update the app_version then everyone must do it, because it's part of the ledger. So an optional upgrade must not update the app version.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm also not sure how an optional upgrade would be handled if the node with the upgrade is started later than the height. Nothing will trigger that upgrade, so we don't know if it's been applied or not?

Comment on lines +95 to +96
/// Query the current upgrade schedule
UpgradeSchedule,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not exactly sure what an upgrade schedule is; maybe the docstring could explain a bit more.

For example I understand it can be the contents of the JSON file, but I was wondering if it could also include the versions that the node is capable of executing.

@@ -235,7 +237,7 @@ async fn test_applying_upgrades() {
tester.commit().await.unwrap();

// check that the app_version was upgraded to 1
assert_eq!(tester.state_params().app_version, 1);
assert_eq!(tester.state_params().app_version, block_height as u64 + 10);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think block_height + 10 as app_version is puzzling. I kept wondering why that should be true, and it's true in this particular set of hardcoded values, but it looks like some kind of rule that one has to make sense of. Perhaps it would be more straight forward to just iterate over the known list of (height, version) pairs and assert them, rather than have a formula for it that works now, but it's not clear where the upgrades are defined that this offset must be adhered to.

@@ -19,6 +23,9 @@ use super::{
FvmMessage, FvmMessageInterpreter,
};

/// Indicates whether the node has been frozen due to not having a required upgrade.
pub static IS_FROZEN: AtomicBool = AtomicBool::new(false);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this live inside the FvmMessageInterpreter instead of being a public static global field? The way it is, it can be set from anywhere, by anyone. There has been no rule so far that there can be only one interpreter instance. The application only creates one, so there is no problem with multiple conflicting values.

});

tracing::info!(app_version = state.app_version(), "upgraded app version");
if let Some(upgrade_info) = self.upgrade_schedule.get(state.block_height()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking whether we should also add this check to the ProposalInterpreter implementations (which this interpreter does not implement), but I think you are right, this should be sufficient. If a validator proposes a block without knowing whether it will be able to execute it, it can still be finalized and cause all nodes to freeze there during delivery.

If we live with this model, then it's also probably right to keep the execution of the upgrade in end, because the decision about what transactions to put into the block has been made the the current version of the software... but no, that doesn't make any difference, if the new version will have to execute it, it will have to be able to handle all the transactions that the previous one already put into the block.

So that could be prevented by adding the freeze already to the proposal handling: if for example we wanted to get rid of a certain transaction type, we could just say "we're no longer eligible to create the next block", rather than put it in, then consider it invalid.

If we want to handle such cases by marking these transactions as invalid, then I reckon the upgrade should be done in begin after all, which signifies: once this block begins, it's the new rules, even if the transactions are old. Versus suggesting that "until the end, it's the old rules", which could leave the new version in a conundrum whether to be backwards compatible just for the sake of a single block.

}
FvmQuery::NodeState => {
let ret = NodeState {
frozen: IS_FROZEN.load(Ordering::Relaxed),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this one is executed by the same interpreter, it can access it as a member variable.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe it could tell on which upgrade it has been frozen, instead of just a bool.

pub struct UpgradeSchedule {
// schedule of upgrades to be executed at given height
schedule: BTreeMap<ChainEpoch, UpgradeInfo>,
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I see missing here is a file system watcher. The JSON is only read when the node starts. We want it to stop when a new upgrade is announced, and the height arrives with the old version still running. If it doesn't refresh the file, then it won't know.

I reckon it could keep a reference to the file and start a background loop that checks it every now and then, then loads new items into the schedule.

This is also important because without a governance module, the upgrade schedule somehow has to make it to all full nodes, which is why I think the visor should be able to download it and overwrite the file on the disk. Not sure how to make sure that doesn't happen for the guy who is adding upgrades by hand.

Comment on lines +55 to +56
pub fn add(&mut self, upgrade_info: UpgradeInfo) -> anyhow::Result<()> {
match self.schedule.entry(upgrade_info.height) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe add should check not only that the height is unoccupied, but that the version fits between the previous and next upgrades. I'm not sure what kind of semantics we put on the app version, but since it identifies the upgrade, I reckon making it strictly monotonic is a reasonable default.

Copy link
Contributor

@aakoshh aakoshh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great 👏

I added a lot of exploratory comments about design. Curious to hear others' opinions on the semantics of app version for example, as any upgrade that touches it is consensus breaking, and I find hard to reason about required because of this.

What I'd be curious about as an operator is whether I an upgrade my node to any particular future version and expect it to work before the upgrades are applied. For example maybe a migration changes the data schema of an IPC actor. Until it's executed, we cannot expect the new bindings to work, so we couldn't start a fendermint that has new bindings ahead of time. Maybe fendermint could have a minimum application version that it checks at the start, and quits if it's not time for this particular binary to run yet.

@raulk raulk added the Fluence label Mar 19, 2024
@raulk raulk removed the Fluence label Mar 19, 2024
@fridrik01
Copy link
Contributor Author

Closing while working on addressing remaining comments

@fridrik01 fridrik01 closed this Mar 19, 2024
@jsoares
Copy link
Contributor

jsoares commented Mar 19, 2024

As a general practice, I find it best to revert to draft rather than close. It keeps things observable while clearly signalling that it's not ready for review.

@fridrik01
Copy link
Contributor Author

fridrik01 commented Mar 19, 2024

As a general practice, I find it best to revert to draft rather than close. It keeps things observable while clearly signalling that it's not ready for review.

I agree and this is the first time I have closed a PR instead of putting to draft. Reason was that I have on several occations had my PRs reviewed while they were in draft and I thought this time that might happen again as I had pushed several commits and didn't want someone to review it before I push the remaining commits. But sure, I'll move this to draft, and lets see what happens :)

@fridrik01 fridrik01 reopened this Mar 19, 2024
@fridrik01 fridrik01 marked this pull request as draft March 19, 2024 17:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants