Crash during 3.13 -> 4.0 upgrade when there is an enabled feature flag of a disabled plugin #12963
-
Describe the bugWhen a plugin is enabled its feature flags are discovered however when the same plugin is disabled its feature flags remain discovered. It is even possible to enable such feature flag leading to various problems, crashes, and nodes refusing to start. We have originally faced this issue with the MQTT plugin and the rabbit_mqtt_qos0_queue feature flag. On a new 3.13.7 cluster mqtt plugin was enabled then disabled. Then all feature flags were enabled enabling boot crash2024-12-16 12:39:58.423260+00:00 [error] <0.267.0> BOOT FAILED 2024-12-16 12:39:58.423260+00:00 [error] <0.267.0> =========== 2024-12-16 12:39:58.423260+00:00 [error] <0.267.0> Exception during startup: 2024-12-16 12:39:58.423260+00:00 [error] <0.267.0> 2024-12-16 12:39:58.423260+00:00 [error] <0.267.0> error:{badmatch,#{feature_flags => #{classic_mirrored_queue_version => #{name => classic_mirrored_queue_version,desc => "Support setting version for classic mirrored queues",stability => required,provided_by => rabbit},classic_queue_type_delivery_support => #{name => classic_queue_type_delivery_support,desc => "Bug fix for classic queue deliveries using mixed versions",stability => required,doc_url => "https://github.com/rabbitmq/rabbitmq-server/issues/5931",depends_on => [stream_queue],provided_by => rabbit},detailed_queues_endpoint => #{name => detailed_queues_endpoint,desc => "Add a detailed queues HTTP API endpoint. Reduce number of metrics in the default endpoint.",stability => stable,depends_on => [feature_flags_v2],provided_by => rabbitmq_management},direct_exchange_routing_v2 => #{name => direct_exchange_routing_v2,desc => "v2 direct exchange routing implementation",stability => required,depends_on => [feature_flags_v2,implicit_default_bindings],provided_by => rabbit},drop_unroutable_metric => #{name => drop_unroutable_metric,desc => "Count unroutable publishes to be dropped in stats",stability => required,provided_by => rabbitmq_management_agent},empty_basic_get_metric => #{name => empty_basic_get_metric,desc => "Count AMQP `basic.get` on empty queues in stats",stability => required,provided_by => rabbitmq_management_agent},feature_flags_v2 => #{name => feature_flags_v2,desc => "Feature flags subsystem V2",stability => required,provided_by => rabbit},implicit_default_bindings => #{name => implicit_default_bindings,desc => "Default bindings are now implicit, instead of being stored in the database",stability => required,provided_by => rabbit},listener_records_in_ets => #{name => listener_records_in_ets,desc => "Store listener records in ETS instead of Mnesia",stability => required,depends_on => [feature_flags_v2],provided_by => rabbit},maintenance_mode_status => #{name => maintenance_mode_status,desc => "Maintenance mode status",stability => required,provided_by => rabbit},message_containers => #{name => message_containers,desc => "Message containers.",stability => stable,depends_on => [feature_flags_v2],provided_by => rabbit},message_containers_deaths_v2 => #{name => message_containers_deaths_v2,desc => "Bug fix for dead letter cycle detection",stability => stable,doc_url => "https://github.com/rabbitmq/rabbitmq-server/issues/11159",depends_on => [message_containers],provided_by => rabbit},quorum_queue => #{name => quorum_queue,desc => "Support queues of type `quorum`",stability => required,doc_url => "https://www.rabbitmq.com/quorum-queues.html",provided_by => rabbit},quorum_queue_non_voters => #{name => quorum_queue_non_voters,desc => "Allows new quorum queue members to be added as non voters initially.",stability => stable,depends_on => [quorum_queue],provided_by => rabbit},restart_streams => #{name => restart_streams,desc => "Support for restarting streams with optional preferred next leader argument.Used to implement stream leader rebalancing",stability => stable,depends_on => [stream_queue],provided_by => rabbit},stream_filtering => #{name => stream_filtering,desc => "Support for stream filtering.",stability => stable,depends_on => [stream_queue],provided_by => rabbit},stream_queue => #{name => stream_queue,desc => "Support queues of type `stream`",stability => required,doc_url => "https://www.rabbitmq.com/stream.html",depends_on => [quorum_queue],provided_by => rabbit},stream_sac_coordinator_unblock_group => #{name => stream_sac_coordinator_unblock_group,desc => "Bug fix to unblock a group of consumers in a super stream partition",stability => stable,doc_url => "https://github.com/rabbitmq/rabbitmq-server/issues/7743",depends_on => [stream_single_active_consumer],provided_by => rabbit},stream_single_active_consumer => #{name => stream_single_active_consumer,desc => "Single active consumer for streams",stability => required,doc_url => "https://www.rabbitmq.com/stream.html",depends_on => [stream_queue],provided_by => rabbit},stream_update_config_command => #{name => stream_update_config_command,desc => "A new internal command that is used to update streams as part of a policy.",stability => stable,depends_on => [stream_queue],provided_by => rabbit},tracking_records_in_ets => #{name => tracking_records_in_ets,desc => "Store tracking records in ETS instead of Mnesia",stability => required,depends_on => [feature_flags_v2],provided_by => rabbit},user_limits => #{name => user_limits,desc => "Configure connection and channel limits for a user",stability => required,provided_by => rabbit},virtual_host_metadata => #{name => virtual_host_metadata,desc => "Virtual host metadata (description, tags, etc)",stability => required,provided_by => rabbit},transient_nonexcl_queues => #{messages => #{when_permitted => "Feature `transient_nonexcl_queues` is deprecated.\nBy default, this feature can still be used for now.\nIts use will not be permitted by default in a future minor RabbitMQ version and the feature will be removed from a future major RabbitMQ version; actual versions to be determined.\nTo continue using this feature when it is not permitted by default, set the following parameter in your configuration:\n \"deprecated_features.permit.transient_nonexcl_queues = true\"\nTo test RabbitMQ as if the feature was removed, set this in your configuration:\n \"deprecated_features.permit.transient_nonexcl_queues = false\"",when_denied => "Feature `transient_nonexcl_queues` is deprecated.\nIts use is not permitted per the configuration (overriding the default, which is permitted):\n \"deprecated_features.permit.transient_nonexcl_queues = false\"\nIts use will not be permitted by default in a future minor RabbitMQ version and the feature will be removed from a future major RabbitMQ version; actual versions to be determined.\nTo continue using this feature when it is not permitted by default, set the following parameter in your configuration:\n \"deprecated_features.permit.transient_nonexcl_queues = true\""},name => transient_nonexcl_queues,deprecation_phase => permitted_by_default,doc_url => "https://blog.rabbitmq.com/posts/2021/08/4.0-deprecation-announcements/#removal-of-transient-non-exclusive-queues",provided_by => rabbit},global_qos => #{messages => #{when_permitted => "Feature `global_qos` is deprecated.\nBy default, this feature can still be used for now.\nIts use will not be permitted by default in a future minor RabbitMQ version and the feature will be removed from a future major RabbitMQ version; actual versions to be determined.\nTo continue using this feature when it is not permitted by default, set the following parameter in your configuration:\n \"deprecated_features.permit.global_qos = true\"\nTo test RabbitMQ as if the feature was removed, set this in your configuration:\n \"deprecated_features.permit.global_qos = false\"",when_denied => "Feature `global_qos` is deprecated.\nIts use is not permitted per the configuration (overriding the default, which is permitted):\n \"deprecated_features.permit.global_qos = false\"\nIts use will not be permitted by default in a future minor RabbitMQ version and the feature will be removed from a future major RabbitMQ version; actual versions to be determined.\nTo continue using this feature when it is not permitted by default, set the following parameter in your configuration:\n \"deprecated_features.permit.global_qos = true\""},name => global_qos,deprecation_phase => permitted_by_default,doc_url => "https://blog.rabbitmq.com/posts/2021/08/4.0-deprecation-announcements/#removal-of-global-qos",provided_by => rabbit},khepri_db => #{name => khepri_db,desc => "Use the new Khepri Raft-based metadata store",stability => experimental,doc_url => [],depends_on => [feature_flags_v2,direct_exchange_routing_v2,maintenance_mode_status,user_limits,virtual_host_metadata,tracking_records_in_ets,listener_records_in_ets,classic_queue_mirroring,ram_node_type],provided_by => rabbit},classic_queue_mirroring => #{messages => #{when_permitted => "Classic mirrored queues are deprecated.\nBy default, they can still be used for now.\nTheir use will not be permitted by default in the next minorRabbitMQ version (if any) and they will be removed from RabbitMQ 4.0.0.\nTo continue using classic mirrored queues when they are not permitted by default, set the following parameter in your configuration:\n \"deprecated_features.permit.classic_queue_mirroring = true\"\nTo test RabbitMQ as if they were removed, set this in your configuration:\n \"deprecated_features.permit.classic_queue_mirroring = false\"",when_denied => "Classic mirrored queues are deprecated.\nTheir use is not permitted per the configuration (overriding the default, which is permitted):\n \"deprecated_features.permit.classic_queue_mirroring = false\"\nTheir use will not be permitted by default in the next minor RabbitMQ version (if any) and they will be removed from RabbitMQ 4.0.0.\nTo continue using classic mirrored queues when they are not permitted by default, set the following parameter in your configuration:\n \"deprecated_features.permit.classic_queue_mirroring = true\""},name => classic_queue_mirroring,callbacks => #{enable => {rabbit_deprecated_features,enable_underlying_feature_flag_cb},is_feature_used => {rabbit_mirror_queue_misc,are_cmqs_used}},deprecation_phase => permitted_by_default,doc_url => "https://blog.rabbitmq.com/posts/2021/08/4.0-deprecation-announcements/#removal-of-classic-queue-mirroring",provided_by => rabbit},ram_node_type => #{messages => #{when_permitted => "Feature `ram_node_type` is deprecated.\nBy default, this feature can still be used for now.\nIts use will not be permitted by default in a future minor RabbitMQ version and the feature will be removed from a future major RabbitMQ version; actual versions to be determined.\nTo continue using this feature when it is not permitted by default, set the following parameter in your configuration:\n \"deprecated_features.permit.ram_node_type = true\"\nTo test RabbitMQ as if the feature was removed, set this in your configuration:\n \"deprecated_features.permit.ram_node_type = false\"",when_denied => "Feature `ram_node_type` is deprecated.\nIts use is not permitted per the configuration (overriding the default, which is permitted):\n \"deprecated_features.permit.ram_node_type = false\"\nIts use will not be permitted by default in a future minor RabbitMQ version and the feature will be removed from a future major RabbitMQ version; actual versions to be determined.\nTo continue using this feature when it is not permitted by default, set the following parameter in your configuration:\n \"deprecated_features.permit.ram_node_type = true\""},name => ram_node_type,deprecation_phase => permitted_by_default,doc_url => "https://blog.rabbitmq.com/posts/2021/08/4.0-deprecation-announcements/#removal-of-ram-nodes",provided_by => rabbit},management_metrics_collection => #{messages => #{when_permitted => "Feature `management_metrics_collection` is deprecated.\nBy default, this feature can still be used for now.\nIts use will not be permitted by default in a future minor RabbitMQ version and the feature will be removed from a future major RabbitMQ version; actual versions to be determined.\nTo continue using this feature when it is not permitted by default, set the following parameter in your configuration:\n \"deprecated_features.permit.management_metrics_collection = true\"\nTo test RabbitMQ as if the feature was removed, set this in your configuration:\n \"deprecated_features.permit.management_metrics_collection = false\"",when_denied => "Feature `management_metrics_collection` is deprecated.\nIts use is not permitted per the configuration (overriding the default, which is permitted):\n \"deprecated_features.permit.management_metrics_collection = false\"\nIts use will not be permitted by default in a future minor RabbitMQ version and the feature will be removed from a future major RabbitMQ version; actual versions to be determined.\nTo continue using this feature when it is not permitted by default, set the following parameter in your configuration:\n \"deprecated_features.permit.management_metrics_collection = true\""},name => management_metrics_collection,deprecation_phase => permitted_by_default,doc_url => "https://blog.rabbitmq.com/posts/2021/08/4.0-deprecation-announcements/#disable-metrics-delivery-via-the-management-api--ui",provided_by => rabbitmq_management_agent}},states_per_node => #{'rabbit@healthy-cobalt-greyhound-03' => #{rabbit_mqtt_qos0_queue => true,classic_mirrored_queue_version => true,classic_queue_type_delivery_support => true,detailed_queues_endpoint => true,direct_exchange_routing_v2 => true,drop_unroutable_metric => true,empty_basic_get_metric => true,feature_flags_v2 => true,implicit_default_bindings => true,listener_records_in_ets => true,maintenance_mode_status => true,message_containers => true,message_containers_deaths_v2 => true,quorum_queue => true,quorum_queue_non_voters => true,restart_streams => true,stream_filtering => true,stream_queue => true,stream_sac_coordinator_unblock_group => true,stream_single_active_consumer => true,stream_update_config_command => true,tracking_records_in_ets => true,user_limits => true,virtual_host_metadata => true,transient_nonexcl_queues => false,global_qos => false,khepri_db => false,classic_queue_mirroring => false,ram_node_type => false,management_metrics_collection => false,delete_ra_cluster_mqtt_node => false,mqtt_v5 => false},'rabbit@healthy-cobalt-greyhound-02' => #{rabbit_mqtt_qos0_queue => true,classic_mirrored_queue_version => true,classic_queue_type_delivery_support => true,detailed_queues_endpoint => true,direct_exchange_routing_v2 => true,drop_unroutable_metric => true,empty_basic_get_metric => true,feature_flags_v2 => true,implicit_default_bindings => true,listener_records_in_ets => true,maintenance_mode_status => true,message_containers => true,message_containers_deaths_v2 => true,quorum_queue => true,quorum_queue_non_voters => true,restart_streams => true,stream_filtering => true,stream_queue => true,stream_sac_coordinator_unblock_group => true,stream_single_active_consumer => true,stream_update_config_command => true,tracking_records_in_ets => true,user_limits => true,virtual_host_metadata => true,transient_nonexcl_queues => false,global_qos => false,khepri_db => false,classic_queue_mirroring => false,ram_node_type => false,management_metrics_collection => false,delete_ra_cluster_mqtt_node => false,mqtt_v5 => false}},applications_per_node => #{'rabbit@healthy-cobalt-greyhound-03' => [rabbit,rabbit_common,rabbitmq_consistent_hash_exchange,rabbitmq_federation,rabbitmq_federation_management,rabbitmq_management,rabbitmq_management_agent,rabbitmq_prelaunch,rabbitmq_prometheus,rabbitmq_shovel,rabbitmq_shovel_management,rabbitmq_web_dispatch],'rabbit@healthy-cobalt-greyhound-02' => [rabbit,rabbit_common,rabbitmq_consistent_hash_exchange,rabbitmq_federation,rabbitmq_federation_management,rabbitmq_management,rabbitmq_management_agent,rabbitmq_prelaunch,rabbitmq_prometheus,rabbitmq_shovel,rabbitmq_shovel_management,rabbitmq_web_dispatch]}}} 2024-12-16 12:39:58.423260+00:00 [error] <0.267.0> 2024-12-16 12:39:58.423260+00:00 [error] <0.267.0> rabbit_ff_controller:-check_one_way_compatibility/2-fun-0-/3, line 514 2024-12-16 12:39:58.423260+00:00 [error] <0.267.0> lists:all_1/2, line 1520 2024-12-16 12:39:58.423260+00:00 [error] <0.267.0> rabbit_ff_controller:are_compatible/2, line 496 2024-12-16 12:39:58.423260+00:00 [error] <0.267.0> rabbit_ff_controller:check_node_compatibility_task1/4, line 437 2024-12-16 12:39:58.423260+00:00 [error] <0.267.0> rabbit_db_cluster:check_compatibility/1, line 376 2024-12-16 12:39:58.423260+00:00 [error] <0.267.0> rabbit_mnesia:check_cluster_consistency/2, line 687 2024-12-16 12:39:58.423260+00:00 [error] <0.267.0> lists:foldl/3, line 1594 2024-12-16 12:39:58.423260+00:00 [error] <0.267.0> rabbit_mnesia:check_cluster_consistency/0, line 648 But similar issues can be reproduced on new 4.0.4 cluster with the rabbitmq_management plugin and It is possible that feature flag handling changed between 3.13.7 and 4.0 so that this crash that happened during upgrade couldn't happen for later upgrades (eg 4.0.x -> 4.1.x) Reproduction steps
Expected behavior
Additional contextNo response |
Beta Was this translation helpful? Give feedback.
Replies: 7 comments 21 replies
-
One difference between 3.13.7 and mqtt plugin and 4.0.x and management plugin is that on 3.13.7 with a disabled mqtt plugin I got the following error printout
And out of the 3 feature flags defined by the mqtt plugin ( |
Beta Was this translation helpful? Give feedback.
-
Such exotic scenarios where not all nodes have all the plugins enabled, or they are being disabled and re-enabled during an upgrade… are they a really good use of our time? I highly doubt any feature flag- or upgrades-related integration suites enable plugins "mid-upgrade". Feature flags are considered and documented as cluster-wide state. This means that the enabled set of plugins must be stable during an upgrade, or else the state is mutated by something the feature flag controller has no knowledge of, or control over. |
Beta Was this translation helpful? Give feedback.
-
@gomoripeti I have a fundamental problem with our small team's resources being spent on highly exotic scenarios like this. Disabling and re-enabling plugins during a version upgrade is not a good idea. During a routine restart where the list of plugin changes feature flags should play no role if the version is stable. In a hypothetical scenario where one of the enabled plugin automatically needs to enable feature flags, it will not be able to do that until all nodes are compatible. We can spend a few person-weeks on this or we can spend it on improvements such as Ra 2.16 which deliver double digit efficiency improvements on workflows that affect most QQ users. |
Beta Was this translation helpful? Give feedback.
-
Enabled plugins is a node's state. Feature flags is a cluster-wide state. It would likely raise a lot of eyebrows if a plugin was listed as "disabled" or unavailable due to feature flag state.
This would imply that when a plugin is disabled, so should be its feature flags, and a feature flag cannot be disabled once enabled, or at least it won't be possible while other nodes have the plugin and the FF enabled. Instead of trying to change RabbitMQ, reduce the number of changes are allowed during a version upgrade in your system. A version upgrade is not the time when plugins should be enabled or disabled, but they can be enabled or disabled once all nodes are on the same version, in which case you won't run into any surprises around feature flags. |
Beta Was this translation helpful? Give feedback.
-
I echo Michael's thoughts. I think this scenario is quite unusual, perhaps even artificial. That being said, I think we should print a friendly error, as we did in 3.13.x. |
Beta Was this translation helpful? Give feedback.
-
This scenario is already supported. This is clearly a bug: the feature flag controller isn’t supposed to run a callback on a node that doesn’t know the feature flag. The list of nodes is already filtered, so there is an issue with the filtering or the input of that filtering.
This is expected because a node can have a plugin enabled while other nodes in the cluster don’t have it enabled or don’t have it at all. When that plugin is enabled elsewhere, it is supposed to pick up the feature state from the rest of the cluster. The sync is the same as when a node joins a cluster or is restarted. |
Beta Was this translation helpful? Give feedback.
-
Sorry for the misleading and bad repro steps using 4.0.x and the management plugin. I tried to come up with an example on a recent version but I failed. I'd like to clarify that there is no change happening during the upgrade. The real world issue what we faced is the following (that I also described and attached the logs for)
...while typing the core team already provided a potential patch :) thank you very much, I am giving it a try |
Beta Was this translation helpful? Give feedback.
I did experiment a bit more with this, and I came to the same conclusion that on 4.0.x and above the same problem cannot happen (even with contrived scenarios like enabling a plugin/feature flag during a netsplit) It is specific to the difference of feature flag inventory representation between 3.13 and 4.0 in the very specific case of a plugin being enabled and then disabled.
For the 3.13.x -> 4.0.x rolling upgrade case there is a workaround: enabling the MQTT plugin and its feature flags before the upgrade and then the plugin can be re-disabled after the upgrade.
Therefore I consider this issue resolved. Thank you again for the help with it.