-
Describe the bugWhen a queue is deleted with multiple bindings, we do observe a lock of any publishing message during a significant amount of time. Reproduction stepsMinimal step to reproduce: Initial state:
Actions:
Behavior observed:
Given the number of bindings (we have up to 10000 bindings sometimes) and some other factor not yet identified, this lock mechanism could last up to 30minutes in production. If we do explicit remove all bindings before the queue deletion, we don't observe any lock mechanism. Expected behaviorDeletion of a queue should not affect exchange performance. Additional contextRabbitMQ 3.11.13 3 X (8CPU 64Go) clusters 400 messages per seconds, 250 exchanges, 1600 queues, 3000 channels, 900 connections |
Beta Was this translation helpful? Give feedback.
Replies: 6 comments 9 replies
-
|
Beta Was this translation helpful? Give feedback.
-
@lchenay FYI, all "upvote fests" are perceived very negatively by our team. A 31 minute old issue immediately attracts a few 👍s, I wonder what may be going on… We are not dummies.
We need a way to reproduce against I have good news for you: in Finally, I am not intelligent enough to know what "KPI of control for the bugs" even means. |
Beta Was this translation helpful? Give feedback.
-
Another observation: a queue with thousands of bindings is a very rare workload. An exchange with thousands of bindings is fairly common. If that exchange is a fanout, these days you arguably should use a single stream (streams shipped in 3.9.x) with non-destructive, repeated consumption instead of fanouts to thousands of queues that need thousands bindings that need a lot of schema data store locking, which is expensive and will affect publishers because routing needs to access those bindings. While not useful specifically to avoid mass deletions of a large number of bindings, 3.11.x even includes superstreams, which prove that by 3.11, streams were fairly mature and had features added on top of them, in both open source and Tanzu RabbitMQ. |
Beta Was this translation helpful? Give feedback.
-
Thanks you sincerly for the quick answer! Was not expected that much reactivity. @mkuratczyk will work on exact script reproduce steps as agnostic as possible ; share it ; and will run it on as much version as possible to increase the discussion data. @michaelklishin I have miss the information of end of support. I haven't this visibility and will ping the infrastructure team right now. @michaelklishin Sorry for "emoji fest". All of them are part of my company, and are I suppose more a joy expression after 2 weeks of strong investigation on our multiple production incident, rather any attempts of issue prioritisation by-pass. I will make them the feedback. @michaelklishin I will carefully read all those streams. Clearly our implementation is not good. Having all those thousands bindings on a single queue seems clearly a miss-usage of the tooling. |
Beta Was this translation helpful? Give feedback.
-
Using this quick docker-compose to simulate a rabbtimq Cluster locally, testing versions : 3.11.13, 3.13.7 and 4.0.4:
--
-- Here the result locally:
Code to automate all bindings / measures import amqplib from 'amqplib';
const sendAndMeasure = async (ch1, exchangeName, cb) => {
const start = new Date();
ch1.publish(exchangeName, '', Buffer.from('Hello World!'), undefined, () => {
console.log('Time taken to publish 1 messages:', (new Date()) - start, 'ms');
cb && cb()
});
}
(async () => {
const exchangeName="com.exchange";
const queueName="com.queue";
const conn = await amqplib.connect('amqp://guest:[email protected]:5672/');
const ch1 = await conn.createConfirmChannel();
await ch1.assertExchange(exchangeName, 'fanout', { durable: false });
await ch1.assertQueue(queueName, { durable: false, autoDelete: true });
// measure when nothing is bound to the queue
sendAndMeasure(ch1, exchangeName);
const subCh = await conn.createChannel();
for (let i = 0; i < 10000; i++) {
await subCh.bindQueue(queueName, exchangeName, 'random_binding_' + (Math.round(Math.random()*100000)));
}
subCh.deleteQueue(queueName);
// measure during the deletion of the queue
sendAndMeasure(ch1, exchangeName, process.exit)
})(); |
Beta Was this translation helpful? Give feedback.
-
Adding 1ms delay on network using tc on each container (to simulate inter AZ latency, and be more realistic)
I did reproduce:
With Khepri, vesion 4.0.4 and 4 nodes :
|
Beta Was this translation helpful? Give feedback.
In that case, I will not deep dive further this Mnesia issue / reproduce case.
I close the topic and will deal with potential upgrade.
Thanks all!