| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
|
| |
were actually fine before: a) the rabbit_disk_queue table is local_content and b) only one process ever accesses that table - thus there is no reason why any transaction will ever retry. However, this change is probably still beneficial. The only slight loss is that tx-commit is no longer atomic (ref counting of messages in ets, not mnesia, was resulting in non idempotency, so moved outside the transaction). This means that you could have msgs in a tx committed, but the acks not enforced, in the event of power failure or other catastrophic event.
All tests pass.
|
|\ |
|
| |
| |
| |
| | |
amqqueue
|
| |\ |
|
| | |
| | |
| | |
| | | |
it though because I decided to only start it up when in mixed mode and when the amqqueue_process starts to hibernate (otherwise, we start it up too soon, it doesn't make much progress and then we just have to shut it down anyway). However, other manual tests definitely exercise it and it seems to be very effective. Certainly can't make it crash now.
|
| | |
| | |
| | |
| | | |
happened to spot by reading the code. I am deeply alarmed by how many of these sorts of bugs I am finding and how many more there must be. OTOH, they do seem to crop up much more in code which has been changed substantially and repeatedly, though it's very possible that's just because I'm looking there more than elsewhere.
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
undelivered when in fact they have been delivered when converting to disk_only mode. In truth, this bug didn't exist because there is no way in which a message could end up in that form in the mixed_queue which had previously been delivered. However, that will change when the prefetcher comes in, necessitating this "bug" gets fixed.
The solution is to make tx_commit not just take a list of msg ids in the txn, but to take a list of {msgid, delivered} tuples. In this way it mirrors the disk_queue:publish function in that the delivery flag can be set explicitly.
Tests adjusted. All tests pass.
|
| | | |
|
| | |
| | |
| | |
| | | |
prefetcher and its limitations.
|
| | |
| | |
| | |
| | | |
revision at the end of bug21087 on the grounds that the min_pri stuff wasn't enormously compelling and added a good chunk of complexity. Also, I don't believe it'll be needed for the new prefetcher. All tests pass.
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
really didn't want to have to deal with merging and the other information about this bug is in the above comments in 20980 so it's in here.
Now on commit, we test to see if we need to sync the current file. If so then we just store all the txn details in state for later dealing with. If not, we really do the commit there and then and reply. Interestingly, performance is actually better now than it was (see details in bug20470) but, eg, the one-in-one-out at altitude test has further reduced fsyncs from 21 to 6 and now completes in 2.1 seconds, not 3.6 (altitude of 1000, then 5000 @ one in, one out, then 1000 drain). All tests pass.
We now guarantee that the messages will be fsync'd to disk _before_ anything is done to mnesia, in all cases of a txn_commit.
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
queue in disk-only mode would not be marked delivered even if they were persistent, thus resulting in redelivery on broker startup without the message being marked predelivered.
Also, spotted (not fixed yet) bug in commit coalescing in which the mnesia transaction is always commiting before the messages are flushed to disk. What should happen is that if coalescing is going to happen, the mnesia transaction should be delayed too, and happen only _after_ the disk sync. I.e. it doesn't matter if we disk sync and then the mnesia txn fails, but it does matter if the mnesia txn succeeds and then the disk sync fails.
Also, I think I've worked out how to do prefetching properly. It's not actually that complex.
|
| | |
| | |
| | |
| | |
| | |
| | | |
I decided the right thing to do is to prefer older messages in the cache to younger ones. This is because they're more likely to be used sooner. Which means we just fill it up and then leave it alone, which is nice and simple.
Things are pretty much ok with it now, but the whole notion of prefetch is still wrong and needs to be changed to be driven by the mixed queue, not the disk_queue. For one reason, currently, if two or more queues issue prefetch requests, and the first fills the cache up, then the 2nd won't do anything. The cache is useful, but shouldn't be abused for prefetching purposes. The two things are separate.
|
| | |
| | |
| | |
| | | |
isn't too bad with big queues, and memory use is stable. In disk_queue, when iterating through the mnesia table, do the normal limited batching for removal of non-persistent messages.
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
show stopper, and rightly so. I personally don't like the idea of adding additional tokens to the disk queue to indicated queue switch because it can substantially increase the number of OS calls and writes and reads from disk and, eg, getting queue length right and memory size right is made a fair bit more complex. So abandon the two queues idea.
Instead, store the persistent flag in the stop byte on disk. Then on startup, the persistent flag turns up in the MsgLocations ets table. This is all done and all tests pass.
The next stage is that on start up, go through each queue and just wipe out non-persistent messages. This should be pretty fast. Then call the shuffle_up function as is currently being done. This will eliminate the gaps in sequences. This really should be enough. Then the mixed_queue can go back to just talking about a single queue.
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
foldl on the entire queue. This seems excessive, but it does work. It only takes 75 seconds on my machine to get through 1e6 1024-byte messages, and 160 seconds to get through 2e6 1024-byte messages. So that doesn't worry me any more. Also, it's done in constant memory... ish[0].
Also fixed the queue_mode_manager. Registration does not now produce a mode. Instead, it assumes you're starting up in disk only mode and then the first memory_report will result in the correct mode being set. This is safe and prevents a potentially deadly prefetch being sent when a queue starts up in mixed mode only to be sent to disk_only mode.
However, the disk_queue has to start up in mixed mode because if it doesn't it has no way to estimate its memory use for disk mode. As such, it registers and then sends a report of 0 memory use. This guarantees that it can be put in mixed mode, thus it can then respond as necessary to the queue_mode_manager.
I've not done anything further at this stage with the use of the erlang queue in the mixed_queue module when in disk mode (the potential per-message cost). Really you don't want to send individual entries here to the disk_queue, you want to batch them up... makes this rather more complex.
[0] Sort of wrong. It can use the cache, and if you think about not too big queues sharing messages, this is clearly a good thing. But if there are lots of shared messages then it all goes wrong because the cache will get over populated and exhaust memory. Furthermore, the foldl is entirely in the disk_queue process. This means that during the foldl it won't be reporting memory and it won't be able to respond to request to change its mode.
All of which points pretty strongly to the requirement that the prefetch needs to be somewhat more sophisticated.
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
Previously, persistent and non-persistent messages went into the same queue on disk. The advantage of this is that you don't need to track which queue you're currently reading from and for how many messages. However, the downside to this is that on queue recovery you need to iterate through the entire queue and delete all non-persistent messages. This takes a huge amount of time.
So now this is changed. Each amqqueue is now two on disk queues. One for persistent messages and one for non-persistent messages. Thus queue recovery is now trivial - just delete the non-persistent queue. However, we now _always_ use the erlang queue in mixed_queue to track (in disk mode) how many of each queue we need to read (i.e. run-length encoding). This, in the worst case (alternating persistent and non-persistent) is per-message cost. It's possible we need some sort of disk-based queue (AGH!). Not sure. Provided the queue only contains one sort of message, it degenerates to a simple single counter.
All tests pass. However, there is a bug, which is that on recovery, the size of the queue (RAM cost) is not known. As such, the reporting of the queue to the queue_mode manager on queue recovery is incorrect (it starts of 0, and can go -ve). I've not decided how to fix this yet, because I do not want to have to iterate through all the messages to get the queue size out!
|
| | | |
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
on disk, it issues a low priority prefetch instruction to the disk queue, which populates the disk_queue's cache. Note that this shouldn't impact on memory as by virtue of the mixed_queue being in mixed mode, the contents of the queue are already accounted for in memory even though they were on disk. The effect of this is that when the deliver comes, it doesn't need to go to disk to read the message as the messages are already in cache. Testing:
A 100,000 * 1Kb msg queue takes 15 seconds to drain (basic.get, noack) when the messages are in memory, in the mixed queue.
On disk, without prefetch, takes 32 seconds
On disk, with prefetch, cache hot, takes 25 seconds.
The next step is to get the disk queue to signal back to the queue that the prefetch is done and for the queue to grab the messages from the disk_queue in advance, thus meaning that on delivery, all that is needed is the async acks being sent to the disk_queue (assuming the messages are not actually persistent).
|
| | | |
|
| | | |
|
| | |
| | |
| | |
| | | |
the documentation!
|
| | | |
|
| | | |
|
| |\ \
| | | |
| | | |
| | | |
| | | |
| | | | |
queues, give 2 queues the same length such that they can't both fit in memory, and slowly trickle in messages. As each one gets a message, it'll force the other one out to disk (the other one will either be in the hibernating or lowrate groups). This is bad. Therefore, adjusted the conditions under which we bring a queue back in from disk to exclude queues that are either hibernating or low rate (don't forget, even a list_queues will wake up a queue and cause it to report memory). If you have two fast queues then neither of them will be in the groups of low rate or hibernating queues, so neither will be candidates for eviction so the problem doesn't exist there, instead, if they need more memory and can't fit in ram then they'll evict themselves to disk rather than anyone else.
Also realised that a million queues isn't unreasonable, so minimum number of tokens in the system should be more like 1e7 if not higher.
|
| | | | |
|
| | | | |
|
| | | |
| | | |
| | | |
| | | | |
that I failed to spot last night, but apparently came to me during my dreams. I have no idea how the tests managed to pass last night...
|
| | | |
| | | |
| | | |
| | | | |
is on disk or not. It does not use any sequence numbers, nor does it try to correllate queue position with sequence numbers in the disk_queue. Therefore, there is absolutely no reason for the disk_queue to have all the necessary complexity associated with being able to cope with non-contiguous sequence ids. Thus all removed. This has made the disk_queue a good bit simpler and slightly faster in a few cases too. All tests pass.
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
mixed_queue:to_disk_only_mode. This function puts the next N messages at the front of the queue to the back and is MUCH more efficient than calling phantom_deliver and then requeue_with_seqs. This means that a queue which has been sent to disk, then converted back to mixed mode, some minor been done, and then sent back to disk takes almost no time in transitions beyond the first transition. The test of this:
1) declare durable queue
2) send 100,000 persistent messages to it
3) send 100,000 non-persistent messages to it
4) send 100,000 persistent messages to it
5) now pin it to disk - it'll make two calls to requeue_next_n and should be rather quick as it's only the middle 100,000 messages that actually have to be written, the other 200,000 don't even get sent between the disk_queue and mixed_queue in either direction. A total of 100,003 calls are necessary for this transition: 2 requeue_next_n, 100,000 tx_publish, 1 tx_commit
6) now unpin it from disk and list the queues to wake it up. The transition to mixed_mode is one call, zero reads, and instantaneous
7) now repin it to disk. The mixed queue knows everything is still on disk, so it makes one call to requeue_next_n with N = 300,000. The disk_queue sees this is the whole queue and so doesn't need to do any work at all and so is instant.
All tests pass.
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
a process comes out of hibernation, then does under 10 second's work before hibernating again then it'll only issue a memory report when it goes back to hibernation, thus it'll always claim to the queue_mode_manager that it's hibernating. Thus now, when hibernating, or when receiving the report_memory message, we set state so that when the next normal message comes in, we always send a memory report after that message. This ensures that when a process wakes up and does some real work, the queue_mode_manager will be informed.
Applied this, and the ability to hibernate to the disk_queue too. Plus some minor refactoring and better state field names. All tests pass, and the disk_queue really does hibernate with the binary backoff as I wanted.
|
| | | |
| | | |
| | | |
| | | | |
queue_mode_manager rather than directly talking to the queues. This means the queues and the queue manager can't disagree on the mode a queue should be in.
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
mixed_queue so that it does batching. This means that it won't just flood the disk_queue with a billion messages, thus exhausting memory. Instead it does batching and uses tx_commit to demarkate the batches. This means the conversion happens as quickly as possible and does not exhaust memory. Dropped the memory alarms to 0.8. This is a good idea because converting queues between modes transiently takes a fair chunk of memory, and leaving the alarms up at 0.95 was proving too high making the mode transitions exhaust ram and swap to buggery.
However, there is a problem when going to disk_mode in mixed queue where messages in the queue are already on disk. A million calls to phantom deliver is not a good idea, and locks a CPU core at 100% for a very long time.
|
| |\ \ \ |
|
| |\ \ \ \ |
|
| |\ \ \ \ \ |
|
| | | | | | | |
|
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | | |
Obviously, converting a mixed queue to disk does take some time and the values are deliberately set low to save memory because on this transition, the disk_queue mailbox will go insane and eat lots of memory very quickly. But it seems about the right balance. I'll add documentation next
|
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | | |
is no need for the emergency tokens, nor any need for the weird doubling. So it's basically got much simpler.
We hold two queues, one of hibernating queues (ordered by when they hibernated) and another priority_queue of lowrate queues (ordered by the amount of memory allocated to them). We evict to disk from the hibernated and then the lowrate queues in their relevant orders. Seems to work. Oh and disk_queue is now managed by the tokens too.
|
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | | |
a) when we're not hibernating, every 10 seconds
b) immediately prior to hibernating
c) as soon as we stop hibernating
|
| | | | | | | |
|
| | | | | | | |
|
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | | |
mixed mode, we may as well do it really lazily and not bother with any communication with the disk_queue. We just have a token in the queue which indicates how many messages we are expecting to get from the disk queue. This makes disk -> mixed almost instantaneous. This also means that performance is not initially brilliant. Maybe we need some way of the queue knowing that both it and the disk_queue are idle and deciding to prefetch. Even batching could work well. It's an endless trade off between getting operations to happen quickly and being able to get good performance. Dunno what the third thing is, probably not necessary, as you can't even have both of those, let alone pick 2 from 3!
|
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | | |
apply_after, not apply interval, and then after reporting memory use, don't set a new timer going (but do set a new timer going on every other message (other than timeouts)). This means that if nothing is going on, after a memory report, the process can wait as long as it needs to before the hibernate timeout fires.
|
| | |\ \ \ \ \
| | | | | | | |
| | | | | | | |
| | | | | | | | |
10seconds which means the memory_report timer will always fire and reset the timeout - thus the queue process will never hibernate.
|
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | | |
publish. This massively reduces the number of sync calls to disk_queue, potentially to one, if every message in the queue is non persistent (or the queue is non durable).
|
| | | | | | | |
| | | | | | | |
| | | | | | | |
| | | | | | | | |
solve the problems. I don't quite buy this though, as all I was doing was stopping and starting the app so I don't understand why this was affecting the clustering configuration or causing issues _much_ further down the test line. But still, it seems to be repeatedly passing for me atm.
|
| | | |\ \ \ \ \
| | | | | | | | |
| | | | | | | | |
| | | | | | | | | |
"All replicas on diskfull nodes are not active yet".
|
| | | | | | | | | |
|
| | | | | | | | | |
|