summaryrefslogtreecommitdiff
Commit message (Collapse)AuthorAgeFilesLines
* Removed some transactions and made all transaction bodies idempotent. They ↵bug20980Matthew Sackman2009-08-051-74/+56
| | | | | | were actually fine before: a) the rabbit_disk_queue table is local_content and b) only one process ever accesses that table - thus there is no reason why any transaction will ever retry. However, this change is probably still beneficial. The only slight loss is that tx-commit is no longer atomic (ref counting of messages in ets, not mnesia, was resulting in non idempotency, so moved outside the transaction). This means that you could have msgs in a tx committed, but the acks not enforced, in the event of power failure or other catastrophic event. All tests pass.
* merge in from 21087Matthew Sackman2009-08-0322-931/+4582
|\
| * just making set_mode as a pcast in the disk_queue just like it is in the ↵Matthew Sackman2009-08-011-1/+1
| | | | | | | | amqqueue
| * merging in from 21087. All tests pass.Matthew Sackman2009-07-293-96/+163
| |\
| * | Prefetcher appears to be done and working well. None of the tests exercise ↵Matthew Sackman2009-07-214-97/+262
| | | | | | | | | | | | it though because I decided to only start it up when in mixed mode and when the amqqueue_process starts to hibernate (otherwise, we start it up too soon, it doesn't make much progress and then we just have to shut it down anyway). However, other manual tests definitely exercise it and it seems to be very effective. Certainly can't make it crash now.
| * | sigh, another stupid bug which none of the tests catch and which I just ↵Matthew Sackman2009-07-211-2/+2
| | | | | | | | | | | | happened to spot by reading the code. I am deeply alarmed by how many of these sorts of bugs I am finding and how many more there must be. OTOH, they do seem to crop up much more in code which has been changed substantially and repeatedly, though it's very possible that's just because I'm looking there more than elsewhere.
| * | Fixed a bug in the mixed_queue which could lead to messages being marked ↵Matthew Sackman2009-07-214-19/+29
| | | | | | | | | | | | | | | | | | | | | | | | undelivered when in fact they have been delivered when converting to disk_only mode. In truth, this bug didn't exist because there is no way in which a message could end up in that form in the mixed_queue which had previously been delivered. However, that will change when the prefetcher comes in, necessitating this "bug" gets fixed. The solution is to make tx_commit not just take a list of msg ids in the txn, but to take a list of {msgid, delivered} tuples. In this way it mirrors the disk_queue:publish function in that the delivery flag can be set explicitly. Tests adjusted. All tests pass.
| * | minor doc typeos and formattingMatthew Sackman2009-07-211-12/+13
| | |
| * | bare non-functioning skeleton of prefetcher. Essay written on design of ↵Matthew Sackman2009-07-213-2/+206
| | | | | | | | | | | | prefetcher and its limitations.
| * | Stripping out old broken prefetch. Also reverted gen_server2 back to the ↵Matthew Sackman2009-07-213-254/+83
| | | | | | | | | | | | revision at the end of bug21087 on the grounds that the min_pri stuff wasn't enormously compelling and added a good chunk of complexity. Also, I don't believe it'll be needed for the new prefetcher. All tests pass.
| * | Fixed the commit bug. Really this should probably be in bug20470 but I ↵Matthew Sackman2009-07-191-51/+48
| | | | | | | | | | | | | | | | | | | | | | | | really didn't want to have to deal with merging and the other information about this bug is in the above comments in 20980 so it's in here. Now on commit, we test to see if we need to sync the current file. If so then we just store all the txn details in state for later dealing with. If not, we really do the commit there and then and reply. Interestingly, performance is actually better now than it was (see details in bug20470) but, eg, the one-in-one-out at altitude test has further reduced fsyncs from 21 to 6 and now completes in 2.1 seconds, not 3.6 (altitude of 1000, then 5000 @ one in, one out, then 1000 drain). All tests pass. We now guarantee that the messages will be fsync'd to disk _before_ anything is done to mnesia, in all cases of a txn_commit.
| * | Spotted and corrected some mistakes where messages published to the mixed ↵Matthew Sackman2009-07-192-7/+7
| | | | | | | | | | | | | | | | | | | | | | | | queue in disk-only mode would not be marked delivered even if they were persistent, thus resulting in redelivery on broker startup without the message being marked predelivered. Also, spotted (not fixed yet) bug in commit coalescing in which the mnesia transaction is always commiting before the messages are flushed to disk. What should happen is that if coalescing is going to happen, the mnesia transaction should be delayed too, and happen only _after_ the disk sync. I.e. it doesn't matter if we disk sync and then the mnesia txn fails, but it does matter if the mnesia txn succeeds and then the disk sync fails. Also, I think I've worked out how to do prefetching properly. It's not actually that complex.
| * | ok, limits on the cache, and on prefetch.Matthew Sackman2009-07-171-18/+30
| | | | | | | | | | | | | | | | | | I decided the right thing to do is to prefer older messages in the cache to younger ones. This is because they're more likely to be used sooner. Which means we just fill it up and then leave it alone, which is nice and simple. Things are pretty much ok with it now, but the whole notion of prefetch is still wrong and needs to be changed to be driven by the mixed queue, not the disk_queue. For one reason, currently, if two or more queues issue prefetch requests, and the first fills the cache up, then the 2nd won't do anything. The cache is useful, but shouldn't be abused for prefetching purposes. The two things are separate.
| * | part 2 done. The mixed_queue is back to using only one queue. Start up time ↵Matthew Sackman2009-07-173-152/+114
| | | | | | | | | | | | isn't too bad with big queues, and memory use is stable. In disk_queue, when iterating through the mnesia table, do the normal limited batching for removal of non-persistent messages.
| * | The use of the in-memory run length queue in disk_only queue is considered a ↵Matthew Sackman2009-07-173-67/+98
| | | | | | | | | | | | | | | | | | | | | | | | show stopper, and rightly so. I personally don't like the idea of adding additional tokens to the disk queue to indicated queue switch because it can substantially increase the number of OS calls and writes and reads from disk and, eg, getting queue length right and memory size right is made a fair bit more complex. So abandon the two queues idea. Instead, store the persistent flag in the stop byte on disk. Then on startup, the persistent flag turns up in the MsgLocations ets table. This is all done and all tests pass. The next stage is that on start up, go through each queue and just wipe out non-persistent messages. This should be pretty fast. Then call the shuffle_up function as is currently being done. This will eliminate the gaps in sequences. This really should be enough. Then the mixed_queue can go back to just talking about a single queue.
| * | Well it's better. The memory size is now recovered at start up by doing a ↵Matthew Sackman2009-07-165-73/+94
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | foldl on the entire queue. This seems excessive, but it does work. It only takes 75 seconds on my machine to get through 1e6 1024-byte messages, and 160 seconds to get through 2e6 1024-byte messages. So that doesn't worry me any more. Also, it's done in constant memory... ish[0]. Also fixed the queue_mode_manager. Registration does not now produce a mode. Instead, it assumes you're starting up in disk only mode and then the first memory_report will result in the correct mode being set. This is safe and prevents a potentially deadly prefetch being sent when a queue starts up in mixed mode only to be sent to disk_only mode. However, the disk_queue has to start up in mixed mode because if it doesn't it has no way to estimate its memory use for disk mode. As such, it registers and then sends a report of 0 memory use. This guarantees that it can be put in mixed mode, thus it can then respond as necessary to the queue_mode_manager. I've not done anything further at this stage with the use of the erlang queue in the mixed_queue module when in disk mode (the potential per-message cost). Really you don't want to send individual entries here to the disk_queue, you want to batch them up... makes this rather more complex. [0] Sort of wrong. It can use the cache, and if you think about not too big queues sharing messages, this is clearly a good thing. But if there are lots of shared messages then it all goes wrong because the cache will get over populated and exhaust memory. Furthermore, the foldl is entirely in the disk_queue process. This means that during the foldl it won't be reporting memory and it won't be able to respond to request to change its mode. All of which points pretty strongly to the requirement that the prefetch needs to be somewhat more sophisticated.
| * | Substantial changes to mixed_queue.Matthew Sackman2009-07-152-224/+264
| | | | | | | | | | | | | | | | | | | | | | | | Previously, persistent and non-persistent messages went into the same queue on disk. The advantage of this is that you don't need to track which queue you're currently reading from and for how many messages. However, the downside to this is that on queue recovery you need to iterate through the entire queue and delete all non-persistent messages. This takes a huge amount of time. So now this is changed. Each amqqueue is now two on disk queues. One for persistent messages and one for non-persistent messages. Thus queue recovery is now trivial - just delete the non-persistent queue. However, we now _always_ use the erlang queue in mixed_queue to track (in disk mode) how many of each queue we need to read (i.e. run-length encoding). This, in the worst case (alternating persistent and non-persistent) is per-message cost. It's possible we need some sort of disk-based queue (AGH!). Not sure. Provided the queue only contains one sort of message, it degenerates to a simple single counter. All tests pass. However, there is a bug, which is that on recovery, the size of the queue (RAM cost) is not known. As such, the reporting of the queue to the queue_mode manager on queue recovery is incorrect (it starts of 0, and can go -ve). I've not decided how to fix this yet, because I do not want to have to iterate through all the messages to get the queue size out!
| * | Just adding a bit more testing really just to bump the code coverage up.Matthew Sackman2009-07-102-1/+46
| | |
| * | prefetch part 1. When a mixed_queue sees that the next item in its queue is ↵Matthew Sackman2009-07-103-40/+85
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | on disk, it issues a low priority prefetch instruction to the disk queue, which populates the disk_queue's cache. Note that this shouldn't impact on memory as by virtue of the mixed_queue being in mixed mode, the contents of the queue are already accounted for in memory even though they were on disk. The effect of this is that when the deliver comes, it doesn't need to go to disk to read the message as the messages are already in cache. Testing: A 100,000 * 1Kb msg queue takes 15 seconds to drain (basic.get, noack) when the messages are in memory, in the mixed queue. On disk, without prefetch, takes 32 seconds On disk, with prefetch, cache hot, takes 25 seconds. The next step is to get the disk queue to signal back to the queue that the prefetch is done and for the queue to grab the messages from the disk_queue in advance, thus meaning that on delivery, all that is needed is the async acks being sent to the disk_queue (assuming the messages are not actually persistent).
| * | Adjusted documentationMatthew Sackman2009-07-101-7/+11
| | |
| * | *cough*Matthew Sackman2009-07-091-1/+2
| | |
| * | ...and with some testing and debugging, it might even work as described in ↵Matthew Sackman2009-07-092-28/+24
| | | | | | | | | | | | the documentation!
| * | additional documentationMatthew Sackman2009-07-091-1/+9
| | |
| * | Initial work to permit low priority background tasks to be catered for.Matthew Sackman2009-07-095-78/+236
| | |
| * | merging in from 21087. In testing, observed oscillation - with lots of ↵Matthew Sackman2009-07-092-16/+17
| |\ \ | | | | | | | | | | | | | | | | | | | | queues, give 2 queues the same length such that they can't both fit in memory, and slowly trickle in messages. As each one gets a message, it'll force the other one out to disk (the other one will either be in the hibernating or lowrate groups). This is bad. Therefore, adjusted the conditions under which we bring a queue back in from disk to exclude queues that are either hibernating or low rate (don't forget, even a list_queues will wake up a queue and cause it to report memory). If you have two fast queues then neither of them will be in the groups of low rate or hibernating queues, so neither will be candidates for eviction so the problem doesn't exist there, instead, if they need more memory and can't fit in ram then they'll evict themselves to disk rather than anyone else. Also realised that a million queues isn't unreasonable, so minimum number of tokens in the system should be more like 1e7 if not higher.
| * | | length is never used in disk_queue, so removedMatthew Sackman2009-07-091-21/+12
| | | |
| * | | minor documentation fix.Matthew Sackman2009-07-091-2/+1
| | | |
| * | | Fixes from removing the non-contiguous sequences support from the disk queue ↵Matthew Sackman2009-07-091-31/+20
| | | | | | | | | | | | | | | | that I failed to spot last night, but apparently came to me during my dreams. I have no idea how the tests managed to pass last night...
| * | | The mixed queue contains in its queue knowledge of whether the next message ↵Matthew Sackman2009-07-084-190/+83
| | | | | | | | | | | | | | | | is on disk or not. It does not use any sequence numbers, nor does it try to correllate queue position with sequence numbers in the disk_queue. Therefore, there is absolutely no reason for the disk_queue to have all the necessary complexity associated with being able to cope with non-contiguous sequence ids. Thus all removed. This has made the disk_queue a good bit simpler and slightly faster in a few cases too. All tests pass.
| * | | added requeue_next_n to disk_queue and made use of it in ↵Matthew Sackman2009-07-083-28/+62
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | mixed_queue:to_disk_only_mode. This function puts the next N messages at the front of the queue to the back and is MUCH more efficient than calling phantom_deliver and then requeue_with_seqs. This means that a queue which has been sent to disk, then converted back to mixed mode, some minor been done, and then sent back to disk takes almost no time in transitions beyond the first transition. The test of this: 1) declare durable queue 2) send 100,000 persistent messages to it 3) send 100,000 non-persistent messages to it 4) send 100,000 persistent messages to it 5) now pin it to disk - it'll make two calls to requeue_next_n and should be rather quick as it's only the middle 100,000 messages that actually have to be written, the other 200,000 don't even get sent between the disk_queue and mixed_queue in either direction. A total of 100,003 calls are necessary for this transition: 2 requeue_next_n, 100,000 tx_publish, 1 tx_commit 6) now unpin it from disk and list the queues to wake it up. The transition to mixed_mode is one call, zero reads, and instantaneous 7) now repin it to disk. The mixed queue knows everything is still on disk, so it makes one call to requeue_next_n with N = 300,000. The disk_queue sees this is the whole queue and so doesn't need to do any work at all and so is instant. All tests pass.
| * | | found a bug in the memory reports in combination with hibernation in that if ↵Matthew Sackman2009-07-083-52/+78
| | | | | | | | | | | | | | | | | | | | | | | | a process comes out of hibernation, then does under 10 second's work before hibernating again then it'll only issue a memory report when it goes back to hibernation, thus it'll always claim to the queue_mode_manager that it's hibernating. Thus now, when hibernating, or when receiving the report_memory message, we set state so that when the next normal message comes in, we always send a memory report after that message. This ensures that when a process wakes up and does some real work, the queue_mode_manager will be informed. Applied this, and the ability to hibernate to the disk_queue too. Plus some minor refactoring and better state field names. All tests pass, and the disk_queue really does hibernate with the binary backoff as I wanted.
| * | | Sorted out rabbitmqctl so that it sends pinning commands to the ↵Matthew Sackman2009-07-073-33/+100
| | | | | | | | | | | | | | | | queue_mode_manager rather than directly talking to the queues. This means the queues and the queue manager can't disagree on the mode a queue should be in.
| * | | lots of tuning and testing. Totally rewrote the to_disk_only_mode in ↵Matthew Sackman2009-07-076-57/+103
| | | | | | | | | | | | | | | | | | | | | | | | mixed_queue so that it does batching. This means that it won't just flood the disk_queue with a billion messages, thus exhausting memory. Instead it does batching and uses tx_commit to demarkate the batches. This means the conversion happens as quickly as possible and does not exhaust memory. Dropped the memory alarms to 0.8. This is a good idea because converting queues between modes transiently takes a fair chunk of memory, and leaving the alarms up at 0.95 was proving too high making the mode transitions exhaust ram and swap to buggery. However, there is a problem when going to disk_mode in mixed queue where messages in the queue are already on disk. A million calls to phantom deliver is not a good idea, and locks a CPU core at 100% for a very long time.
| * | | oh look, another merge in from 21087Matthew Sackman2009-07-061-7/+8
| |\ \ \
| * \ \ \ (another) merge from 21087 (not 21097 as mentioned in previous commit)Matthew Sackman2009-07-061-28/+49
| |\ \ \ \
| * \ \ \ \ merge from bug21097Matthew Sackman2009-07-0620-919/+3984
| |\ \ \ \ \
| | * | | | | Added documentationMatthew Sackman2009-07-061-0/+66
| | | | | | |
| | * | | | | testing shows these values work well. The whole thing works pretty well. ↵Matthew Sackman2009-07-063-4/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Obviously, converting a mixed queue to disk does take some time and the values are deliberately set low to save memory because on this transition, the disk_queue mailbox will go insane and eat lots of memory very quickly. But it seems about the right balance. I'll add documentation next
| | * | | | | Reworked. Because the disk->mixed transition doesn't eat up any ram, there ↵Matthew Sackman2009-07-038-212/+435
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | is no need for the emergency tokens, nor any need for the weird doubling. So it's basically got much simpler. We hold two queues, one of hibernating queues (ordered by when they hibernated) and another priority_queue of lowrate queues (ordered by the amount of memory allocated to them). We evict to disk from the hibernated and then the lowrate queues in their relevant orders. Seems to work. Oh and disk_queue is now managed by the tokens too.
| | * | | | | report memory:Matthew Sackman2009-07-031-5/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | a) when we're not hibernating, every 10 seconds b) immediately prior to hibernating c) as soon as we stop hibernating
| | * | | | | wip, dnc.Matthew Sackman2009-07-022-27/+89
| | | | | | |
| | * | | | | cosmeticMatthew Sackman2009-07-021-2/+2
| | | | | | |
| | * | | | | well if we're going to not actually pull messages off disk when going to ↵Matthew Sackman2009-07-023-84/+29
| | | | | | | | | | | | | | | | | | | | | | | | | | | | mixed mode, we may as well do it really lazily and not bother with any communication with the disk_queue. We just have a token in the queue which indicates how many messages we are expecting to get from the disk queue. This makes disk -> mixed almost instantaneous. This also means that performance is not initially brilliant. Maybe we need some way of the queue knowing that both it and the disk_queue are idle and deciding to prefetch. Even batching could work well. It's an endless trade off between getting operations to happen quickly and being able to get good performance. Dunno what the third thing is, probably not necessary, as you can't even have both of those, let alone pick 2 from 3!
| | * | | | | Sorted out the timer versus hibernate binary backoff. The trick is to use ↵Matthew Sackman2009-07-021-10/+17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | apply_after, not apply interval, and then after reporting memory use, don't set a new timer going (but do set a new timer going on every other message (other than timeouts)). This means that if nothing is going on, after a memory report, the process can wait as long as it needs to before the hibernate timeout fires.
| | * | | | | merge in from 21087. Behaviour is now broken because the timeout can get > ↵Matthew Sackman2009-07-0219-908/+3675
| | |\ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | 10seconds which means the memory_report timer will always fire and reset the timeout - thus the queue process will never hibernate.
| | | * | | | | When converting to disk mode, use tx_publish and tx_commit instead of ↵Matthew Sackman2009-07-011-9/+16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | publish. This massively reduces the number of sync calls to disk_queue, potentially to one, if every message in the queue is non persistent (or the queue is non durable).
| | | * | | | | Well after all that pain, simply doing the disk queue tests first seems to ↵Matthew Sackman2009-07-011-3/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | solve the problems. I don't quite buy this though, as all I was doing was stopping and starting the app so I don't understand why this was affecting the clustering configuration or causing issues _much_ further down the test line. But still, it seems to be repeatedly passing for me atm.
| | | * | | | | merge, but it still doesn't work. Sometimes it blows up on clustering with ↵Matthew Sackman2009-06-301-5/+14
| | | |\ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | "All replicas on diskfull nodes are not active yet".
| | | | * | | | | Well, this seems to work.bug19662Matthew Sackman2009-06-301-4/+13
| | | | | | | | |
| | | | * | | | | and now clustering seems to work again...Matthew Sackman2009-06-301-4/+4
| | | | | | | | |