summaryrefslogtreecommitdiff
path: root/src/mem3
Commit message (Collapse)AuthorAgeFilesLines
* Add mem3_util:find_dirty_shards functionfix-create-db-optionsRussell Branca2020-03-261-0/+42
|
* Ensure shards are created with db optionsRussell Branca2020-03-253-2/+41
|
* Cleanup mem3 shards_db config lookupsRussell Branca2020-03-253-18/+10
|
* Fix mem3_sync_event_listener testPaul J. Davis2020-02-271-1/+13
| | | | | | | | | | | | | There's a race between the meck:wait call in setup and killing the config_event process. Its possible that we could kill and restart the config_event process after meck:wait returns, but before gen_event:add_sup_handler is called. More likely, we could end up killing the config_event gen_event process before its fully handled the add_sup_handler message and linked the notifier pid. This avoids the race by waiting for config_event to return that it has processed the add_sup_handler message instead of relying on meck:wait for the subscription call.
* Replace Triq with PropErNick Vatamaniuc2020-01-212-5/+34
| | | | | | | | It was already used in the IOQ2 work so all the plumbing to pull it in during dev testing was there and it seems awkward to have two different property testing framework for just a few tests. It is still an optional component and is not included in the release.
* Debug mem3_sync_event_listener flakinessNick Vatamaniuc2020-01-161-0/+4
| | | | | Noticed mem3_sync_event_listner tests still fails intermetently, add a debug log to it to hopefully find the cause of the failure.
* Eliminate multiple compiler warningsJoan Touzet2020-01-131-1/+0
| | | | | | | | | | | | | | | | | | | | | | We now only support OTP 20+, with 19 at a stretch. erlang:now/0 was deprecated in OTP 18, so we can now suppress these warnings: ``` /home/joant/couchdb/src/dreyfus/src/dreyfus_index_updater.erl:62: Warning: erlang:now/0: Deprecated BIF. See the "Time and Time Correction in Erlang" chapter of the ERTS User's Guide for more information. /home/joant/couchdb/src/dreyfus/src/dreyfus_index_updater.erl:83: Warning: erlang:now/0: Deprecated BIF. See the "Time and Time Correction in Erlang" chapter of the ERTS User's Guide for more information. ``` Also, some unused variables were removed: ``` /home/joant/couchdb/src/couch/src/couch_bt_engine.erl:997: Warning: variable 'NewSeq' is unused /home/joant/couchdb/src/mem3/src/mem3_rep.erl:752: Warning: variable 'TMap' is unused /home/joant/couchdb/src/dreyfus/src/dreyfus_httpd.erl:76: Warning: variable 'LimitValue' is unused /home/joant/couchdb/src/dreyfus/src/dreyfus_util.erl:345: Warning: variable 'Db' is unused ``` PRs to follow in ets_lru, hyper, ibrowse to track the rest of `erlang:now/0` deprecations.
* Debug mem3 eunit errorreset-corrupt-view-indexPaul J. Davis2020-01-101-1/+1
|
* When shard splitting make sure to reset the targets before any retriesNick Vatamaniuc2020-01-102-15/+5
| | | | | | | | | | Previously the target was reset only when the whole job started, but not when the initial copy phase restarted on its own. If that happened, we left the target around so the retry failed always with the `eexist` error. Target reset has a check to make sure the shards are not in the global shard map, in case someone manually added them, for example. If they are found there the job panics and exists.
* Lock shard splitting targets during the initial copy phaseNick Vatamaniuc2020-01-051-1/+34
| | | | | | | | | | | | | | | During the initial copy phase target shards are opened outside the couch_server. Previously, it was possible to manually (via remsh for instance) open the same targets via the couch_server by using the `couch_db:open/2` API for example. That could lead to corruption as there would be two writers for the same DB file. In order to prevent such a scenario, introduce a mechanism for the shard splitter to lock the target shards such that any regular open call would fail during the initial copy phase. The locking mechanism is generic and would allow local locking of shards for possibly other reasons in the future as well.
* Speedup eunit: mem3_sync_event_listenerPaul J. Davis2019-12-251-18/+35
|
* Speedup eunit: mem3_shardsPaul J. Davis2019-12-251-18/+36
|
* Speedup eunit: mem3_repPaul J. Davis2019-12-251-1/+1
|
* Remove delayed commits optionNick Vatamaniuc2019-09-261-2/+0
| | | | | | | | | | | | | | This effectively removes a lot couch_db:ensure_full_commit/1,2 calls. Low level fsync configuration options are also removed as it might be tempting to start using those instead of delayed commits, however unlike delayed commits, changing those default could lead to data corruption. `/_ensure_full_commit` HTTP API was left as is since replicator from older versions of CouchDB would call that, it just returns the start time as if ensure_commit function was called. Issue: https://github.com/apache/couchdb/issues/2165
* Keep database property after overwriting shard mapkeep-dbprop-after-rewriting-shardmapjiangph2019-08-241-1/+15
|
* Fix mem3_sync_event_listener EUnit testNick Vatamaniuc2019-07-301-20/+8
| | | | | Fix a race condition in state matching, also parameterize the state field in wait_state.
* Move eunit tests into test/eunit directoryILYA Khlopotov2019-07-2910-0/+0
|
* Fix EUnit timeouts (#2087)Adam Kocoloski2019-07-285-103/+129
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | * Proactively increase timeout for PBKDF2 test This test was taking 134s in a recent run, which is uncomfortably close to the threshold. * Extend timeouts for all reshard API tests We're observing timeouts on various tests in this suite so let's keep it consistent and increase timeouts across the board. * Bump default timeout for all mem3_reshard tests A couple of these tests were exceeding the default timeout under normal circumstances, but many of them do a significant amount of work, so for simplicity we set a module-wide timeout and apply it consistently throughout. * Modernize the sync_security test setup/teardown This test actually doesn't do much real work, but I think what was happening is that the setup and teardown time was being charged to the test itself. I've refactored it to use a more modern scaffolding following some of our more recent additions to the test suite, but have left the timeout at the default to test this hypothesis. * Increase timeouts on more heavyweight mem3 tests * Extend timeouts for replication tests
* Fix flaky mem3_sync_event_listener EUnit testNick Vatamaniuc2019-07-281-8/+31
| | | | | | | | Config setting was asynchronous and the waiting function was not waiting for the actual state value to change just that the state function was returning. The fix is to wait for the config value to propagate to the state.
* Make mem3_rep:go work when target shards are not yet present in shard mapNick Vatamaniuc2019-06-121-3/+31
| | | | | Before shard splitting it was possible to replicate shards even if they were not in the shard map. This commit brings back that behavior.
* Handle database re-creation edge case in internal replicatorNick Vatamaniuc2019-05-013-6/+33
| | | | | | | | | | | | | | | | | | Previously, if a database was deleted and re-created while the internal replication request was pending, the job would have been retried continuously. mem3:targets_map/2 function would return an empty targets map and mem3_rep:go would raise a function clause exception if the database as present but it was an older "incarnation" of it (with shards living on different target nodes). Because it was an exception and not an {error, ...} result, the process would exit with an error. Subsequently, mem3_sync would try to handle process exit and check of the database was deleted, but it also didn't account for the case when the database was created, so it would resubmit the into queue again. To fix it, we introduce a function to check if the database shard is part of the current database shard map. Then perform the check both before building the targets map and also on job retries.
* Increase max number of resharding jobsNick Vatamaniuc2019-04-241-1/+1
| | | | Also make it divisible by default Q and N
* Allow restricting resharding parametersNick Vatamaniuc2019-04-232-0/+51
| | | | | | To avoid inadvertently splitting all the shards in all the ranges due to user error, introduce an option enforce the presence of node and range job creation parameters.
* Fix upgrade clause for mem3_rpc:load_checkpoint/4,5Nick Vatamaniuc2019-04-111-1/+5
| | | | | | | | | When upgrading, the new mem3_rpc:load_checkpoint with a filter hash arg won't be available on older nodes. Filter hashes are not currently used anyway, so to avoid crashes on mixed cluster call the older version without the filter hash part when the filter has the default <<>> value.
* Implement resharding HTTP APINick Vatamaniuc2019-04-036-0/+2522
| | | | | | | | | | | | This implements the API as defined in RFC #1920 The handlers live in the `mem3_reshard_httpd` and helpers, like validators live in the `mem3_reshard_api` module. There are also a bunch of high level (HTTP & fabric) API tests that check that shard splitting happens properly, jobs are behaving as defined in the RFC, etc. Co-authored-by: Eric Avdey <eiri@eiri.ca>
* Resharding supervisor and job managerNick Vatamaniuc2019-04-038-4/+1196
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Most of the resharding logic lives in the mem3 application under the `mem3_reshard_sup` supervisor. `mem3_reshard_sup` has three children: 1) `mem3_reshard` : The main reshading job manager. 2) `mem3_reshard_job_sup` : A simple-one-for-one supervisor to keep track of individual resharding jobs. 3) `mem3_reshard_dbdoc` : Helper gen_server used to update the shard map. `mem_reshard` gen_server is the central point in the resharding logic. It is a job manager which accept new jobs, monitors jobs when they run, checkpoints their status as they make progress, and knows how to restore their state when a node reboots. Jobs are represented as instances of the `#job{}` records defined in `mem3_reshard.hrl` header. There is also a global resharding state represented by a `#state{}` record. `mem3_reshard` gen_server maintains an ets table of "live" `#job{}` records. as its gen_server state represented by `#state{}`. When jobs are checkpointed or user updates the global resharding state, `mem3_reshard` will use the `mem3_reshard_store` module to persist those updates to `_local/...` documents in the shards database. The idea is to allow jobs to persist across node or application restarts. After a job is added, if the global state is not `stopped`, `mem3_reshard` manager will ask the `mem3_reshard_job_sup` to spawn a new child. That child will be running in a gen_server defined in `mem3_reshard_job` module (included in subsequent commits). Each child process will periodically ask `mem3_reshard` manager to checkpoint when it jump to a new state. `mem3_reshard` checkpoints then informs the child to continue its work.
* Shard splitting job implementationNick Vatamaniuc2019-04-035-0/+1575
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This is the implementation of the shard splitting job. `mem3_reshard` manager spawns `mem3_reshard_job` instances via the `mem3_reshard_job_sup` supervisor. Each job is a gen_server process that starts in `mem3_reshard_job:init/1` with `#job{}` record instance as the argument. Then the job goes through recovery, so it can handle resuming in cases where the job was interrupted previously and it was initialized from a checkpointed state. Checkpoiting happens in `mem3_reshard` manager with the help of the `mem3_reshard_store` module (introduced in a previous commit). After recovery, processing starts in the `switch_state` function. The states are defined as a sequence of atoms in a list in `mem3_reshard.hrl`. In the `switch_state()` function, the state and history is updated in the `#job{}` record, then `mem3_reshard` manager is asked to checkpoint the new state. The job process waits for `mem3_reshard` manager to notify it when checkpointing has finished so it can continue processesing the new state. That happens when the `do_state` gen_server cast is received. `do_state` function has state matching heads for each state. Usually if there are long running tasks to be performed `do_state` will spawn a few workers and perform all the work in there. In the meantime the main job process will simpy wait for all the workers to exit. When that happens, it will call `switch_state` to switch to the new state, checkpoint again and so on. Since there are quite a few steps needed to split a shard, some of the helper function needed are defined in separate modules such as: * mem3_reshard_index : Index discovery and building. * mem3_reshard_dbdoc : Shard map updates. * couch_db_split : Initial (bulk) data copy (added in a separate commit). * mem3_rep : To perfom "top-offs" in between some steps.
* Update internal replicator to handle split shardsNick Vatamaniuc2019-04-033-192/+849
| | | | | | | | | | | | | | | | | | Shard splitting will result in uneven shard copies. Previously internal replicator knew to replicate from one shard copy to another but now it needs to know how to replicate from one source to possibly multiple targets. The main idea is to reuse the same logic and "pick" function as `couch_db_split`. But to avoid a penalty of calling the custom hash function for every document even for cases when there is just a single target, there is a special "1 target" case where the hash function is `undefined`. Another case where internal replicator is used is to topoff replication and to replicate the shard map dbs to and from current node (used in shard map update logic). For that reason there are a few helper mem3_util and mem3_rpc functions.
* Uneven shard copy handling in mem3 and fabricNick Vatamaniuc2019-04-037-30/+453
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The introduction of shard splitting will eliminate the contraint that all document copies are located in shards with same range boundaries. That assumption was made by default in mem3 and fabric functions that do shard replacement, worker spawning, unpacking `_changes` update sequences and some others. This commit updates those places to handle the case where document copies might be in different shard ranges. A good place to start from is the `mem3_util:get_ring()` function. This function returns a full non-overlapped ring from a set of possibly overlapping shards. This function is used by almost everything else in this commit: 1) It's used when only a single copy of the data is needed, for example in cases where _all_docs or _changes procesessig. 2) Used when checking if progress is possible after some nodes died. `get_ring()` returns `[]` when it cannot find a full ring is used to indicate that progress is not possible. 3) During shard replacement. This is pershaps the most complicated case. During replacement besides just finding a possible covering of the ring from the set of shards, it is also desirable to find one that minimizes the number of workers that have to be replaced. A neat trick used here is to provide `get_ring` with a custom sort function, which prioritizes certain shard copies over others. In case of replacements it prioritiezes shards for which workers have already spawned. In the default cause `get_ring()` will prioritize longer ranges over shorter ones, so for example, to cover the interval [00-ff] with either [00-7f, 80-ff] or [00-ff] shards ranges, it will pick the single [00-ff] range instead of [00-7f, 80-ff] pair. Co-authored-by: Paul J. Davis <davisp@apache.org>
* Add new /{db}/_sync_shards endpoint (admin-only) (#1811)Joan Touzet2019-01-182-1/+20
| | | | | | | | | | | | | This server admin-only endpoint forces an n-way sync of all shards across all nodes on which they are hosted. This can be useful for an administrator adding a new node to the cluster, after updating _dbs so that the new node hosts an existing db with content, to force the new node to sync all of that db's shards. Users may want to bump their `[mem3] sync_concurrency` value to a larger figure for the duration of the shards sync. Closes #1807
* Implement partitioned dbsPaul J. Davis2019-01-181-2/+10
| | | | | | | | | | | | | This change introduces the ability for users to place a group of documents in a single shard range by specifying a "partition key" in the document id. A partition key is denoted by everything preceding a colon ':' in the document id. Every document id (except for design documents) in a partitioned database is required to have a partition key. Co-authored-by: Garren Smith <garren.smith@gmail.com> Co-authored-by: Robert Newson <rnewson@apache.org>
* Implement configurable hash functionsPaul J. Davis2019-01-186-21/+116
| | | | | | | This provides the capability for features to specify alternative hash functions for placing documents in a given shard range. While the functionality exists with this implementation it is not yet actually used.
* Remove explicit modules list from .app.src filesJay Doane2018-12-271-14/+0
| | | | | | The modules lists in .app files are automatically generated by rebar from .app.src files, so these explicit lists are unnecessary and prone to being out of date.
* Reduce number of behaviour undefined compiler warningsJay Doane2018-12-271-4/+9
| | | | | | | | | | | | | | | Move couch_event and mem3 modules earlier in the list of SubDirs to suppress behaviour undefined warnings. This has the side effect of running the tests in the new order, which induces failures in couch_index tests. Those failures are related to quorum, and can be traced to mem3 seeds tests leaving a _nodes db containing several node docs in the tmp/data directory, ultimately resulting in badmatch errors e.g. when a test expects 'ok' but gets 'accepted' instead. To prevent test failures, a cleanup function is implemented which deletes any existing "nodes_db" left after test completion.
* Suppress crypto and random compiler warningsJay Doane2018-12-271-2/+12
| | | | | | | | | | | | Replace deprecated crypto:rand_uniform/2 and 'random' module functions with equivalent couch_rand:uniform/1 calls, or eliminate the offending code entirely if unused. Note that crypto:rand_uniform/2 takes two parameters which have different semantics than the single argument couch_rand:uniform/1. Tests in mem3 are also provided to validate that the random rotation of node lists was converted correctly.
* Suppress unused variable and type compiler warningsJay Doane2018-12-272-2/+1
|
* Filter out empty missing_revs results in mem3_repNick Vatamaniuc2018-12-061-2/+6
| | | | | This avoids needlessly making cross-cluster fabric:update_docs(Db, [], Opts) calls.
* Enable cluster auto-assembly through a seedlist (#1658)Adam Kocoloski2018-11-108-17/+270
| | | | | | | | | | | | | | | | | | | | | Enable cluster auto-assembly through a seedlist This introduces a new config setting which allows an administrator to configure an initial list of nodes that should be contacted when a node boots up: [cluster] seedlist = couchdb@node1.example.com,couchdb@node2.example.com,couchdb@node3.example.com If configured, CouchDB will add every node in the seedlist to the _nodes DB automatically, which will trigger a distributed Erlang connection and a replication of the internal system databases to the local node. This eliminates the need to explicitly add each node using the HTTP API. We also modify the /_up endpoint to reflect the progress of the initial seeding of the node. If a seedlist is configured the endpoint will return 404 until the local node has updated its local replica of each of the system databases from one of the members of the seedlist. Once the status flips to "ok" the endpoint will return 200 and it's safe to direct requests to the new node.
* Allow disabling off-heap messagesNick Vatamaniuc2018-09-061-1/+1
| | | | | | | | | | | | | | Off-heap messages is an Erlang 19 feature: http://erlang.org/doc/man/erlang.html#process_flag_message_queue_data It is adviseable to use that setting for processes which expect to receive a lot of messages. CouchDB sets it for couch_server, couch_log_server and bunch of others as well. In some cases the off-heap behavior could alter the timing of message receives and expose subtle bugs that have been lurking in the code for years. Or could slightly reduce performance, so a safety measure allow disabling it.
* Log warning when changes seq rewinds to 0Jay Doane2018-09-041-0/+7
|
* Implement convinience `mem3:ping/2` functionILYA Khlopotov2018-08-301-0/+24
| | | | | | | | Sometimes in operations it is helpful to re-establish connection between erlang nodes. Usually it is achieved by calling `net_adm:ping/1`. However the `ping` function provided by OTP uses `infinity` timeout. Which causes indefinite hang in some cases. This PR adds convinience function to be used instead of `net_adm:ping/1`.
* Fix dialyzer warning of shard record constructionjiangph2018-08-281-6/+6
| | | | | | | | - Fix dialyzer warning that record construction #shard violates the declared type in fabric_doc_open_revs.erl, cpse_test_purge_replication.erl and other files Fixes #1580
* Improve validation of database creation parametersimprove-dbcreate-validationRobert Newson2018-08-272-5/+12
|
* [07/10] Clustered Purge: Internal replicationPaul J. Davis2018-08-224-14/+287
| | | | | | | | | | | | | | | | | | | | | | | | This commit implements the internal replication of purge requests. This part of the anit-entropy process is important for ensuring that shard copies continue to be eventually consistent even if updates happen to shards independently due to a network split or other event that prevents the successful purge request to a given copy. The main addition to internal replication is that we both pull and push purge requests between the source and target shards. The push direction is obvious given that internal replication is in the push direction already. Pull isn't quite as obvious but is required so that we don't push an update that was already purged on the target. Of note is that internal replication also has to maintain _local doc checkpoints to prevent compaction removing old purge requests or else shard copies could end up missing purge requests which would prevent the shard copies from ever reaching a consistent state. COUCHDB-3326 Co-authored-by: Mayya Sharipova <mayyas@ca.ibm.com> Co-authored-by: jiangphcn <jiangph@cn.ibm.com>
* Simplify logic in mem3_repCOUCHDB-3326-clustered-purge-pr2-simplify-mem3-repPaul J. Davis2018-08-211-10/+9
| | | | | | Previously there were two separate database references and it was not clear which was used where. This simplifies things by reducing it to a single instance so that the logic is simpler.
* Make MD5 hash implementation configurable (#1171)rokek2018-07-161-3/+3
|
* Call `set_mqd_off_heap` for critical processesPaul J. Davis2018-06-191-0/+1
| | | | | | This uses the new `couch_util:set_mqd_off_heap/0` function to set message queues to off_heap for some of our critical processes that receive a significant amount of load in terms of message volume.
* Fix mem3 tests (#1285)Eric Avdey2018-04-174-93/+36
| | | | | | | | | | | | | | | | The changes listener started in setup of mem3_shards test was crashing when tried to register on unstarted couch_event server, so the test was either fast enough to do assertions before of that or failed on dead listener process. This change removes dependency on mocking and uses a standard test_util's star and stop of couch. Module start moved into the test body to avoid masking potential failure in a setup. Also the tests mem3_sync_security_test and mem3_util_test been modified to avoid setup and teardown side effects.
* Fix shard substitution in changes feedsPaul J. Davis2018-03-281-3/+1
|
* Implement pluggable storage enginesPaul J. Davis2018-02-287-60/+75
| | | | | | | | | This change moves the main work of storage engines to run through the new couch_db_engine behavior. This allows us to replace the storage engine with different implementations that can be tailored to specific work loads and environments. COUCHDB-3287