| Commit message (Collapse) | Author | Age | Files | Lines |
| |
|
| |
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
There's a race between the meck:wait call in setup and killing the
config_event process. Its possible that we could kill and restart the
config_event process after meck:wait returns, but before
gen_event:add_sup_handler is called. More likely, we could end up
killing the config_event gen_event process before its fully handled the
add_sup_handler message and linked the notifier pid.
This avoids the race by waiting for config_event to return that it has
processed the add_sup_handler message instead of relying on meck:wait
for the subscription call.
|
|
|
|
|
|
|
|
| |
It was already used in the IOQ2 work so all the plumbing to pull it in during
dev testing was there and it seems awkward to have two different property
testing framework for just a few tests.
It is still an optional component and is not included in the release.
|
|
|
|
|
| |
Noticed mem3_sync_event_listner tests still fails intermetently, add a debug
log to it to hopefully find the cause of the failure.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
We now only support OTP 20+, with 19 at a stretch. erlang:now/0
was deprecated in OTP 18, so we can now suppress these warnings:
```
/home/joant/couchdb/src/dreyfus/src/dreyfus_index_updater.erl:62: Warning: erlang:now/0: Deprecated BIF. See the "Time and Time Correction in Erlang" chapter of the ERTS User's Guide for more information.
/home/joant/couchdb/src/dreyfus/src/dreyfus_index_updater.erl:83: Warning: erlang:now/0: Deprecated BIF. See the "Time and Time Correction in Erlang" chapter of the ERTS User's Guide for more information.
```
Also, some unused variables were removed:
```
/home/joant/couchdb/src/couch/src/couch_bt_engine.erl:997: Warning: variable 'NewSeq' is unused
/home/joant/couchdb/src/mem3/src/mem3_rep.erl:752: Warning: variable 'TMap' is unused
/home/joant/couchdb/src/dreyfus/src/dreyfus_httpd.erl:76: Warning: variable 'LimitValue' is unused
/home/joant/couchdb/src/dreyfus/src/dreyfus_util.erl:345: Warning: variable 'Db' is unused
```
PRs to follow in ets_lru, hyper, ibrowse to track the rest of `erlang:now/0`
deprecations.
|
| |
|
|
|
|
|
|
|
|
|
|
| |
Previously the target was reset only when the whole job started, but not when
the initial copy phase restarted on its own. If that happened, we left the
target around so the retry failed always with the `eexist` error.
Target reset has a check to make sure the shards are not in the global shard
map, in case someone manually added them, for example. If they are found there
the job panics and exists.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
During the initial copy phase target shards are opened outside the
couch_server. Previously, it was possible to manually (via remsh for instance)
open the same targets via the couch_server by using the `couch_db:open/2` API
for example. That could lead to corruption as there would be two writers for
the same DB file.
In order to prevent such a scenario, introduce a mechanism for the shard
splitter to lock the target shards such that any regular open call would fail
during the initial copy phase.
The locking mechanism is generic and would allow local locking of shards for
possibly other reasons in the future as well.
|
| |
|
| |
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This effectively removes a lot couch_db:ensure_full_commit/1,2 calls.
Low level fsync configuration options are also removed as it might be tempting
to start using those instead of delayed commits, however unlike delayed
commits, changing those default could lead to data corruption.
`/_ensure_full_commit` HTTP API was left as is since replicator from older
versions of CouchDB would call that, it just returns the start time as if
ensure_commit function was called.
Issue: https://github.com/apache/couchdb/issues/2165
|
| |
|
|
|
|
|
| |
Fix a race condition in state matching, also parameterize the state
field in wait_state.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
* Proactively increase timeout for PBKDF2 test
This test was taking 134s in a recent run, which is uncomfortably close
to the threshold.
* Extend timeouts for all reshard API tests
We're observing timeouts on various tests in this suite so let's keep
it consistent and increase timeouts across the board.
* Bump default timeout for all mem3_reshard tests
A couple of these tests were exceeding the default timeout under normal
circumstances, but many of them do a significant amount of work, so for
simplicity we set a module-wide timeout and apply it consistently
throughout.
* Modernize the sync_security test setup/teardown
This test actually doesn't do much real work, but I think what was
happening is that the setup and teardown time was being charged to the
test itself. I've refactored it to use a more modern scaffolding
following some of our more recent additions to the test suite, but have
left the timeout at the default to test this hypothesis.
* Increase timeouts on more heavyweight mem3 tests
* Extend timeouts for replication tests
|
|
|
|
|
|
|
|
| |
Config setting was asynchronous and the waiting function was not
waiting for the actual state value to change just that the state
function was returning.
The fix is to wait for the config value to propagate to the state.
|
|
|
|
|
| |
Before shard splitting it was possible to replicate shards even if they were
not in the shard map. This commit brings back that behavior.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Previously, if a database was deleted and re-created while the internal
replication request was pending, the job would have been retried continuously.
mem3:targets_map/2 function would return an empty targets map and mem3_rep:go
would raise a function clause exception if the database as present but it was
an older "incarnation" of it (with shards living on different target nodes).
Because it was an exception and not an {error, ...} result, the process would
exit with an error. Subsequently, mem3_sync would try to handle process exit
and check of the database was deleted, but it also didn't account for the case
when the database was created, so it would resubmit the into queue again.
To fix it, we introduce a function to check if the database shard is part of
the current database shard map. Then perform the check both before building the
targets map and also on job retries.
|
|
|
|
| |
Also make it divisible by default Q and N
|
|
|
|
|
|
| |
To avoid inadvertently splitting all the shards in all the ranges due to user
error, introduce an option enforce the presence of node and range job creation
parameters.
|
|
|
|
|
|
|
|
|
| |
When upgrading, the new mem3_rpc:load_checkpoint with a filter hash arg won't
be available on older nodes.
Filter hashes are not currently used anyway, so to avoid crashes on mixed
cluster call the older version without the filter hash part when the filter has
the default <<>> value.
|
|
|
|
|
|
|
|
|
|
|
|
| |
This implements the API as defined in RFC #1920
The handlers live in the `mem3_reshard_httpd` and helpers, like validators live
in the `mem3_reshard_api` module.
There are also a bunch of high level (HTTP & fabric) API tests that check that
shard splitting happens properly, jobs are behaving as defined in the RFC, etc.
Co-authored-by: Eric Avdey <eiri@eiri.ca>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Most of the resharding logic lives in the mem3 application under the
`mem3_reshard_sup` supervisor. `mem3_reshard_sup` has three children:
1) `mem3_reshard` : The main reshading job manager.
2) `mem3_reshard_job_sup` : A simple-one-for-one supervisor to keep track of
individual resharding jobs.
3) `mem3_reshard_dbdoc` : Helper gen_server used to update the shard map.
`mem_reshard` gen_server is the central point in the resharding logic. It is a job
manager which accept new jobs, monitors jobs when they run, checkpoints their
status as they make progress, and knows how to restore their state when a node
reboots.
Jobs are represented as instances of the `#job{}` records defined in
`mem3_reshard.hrl` header. There is also a global resharding state represented
by a `#state{}` record.
`mem3_reshard` gen_server maintains an ets table of "live" `#job{}` records. as
its gen_server state represented by `#state{}`. When jobs are checkpointed or
user updates the global resharding state, `mem3_reshard` will use the
`mem3_reshard_store` module to persist those updates to `_local/...` documents
in the shards database. The idea is to allow jobs to persist across node or
application restarts.
After a job is added, if the global state is not `stopped`, `mem3_reshard`
manager will ask the `mem3_reshard_job_sup` to spawn a new child. That child
will be running in a gen_server defined in `mem3_reshard_job` module (included
in subsequent commits). Each child process will periodically ask `mem3_reshard`
manager to checkpoint when it jump to a new state. `mem3_reshard` checkpoints
then informs the child to continue its work.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This is the implementation of the shard splitting job. `mem3_reshard` manager
spawns `mem3_reshard_job` instances via the `mem3_reshard_job_sup` supervisor.
Each job is a gen_server process that starts in `mem3_reshard_job:init/1` with
`#job{}` record instance as the argument.
Then the job goes through recovery, so it can handle resuming in cases where
the job was interrupted previously and it was initialized from a checkpointed
state. Checkpoiting happens in `mem3_reshard` manager with the help of the
`mem3_reshard_store` module (introduced in a previous commit).
After recovery, processing starts in the `switch_state` function. The states
are defined as a sequence of atoms in a list in `mem3_reshard.hrl`.
In the `switch_state()` function, the state and history is updated in the
`#job{}` record, then `mem3_reshard` manager is asked to checkpoint the new
state. The job process waits for `mem3_reshard` manager to notify it when
checkpointing has finished so it can continue processesing the new state. That
happens when the `do_state` gen_server cast is received.
`do_state` function has state matching heads for each state. Usually if there
are long running tasks to be performed `do_state` will spawn a few workers and
perform all the work in there. In the meantime the main job process will simpy
wait for all the workers to exit. When that happens, it will call
`switch_state` to switch to the new state, checkpoint again and so on.
Since there are quite a few steps needed to split a shard, some of the helper
function needed are defined in separate modules such as:
* mem3_reshard_index : Index discovery and building.
* mem3_reshard_dbdoc : Shard map updates.
* couch_db_split : Initial (bulk) data copy (added in a separate commit).
* mem3_rep : To perfom "top-offs" in between some steps.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Shard splitting will result in uneven shard copies. Previously internal
replicator knew to replicate from one shard copy to another but now it needs to
know how to replicate from one source to possibly multiple targets.
The main idea is to reuse the same logic and "pick" function as
`couch_db_split`.
But to avoid a penalty of calling the custom hash function for every document
even for cases when there is just a single target, there is a special "1
target" case where the hash function is `undefined`.
Another case where internal replicator is used is to topoff replication and to
replicate the shard map dbs to and from current node (used in shard map update
logic). For that reason there are a few helper mem3_util and mem3_rpc
functions.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The introduction of shard splitting will eliminate the contraint that all
document copies are located in shards with same range boundaries. That
assumption was made by default in mem3 and fabric functions that do shard
replacement, worker spawning, unpacking `_changes` update sequences and some
others. This commit updates those places to handle the case where document
copies might be in different shard ranges.
A good place to start from is the `mem3_util:get_ring()` function. This
function returns a full non-overlapped ring from a set of possibly overlapping
shards.
This function is used by almost everything else in this commit:
1) It's used when only a single copy of the data is needed, for example in
cases where _all_docs or _changes procesessig.
2) Used when checking if progress is possible after some nodes died.
`get_ring()` returns `[]` when it cannot find a full ring is used to indicate
that progress is not possible.
3) During shard replacement. This is pershaps the most complicated case. During
replacement besides just finding a possible covering of the ring from the set
of shards, it is also desirable to find one that minimizes the number of
workers that have to be replaced. A neat trick used here is to provide
`get_ring` with a custom sort function, which prioritizes certain shard copies
over others. In case of replacements it prioritiezes shards for which workers
have already spawned. In the default cause `get_ring()` will prioritize longer
ranges over shorter ones, so for example, to cover the interval [00-ff] with
either [00-7f, 80-ff] or [00-ff] shards ranges, it will pick the single [00-ff]
range instead of [00-7f, 80-ff] pair.
Co-authored-by: Paul J. Davis <davisp@apache.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This server admin-only endpoint forces an n-way sync of all shards
across all nodes on which they are hosted.
This can be useful for an administrator adding a new node to the
cluster, after updating _dbs so that the new node hosts an existing db
with content, to force the new node to sync all of that db's shards.
Users may want to bump their `[mem3] sync_concurrency` value to a
larger figure for the duration of the shards sync.
Closes #1807
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This change introduces the ability for users to place a group of
documents in a single shard range by specifying a "partition key" in the
document id. A partition key is denoted by everything preceding a colon
':' in the document id.
Every document id (except for design documents) in a partitioned
database is required to have a partition key.
Co-authored-by: Garren Smith <garren.smith@gmail.com>
Co-authored-by: Robert Newson <rnewson@apache.org>
|
|
|
|
|
|
|
| |
This provides the capability for features to specify alternative hash
functions for placing documents in a given shard range. While the
functionality exists with this implementation it is not yet actually
used.
|
|
|
|
|
|
| |
The modules lists in .app files are automatically generated by rebar
from .app.src files, so these explicit lists are unnecessary and prone
to being out of date.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Move couch_event and mem3 modules earlier in the list of SubDirs to
suppress behaviour undefined warnings.
This has the side effect of running the tests in the new order, which
induces failures in couch_index tests. Those failures are related to
quorum, and can be traced to mem3 seeds tests leaving a _nodes db
containing several node docs in the tmp/data directory, ultimately
resulting in badmatch errors e.g. when a test expects 'ok' but gets
'accepted' instead.
To prevent test failures, a cleanup function is implemented which
deletes any existing "nodes_db" left after test completion.
|
|
|
|
|
|
|
|
|
|
|
|
| |
Replace deprecated crypto:rand_uniform/2 and 'random' module functions
with equivalent couch_rand:uniform/1 calls, or eliminate the offending
code entirely if unused.
Note that crypto:rand_uniform/2 takes two parameters which have
different semantics than the single argument couch_rand:uniform/1.
Tests in mem3 are also provided to validate that the random rotation of
node lists was converted correctly.
|
| |
|
|
|
|
|
| |
This avoids needlessly making cross-cluster fabric:update_docs(Db, [], Opts)
calls.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Enable cluster auto-assembly through a seedlist
This introduces a new config setting which allows an administrator to
configure an initial list of nodes that should be contacted when a node
boots up:
[cluster]
seedlist = couchdb@node1.example.com,couchdb@node2.example.com,couchdb@node3.example.com
If configured, CouchDB will add every node in the seedlist to the _nodes
DB automatically, which will trigger a distributed Erlang connection and
a replication of the internal system databases to the local node. This
eliminates the need to explicitly add each node using the HTTP API.
We also modify the /_up endpoint to reflect the progress of the initial seeding
of the node. If a seedlist is configured the endpoint will return 404 until the
local node has updated its local replica of each of the system databases from
one of the members of the seedlist. Once the status flips to "ok" the endpoint
will return 200 and it's safe to direct requests to the new node.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Off-heap messages is an Erlang 19 feature:
http://erlang.org/doc/man/erlang.html#process_flag_message_queue_data
It is adviseable to use that setting for processes which expect to receive a
lot of messages. CouchDB sets it for couch_server, couch_log_server and bunch
of others as well.
In some cases the off-heap behavior could alter the timing of message receives
and expose subtle bugs that have been lurking in the code for years. Or could
slightly reduce performance, so a safety measure allow disabling it.
|
| |
|
|
|
|
|
|
|
|
| |
Sometimes in operations it is helpful to re-establish connection between
erlang nodes. Usually it is achieved by calling `net_adm:ping/1`. However
the `ping` function provided by OTP uses `infinity` timeout. Which causes
indefinite hang in some cases. This PR adds convinience function to be
used instead of `net_adm:ping/1`.
|
|
|
|
|
|
|
|
| |
- Fix dialyzer warning that record construction #shard violates
the declared type in fabric_doc_open_revs.erl,
cpse_test_purge_replication.erl and other files
Fixes #1580
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This commit implements the internal replication of purge requests. This
part of the anit-entropy process is important for ensuring that shard
copies continue to be eventually consistent even if updates happen to
shards independently due to a network split or other event that prevents
the successful purge request to a given copy.
The main addition to internal replication is that we both pull and push
purge requests between the source and target shards. The push direction
is obvious given that internal replication is in the push direction
already. Pull isn't quite as obvious but is required so that we don't
push an update that was already purged on the target.
Of note is that internal replication also has to maintain _local doc
checkpoints to prevent compaction removing old purge requests or else
shard copies could end up missing purge requests which would prevent the
shard copies from ever reaching a consistent state.
COUCHDB-3326
Co-authored-by: Mayya Sharipova <mayyas@ca.ibm.com>
Co-authored-by: jiangphcn <jiangph@cn.ibm.com>
|
|
|
|
|
|
| |
Previously there were two separate database references and it was not
clear which was used where. This simplifies things by reducing it to a
single instance so that the logic is simpler.
|
| |
|
|
|
|
|
|
| |
This uses the new `couch_util:set_mqd_off_heap/0` function to set
message queues to off_heap for some of our critical processes that
receive a significant amount of load in terms of message volume.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The changes listener started in setup of mem3_shards test
was crashing when tried to register on unstarted couch_event
server, so the test was either fast enough to do assertions
before of that or failed on dead listener process.
This change removes dependency on mocking and uses
a standard test_util's star and stop of couch. Module start
moved into the test body to avoid masking potential failure
in a setup.
Also the tests mem3_sync_security_test and mem3_util_test
been modified to avoid setup and teardown side effects.
|
| |
|
|
|
|
|
|
|
|
|
| |
This change moves the main work of storage engines to run through the
new couch_db_engine behavior. This allows us to replace the storage
engine with different implementations that can be tailored to specific
work loads and environments.
COUCHDB-3287
|