From 0e79aa787877cdbffc8900952115de9173f41732 Mon Sep 17 00:00:00 2001 From: Sergey Poznyakoff Date: Sat, 31 Jul 2021 00:32:38 +0300 Subject: Update the documentation --- README_crash_tolerance.txt | 197 --------------------------------------------- doc/gdbm.3 | 9 ++- doc/gdbm.texi | 188 ++++++++++++++++++++++++++++++++++++++---- doc/gdbmtool.1 | 13 ++- 4 files changed, 190 insertions(+), 217 deletions(-) delete mode 100644 README_crash_tolerance.txt diff --git a/README_crash_tolerance.txt b/README_crash_tolerance.txt deleted file mode 100644 index 5aaf483..0000000 --- a/README_crash_tolerance.txt +++ /dev/null @@ -1,197 +0,0 @@ - -Crash Tolerance for GNU dbm -=========================== - -This file describes a new (as of release 1.21) feature that can be -enabled at compile time and used in environments with appropriate -support from the OS (currently Linux) and filesystem (currently XFS, -BtrFS, and OCFS2). The feature is a "pure opt-in," in the sense that -it has no effect whatsoever unless it is explicitly enabled at -compile time and used by applications. It has been tested on -late-2020-vintage Fedora Linux and XFS. - -See the "Drill Bits" column in the July/August 2021 issue of ACM -_Queue_ magazine for a broader discussion of crash-tolerant GNU dbm. -If for whatever reason you can't access this column, contact the -author (Kelly). - -Read and thoroughly understand this file before attempting to use the -new feature. Address questions/feedback to the maintainer(s) and to -Terence Kelly, tpkelly@{acm.org, cs.princeton.edu, eecs.umich.edu}. - -- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -Background: - -Historically GNU dbm did not tolerate crashes: An ill-timed crash due -to a power outage, an operating system kernel panic, or an abnormal -application process termination could corrupt or destroy data in the -database file. Corruption is likely if a crash occurs during updates -to the GDBM database file, e.g., during a gdbm_store() or gdbm_sync() -call. Therefore GNU dbm was not suitable for applications that -require the ability to recover an up-to-date consistent state of -their persistent data following a crash. Such applications resorted -instead to alternative "transactional" NoSQL data stores such as -BerkeleyDB or Kyoto Cabinet, or even full-blown SQL databases such as -MySQL or SQLite. Which is unfortunate if all the application really -needs is a crash-tolerant GDBM. - -New crash-tolerance feature: - -GNU dbm now includes an optional crash-tolerance mechanism that, when -used correctly, guarantees that a consistent recent state of -application data can be recovered followng a crash. Specifically, it -guarantees that the state of the database file corresponding to the -most recent successful gdbm_sync() call can be recovered. Crash -tolerance must be enabled when the GNU dbm library is compiled, and -applications must request crash tolerance for each GDBM_FILE by -calling a new API. - -If the new mechanism is used correctly, crashes such as power -outages, OS kernel panics, and (some) application process crashes -will be tolerated. Non-tolerated failures include physical -destruction of storage devices and corruption due to bugs in -application logic. For example, the new mechanism won't help if a -pointer bug in your application corrupts gdbm's private in-memory -data which in turn corrupts the database file. - -Using crash tolerance: - -(1) The GNU dbm library must be built with an additional C compiler -#define flag. After unpacking the tarball, from the C-shell command -line it suffices to do the following before running make: - - % setenv CFLAGS -DGDBM_FAILURE_ATOMIC - % ./configure CFLAGS=-DGDBM_FAILURE_ATOMIC >& configure.out - -(2) You must use a filesystem that supports reflink copying. -Currently XFS, BtrFS, and OCFS2 support reflink. You can create such -a filesystem if you don't have one already. (Note that reflink -support may require that special options be specified at the time of -filesystem creation; this is true of XFS.) The most conventional way -to create a filesystem is on a dedicated storage device. However it -is also possible to create a filesystem *within an ordinary file* on -some other filesystem. For example, executing the following commands -from the C-shell command line will create a smallish XFS filesystem -inside a file on an ext4 filesystem: - - % mkdir XFS - % cd XFS - % sudo truncate --size 512m XFSfile - % sudo mkfs.xfs -m crc=1 -m reflink=1 XFSfile - % sudo mkdir XFSmountpoint - % sudo mount -o loop XFSfile XFSmountpoint - % sudo xfs_info XFSmountpoint - % cd XFSmountpoint - % sudo mkdir test - % set me = `whoami`':'`whoami` - % sudo chown $me test - % cd test - % echo foo > bar - % ls -l bar - -After executing the commands above, from the diretory where you -started you should see a directory XFS/XFSmountpoint/test/ where your -unprivileged user account may create and delete files. Reflink -copying via ioctl(FICLONE) should work for files in and below this -directory. You can test reflink copying using the GNU "cp" -command-line utility: "cp --reflink=always file1 file2". Read the -manpage for the Linux-specific API "ioctl_ficlone(2)" for additional -information. - -Your GNU dbm database file and two other files described below must -all reside on the same reflink-capable filesystem. - -(3) In your application source code, #define GDBM_FAILURE_ATOMIC -before you #include . - -(4) Open a GNU dbm database with gdbm_open(). Unless you know what -you are doing, do *not* specify the GDBM_SYNC flag when opening the -database. The reason is that you want your application to explicitly -control when gdbm_sync() is called; you don't want an implicit sync -on every database operation. - -(5) Request crash tolerance by invoking the following new interface: - - gdbm_failure_atomic(GDBM_FILE dbf, const char *even, const char *odd); - -"even" and "odd" are the pathnames of two files that will be created -and filled with snapshots of the database file. These two files must -*not* exist when gdbm_failure_atomic() is called and must reside on the -same filesystem as the database file. The filesystem must support -reflink copying, i.e., ioctl(FICLONE) must work. - -After you call gdbm_failure_atomic(), every call to gdbm_sync() will -make an efficient reflink snapshot of the database file in either the -"even" or the "odd" snapshot file; consecutive gdbm_sync() calls -alternate between the two, hence the names. The permission bits and -last-mod timestamps on the snapshot files determine which one -contains the state of the database file corresponding to the most -recent successful gdbm_sync(). Post-crash recovery is described -below. - -(6) When your application knows that the state of the database is -consistent (i.e., all relevant application-level invariants hold), -you may call gdbm_sync(). For example, if your application manages -bank accounts, transferring money from one account to another should -maintain the invariant that the sum of the two accounts is the same -before and after the transfer: It is correct to decrement account A -by $7, increment account B by $7, and then call gdbm_sync(). However -it is *not* correct to call gdbm_sync() *between* the decrement of A -and the increment of B, because a crash immediately after that call -would destroy money. The general rule is simple, sensible, and -memorable: Call gdbm_sync() only when the database is in a state from -which you are willing and able to recover following a crash. (If you -think about it you'll realize that there's never any other moment -when you'd really want to call gdbm_sync(), regardless of whether -crash-tolerance is enabled. Why on earth would you push the state of -an inconsistent unrecoverable database down to durable media?). - -(7) If a crash occurs, the snapshot file ("even" or "odd") containing -the database state reflecting the most recent successful gdbm_sync() -call is the snapshot file whose permission bits are read-only and -whose last-modification timestamp is greatest. If both snapshot -files are readable, we choose the one with the most recent -last-modification timestamp. Following a crash, *do not* do anything -that could change the file permissions or last-mod timestamp on -either snapshot file! - -The gdbm_latest() function takes two filename arguments---the "even" -and "odd" snapshot filenames---and tells you which is the most recent -readable file. That's the snapshot file that should replace the -original database file, which may have been corrupted by the crash. - -Return values: - -Both new functions, gdbm_failure_atomic() and gdbm_latest(), pinpoint -mishaps by returning the *negation* of the source code line number on -which something went wrong: "return (-1 * __LINE__)". So to diagnose -problems, "use the Source, Luke!" - -Note that the values returned by the gdbm_sync() function may change -as a result of enabling crash tolerance. Applications unprepared for -the new return values might become confused. - -Performance: - -The purpose of a parachute is not to hasten descent. Crash tolerance -is a safety mechanism, not a performance accelerator. Reflink -copying is designed to be as efficient as possible, but making -snapshots of the GNU dbm database file on every gdbm_sync() call -entails overheads. The performance impact of GDBM crash tolerance -will depend on many factors including the type and configuration of -the underlying storage system, how often the application calls -gdbm_sync(), and the extent of changes to the database file between -consecutive calls to gdbm_sync(). - -Availability: - -To ensure that application data can survive the failure of one or -more storage devices, replicated storage (e.g., RAID) may be used -beneath the reflink-capable filesystem. Some cloud providers offer -block storage services that mimic the interface of individual storage -devices but that are implemented as high-availability fault-tolerant -replicated distributed storage systems. Installing a reflink-capable -filesystem atop a high-availability storage system is a good starting -point for a high-availability crash-tolerant GDBM. - diff --git a/doc/gdbm.3 b/doc/gdbm.3 index 963b9f0..6f569dc 100644 --- a/doc/gdbm.3 +++ b/doc/gdbm.3 @@ -13,7 +13,7 @@ .\" .\" You should have received a copy of the GNU General Public License .\" along with GDBM. If not, see . */ -.TH GDBM 3 "June 25, 2021" "GDBM" "GDBM User Reference" +.TH GDBM 3 "July 31, 2021" "GDBM" "GDBM User Reference" .SH NAME GDBM \- The GNU database manager. Includes \fBdbm\fR and \fBndbm\fR compatibility. @@ -446,9 +446,10 @@ of the underlying database. This mechanism requires OS and filesystem support and must be requested when \fBgdbm\fR is compiled. The crash-tolerance mechanism is a "pure opt-in" feature, in the sense that it has no effects whatsoever except on those applications -that explicitly request it. See file "README_crash_tolerance.txt" -in the distribution tarball for details. - +that explicitly request it. For details, see the chapter +.B "Crash Tolerance" +in the +.BR "GDBM manual" . .SH LINKING This library is accessed by specifying \fI\-lgdbm\fR as the last parameter to the compile line, e.g.: diff --git a/doc/gdbm.texi b/doc/gdbm.texi index 84cc3aa..7a9198c 100644 --- a/doc/gdbm.texi +++ b/doc/gdbm.texi @@ -107,6 +107,7 @@ Functions: * Sequential:: Sequential access to records. * Reorganization:: Database reorganization. * Sync:: Insure all writes to disk have competed. +* Database format:: GDBM database formats. * Flat files:: Export and import to Flat file format. * Errors:: Error handling. * Recovery:: Recovery from fatal errors. @@ -404,9 +405,6 @@ the database and wants it created if it does not already exist. If created, regardless of whether one existed, and wants read and write access to the new database. -@kwindex GDBM_SYNC -@kwindex GDBM_NOLOCK -@kwindex GDBM_NOMMAP The following constants may also be logically or'd into the database flags: @@ -423,6 +421,14 @@ A reverse of @code{GDBM_SYNC}. Synchronize writes only when needed. This is the default. The flag is provided for compatibility with previous versions of @command{GDBM}. +@kwindex GDBM_NUMSYNC +@item GDBM_NUMSYNC +Useful only together with @code{GDBM_NEWDB}, this bit instructs +@code{gdbm_open} to create new database in @dfn{extended database +format}, suitable for effective crash recovery. @xref{Numsync}, for a +detailed discussion of this format, and @ref{Crash Tolerance}, for a +discussion of crash recovery. + @kwindex GDBM_NOLOCK @item GDBM_NOLOCK Don't lock the database file. Use this flag if you intend to do @@ -870,6 +876,46 @@ immediately after the set of changes have been made. describing the error and returns -1. @end deftypefn +@node Database format +@chapter Changing database format +As of version @value{VERSION}, @command{GDBM} supports databases in +two formats: @dfn{standard} and @dfn{extended}. The standard format +is used most often. The @dfn{extended} database format is used to +provide additional crash resistance (@pxref{Crash Tolerance}). + +Depending on the value of the @var{flags} parameter in a call to +@code{gdbm_open} (@pxref{Open}), a database can be created in either +format. + +The format of an existing database can be changed using the +@code{gdbm_convert} function: + +@deftypefn {gdbm interface} int gdbm_convert (GDBM_FILE @var{dbf}, @ + int @var{flag}) +Changes the format of the database file @var{dbf}. Allowed values for +@var{flag} are: + +@table @code +@item 0 +Convert database to the standard format. + +@kwindex GDBM_NUMSYNC +@item GDBM_NUMSYNC +Convert database to the extended @dfn{numsync} format (@pxref{Numsync}). +@end table + +On success, the function returns 0. In this case, it should be +followed by a call to @code{gdbm_sync} (@pxref{Sync}) or +@code{gdbm_close} (@pxref{Close}) to ensure the changes are written to +the disk. + +On error, returns -1 and sets the @code{gdbm_errno} variable +(@pxref{Variables, gdbm_errno}). + +If the database is already in the requested format, the function +returns success (0) without doing anything. +@end deftypefn + @node Flat files @chapter Export and Import @cindex Flat file format @@ -1345,11 +1391,11 @@ support from the OS and the filesystem. As of version @value{VERSION}, this means a Linux kernel 5.12.12 or later and a filesystem that supports reflink copying, such as XFS, BtrFS, or OCFS2. If these prerequisites are met, crash tolerance code will -be enabled automaticaly by the @command{configure} script when +be enabled automatically by the @command{configure} script when building the package. The crash-tolerance mechanism, when used correctly, guarantees that a -consistent recent state of application data can be recovered followng +consistent recent state of application data can be recovered following a crash. Specifically, it guarantees that the state of the database file corresponding to the most recent successful gdbm_sync() call can be recovered. @@ -1359,7 +1405,7 @@ outages, OS kernel panics, and (some) application process crashes will be tolerated. Non-tolerated failures include physical destruction of storage devices and corruption due to bugs in application logic. For example, the new mechanism won't help if a -pointer bug in your application corrupts gdbm's private in-memory +pointer bug in your application corrupts @command{GDBM} private in-memory data which in turn corrupts the database file. To enable crash tolerance in your application, follow these steps. @@ -1391,7 +1437,7 @@ The XFS filesystem is now available in directory unprivileged user account may create and delete files: @example -mkdir XFSmountpoint +cd XFSmountpoint mkdir test chown @var{user}:@var{group} test @end example @@ -1415,11 +1461,14 @@ all reside on the same reflink-capable filesystem. @heading Enabling crash tolerance -Open a GNU dbm database with @code{gdbm_open}. Unless you know what -you are doing, do not specify the @code{GDBM_SYNC} flag when opening the -database. The reason is that you want your application to explicitly -control when @code{gdbm_sync} is called; you don't want an implicit sync -on every database operation. +Open a GNU dbm database with @code{gdbm_open}. Whenever possible, use +the extended @command{GDBM} format. Generally speaking, this means +using the @code{GDBM_NUMSYNC} flag when creating the database +(@pxref{Numsync}). Unless you know what you are doing, do not specify +the @code{GDBM_SYNC} flag when opening the database. The reason is that +you want your application to explicitly control when @code{gdbm_sync} +is called; you don't want an implicit sync on every database +operation. Request crash tolerance by invoking the following interface: @@ -1470,9 +1519,11 @@ containing the database state reflecting the most recent successful @code{gdbm_sync} call is the snapshot file whose permission bits are read-only and whose last-modification timestamp is greatest. If both snapshot files are readable, we choose the one with the most recent -last-modification timestamp. Following a crash, @emph{do not} do -anything that could change the file permissions or last-mod timestamp on -either snapshot file! +last-modification timestamp@footnote{The experimental @dfn{numsync} +extension is provided to handle such case gracefully. @xref{Numsync}, +for details.}. Following a crash, @emph{do not} do anything that +could change the file permissions or last-mod timestamp on either +snapshot file! The @code{gdbm_latest_snapshot} function is provided, that selects the right snapshot among the two. Invoke it as: @@ -1502,6 +1553,19 @@ switch (gdbm_latest_snapshot (even, odd, &recovery_file)) case GDBM_SNAPSHOT_SAME: fprintf (stderr, "Both snapshots have the same date!\n); exit (1); + + case GDBM_SNAPSHOT_SUSPICIOUS: + /* + * That can occur only in databases with extended numsync header + * enabled. @xref{Numsync}. + */ + fprintf (stderr, "returned snapshot %s is suspicious\n", recovery_file); + fprintf (stderr, "examine it and take action\n"); + /* + * Switch to interactive mode letting the user examine the + * snapshot and take appropriate action + */ + @} @end group @end example @@ -1529,6 +1593,76 @@ replicated distributed storage systems. Installing a reflink-capable filesystem atop a high-availability storage system is a good starting point for a high-availability crash-tolerant GDBM. +@node Numsync +@section Numsync Extension + +In @ref{Crash recovery}, we have shown that for database recovery, +one should select the snapshot whose permission bits are read-only and +whose last-modification timestamp is greatest. However, there may be +cases when a crash occurs at such a time that both snapshot files +remain readable. It may also happen, that their permissions and/or +modification times are inadvertently changed before recovery. To +make it possible to select the right snapshot in such cases, a new +@dfn{extended database format} was introduced in @command{GDBM} +version 1.21. This format adds to the database header the +@code{numsync} field, that holds the number of synchronizations the +database underwent before being closed or abandoned due to a crash. + +Each snapshot is an exact copy of the database at a given point of +time. Thus, if both snapshots of a database in extended format are +readable, it will suffice to examine their @code{numsync} counters +and select the one whose @code{numsync} is greater. That's what +the @code{gdbm_latest_snapshot} function does in this case. + +It is worth noticing, that the two counters should differ exactly by +one. If the difference is greater than that, @code{gdbm_latest_snapshot} +will still select the snapshot with the greater @code{numsync} value, +but will return a special status code, @code{GDBM_SNAPSHOT_SUSPICIOUS}, +indicating that the proposed snapshot file has been chosen based on +suspicious or unreliable data. If, during a recovery attempt, you get +this status code, we recommend to proceed with the manual recovery, +e.g. by examining both snapshot files using @command{gdbmtool -r} +(@pxref{gdbmtool}). + +To create a database in extended format, call @code{gdbm_open} with +both @code{GDBM_NEWDB} and @code{GDBM_NUMSYNC} flags: + +@example +dbf = gdbm_open(dbfile, 0, GDBM_NEWDB|GDBM_NUMSYNC, 0600, NULL); +@end example + +@noindent +Notice, that this flag must always be used together with +@code{GDBM_NEWDB} (@pxref{Open}). + +A standard @command{GDBM} database can be converted to the extended +format. To convert an existing database to the extended format, use the +@code{gdbm_convert} function (@pxref{Database format}): + +@example + rc = gdbm_convert(dbf, GDBM_NUMSYNC); +@end example + +You can do the same using the @command{gdbmtool} utility +(@pxref{commands, upgrade}): + +@example +gdbmtool @var{dbname} upgrade +@end example + +The conversion is reversible. To convert a database from extended +format back to the standard @command{GDBM} format, do: + +@example + rc = gdbm_convert(dbf, 0); +@end example + +To do the from the command line: + +@example +gdbmtool @var{dbname} downgrade +@end example + @node Crash Tolerance API @section Crash Tolerance API @@ -1581,6 +1715,16 @@ select between the two snapshots (this means they are both readable and have exactly the same @code{mtime} timestamp), the function returns @code{GDBM_SNAPSHOT_SAME}. +@kwindex GDBM_SNAPSHOT_SUSPICIOUS +If the @samp{numsync} extension is enabled (@pxref{Numsync}), the +function can also return the @code{GDBM_SNAPSHOT_SUSPICIOUS} status +code. This happens when the @code{numsync} counters in the two +snapshots differ by more than one. In this case, the function selects +the snapshot with the greater @code{numsync} value. If you get this +status code when recovering from a crash, it is recommended to switch +to manual recovery procedure, letting the user examine the snapshots +and take the appropriate action. + If any value other than @code{GDBM_SNAPSHOT_OK} is returned, it is guaranteed that the function don't touch @var{retval}. @end deftypefn @@ -2911,6 +3055,11 @@ Delete record with the given @var{key} Print hash directory. @end deffn +@deffn {command verb} downgrade +Downgrade the database from extended to the standard database format. +@xref{Numsync}. +@end deffn + @anchor{gdbmtool export} @deffn {command verb} export @var{file-name} [truncate] [binary|ascii] Export the database to the flat file @var{file-name}. @xref{Flat files}, @@ -3077,6 +3226,15 @@ Store the @var{data} with @var{key} in the database. If @var{key} already exists, its data will be replaced. @end deffn +@deffn {command verb} sync +Synchronize the database with the disk storage (@pxref{Sync}). +@end deffn + +@deffn {command verb} upgrade +Upgrade the database from standard to extended database format. +@xref{Numsync}. +@end deffn + @deffn {command verb} version Print the version of @command{gdbm}. @end deffn diff --git a/doc/gdbmtool.1 b/doc/gdbmtool.1 index d15b7cd..20c7c27 100644 --- a/doc/gdbmtool.1 +++ b/doc/gdbmtool.1 @@ -13,7 +13,7 @@ .\" .\" You should have received a copy of the GNU General Public License .\" along with GDBM. If not, see . */ -.TH GDBMTOOL 1 "June 27, 2018" "GDBM" "GDBM User Reference" +.TH GDBMTOOL 1 "July 31, 2021" "GDBM" "GDBM User Reference" .SH NAME gdbmtool \- examine and modify a GDBM database .SH SYNOPSIS @@ -179,6 +179,10 @@ Delete record with the given \fIKEY\fR. .BR dir Print hash directory. .TP +.BR downgrade +Downgrade the database from the extended \fInumsync\fR format to the +standard format. +.TP \fBexport\fR \fIFILE\-NAME\fR [\fBtruncate\fR] [\fBbinary\fR|\fBascii\fR] Export the database to the flat file \fIFILE\-NAME\fR. This is equivalent to .BR gdbm_dump (1). @@ -270,6 +274,13 @@ Print current program status. Store the \fIDATA\fR with the given \fIKEY\fR in the database. If the \fIKEY\fR already exists, its data will be replaced. .TP +.B sync +Synchronize the database file with the disk storage. +.TP +.B upgrade +Upgrade the database from the standard to the extended \fInumsync\fR +format. +.TP \fBunset\fR \fIVARIABLE\fR... Unsets listed variables. .TP -- cgit v1.2.1