summaryrefslogtreecommitdiff
path: root/src/include/storage
Commit message (Collapse)AuthorAgeFilesLines
* Add writeback to pg_stat_ioAndres Freund2023-05-171-2/+3
| | | | | | | | | | | | | | | 28e626bde00 added the concept of IOOps but neglected to include writeback operations. ac8d53dae5 added time spent doing these I/O operations. Without counting writeback, checkpointer write time in the log often differed substantially from that in pg_stat_io. To fix this, add IOOp IOOP_WRITEBACK and track writeback in pg_stat_io. Bumps catversion. Author: Melanie Plageman <melanieplageman@gmail.com> Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com> Reported-by: Andres Freund <andres@anarazel.de> Discussion: https://postgr.es/m/20230419172326.dhgyo4wrrhulovt6%40awork3.anarazel.de
* Update parameter name context to wb_contextAndres Freund2023-05-171-2/+2
| | | | | | | | | | For clarity of review, renaming the function parameter "context" in ScheduleBufferTagForWriteback() and IssuePendingWritebacks() to "wb_context" is a separate commit. The next commit adds an "io_context" parameter and "wb_context" makes it more clear which is which. Author: Melanie Plageman <melanieplageman@gmail.com> Discussion: https://postgr.es/m/CAAKRu_acc6iL4M3hvOTeztf_ZPpsB3Pqio5aVHgZ5q=Pi3BZKg@mail.gmail.com
* Remove bogus #include added by d4e71df6d75.Thomas Munro2023-04-261-1/+0
| | | | | | | | | The recently added inclusion of guc.h in smgr.h is not necessary and introduces more server-related stuff. Removing the directive helps avoid potential issues with including sgmr.h in frontends. Author: Kyotaro Horiguchi <horikyota.ntt@gmail.com> Discussion: https://postgr.es/m/20230425.115748.2130383825066921512.horikyota.ntt%40gmail.com
* Remove vacuum_defer_cleanup_ageAndres Freund2023-04-241-1/+0
| | | | | | | | | | | | | | | | | | | | vacuum_defer_cleanup_age was introduced before hot_standby_feedback and replication slots existed. It is hard to use reasonably - commonly it will either be set too low (not preventing recovery conflicts, while still causing some bloat), or too high (causing a lot of bloat). The alternatives do not have that issue. That on its own might not be sufficient reason to remove vacuum_defer_cleanup_age, but it also complicates computation of xid horizons. See e.g. the bug fixed in be504a3e974. It also is untested. This commit removes TransactionIdRetreatSafely(), as there are no users anymore. There might be potential future users, hence noting that here. Reviewed-by: Daniel Gustafsson <daniel@yesql.se> Reviewed-by: Justin Pryzby <pryzby@telsasoft.com> Reviewed-by: Alvaro Herrera <alvherre@alvh.no-ip.org> Discussion: https://postgr.es/m/20230317230930.nhsgk3qfk7f4axls@awork3.anarazel.de
* Fix some typos and some incorrectly duplicated wordsDavid Rowley2023-04-181-1/+1
| | | | | | Author: Justin Pryzby Reviewed-by: David Rowley Discussion: https://postgr.es/m/ZD3D1QxoccnN8A1V@telsasoft.com
* Harmonize some more function parameter names.Peter Geoghegan2023-04-131-1/+1
| | | | | | | | | | | | | | Make sure that function declarations use names that exactly match the corresponding names from function definitions in a few places. These inconsistencies were all introduced relatively recently, after the code base had parameter name mismatches fixed in bulk (see commits starting with commits 4274dc22 and 035ce1fe). pg_bsd_indent still has a couple of similar inconsistencies, which I (pgeoghegan) have left untouched for now. Like all earlier commits that cleaned up function parameter names, this commit was written with help from clang-tidy.
* Handle logical slot conflicts on standbyAndres Freund2023-04-082-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | During WAL replay on the standby, when a conflict with a logical slot is identified, invalidate such slots. There are two sources of conflicts: 1) Using the information added in 6af1793954e, logical slots are invalidated if required rows are removed 2) wal_level on the primary server is reduced to below logical Uses the infrastructure introduced in the prior commit. FIXME: add commit reference. Change InvalidatePossiblyObsoleteSlot() to use a recovery conflict to interrupt use of a slot, if called in the startup process. The new recovery conflict is added to pg_stat_database_conflicts, as confl_active_logicalslot. See 6af1793954e for an overall design of logical decoding on a standby. Bumps catversion for the addition of the pg_stat_database_conflicts column. Bumps PGSTAT_FILE_FORMAT_ID for the same reason. Author: "Drouvot, Bertrand" <bertranddrouvot.pg@gmail.com> Author: Andres Freund <andres@anarazel.de> Author: Amit Khandekar <amitdkhan.pg@gmail.com> (in an older version) Reviewed-by: "Drouvot, Bertrand" <bertranddrouvot.pg@gmail.com> Reviewed-by: Andres Freund <andres@anarazel.de> Reviewed-by: Robert Haas <robertmhaas@gmail.com> Reviewed-by: FabrĂ­zio de Royes Mello <fabriziomello@gmail.com> Reviewed-by: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> Reviewed-by: Amit Kapila <amit.kapila16@gmail.com> Reviewed-by: Alvaro Herrera <alvherre@alvh.no-ip.org> Discussion: https://postgr.es/m/20230407075009.igg7be27ha2htkbt@awork3.anarazel.de
* Add io_direct setting (developer-only).Thomas Munro2023-04-082-0/+8
| | | | | | | | | | | | | | | | | | | | | | | Provide a way to ask the kernel to use O_DIRECT (or local equivalent) where available for data and WAL files, to avoid or minimize kernel caching. This hurts performance currently and is not intended for end users yet. Later proposed work would introduce our own I/O clustering, read-ahead, etc to replace the facilities the kernel disables with this option. The only user-visible change, if the developer-only GUC is not used, is that this commit also removes the obscure logic that would activate O_DIRECT for the WAL when wal_sync_method=open_[data]sync and wal_level=minimal (which also requires max_wal_senders=0). Those are non-default and unlikely settings, and this behavior wasn't (correctly) documented. The same effect can be achieved with io_direct=wal. Author: Thomas Munro <thomas.munro@gmail.com> Author: Andres Freund <andres@anarazel.de> Author: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> Reviewed-by: Justin Pryzby <pryzby@telsasoft.com> Reviewed-by: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> Discussion: https://postgr.es/m/CA%2BhUKGK1X532hYqJ_MzFWt0n1zt8trz980D79WbjwnT-yYLZpg%40mail.gmail.com
* Introduce PG_IO_ALIGN_SIZE and align all I/O buffers.Thomas Munro2023-04-081-2/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In order to have the option to use O_DIRECT/FILE_FLAG_NO_BUFFERING in a later commit, we need the addresses of user space buffers to be well aligned. The exact requirements vary by OS and file system (typically sectors and/or memory pages). The address alignment size is set to 4096, which is enough for currently known systems: it matches modern sectors and common memory page size. There is no standard governing O_DIRECT's requirements so we might eventually have to reconsider this with more information from the field or future systems. Aligning I/O buffers on memory pages is also known to improve regular buffered I/O performance. Three classes of I/O buffers for regular data pages are adjusted: (1) Heap buffers are now allocated with the new palloc_aligned() or MemoryContextAllocAligned() functions introduced by commit 439f6175. (2) Stack buffers now use a new struct PGIOAlignedBlock to respect PG_IO_ALIGN_SIZE, if possible with this compiler. (3) The buffer pool is also aligned in shared memory. WAL buffers were already aligned on XLOG_BLCKSZ. It's possible for XLOG_BLCKSZ to be configured smaller than PG_IO_ALIGNED_SIZE and thus for O_DIRECT WAL writes to fail to be well aligned, but that's a pre-existing condition and will be addressed by a later commit. BufFiles are not yet addressed (there's no current plan to use O_DIRECT for those, but they could potentially get some incidental speedup even in plain buffered I/O operations through better alignment). If we can't align stack objects suitably using the compiler extensions we know about, we disable the use of O_DIRECT by setting PG_O_DIRECT to 0. This avoids the need to consider systems that have O_DIRECT but can't align stack objects the way we want; such systems could in theory be supported with more work but we don't currently know of any such machines, so it's easier to pretend there is no O_DIRECT support instead. That's an existing and tested class of system. Add assertions that all buffers passed into smgrread(), smgrwrite() and smgrextend() are correctly aligned, unless PG_O_DIRECT is 0 (= stack alignment tricks may be unavailable) or the block size has been set too small to allow arrays of buffers to be all aligned. Author: Thomas Munro <thomas.munro@gmail.com> Author: Andres Freund <andres@anarazel.de> Reviewed-by: Justin Pryzby <pryzby@telsasoft.com> Discussion: https://postgr.es/m/CA+hUKGK1X532hYqJ_MzFWt0n1zt8trz980D79WbjwnT-yYLZpg@mail.gmail.com
* Add VACUUM/ANALYZE BUFFER_USAGE_LIMIT optionDavid Rowley2023-04-071-0/+5
| | | | | | | | | | | | | | | | | | | | | | | | Add new options to the VACUUM and ANALYZE commands called BUFFER_USAGE_LIMIT to allow users more control over how large to make the buffer access strategy that is used to limit the usage of buffers in shared buffers. Larger rings can allow VACUUM to run more quickly but have the drawback of VACUUM possibly evicting more buffers from shared buffers that might be useful for other queries running on the database. Here we also add a new GUC named vacuum_buffer_usage_limit which controls how large to make the access strategy when it's not specified in the VACUUM/ANALYZE command. This defaults to 256KB, which is the same size as the access strategy was prior to this change. This setting also controls how large to make the buffer access strategy for autovacuum. Per idea by Andres Freund. Author: Melanie Plageman Reviewed-by: David Rowley Reviewed-by: Andres Freund Reviewed-by: Justin Pryzby Reviewed-by: Bharath Rupireddy Discussion: https://postgr.es/m/20230111182720.ejifsclfwymw2reb@awork3.anarazel.de
* bufmgr: Introduce infrastructure for faster relation extensionAndres Freund2023-04-052-0/+72
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The primary bottlenecks for relation extension are: 1) The extension lock is held while acquiring a victim buffer for the new page. Acquiring a victim buffer can require writing out the old page contents including possibly needing to flush WAL. 2) When extending via ReadBuffer() et al, we write a zero page during the extension, and then later write out the actual page contents. This can nearly double the write rate. 3) The existing bulk relation extension infrastructure in hio.c just amortized the cost of acquiring the relation extension lock, but none of the other costs. Unfortunately 1) cannot currently be addressed in a central manner as the callers to ReadBuffer() need to acquire the extension lock. To address that, this this commit moves the responsibility for acquiring the extension lock into bufmgr.c functions. That allows to acquire the relation extension lock for just the required time. This will also allow us to improve relation extension further, without changing callers. The reason we write all-zeroes pages during relation extension is that we hope to get ENOSPC errors earlier that way (largely works, except for CoW filesystems). It is easier to handle out-of-space errors gracefully if the page doesn't yet contain actual tuples. This commit addresses 2), by using the recently introduced smgrzeroextend(), which extends the relation, without dirtying the kernel page cache for all the extended pages. To address 3), this commit introduces a function to extend a relation by multiple blocks at a time. There are three new exposed functions: ExtendBufferedRel() for extending the relation by a single block, ExtendBufferedRelBy() to extend a relation by multiple blocks at once, and ExtendBufferedRelTo() for extending a relation up to a certain size. To avoid duplicating code between ReadBuffer(P_NEW) and the new functions, ReadBuffer(P_NEW) now implements relation extension with ExtendBufferedRel(), using a flag to tell ExtendBufferedRel() that the relation lock is already held. Note that this commit does not yet lead to a meaningful performance or scalability improvement - for that uses of ReadBuffer(P_NEW) will need to be converted to ExtendBuffered*(), which will be done in subsequent commits. Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi> Reviewed-by: Melanie Plageman <melanieplageman@gmail.com> Discussion: https://postgr.es/m/20221029025420.eplyow6k7tgu6he3@awork3.anarazel.de
* bufmgr: Support multiple in-progress IOs by using resownerAndres Freund2023-04-051-1/+1
| | | | | | | | | | | | | | | | | | A future patch will add support for extending relations by multiple blocks at once. To be concurrency safe, the buffers for those blocks need to be marked as BM_IO_IN_PROGRESS. Until now we only had infrastructure for recovering from an IO error for a single buffer. This commit extends that infrastructure to multiple buffers by using the resource owner infrastructure. This commit increases the size of the ResourceOwnerData struct, which appears to have a just about measurable overhead in very extreme workloads. Medium term we are planning to substantially shrink the size of ResourceOwnerData. Short term the increase is small enough to not worry about it for now. Reviewed-by: Melanie Plageman <melanieplageman@gmail.com> Discussion: https://postgr.es/m/20221029025420.eplyow6k7tgu6he3@awork3.anarazel.de Discussion: https://postgr.es/m/20221029200025.w7bvlgvamjfo6z44@awork3.anarazel.de
* bufmgr: Add Pin/UnpinLocalBuffer()Andres Freund2023-04-051-0/+2
| | | | | | | | So far these were open-coded in quite a few places, without a good reason. Reviewed-by: Melanie Plageman <melanieplageman@gmail.com> Reviewed-by: David Rowley <dgrowleyml@gmail.com> Discussion: https://postgr.es/m/20221029025420.eplyow6k7tgu6he3@awork3.anarazel.de
* bufmgr: Add some more error checking [infrastructure] around pinningAndres Freund2023-04-051-0/+1
| | | | | | | | | | | This adds a few more assertions against a buffer being local in places we don't expect, and extracts the check for a buffer being pinned exactly once from LockBufferForCleanup() into its own function. Later commits will use this function. Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi> Reviewed-by: Melanie Plageman <melanieplageman@gmail.com> Discussion: http://postgr.es/m/419312fd-9255-078c-c3e3-f0525f911d7f@iki.fi
* Add smgrzeroextend(), FileZero(), FileFallocate()Andres Freund2023-04-053-0/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | smgrzeroextend() uses FileFallocate() to efficiently extend files by multiple blocks. When extending by a small number of blocks, use FileZero() instead, as using posix_fallocate() for small numbers of blocks is inefficient for some file systems / operating systems. FileZero() is also used as the fallback for FileFallocate() on platforms / filesystems that don't support fallocate. A big advantage of using posix_fallocate() is that it typically won't cause dirty buffers in the kernel pagecache. So far the most common pattern in our code is that we smgrextend() a page full of zeroes and put the corresponding page into shared buffers, from where we later write out the actual contents of the page. If the kernel, e.g. due to memory pressure or elapsed time, already wrote back the all-zeroes page, this can lead to doubling the amount of writes reaching storage. There are no users of smgrzeroextend() as of this commit. That will follow in future commits. Reviewed-by: Melanie Plageman <melanieplageman@gmail.com> Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi> Reviewed-by: Kyotaro Horiguchi <horikyota.ntt@gmail.com> Reviewed-by: David Rowley <dgrowleyml@gmail.com> Reviewed-by: John Naylor <john.naylor@enterprisedb.com> Discussion: https://postgr.es/m/20221029025420.eplyow6k7tgu6he3@awork3.anarazel.de
* Track shared buffer hits in pg_stat_ioAndres Freund2023-03-301-1/+1
| | | | | | | | | | | | Among other things, this should make it easier to calculate a useful cache hit ratio by excluding buffer reads via buffer access strategies. As buffer access strategies reuse buffers (and thus evict the prior buffer contents), it is normal to see reads on repeated scans of the same data. Author: Melanie Plageman <melanieplageman@gmail.com> Reviewed-by: Bertrand Drouvot <bertranddrouvot.pg@gmail.com> Reviewed-by: Andres Freund <andres@anarazel.de> Discussion: https://postgr.es/m/CAAKRu_beMa9Hzih40%3DXPYqhDVz6tsgUGTrhZXRo%3Dunp%2Bszb%3DUA%40mail.gmail.com
* Remove empty function BufmgrCommit().Tom Lane2023-03-291-1/+0
| | | | | | | | | | | | | | This function has been a no-op for over a decade. Even if bufmgr regains a need to be called during commit, it seems unlikely that the most appropriate call points would be precisely here, so it's not doing us much good as a placeholder either. Now, removing it probably doesn't save any noticeable number of cycles --- but the main call is inside the commit critical section, and the less work done there the better. Matthias van de Meent Discussion: https://postgr.es/m/CAEze2Wi1=tLKbxZnXzcD+8fYKyKqBtivVakLQC_mYBsP4Y8qVA@mail.gmail.com
* Update types in smgr APIPeter Eisentraut2023-02-272-6/+6
| | | | | | | | Change data buffer to void *, from char *, and add const where appropriate. This makes it match the File API (see also 2d4f1ba6cfc2f0a977f1c30bda9848041343e248) and stdio. Discussion: https://www.postgresql.org/message-id/flat/11dda853-bb5b-59ba-a746-e168b1ce4bdb%40enterprisedb.com
* Consolidate ItemPointer to Datum conversion functionsPeter Eisentraut2023-02-131-0/+20
| | | | | | | | | Instead of defining the same set of macros several times, define it once in an appropriate header file. In passing, convert to inline functions. Reviewed-by: Heikki Linnakangas <hlinnaka@iki.fi> Discussion: https://www.postgresql.org/message-id/flat/844dd4c5-e5a1-3df1-bfaf-d1e1c2a16e45%40enterprisedb.com
* pgstat: Track more detailed relation IO statisticsAndres Freund2023-02-092-4/+11
| | | | | | | | | | | | | | | | | | | Commit 28e626bde00 introduced the infrastructure for tracking more detailed IO statistics. This commit adds the actual collection of the new IO statistics for relations and temporary relations. See aforementioned commit for goals and high-level design. The changes in this commit are fairly straight-forward. The bulk of the change is to passing sufficient information to the callsites of pgstat_count_io_op(). A somewhat unsightly detail is that it currently is hard to find a better place to count fsyncs than in md.c, whereas the other pgstat_count_io_op() calls are in bufmgr.c/localbuf.c. As the number of fsyncs is tied to md.c implementation details, it's not obvious there is a better answer. Author: Melanie Plageman <melanieplageman@gmail.com> Reviewed-by: Andres Freund <andres@anarazel.de> Discussion: https://postgr.es/m/20200124195226.lth52iydq2n2uilq@alap3.anarazel.de
* Avoid type cheats for invalid dsa_handles and dshash_table_handles.Tom Lane2023-01-252-3/+3
| | | | | | | | | | | | | | Invent separate macros for "invalid" values of these types, so that we needn't embed knowledge of their representations into calling code. These are all zeroes anyway ATM, so this is not fixing any live bug, but it makes the code cleaner and more future-proof. I (tgl) also chose to move DSM_HANDLE_INVALID into dsm_impl.h, since it seems like it should live beside the typedef for dsm_handle. Hou Zhijie, Nathan Bossart, Kyotaro Horiguchi, Tom Lane Discussion: https://postgr.es/m/OS0PR01MB5716860B1454C34E5B179B6694C99@OS0PR01MB5716.jpnprd01.prod.outlook.com
* Track logrep apply workers' last start times to avoid useless waits.Tom Lane2023-01-221-0/+2
| | | | | | | | | | | | | | Enforce wal_retrieve_retry_interval on a per-subscription basis, rather than globally, and arrange to skip that delay in case of an intentional worker exit. This probably makes little difference in the field, where apply workers wouldn't be restarted often; but it has a significant impact on the runtime of our logical replication regression tests (even though those tests use artificially-small wal_retrieve_retry_interval settings already). Nathan Bossart, with mostly-cosmetic editorialization by me Discussion: https://postgr.es/m/20221122004119.GA132961@nathanxps13
* Add new GUC reserved_connections.Robert Haas2023-01-201-1/+1
| | | | | | | | | | | | This provides a way to reserve connection slots for non-superusers. The slots reserved via the new GUC are available only to users who have the new predefined role pg_use_reserved_connections. superuser_reserved_connections remains as a final reserve in case reserved_connections has been exhausted. Patch by Nathan Bossart. Reviewed by Tushar Ahuja and by me. Discussion: http://postgr.es/m/20230119194601.GA4105788@nathanxps13
* Remove SHM_QUEUEAndres Freund2023-01-191-22/+0
| | | | | | | | | | Prior patches got rid of all the uses of SHM_QUEUE. ilist.h style lists are more widely used and have an easier to use interface. As there are no users left, remove SHM_QUEUE. Reviewed-by: Thomas Munro <thomas.munro@gmail.com> (in an older version) Discussion: https://postgr.es/m/20221120055930.t6kl3tyivzhlrzu2@awork3.anarazel.de Discussion: https://postgr.es/m/20200211042229.msv23badgqljrdg2@alap3.anarazel.de
* Use dlists instead of SHM_QUEUE for predicate lockingAndres Freund2023-01-191-32/+16
| | | | | | | | | Part of a series to remove SHM_QUEUE. ilist.h style lists are more widely used and have an easier to use interface. Reviewed-by: Thomas Munro <thomas.munro@gmail.com> (in an older version) Discussion: https://postgr.es/m/20221120055930.t6kl3tyivzhlrzu2@awork3.anarazel.de Discussion: https://postgr.es/m/20200211042229.msv23badgqljrdg2@alap3.anarazel.de
* Constify proclist.hPeter Eisentraut2023-01-191-3/+3
| | | | | | | This is a follow-up to c8ad4d81. Author: Aleksander Alekseev Discussion: https://www.postgresql.org/message-id/flat/CAJ7c6TM084Ai_8%3DfZaWtULJBLtT1bgzL%3Dk9vHMYom3eyZsekAA%40mail.gmail.com
* Use dlists instead of SHM_QUEUE for syncrep queueAndres Freund2023-01-181-1/+1
| | | | | | | | | Part of a series to remove SHM_QUEUE. ilist.h style lists are more widely used and have an easier to use interface. Reviewed-by: Thomas Munro <thomas.munro@gmail.com> (in an older version) Discussion: https://postgr.es/m/20221120055930.t6kl3tyivzhlrzu2@awork3.anarazel.de Discussion: https://postgr.es/m/20200211042229.msv23badgqljrdg2@alap3.anarazel.de
* Use dlist/dclist instead of PROC_QUEUE / SHM_QUEUE for heavyweight locksAndres Freund2023-01-182-19/+13
| | | | | | | | | | | Part of a series to remove SHM_QUEUE. ilist.h style lists are more widely used and have an easier to use interface. As PROC_QUEUE is now unused, remove it. Reviewed-by: Thomas Munro <thomas.munro@gmail.com> (in an older version) Discussion: https://postgr.es/m/20221120055930.t6kl3tyivzhlrzu2@awork3.anarazel.de Discussion: https://postgr.es/m/20200211042229.msv23badgqljrdg2@alap3.anarazel.de
* Constify the arguments of copydir.h functionsMichael Paquier2023-01-181-2/+2
| | | | | | | | | | This makes sure that the internal logic of these functions does not attempt to change the value of the arguments constified, and it removes one unconstify() in basic_archive.c. Author: Nathan Bossart Reviewed-by: Andrew Dunstan, Peter Eisentraut Discussion: https://postgr.es/m/20230114231126.GA2580330@nathanxps13
* Add BufFileRead variants with short read and EOF detectionPeter Eisentraut2023-01-161-1/+3
| | | | | | | | | | | | | | | Most callers of BufFileRead() want to check whether they read the full specified length. Checking this at every call site is very tedious. This patch provides additional variants BufFileReadExact() and BufFileReadMaybeEOF() that include the length checks. I considered changing BufFileRead() itself, but this function is also used in extensions, and so changing the behavior like this would create a lot of problems there. The new names are analogous to the existing LogicalTapeReadExact(). Reviewed-by: Amit Kapila <amit.kapila16@gmail.com> Discussion: https://www.postgresql.org/message-id/flat/f3501945-c591-8cc3-5ef0-b72a2e0eaa9c@enterprisedb.com
* Perform apply of large transactions by parallel workers.Amit Kapila2023-01-093-2/+21
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently, for large transactions, the publisher sends the data in multiple streams (changes divided into chunks depending upon logical_decoding_work_mem), and then on the subscriber-side, the apply worker writes the changes into temporary files and once it receives the commit, it reads from those files and applies the entire transaction. To improve the performance of such transactions, we can instead allow them to be applied via parallel workers. In this approach, we assign a new parallel apply worker (if available) as soon as the xact's first stream is received and the leader apply worker will send changes to this new worker via shared memory. The parallel apply worker will directly apply the change instead of writing it to temporary files. However, if the leader apply worker times out while attempting to send a message to the parallel apply worker, it will switch to "partial serialize" mode - in this mode, the leader serializes all remaining changes to a file and notifies the parallel apply workers to read and apply them at the end of the transaction. We use a non-blocking way to send the messages from the leader apply worker to the parallel apply to avoid deadlocks. We keep this parallel apply assigned till the transaction commit is received and also wait for the worker to finish at commit. This preserves commit ordering and avoid writing to and reading from files in most cases. We still need to spill if there is no worker available. This patch also extends the SUBSCRIPTION 'streaming' parameter so that the user can control whether to apply the streaming transaction in a parallel apply worker or spill the change to disk. The user can set the streaming parameter to 'on/off', or 'parallel'. The parameter value 'parallel' means the streaming will be applied via a parallel apply worker, if available. The parameter value 'on' means the streaming transaction will be spilled to disk. The default value is 'off' (same as current behaviour). In addition, the patch extends the logical replication STREAM_ABORT message so that abort_lsn and abort_time can also be sent which can be used to update the replication origin in parallel apply worker when the streaming transaction is aborted. Because this message extension is needed to support parallel streaming, parallel streaming is not supported for publications on servers < PG16. Author: Hou Zhijie, Wang wei, Amit Kapila with design inputs from Sawada Masahiko Reviewed-by: Sawada Masahiko, Peter Smith, Dilip Kumar, Shi yu, Kuroda Hayato, Shveta Mallik Discussion: https://postgr.es/m/CAA4eK1+wyN6zpaHUkCLorEWNx75MG0xhMwcFhvjqm2KURZEAGw@mail.gmail.com
* Update copyright for 2023Bruce Momjian2023-01-0256-56/+56
| | | | Backpatch-through: 11
* Add const to BufFileWritePeter Eisentraut2022-12-301-1/+1
| | | | | | | | Make data buffer argument to BufFileWrite a const pointer and bubble this up to various callers and related APIs. This makes the APIs clearer and more consistent. Discussion: https://www.postgresql.org/message-id/flat/11dda853-bb5b-59ba-a746-e168b1ce4bdb%40enterprisedb.com
* Allow parent's WaitEventSets to be freed after fork().Thomas Munro2022-12-231-0/+1
| | | | | | | | | | | | | An epoll fd belonging to the parent should be closed in the child. A kqueue fd is automatically closed by fork(), but we should still adjust our counter. For poll and Windows systems, nothing special is required. On all systems we free the memory. No caller yet, but we'll need this if we start using WaitEventSet in the postmaster as planned. Reviewed-by: Andres Freund <andres@anarazel.de> Discussion: https://postgr.es/m/CA%2BhUKG%2BZ-HpOj1JsO9eWUP%2Bar7npSVinsC_npxSy%2BjdOMsx%3DGg%40mail.gmail.com
* Add WL_SOCKET_ACCEPT event to WaitEventSet API.Thomas Munro2022-12-231-0/+7
| | | | | | | | | | | | | | | To be able to handle incoming connections on a server socket with the WaitEventSet API, we'll need a new kind of event to indicate that the the socket is ready to accept a connection. On Unix, it's just the same as WL_SOCKET_READABLE, but on Windows there is a different underlying kernel event that we need to map our abstraction to. No user yet, but a proposed patch would use this. Reviewed-by: Andres Freund <andres@anarazel.de> Discussion: https://postgr.es/m/CA%2BhUKG%2BZ-HpOj1JsO9eWUP%2Bar7npSVinsC_npxSy%2BjdOMsx%3DGg%40mail.gmail.com
* Add copyright notices to meson filesAndrew Dunstan2022-12-201-0/+2
| | | | Discussion: https://postgr.es/m/222b43a5-2fb3-2c1b-9cd0-375d376c8246@dunslane.net
* Expose some information about backend subxact status.Robert Haas2022-12-191-1/+3
| | | | | | | | | | | | | | | A new function pg_stat_get_backend_subxact() can be used to get information about the number of subtransactions in the cache of a particular backend and whether that cache has overflowed. This can be useful for tracking down performance problems that can result from overflowed snapshots. Dilip Kumar, reviewed by Zhihong Yu, Nikolay Samokhvalov, Justin Pryzby, Nathan Bossart, Ashutosh Sharma, Julien Rouhaud. Additional design comments from Andres Freund, Tom Lane, Bruce Momjian, and David G. Johnston. Discussion: http://postgr.es/m/CAFiTN-ut0uwkRJDQJeDPXpVyTWD46m3gt3JDToE02hTfONEN=Q@mail.gmail.com
* Static assertions cleanupPeter Eisentraut2022-12-151-0/+3
| | | | | | | | | | | | | | | | | | | | | Because we added StaticAssertStmt() first before StaticAssertDecl(), some uses as well as the instructions in c.h are now a bit backwards from the "native" way static assertions are meant to be used in C. This updates the guidance and moves some static assertions to better places. Specifically, since the addition of StaticAssertDecl(), we can put static assertions at the file level. This moves a number of static assertions out of function bodies, where they might have been stuck out of necessity, to perhaps better places at the file level or in header files. Also, when the static assertion appears in a position where a declaration is allowed, then using StaticAssertDecl() is more native than StaticAssertStmt(). Reviewed-by: John Naylor <john.naylor@enterprisedb.com> Discussion: https://www.postgresql.org/message-id/flat/941a04e7-dd6f-c0e4-8cdf-a33b3338cbda%40enterprisedb.com
* Update types in File APIPeter Eisentraut2022-12-081-3/+3
| | | | | | | | | | | | | Make the argument types of the File API match stdio better: - Change the data buffer to void *, from char *. - Change FileWrite() data buffer to const on top of that. - Change amounts to size_t, from int. In passing, change the FilePrefetch() amount argument from int to off_t, to match the underlying posix_fadvise(). Discussion: https://www.postgresql.org/message-id/flat/11dda853-bb5b-59ba-a746-e168b1ce4bdb%40enterprisedb.com
* Improve heuristics for compressing the KnownAssignedXids array.Tom Lane2022-11-291-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Previously, we'd compress only when the active range of array entries reached Max(4 * PROCARRAY_MAXPROCS, 2 * pArray->numKnownAssignedXids). If max_connections is large, the first term could result in not compressing for a long time, resulting in much wastage of cycles in hot-standby backends scanning the array to take snapshots. Get rid of that term, and just bound it to 2 * pArray->numKnownAssignedXids. That however creates the opposite risk, that we might spend too much effort compressing. Hence, consider compressing only once every 128 commit records. (This frequency was chosen by benchmarking. While we only tried one benchmark scenario, the results seem stable over a fairly wide range of frequencies.) Also, force compression when processing RecoveryInfo WAL records (which should be infrequent); the old code could perform compression then, but would do so only after the same array-range check as for the transaction-commit path. Also, opportunistically run compression if the startup process is about to wait for WAL, though not oftener than once a second. This should prevent cases where we waste lots of time by leaving the array not-compressed for long intervals due to low WAL traffic. Lastly, add a simple check to keep us from uselessly compressing when the array storage is already compact. Back-patch, as the performance problem is worse in pre-v14 branches than in HEAD. Simon Riggs and Michail Nikolaev, with help from Tom Lane and Andres Freund. Discussion: https://postgr.es/m/CALdSSPgahNUD_=pB_j=1zSnDBaiOtqVfzo8Ejt5J_k7qZiU1Tw@mail.gmail.com
* lwlock: Fix quadratic behavior with very long wait listsAndres Freund2022-11-202-1/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | Until now LWLockDequeueSelf() sequentially searched the list of waiters to see if the current proc is still is on the list of waiters, or has already been removed. In extreme workloads, where the wait lists are very long, this leads to a quadratic behavior. #backends iterating over a list #backends long. Additionally, the likelihood of needing to call LWLockDequeueSelf() in the first place also increases with the increased length of the wait queue, as it becomes more likely that a lock is released while waiting for the wait list lock, which is held for longer during lock release. Due to the exponential back-off in perform_spin_delay() this is surprisingly hard to detect. We should make that easier, e.g. by adding a wait event around the pg_usleep() - but that's a separate patch. The fix is simple - track whether a proc is currently waiting in the wait list or already removed but waiting to be woken up in PGPROC->lwWaiting. In some workloads with a lot of clients contending for a small number of lwlocks (e.g. WALWriteLock), the fix can substantially increase throughput. As the quadratic behavior arguably is a bug, we might want to decide to backpatch this fix in the future. Author: Andres Freund <andres@anarazel.de> Reviewed-by: Bharath Rupireddy <bharath.rupireddyforpostgres@gmail.com> Discussion: https://postgr.es/m/20221027165914.2hofzp4cvutj6gin@awork3.anarazel.de Discussion: https://postgr.es/m/CALj2ACXktNbG=K8Xi7PSqbofTZozavhaxjatVc14iYaLu4Maag@mail.gmail.com
* Standardize rmgrdesc recovery conflict XID output.Peter Geoghegan2022-11-171-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | Standardize on the name snapshotConflictHorizon for all XID fields from WAL records that generate recovery conflicts when in hot standby mode. This supersedes the previous latestRemovedXid naming convention. The new naming convention places emphasis on how the values are actually used by REDO routines. How the values are generated during original execution (details of which vary by record type) is deemphasized. Users of tools like pg_waldump can now grep for snapshotConflictHorizon to see all potential sources of recovery conflicts in a standardized way, without necessarily having to consider which specific record types might be involved. Also bring a couple of WAL record types that didn't follow any kind of naming convention into line. These are heapam's VISIBLE record type and SP-GiST's VACUUM_REDIRECT record type. Now every WAL record whose REDO routine calls ResolveRecoveryConflictWithSnapshot() passes through the snapshotConflictHorizon field from its WAL record. This is follow-up work to the refactoring from commit 9e540599 that made FREEZE_PAGE WAL records use a standard snapshotConflictHorizon style XID cutoff. No bump in XLOG_PAGE_MAGIC, since the underlying format of affected WAL records doesn't change. Author: Peter Geoghegan <pg@bowt.ie> Reviewed-By: Andres Freund <andres@anarazel.de> Discussion: https://postgr.es/m/CAH2-Wzm2CQUmViUq7Opgk=McVREHSOorYaAjR1ZpLYkRN7_dPw@mail.gmail.com
* Allow use of __sync_lock_test_and_set for spinlocks on any machine.Tom Lane2022-11-021-23/+44
| | | | | | | | | | | | | | | | | | | | | | | | | If we have no special-case code in s_lock.h for the current platform, but the compiler has __sync_lock_test_and_set, use that instead of failing. It's unlikely that anybody's __sync_lock_test_and_set would be so awful as to be worse than our semaphore-based fallback, but if it is, they can (continue to) use --disable-spinlocks. This allows removal of the RISC-V special case installed by commit c32fcac56, which generated exactly the same code but only on that platform. Usefully, the RISC-V buildfarm animals should now test at least the int variant of this patch. I've manually tested both variants on ARM by dint of removing the ARM-specific stanza. We don't want to drop that, because it already has some special knowledge and is likely to grow more over time. Likewise, this is not meant to preclude installing special cases for other arches if that proves worthwhile. Per discussion of a request to install the same code for loongarch64. Like the previous patch, we might as well back-patch to supported branches. Discussion: https://postgr.es/m/761ac43d44b84d679ba803c2bd947cc0@HSMAILSVR04.hs.handsome.com.cn
* Clean up some inconsistencies with GUC declarationsMichael Paquier2022-10-311-0/+9
| | | | | | | | | | | | | | | | | | | | This is similar to 7d25958, and this commit takes care of all the remaining inconsistencies between the initial value used in the C variable associated to a GUC and its default value stored in the GUC tables (as of pg_settings.boot_val). Some of the initial values of the GUCs updated rely on a compile-time default. These are refactored so as the GUC table and its C declaration use the same values. This makes everything consistent with other places, backend_flush_after, bgwriter_flush_after, port, checkpoint_flush_after doing so already, for example. Extracted from a larger patch by Peter Smith. The spots updated in the modules are from me. Author: Peter Smith, Michael Paquier Reviewed-by: Nathan Bossart, Tom Lane, Justin Pryzby Discussion: https://postgr.es/m/CAHut+PtHE0XSfjjRQ6D4v7+dqzCw=d+1a64ujra4EX8aoc_Z+w@mail.gmail.com
* Move pg_pwritev_with_retry() to src/common/file_utils.cMichael Paquier2022-10-271-6/+0
| | | | | | | | | | | | | This commit moves pg_pwritev_with_retry(), a convenience wrapper of pg_writev() able to handle partial writes, to common/file_utils.c so that the frontend code is able to use it. A first use-case targetted for this routine is pg_basebackup and pg_receivewal, for the zero-padding of a newly-initialized WAL segment. This is used currently in the backend when the GUC wal_init_zero is enabled (default). Author: Bharath Rupireddy Reviewed-by: Nathan Bossart, Thomas Munro Discussion: https://postgr.es/m/CALj2ACUq7nAb7=bJNbK3yYmp-SZhJcXFR_pLk8un6XgDzDF3OA@mail.gmail.com
* Fix some comments in proc.hMichael Paquier2022-10-151-3/+3
| | | | | | | | There was a typo and two places where delayChkpt was still mentioned, but it is called delayChkptFlags these days. Author: David Christensen Discussion: https://postgr.es/m/CAOxo6XLB=ab_Y9jRw4iKyMZDns0wo=EGSRvijhhaL67RzqbtMg@mail.gmail.com
* C comment: explain procArray->pgprocnos[]Bruce Momjian2022-10-111-1/+5
| | | | | | | | | | Reported-by: Aleksander Alekseev Discussion: https://postgr.es/m/CAJ7c6TOs9Dh3KNR2kiQJ3Ow0=TBucL_57DAbm--2p8w5x_8YXQ@mail.gmail.com Author: Aleksander Alekseev Backpatch-through: master
* Revert 56-bit relfilenode change and follow-up commits.Robert Haas2022-09-283-59/+15
| | | | | | | | There are still some alignment-related failures in the buildfarm, which might or might not be able to be fixed quickly, but I've also just realized that it increased the size of many WAL records by 4 bytes because a block reference contains a RelFileLocator. The effect of that hasn't been studied or discussed, so revert for now.
* Fix alignment problems with SharedInvalSmgrMsg.Robert Haas2022-09-281-2/+5
| | | | | | | | | | | | SharedInvalSmgrMsg can't require 8-byte alignment, because then SharedInvalidationMessage will require 8-byte alignment, which will then cause ParseCommitRecord to fail on machines that are picky about alignment, because it assumes that everything that gets packed into a commit record requires only 4-byte alignment. Another problem with 05d4cbf9b6ba708858984b01ca0fc56d59d4ec7c. Discussion: http://postgr.es/m/3825454.1664310917@sss.pgh.pa.us
* In BufTagGetForkNum, cast to the correct type.Robert Haas2022-09-271-1/+1
| | | | | | | | Another defect in 05d4cbf9b6ba708858984b01ca0fc56d59d4ec7c. Per CI, via Justin Pryzby. Discussion: http://postgr.es/m/20220927200712.GH6256@telsasoft.com