doc/src/sgml/high-availability.sgml


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
1001
1002
1003
1004
1005
1006
1007
1008
1009
1010
1011
1012
1013
1014
1015
1016
1017
1018
1019
1020
1021
1022
1023
1024
1025
1026
1027
1028
1029
1030
1031
1032
1033
1034
1035
1036
1037
1038
1039
1040
1041
1042
1043
1044
1045
1046
1047
1048
1049
1050
1051
1052
1053
1054
1055
1056
1057
1058
1059
1060
1061
1062
1063
1064
1065
1066
1067
1068
1069
1070
1071
1072
1073
1074
1075
1076
1077
1078
1079
1080
1081
1082
1083
1084
1085
1086
1087
1088
1089
1090
1091
1092
1093
1094
1095
1096
1097
1098
1099
1100
1101
1102
1103
1104
1105
1106
1107
1108
1109
1110
1111
1112
1113
1114
1115
1116
1117
1118
1119
1120
1121
1122
1123
1124
1125
1126
1127
1128
1129
1130
1131
1132
1133
1134
1135
1136
1137
1138
1139
1140
1141
1142
1143
1144
1145
1146
1147
1148
1149
1150
1151
1152
1153
1154
1155
1156
1157
1158
1159
1160
1161
1162
1163
1164
1165
1166
1167
1168
1169
1170
1171
1172
1173
1174
1175
1176
1177
1178
1179
1180
1181
1182
1183
1184
1185
1186
1187
1188
1189
1190
1191
1192
1193
1194
1195
1196
1197
1198
1199
1200
1201
1202
1203
1204
1205
1206
1207
1208
1209
1210
1211
1212
1213
1214
1215
1216
1217
1218
1219
1220
1221
1222
1223
1224
1225
1226
1227
1228
1229
1230
1231
1232
1233
1234
1235
1236
1237
1238
1239
1240
1241
1242
1243
1244
1245
1246
1247
1248
1249
1250
1251
1252
1253
1254
1255
1256
1257
1258
1259
1260
1261
1262
1263
1264
1265
1266
1267
1268
1269
1270
1271
1272
1273
1274
1275
1276
1277
1278
1279
1280
1281
1282
1283
1284
1285
1286
1287
1288
1289
1290
1291
1292
1293
1294
1295
1296
1297
1298
1299
1300
1301
1302
1303
1304
1305
1306
1307
1308
1309
1310
1311
1312
1313
1314
1315
1316
1317
1318
1319
1320
1321
1322
1323
1324
1325
1326
1327
1328
1329
1330
1331
1332
1333
1334
1335
1336
1337
1338
1339
1340
1341
1342
1343
1344
1345
1346
1347
1348
1349
1350
1351
1352
1353
1354
1355
1356
1357
1358
1359
1360
1361
1362
1363
1364
1365
1366
1367
1368
1369
1370
1371
1372
1373
1374
1375
1376
1377
1378
1379
1380
1381
1382
1383
1384
1385
1386
1387
1388
1389
1390
1391
1392
1393
1394
1395
1396
1397
1398
1399
1400
1401
1402
1403
1404
1405
1406
1407
1408
1409
1410
1411
1412
1413
1414
1415
1416
1417
1418
1419
1420
1421
1422
1423
1424
1425
1426
1427
1428
1429
1430
1431
1432
1433
1434
1435
1436
1437
1438
1439
1440
1441
1442
1443
1444
1445
1446
1447
1448
1449
1450
1451
1452
1453
1454
1455
1456
1457
1458
1459
1460
1461
1462
1463
1464
1465
1466
1467
1468
1469
1470
1471
1472
1473
1474
1475
1476
1477
1478
1479
1480
1481
1482
1483
1484
1485
1486
1487
1488
1489
1490
1491
1492
1493
1494
1495
1496
1497
1498
1499
1500
1501
1502
1503
1504
1505
1506
1507
1508
1509
1510
1511
1512
1513
1514
1515
1516
1517
1518
1519
1520
1521
1522
1523
1524
1525
1526
1527
1528
1529
1530
1531
1532
1533
1534
1535
1536
1537
1538
1539
1540
1541
1542
1543
1544
1545
1546
1547
1548
1549
1550
1551
1552
1553
1554
1555
1556
1557
1558
1559
1560
1561
1562
1563
1564
1565
1566
1567
1568
1569
1570
1571
1572
1573
1574
1575
1576
1577
1578
1579
1580
1581
1582
1583
1584
1585
1586
1587
1588
1589
1590
1591
1592
1593
1594
1595
1596
1597
1598
1599
1600
1601
1602
1603
1604
1605
1606
1607
1608
1609
1610
1611
1612
1613
1614
1615
1616
1617
1618
1619
1620
1621
1622
1623
1624
1625
1626
1627
1628
1629
1630
1631
1632
1633
1634
1635
1636
1637
1638
1639
1640
1641
1642
1643
1644
1645
1646
1647
1648
1649
1650
1651
1652
1653
1654
1655
1656
1657
1658
1659
1660
1661
1662
1663
1664
1665
1666
1667
1668
1669
1670
1671
1672
1673
1674
1675
1676
1677
1678
1679
1680
1681
1682
1683
1684
1685
1686
1687
1688
1689
1690
1691
1692
1693
1694
1695
1696
1697
1698
1699
1700
1701
1702
1703
1704
1705
1706
1707
1708
1709
1710
1711
1712
1713
1714
1715
1716
1717
1718
1719
1720
1721
1722
1723
1724
1725
1726
1727
1728
1729
1730
1731
1732
1733
1734
1735
1736
1737
1738
1739
1740
1741
1742
1743
1744
1745
1746
1747
1748
1749
1750
1751
1752
1753
1754
1755
1756
1757
1758
1759
1760
1761
1762
1763
1764
1765
1766
1767
1768
1769
1770
1771
1772
1773
1774
1775
1776
1777
1778
1779
1780
1781
1782
1783
1784
1785
1786
1787
1788
1789
1790
1791
1792
1793
1794
1795
1796
1797
1798
1799
1800
1801
1802
1803
1804
1805
1806
1807
1808
1809
1810
1811
1812
1813
1814
1815
1816
1817
1818
1819
1820
1821
1822
1823
1824
1825
1826
1827
1828
1829
1830
1831
1832
1833
1834
1835
1836
1837
1838
1839
1840
1841
1842
1843
1844
1845
1846
1847
1848
1849
1850
1851
1852
1853
1854
1855
1856
1857
1858
1859
1860
1861
1862
1863
1864
1865
1866
1867
1868
1869
1870
1871
1872
1873
1874
1875
1876
1877
1878
1879
1880
1881
1882
1883
1884
1885
1886
1887
1888
1889
1890
1891
1892
1893
1894
1895
1896
1897
1898
1899
1900
1901
1902
1903
1904
1905
1906
1907
1908
1909
1910
1911
1912
1913
1914
1915
1916
1917
1918
1919
1920
1921
1922
1923
1924
1925
1926
1927
1928
1929
1930
1931
1932
1933
1934
1935
1936
1937
1938
1939
1940
1941
1942
1943
1944
1945
1946
1947
1948
1949
1950
1951
1952
1953
1954
1955
1956
1957
1958
1959
1960
1961
1962
1963
1964
1965
1966
1967
1968
1969
1970
1971
1972
1973
1974
1975
1976
1977
1978
1979
1980
1981
1982
1983
1984
1985
1986
1987
1988
1989
1990
1991
1992
1993
1994
1995
1996
1997
1998
1999
2000
2001
2002
2003
2004
2005
2006
2007
2008
2009
2010
2011
2012
2013
2014
2015
2016
2017
2018
2019
2020
2021
2022
2023
2024
2025
2026
2027
2028
2029
2030
2031
2032
2033
2034
2035
2036
2037
2038
2039
2040
2041
2042
2043
2044
2045
2046
2047
2048
2049
2050
2051
2052
2053
2054
2055
2056
2057
2058
2059
2060
2061
2062
2063
2064
2065
2066
2067
2068
2069
2070
2071
2072
2073
2074
2075
2076
2077
2078
2079
2080
2081
2082
2083
2084
2085
2086
2087
2088
2089
2090
2091
2092
2093
2094
2095
2096
2097
2098
2099
2100
2101
2102
2103
2104
2105
2106
2107
2108
2109
2110
2111
2112
2113
2114
2115
2116
2117
2118
2119
2120
2121
2122
2123
2124
2125
2126
2127
2128
2129
2130
2131
2132
2133
2134
2135
2136
2137
2138
2139
2140
2141
2142
2143
2144
2145
2146
2147
2148
2149
2150
2151
2152
2153
2154
2155
2156
2157
2158
2159
2160
2161
2162
2163
2164
2165
2166
2167
2168
2169
2170
2171
2172
2173
2174
2175
2176
2177
2178
2179
2180
2181
2182
2183
2184
2185
2186
2187
2188
2189
2190
2191
2192
2193
2194
2195
2196
2197
2198
2199
2200
2201
2202
2203
2204
2205
2206
2207
2208
2209
2210
2211
2212
2213
2214
2215
2216
2217
2218
2219
2220
2221
2222
2223
2224
2225
2226
2227
2228
2229
2230
2231
2232
2233
2234
2235
2236
2237
2238
2239
2240
2241
2242
2243
2244
2245
2246
2247
2248
2249
2250
2251
2252
2253
2254
2255
2256
2257
2258
2259
2260
2261
2262
2263
2264
2265
2266
2267
2268
2269
2270
2271
2272
2273
2274
2275
2276
2277
2278
2279
2280
2281
2282
2283
2284
2285
2286
2287
2288
2289
2290
2291
2292
2293
2294
2295
2296
2297
2298
2299
2300
2301
2302
2303
2304
2305
2306
2307
2308
2309
2310
2311
2312
2313
2314
2315
2316
2317
2318
2319
2320
2321
2322
2323
2324
2325
2326
2327
2328

<!-- doc/src/sgml/high-availability.sgml -->

<chapter id="high-availability">
 <title>High Availability, Load Balancing, and Replication</title>

 <indexterm><primary>high availability</primary></indexterm>
 <indexterm><primary>failover</primary></indexterm>
 <indexterm><primary>replication</primary></indexterm>
 <indexterm><primary>load balancing</primary></indexterm>
 <indexterm><primary>clustering</primary></indexterm>
 <indexterm><primary>data partitioning</primary></indexterm>

 <para>
  Database servers can work together to allow a second server to
  take over quickly if the primary server fails (high
  availability), or to allow several computers to serve the same
  data (load balancing).  Ideally, database servers could work
  together seamlessly.  Web servers serving static web pages can
  be combined quite easily by merely load-balancing web requests
  to multiple machines.  In fact, read-only database servers can
  be combined relatively easily too.  Unfortunately, most database
  servers have a read/write mix of requests, and read/write servers
  are much harder to combine.  This is because though read-only
  data needs to be placed on each server only once, a write to any
  server has to be propagated to all servers so that future read
  requests to those servers return consistent results.
 </para>

 <para>
  This synchronization problem is the fundamental difficulty for
  servers working together.  Because there is no single solution
  that eliminates the impact of the sync problem for all use cases,
  there are multiple solutions.  Each solution addresses this
  problem in a different way, and minimizes its impact for a specific
  workload.
 </para>

 <para>
  Some solutions deal with synchronization by allowing only one
  server to modify the data.  Servers that can modify data are
  called read/write, <firstterm>master</firstterm> or <firstterm>primary</firstterm> servers.
  Servers that track changes in the primary are called <firstterm>standby</firstterm>
  or <firstterm>secondary</firstterm> servers. A standby server that cannot be connected
  to until it is promoted to a primary server is called a <firstterm>warm
  standby</firstterm> server, and one that can accept connections and serves read-only
  queries is called a <firstterm>hot standby</firstterm> server.
 </para>

 <para>
  Some solutions are synchronous,
  meaning that a data-modifying transaction is not considered
  committed until all servers have committed the transaction.  This
  guarantees that a failover will not lose any data and that all
  load-balanced servers will return consistent results no matter
  which server is queried. In contrast, asynchronous solutions allow some
  delay between the time of a commit and its propagation to the other servers,
  opening the possibility that some transactions might be lost in
  the switch to a backup server, and that load balanced servers
  might return slightly stale results.  Asynchronous communication
  is used when synchronous would be too slow.
 </para>

 <para>
  Solutions can also be categorized by their granularity.  Some solutions
  can deal only with an entire database server, while others allow control
  at the per-table or per-database level.
 </para>

 <para>
  Performance must be considered in any choice.  There is usually a
  trade-off between functionality and
  performance.  For example, a fully synchronous solution over a slow
  network might cut performance by more than half, while an asynchronous
  one might have a minimal performance impact.
 </para>

 <para>
  The remainder of this section outlines various failover, replication,
  and load balancing solutions.
 </para>

 <sect1 id="different-replication-solutions">
 <title>Comparison of Different Solutions</title>

 <variablelist>

  <varlistentry>
   <term>Shared Disk Failover</term>
   <listitem>

    <para>
     Shared disk failover avoids synchronization overhead by having only one
     copy of the database.  It uses a single disk array that is shared by
     multiple servers.  If the main database server fails, the standby server
     is able to mount and start the database as though it were recovering from
     a database crash.  This allows rapid failover with no data loss.
    </para>

    <para>
     Shared hardware functionality is common in network storage devices.
     Using a network file system is also possible, though care must be
     taken that the file system has full <acronym>POSIX</acronym> behavior (see <xref
     linkend="creating-cluster-nfs"/>).  One significant limitation of this
     method is that if the shared disk array fails or becomes corrupt, the
     primary and standby servers are both nonfunctional.  Another issue is
     that the standby server should never access the shared storage while
     the primary server is running.
    </para>

   </listitem>
  </varlistentry>

  <varlistentry>
   <term>File System (Block Device) Replication</term>
   <listitem>

    <para>
     A modified version of shared hardware functionality is file system
     replication, where all changes to a file system are mirrored to a file
     system residing on another computer.  The only restriction is that
     the mirroring must be done in a way that ensures the standby server
     has a consistent copy of the file system &mdash; specifically, writes
     to the standby must be done in the same order as those on the primary.
     <productname>DRBD</productname> is a popular file system replication solution
     for Linux.
    </para>

<!--
https://forge.continuent.org/pipermail/sequoia/2006-November/004070.html

Oracle RAC is a shared disk approach and just send cache invalidations
to other nodes but not actual data. As the disk is shared, data is
only committed once to disk and there is a distributed locking
protocol to make nodes agree on a serializable transactional order.
-->

   </listitem>
  </varlistentry>

  <varlistentry>
   <term>Write-Ahead Log Shipping</term>
   <listitem>

    <para>
     Warm and hot standby servers can be kept current by reading a
     stream of write-ahead log (<acronym>WAL</acronym>)
     records.  If the main server fails, the standby contains
     almost all of the data of the main server, and can be quickly
     made the new primary database server.  This can be synchronous or
     asynchronous and can only be done for the entire database server.
    </para>
    <para>
     A standby server can be implemented using file-based log shipping
     (<xref linkend="warm-standby"/>) or streaming replication (see
     <xref linkend="streaming-replication"/>), or a combination of both. For
     information on hot standby, see <xref linkend="hot-standby"/>.
    </para>
   </listitem>
  </varlistentry>

  <varlistentry>
   <term>Logical Replication</term>
   <listitem>
    <para>
     Logical replication allows a database server to send a stream of data
     modifications to another server.  <productname>PostgreSQL</productname>
     logical replication constructs a stream of logical data modifications
     from the WAL.  Logical replication allows replication of data changes on
     a per-table basis.  In addition, a server that is publishing its own
     changes can also subscribe to changes from another server, allowing data
     to flow in multiple directions.  For more information on logical
     replication, see <xref linkend="logical-replication"/>.  Through the
     logical decoding interface (<xref linkend="logicaldecoding"/>),
     third-party extensions can also provide similar functionality.
    </para>
   </listitem>
  </varlistentry>

  <varlistentry>
   <term>Trigger-Based Primary-Standby Replication</term>
   <listitem>

    <para>
     A trigger-based replication setup typically funnels data modification
     queries to a designated primary server. Operating on a per-table basis,
     the primary server sends data changes (typically) asynchronously to the
     standby servers.  Standby servers can answer queries while the primary is
     running, and may allow some local data changes or write activity.  This
     form of replication is often used for offloading large analytical or data
     warehouse queries.
    </para>

    <para>
     <productname>Slony-I</productname> is an example of this type of
     replication, with per-table granularity, and support for multiple standby
     servers.  Because it updates the standby server asynchronously (in
     batches), there is possible data loss during fail over.
    </para>
   </listitem>
  </varlistentry>

  <varlistentry>
   <term>SQL-Based Replication Middleware</term>
   <listitem>

    <para>
     With SQL-based replication middleware, a program intercepts
     every SQL query and sends it to one or all servers.  Each server
     operates independently.  Read-write queries must be sent to all servers,
     so that every server receives any changes.  But read-only queries can be
     sent to just one server, allowing the read workload to be distributed
     among them.
    </para>

    <para>
     If queries are simply broadcast unmodified, functions like
     <function>random()</function>, <function>CURRENT_TIMESTAMP</function>, and
     sequences can have different values on different servers.
     This is because each server operates independently, and because
     SQL queries are broadcast rather than actual data changes.  If
     this is unacceptable, either the middleware or the application
     must determine such values from a single source and then use those
     values in write queries.  Care must also be taken that all
     transactions either commit or abort on all servers, perhaps
     using two-phase commit (<xref linkend="sql-prepare-transaction"/>
     and <xref linkend="sql-commit-prepared"/>).
     <productname>Pgpool-II</productname> and <productname>Continuent Tungsten</productname>
     are examples of this type of replication.
    </para>
   </listitem>
  </varlistentry>

  <varlistentry>
   <term>Asynchronous Multimaster Replication</term>
   <listitem>

    <para>
     For servers that are not regularly connected or have slow
     communication links, like laptops or
     remote servers, keeping data consistent among servers is a
     challenge.  Using asynchronous multimaster replication, each
     server works independently, and periodically communicates with
     the other servers to identify conflicting transactions.  The
     conflicts can be resolved by users or conflict resolution rules.
     Bucardo is an example of this type of replication.
    </para>
   </listitem>
  </varlistentry>

  <varlistentry>
   <term>Synchronous Multimaster Replication</term>
   <listitem>

    <para>
     In synchronous multimaster replication, each server can accept
     write requests, and modified data is transmitted from the
     original server to every other server before each transaction
     commits.  Heavy write activity can cause excessive locking and
     commit delays, leading to poor performance.  Read requests can
     be sent to any server.  Some implementations use shared disk
     to reduce the communication overhead.  Synchronous multimaster
     replication is best for mostly read workloads, though its big
     advantage is that any server can accept write requests &mdash;
     there is no need to partition workloads between primary and
     standby servers, and because the data changes are sent from one
     server to another, there is no problem with non-deterministic
     functions like <function>random()</function>.
    </para>

    <para>
     <productname>PostgreSQL</productname> does not offer this type of replication,
     though <productname>PostgreSQL</productname> two-phase commit (<xref
     linkend="sql-prepare-transaction"/> and <xref
     linkend="sql-commit-prepared"/>)
     can be used to implement this in application code or middleware.
    </para>
   </listitem>
  </varlistentry>

 </variablelist>

 <para>
  <xref linkend="high-availability-matrix"/> summarizes
  the capabilities of the various solutions listed above.
 </para>

 <table id="high-availability-matrix">
  <title>High Availability, Load Balancing, and Replication Feature Matrix</title>
  <tgroup cols="9">
   <colspec colname="col1" colwidth="1.1*"/>
   <colspec colname="col2" colwidth="1*"/>
   <colspec colname="col3" colwidth="1*"/>
   <colspec colname="col4" colwidth="1*"/>
   <colspec colname="col5" colwidth="1*"/>
   <colspec colname="col6" colwidth="1*"/>
   <colspec colname="col7" colwidth="1*"/>
   <colspec colname="col8" colwidth="1*"/>
   <colspec colname="col9" colwidth="1*"/>
   <thead>
    <row>
     <entry>Feature</entry>
     <entry>Shared Disk</entry>
     <entry>File System Repl.</entry>
     <entry>Write-Ahead Log Shipping</entry>
     <entry>Logical Repl.</entry>
     <entry>Trigger-&zwsp;Based Repl.</entry>
     <entry>SQL Repl. Middle-ware</entry>
     <entry>Async. MM Repl.</entry>
     <entry>Sync. MM Repl.</entry>
    </row>
   </thead>

   <tbody>

    <row>
     <entry>Popular examples</entry>
     <entry align="center">NAS</entry>
     <entry align="center">DRBD</entry>
     <entry align="center">built-in streaming repl.</entry>
     <entry align="center">built-in logical repl., pglogical</entry>
     <entry align="center">Londiste, Slony</entry>
     <entry align="center">pgpool-II</entry>
     <entry align="center">Bucardo</entry>
     <entry align="center"></entry>
    </row>

    <row>
     <entry>Comm. method</entry>
     <entry align="center">shared disk</entry>
     <entry align="center">disk blocks</entry>
     <entry align="center">WAL</entry>
     <entry align="center">logical decoding</entry>
     <entry align="center">table rows</entry>
     <entry align="center">SQL</entry>
     <entry align="center">table rows</entry>
     <entry align="center">table rows and row locks</entry>
    </row>

    <row>
     <entry>No special hardware required</entry>
     <entry align="center"></entry>
     <entry align="center">&bull;</entry>
     <entry align="center">&bull;</entry>
     <entry align="center">&bull;</entry>
     <entry align="center">&bull;</entry>
     <entry align="center">&bull;</entry>
     <entry align="center">&bull;</entry>
     <entry align="center">&bull;</entry>
    </row>

    <row>
     <entry>Allows multiple primary servers</entry>
     <entry align="center"></entry>
     <entry align="center"></entry>
     <entry align="center"></entry>
     <entry align="center">&bull;</entry>
     <entry align="center"></entry>
     <entry align="center">&bull;</entry>
     <entry align="center">&bull;</entry>
     <entry align="center">&bull;</entry>
    </row>

    <row>
     <entry>No overhead on primary</entry>
     <entry align="center">&bull;</entry>
     <entry align="center"></entry>
     <entry align="center">&bull;</entry>
     <entry align="center">&bull;</entry>
     <entry align="center"></entry>
     <entry align="center">&bull;</entry>
     <entry align="center"></entry>
     <entry align="center"></entry>
    </row>

    <row>
     <entry>No waiting for multiple servers</entry>
     <entry align="center">&bull;</entry>
     <entry align="center"></entry>
     <entry align="center">with sync off</entry>
     <entry align="center">with sync off</entry>
     <entry align="center">&bull;</entry>
     <entry align="center"></entry>
     <entry align="center">&bull;</entry>
     <entry align="center"></entry>
    </row>

    <row>
     <entry>Primary failure will never lose data</entry>
     <entry align="center">&bull;</entry>
     <entry align="center">&bull;</entry>
     <entry align="center">with sync on</entry>
     <entry align="center">with sync on</entry>
     <entry align="center"></entry>
     <entry align="center">&bull;</entry>
     <entry align="center"></entry>
     <entry align="center">&bull;</entry>
    </row>

    <row>
     <entry>Replicas accept read-only queries</entry>
     <entry align="center"></entry>
     <entry align="center"></entry>
     <entry align="center">with hot standby</entry>
     <entry align="center">&bull;</entry>
     <entry align="center">&bull;</entry>
     <entry align="center">&bull;</entry>
     <entry align="center">&bull;</entry>
     <entry align="center">&bull;</entry>
    </row>

    <row>
     <entry>Per-table granularity</entry>
     <entry align="center"></entry>
     <entry align="center"></entry>
     <entry align="center"></entry>
     <entry align="center">&bull;</entry>
     <entry align="center">&bull;</entry>
     <entry align="center"></entry>
     <entry align="center">&bull;</entry>
     <entry align="center">&bull;</entry>
    </row>

    <row>
     <entry>No conflict resolution necessary</entry>
     <entry align="center">&bull;</entry>
     <entry align="center">&bull;</entry>
     <entry align="center">&bull;</entry>
     <entry align="center"></entry>
     <entry align="center">&bull;</entry>
     <entry align="center">&bull;</entry>
     <entry align="center"></entry>
     <entry align="center">&bull;</entry>
    </row>

   </tbody>
  </tgroup>
 </table>

 <para>
  There are a few solutions that do not fit into the above categories:
 </para>

 <variablelist>

  <varlistentry>
   <term>Data Partitioning</term>
   <listitem>

    <para>
     Data partitioning splits tables into data sets.  Each set can
     be modified by only one server.  For example, data can be
     partitioned by offices, e.g., London and Paris, with a server
     in each office.  If queries combining London and Paris data
     are necessary, an application can query both servers, or
     primary/standby replication can be used to keep a read-only copy
     of the other office's data on each server.
    </para>
   </listitem>
  </varlistentry>

  <varlistentry>
   <term>Multiple-Server Parallel Query Execution</term>
   <listitem>

    <para>
     Many of the above solutions allow multiple servers to handle multiple
     queries, but none allow a single query to use multiple servers to
     complete faster.  This solution allows multiple servers to work
     concurrently on a single query.  It is usually accomplished by
     splitting the data among servers and having each server execute its
     part of the query and return results to a central server where they
     are combined and returned to the user. This can be implemented using the
     <productname>PL/Proxy</productname> tool set.
    </para>

   </listitem>
  </varlistentry>

 </variablelist>

  <para>
   It should also be noted that because <productname>PostgreSQL</productname>
   is open source and easily extended, a number of companies have
   taken <productname>PostgreSQL</productname> and created commercial
   closed-source solutions with unique failover, replication, and load
   balancing capabilities.  These are not discussed here.
  </para>

 </sect1>


 <sect1 id="warm-standby">
 <title>Log-Shipping Standby Servers</title>


  <para>
   Continuous archiving can be used to create a <firstterm>high
   availability</firstterm> (HA) cluster configuration with one or more
   <firstterm>standby servers</firstterm> ready to take over operations if the
   primary server fails. This capability is widely referred to as
   <firstterm>warm standby</firstterm> or <firstterm>log shipping</firstterm>.
  </para>

  <para>
   The primary and standby server work together to provide this capability,
   though the servers are only loosely coupled. The primary server operates
   in continuous archiving mode, while each standby server operates in
   continuous recovery mode, reading the WAL files from the primary. No
   changes to the database tables are required to enable this capability,
   so it offers low administration overhead compared to some other
   replication solutions. This configuration also has relatively low
   performance impact on the primary server.
  </para>

  <para>
   Directly moving WAL records from one database server to another
   is typically described as log shipping. <productname>PostgreSQL</productname>
   implements file-based log shipping by transferring WAL records
   one file (WAL segment) at a time. WAL files (16MB) can be
   shipped easily and cheaply over any distance, whether it be to an
   adjacent system, another system at the same site, or another system on
   the far side of the globe. The bandwidth required for this technique
   varies according to the transaction rate of the primary server.
   Record-based log shipping is more granular and streams WAL changes
   incrementally over a network connection (see <xref
   linkend="streaming-replication"/>).
  </para>

  <para>
   It should be noted that log shipping is asynchronous, i.e., the WAL
   records are shipped after transaction commit. As a result, there is a
   window for data loss should the primary server suffer a catastrophic
   failure; transactions not yet shipped will be lost.  The size of the
   data loss window in file-based log shipping can be limited by use of the
   <varname>archive_timeout</varname> parameter, which can be set as low
   as a few seconds.  However such a low setting will
   substantially increase the bandwidth required for file shipping.
   Streaming replication (see <xref linkend="streaming-replication"/>)
   allows a much smaller window of data loss.
  </para>

  <para>
   Recovery performance is sufficiently good that the standby will
   typically be only moments away from full
   availability once it has been activated. As a result, this is called
   a warm standby configuration which offers high
   availability. Restoring a server from an archived base backup and
   rollforward will take considerably longer, so that technique only
   offers a solution for disaster recovery, not high availability.
   A standby server can also be used for read-only queries, in which case
   it is called a <firstterm>hot standby</firstterm> server. See
   <xref linkend="hot-standby"/> for more information.
  </para>

  <indexterm zone="high-availability">
   <primary>warm standby</primary>
  </indexterm>

  <indexterm zone="high-availability">
   <primary>PITR standby</primary>
  </indexterm>

  <indexterm zone="high-availability">
   <primary>standby server</primary>
  </indexterm>

  <indexterm zone="high-availability">
   <primary>log shipping</primary>
  </indexterm>

  <indexterm zone="high-availability">
   <primary>witness server</primary>
  </indexterm>

  <indexterm zone="high-availability">
   <primary>STONITH</primary>
  </indexterm>

  <sect2 id="standby-planning">
   <title>Planning</title>

   <para>
    It is usually wise to create the primary and standby servers
    so that they are as similar as possible, at least from the
    perspective of the database server.  In particular, the path names
    associated with tablespaces will be passed across unmodified, so both
    primary and standby servers must have the same mount paths for
    tablespaces if that feature is used.  Keep in mind that if
    <xref linkend="sql-createtablespace"/>
    is executed on the primary, any new mount point needed for it must
    be created on the primary and all standby servers before the command
    is executed. Hardware need not be exactly the same, but experience shows
    that maintaining two identical systems is easier than maintaining two
    dissimilar ones over the lifetime of the application and system.
    In any case the hardware architecture must be the same &mdash; shipping
    from, say, a 32-bit to a 64-bit system will not work.
   </para>

   <para>
    In general, log shipping between servers running different major
    <productname>PostgreSQL</productname> release
    levels is not possible. It is the policy of the PostgreSQL Global
    Development Group not to make changes to disk formats during minor release
    upgrades, so it is likely that running different minor release levels
    on primary and standby servers will work successfully. However, no
    formal support for that is offered and you are advised to keep primary
    and standby servers at the same release level as much as possible.
    When updating to a new minor release, the safest policy is to update
    the standby servers first &mdash; a new minor release is more likely
    to be able to read WAL files from a previous minor release than vice
    versa.
   </para>

  </sect2>

  <sect2 id="standby-server-operation" xreflabel="Standby Server Operation">
   <title>Standby Server Operation</title>

   <para>
    A server enters standby mode if a
    <anchor id="file-standby-signal" xreflabel="standby.signal"/>
    <filename>standby.signal</filename>
    <indexterm><primary><filename>standby.signal</filename></primary></indexterm>
    file exists in the data directory when the server is started.
   </para>

   <para>
    In standby mode, the server continuously applies WAL received from the
    primary server. The standby server can read WAL from a WAL archive
    (see <xref linkend="guc-restore-command"/>) or directly from the primary
    over a TCP connection (streaming replication). The standby server will
    also attempt to restore any WAL found in the standby cluster's
    <filename>pg_wal</filename> directory. That typically happens after a server
    restart, when the standby replays again WAL that was streamed from the
    primary before the restart, but you can also manually copy files to
    <filename>pg_wal</filename> at any time to have them replayed.
   </para>

   <para>
    At startup, the standby begins by restoring all WAL available in the
    archive location, calling <varname>restore_command</varname>. Once it
    reaches the end of WAL available there and <varname>restore_command</varname>
    fails, it tries to restore any WAL available in the <filename>pg_wal</filename> directory.
    If that fails, and streaming replication has been configured, the
    standby tries to connect to the primary server and start streaming WAL
    from the last valid record found in archive or <filename>pg_wal</filename>. If that fails
    or streaming replication is not configured, or if the connection is
    later disconnected, the standby goes back to step 1 and tries to
    restore the file from the archive again. This loop of retries from the
    archive, <filename>pg_wal</filename>, and via streaming replication goes on until the server
    is stopped or failover is triggered by a trigger file.
   </para>

   <para>
    Standby mode is exited and the server switches to normal operation
    when <command>pg_ctl promote</command> is run, or
    <function>pg_promote()</function> is called. Before failover,
    any WAL immediately available in the archive or in <filename>pg_wal</filename>
    will be restored, but no attempt is made to connect to the primary.
   </para>
  </sect2>

  <sect2 id="preparing-primary-for-standby">
   <title>Preparing the Primary for Standby Servers</title>

   <para>
    Set up continuous archiving on the primary to an archive directory
    accessible from the standby, as described
    in <xref linkend="continuous-archiving"/>. The archive location should be
    accessible from the standby even when the primary is down, i.e., it should
    reside on the standby server itself or another trusted server, not on
    the primary server.
   </para>

   <para>
    If you want to use streaming replication, set up authentication on the
    primary server to allow replication connections from the standby
    server(s); that is, create a role and provide a suitable entry or
    entries in <filename>pg_hba.conf</filename> with the database field set to
    <literal>replication</literal>.  Also ensure <varname>max_wal_senders</varname> is set
    to a sufficiently large value in the configuration file of the primary
    server. If replication slots will be used,
    ensure that <varname>max_replication_slots</varname> is set sufficiently
    high as well.
   </para>

   <para>
    Take a base backup as described in <xref linkend="backup-base-backup"/>
    to bootstrap the standby server.
   </para>
  </sect2>

  <sect2 id="standby-server-setup">
   <title>Setting Up a Standby Server</title>

   <para>
    To set up the standby server, restore the base backup taken from primary
    server (see <xref linkend="backup-pitr-recovery"/>). Create a file
    <link linkend="file-standby-signal"><filename>standby.signal</filename></link><indexterm><primary>standby.signal</primary></indexterm>
    in the standby's cluster data
    directory. Set <xref linkend="guc-restore-command"/> to a simple command to copy files from
    the WAL archive. If you plan to have multiple standby servers for high
    availability purposes, make sure that <varname>recovery_target_timeline</varname> is set to
    <literal>latest</literal> (the default), to make the standby server follow the timeline change
    that occurs at failover to another standby.
   </para>

   <note>
     <para>
     <xref linkend="guc-restore-command"/> should return immediately
     if the file does not exist; the server will retry the command again if
     necessary.
    </para>
   </note>

   <para>
     If you want to use streaming replication, fill in
     <xref linkend="guc-primary-conninfo"/> with a libpq connection string, including
     the host name (or IP address) and any additional details needed to
     connect to the primary server. If the primary needs a password for
     authentication, the password needs to be specified in
     <xref linkend="guc-primary-conninfo"/> as well.
   </para>

   <para>
    If you're setting up the standby server for high availability purposes,
    set up WAL archiving, connections and authentication like the primary
    server, because the standby server will work as a primary server after
    failover.
   </para>

   <para>
    If you're using a WAL archive, its size can be minimized using the <xref
    linkend="guc-archive-cleanup-command"/> parameter to remove files that are no
    longer required by the standby server.
    The <application>pg_archivecleanup</application> utility is designed specifically to
    be used with <varname>archive_cleanup_command</varname> in typical single-standby
    configurations, see <xref linkend="pgarchivecleanup"/>.
    Note however, that if you're using the archive for backup purposes, you
    need to retain files needed to recover from at least the latest base
    backup, even if they're no longer needed by the standby.
   </para>

   <para>
    A simple example of configuration is:
<programlisting>
primary_conninfo = 'host=192.168.1.50 port=5432 user=foo password=foopass options=''-c wal_sender_timeout=5000'''
restore_command = 'cp /path/to/archive/%f %p'
archive_cleanup_command = 'pg_archivecleanup /path/to/archive %r'
</programlisting>
   </para>

   <para>
    You can have any number of standby servers, but if you use streaming
    replication, make sure you set <varname>max_wal_senders</varname> high enough in
    the primary to allow them to be connected simultaneously.
   </para>

  </sect2>

  <sect2 id="streaming-replication">
   <title>Streaming Replication</title>

   <indexterm zone="high-availability">
    <primary>Streaming Replication</primary>
   </indexterm>

   <para>
    Streaming replication allows a standby server to stay more up-to-date
    than is possible with file-based log shipping. The standby connects
    to the primary, which streams WAL records to the standby as they're
    generated, without waiting for the WAL file to be filled.
   </para>

   <para>
    Streaming replication is asynchronous by default
    (see <xref linkend="synchronous-replication"/>), in which case there is
    a small delay between committing a transaction in the primary and the
    changes becoming visible in the standby. This delay is however much
    smaller than with file-based log shipping, typically under one second
    assuming the standby is powerful enough to keep up with the load. With
    streaming replication, <varname>archive_timeout</varname> is not required to
    reduce the data loss window.
   </para>

   <para>
    If you use streaming replication without file-based continuous
    archiving, the server might recycle old WAL segments before the standby
    has received them.  If this occurs, the standby will need to be
    reinitialized from a new base backup.  You can avoid this by setting
    <varname>wal_keep_size</varname> to a value large enough to ensure that
    WAL segments are not recycled too early, or by configuring a replication
    slot for the standby.  If you set up a WAL archive that's accessible from
    the standby, these solutions are not required, since the standby can
    always use the archive to catch up provided it retains enough segments.
   </para>

   <para>
    To use streaming replication, set up a file-based log-shipping standby
    server as described in <xref linkend="warm-standby"/>. The step that
    turns a file-based log-shipping standby into streaming replication
    standby is setting the <varname>primary_conninfo</varname> setting
    to point to the primary server. Set
    <xref linkend="guc-listen-addresses"/> and authentication options
    (see <filename>pg_hba.conf</filename>) on the primary so that the standby server
    can connect to the <literal>replication</literal> pseudo-database on the primary
    server (see <xref linkend="streaming-replication-authentication"/>).
   </para>

   <para>
    On systems that support the keepalive socket option, setting
    <xref linkend="guc-tcp-keepalives-idle"/>,
    <xref linkend="guc-tcp-keepalives-interval"/> and
    <xref linkend="guc-tcp-keepalives-count"/> helps the primary promptly
    notice a broken connection.
   </para>

   <para>
    Set the maximum number of concurrent connections from the standby servers
    (see <xref linkend="guc-max-wal-senders"/> for details).
   </para>

   <para>
    When the standby is started and <varname>primary_conninfo</varname> is set
    correctly, the standby will connect to the primary after replaying all
    WAL files available in the archive. If the connection is established
    successfully, you will see a <literal>walreceiver</literal> in the standby, and
    a corresponding <literal>walsender</literal> process in the primary.
   </para>

   <sect3 id="streaming-replication-authentication">
    <title>Authentication</title>
    <para>
     It is very important that the access privileges for replication be set up
     so that only trusted users can read the WAL stream, because it is
     easy to extract privileged information from it.  Standby servers must
     authenticate to the primary as an account that has the
     <literal>REPLICATION</literal> privilege or a superuser. It is
     recommended to create a dedicated user account with
     <literal>REPLICATION</literal> and <literal>LOGIN</literal>
     privileges for replication. While <literal>REPLICATION</literal>
     privilege gives very high permissions, it does not allow the user to
     modify any data on the primary system, which the
     <literal>SUPERUSER</literal> privilege does.
    </para>

    <para>
     Client authentication for replication is controlled by a
     <filename>pg_hba.conf</filename> record specifying <literal>replication</literal> in the
     <replaceable>database</replaceable> field. For example, if the standby is running on
     host IP <literal>192.168.1.100</literal> and the account name for replication
     is <literal>foo</literal>, the administrator can add the following line to the
     <filename>pg_hba.conf</filename> file on the primary:

<programlisting>
# Allow the user "foo" from host 192.168.1.100 to connect to the primary
# as a replication standby if the user's password is correctly supplied.
#
# TYPE  DATABASE        USER            ADDRESS                 METHOD
host    replication     foo             192.168.1.100/32        md5
</programlisting>
    </para>
    <para>
     The host name and port number of the primary, connection user name,
     and password are specified in the <xref linkend="guc-primary-conninfo"/>.
     The password can also be set in the <filename>~/.pgpass</filename> file on the
     standby (specify <literal>replication</literal> in the <replaceable>database</replaceable>
     field).
     For example, if the primary is running on host IP <literal>192.168.1.50</literal>,
     port <literal>5432</literal>, the account name for replication is
     <literal>foo</literal>, and the password is <literal>foopass</literal>, the administrator
     can add the following line to the <filename>postgresql.conf</filename> file on the
     standby:

<programlisting>
# The standby connects to the primary that is running on host 192.168.1.50
# and port 5432 as the user "foo" whose password is "foopass".
primary_conninfo = 'host=192.168.1.50 port=5432 user=foo password=foopass'
</programlisting>
    </para>
   </sect3>

   <sect3 id="streaming-replication-monitoring">
    <title>Monitoring</title>
    <para>
     An important health indicator of streaming replication is the amount
     of WAL records generated in the primary, but not yet applied in the
     standby. You can calculate this lag by comparing the current WAL write
     location on the primary with the last WAL location received by the
     standby. These locations can be retrieved using
     <function>pg_current_wal_lsn</function> on the primary and
     <function>pg_last_wal_receive_lsn</function> on the standby,
     respectively (see <xref linkend="functions-admin-backup-table"/> and
     <xref linkend="functions-recovery-info-table"/> for details).
     The last WAL receive location in the standby is also displayed in the
     process status of the WAL receiver process, displayed using the
     <command>ps</command> command (see <xref linkend="monitoring-ps"/> for details).
    </para>
    <para>
     You can retrieve a list of WAL sender processes via the
     <link linkend="monitoring-pg-stat-replication-view"><structname>
     pg_stat_replication</structname></link> view. Large differences between
     <function>pg_current_wal_lsn</function> and the view's <literal>sent_lsn</literal> field
     might indicate that the primary server is under heavy load, while
     differences between <literal>sent_lsn</literal> and
     <function>pg_last_wal_receive_lsn</function> on the standby might indicate
     network delay, or that the standby is under heavy load.
    </para>
    <para>
     On a hot standby, the status of the WAL receiver process can be retrieved
     via the <link linkend="monitoring-pg-stat-wal-receiver-view">
     <structname>pg_stat_wal_receiver</structname></link> view.  A large
     difference between <function>pg_last_wal_replay_lsn</function> and the
     view's <literal>flushed_lsn</literal> indicates that WAL is being
     received faster than it can be replayed.
    </para>
   </sect3>
  </sect2>

  <sect2 id="streaming-replication-slots">
   <title>Replication Slots</title>
   <indexterm>
    <primary>replication slot</primary>
    <secondary>streaming replication</secondary>
   </indexterm>
   <para>
    Replication slots provide an automated way to ensure that the primary does
    not remove WAL segments until they have been received by all standbys,
    and that the primary does not remove rows which could cause a
    <link linkend="hot-standby-conflict">recovery conflict</link> even when the
    standby is disconnected.
   </para>
   <para>
    In lieu of using replication slots, it is possible to prevent the removal
    of old WAL segments using <xref linkend="guc-wal-keep-size"/>, or by
    storing the segments in an archive using
    <xref linkend="guc-archive-command"/> or <xref linkend="guc-archive-library"/>.
    However, these methods often result in retaining more WAL segments than
    required, whereas replication slots retain only the number of segments
    known to be needed.  On the other hand, replication slots can retain so
    many WAL segments that they fill up the space allocated
    for <literal>pg_wal</literal>;
    <xref linkend="guc-max-slot-wal-keep-size"/> limits the size of WAL files
    retained by replication slots.
   </para>
   <para>
    Similarly, <xref linkend="guc-hot-standby-feedback"/> on its own, without
    also using a replication slot, provides protection against relevant rows
    being removed by vacuum, but provides no protection during any time period
    when the standby is not connected.  Replication slots overcome these
    disadvantages.
   </para>
   <sect3 id="streaming-replication-slots-manipulation">
    <title>Querying and Manipulating Replication Slots</title>
    <para>
     Each replication slot has a name, which can contain lower-case letters,
     numbers, and the underscore character.
    </para>
    <para>
     Existing replication slots and their state can be seen in the
     <link linkend="view-pg-replication-slots"><structname>pg_replication_slots</structname></link>
     view.
    </para>
    <para>
     Slots can be created and dropped either via the streaming replication
     protocol (see <xref linkend="protocol-replication"/>) or via SQL
     functions (see <xref linkend="functions-replication"/>).
    </para>
   </sect3>
   <sect3 id="streaming-replication-slots-config">
    <title>Configuration Example</title>
    <para>
     You can create a replication slot like this:
<programlisting>
postgres=# SELECT * FROM pg_create_physical_replication_slot('node_a_slot');
  slot_name  | lsn
-------------+-----
 node_a_slot |

postgres=# SELECT slot_name, slot_type, active FROM pg_replication_slots;
  slot_name  | slot_type | active
-------------+-----------+--------
 node_a_slot | physical  | f
(1 row)
</programlisting>
     To configure the standby to use this slot, <varname>primary_slot_name</varname>
     should be configured on the standby. Here is a simple example:
<programlisting>
primary_conninfo = 'host=192.168.1.50 port=5432 user=foo password=foopass'
primary_slot_name = 'node_a_slot'
</programlisting>
    </para>
   </sect3>
  </sect2>

  <sect2 id="cascading-replication">
   <title>Cascading Replication</title>

   <indexterm zone="high-availability">
    <primary>Cascading Replication</primary>
   </indexterm>

   <para>
    The cascading replication feature allows a standby server to accept replication
    connections and stream WAL records to other standbys, acting as a relay.
    This can be used to reduce the number of direct connections to the primary
    and also to minimize inter-site bandwidth overheads.
   </para>

   <para>
    A standby acting as both a receiver and a sender is known as a cascading
    standby.  Standbys that are more directly connected to the primary are known
    as upstream servers, while those standby servers further away are downstream
    servers.  Cascading replication does not place limits on the number or
    arrangement of downstream servers, though each standby connects to only
    one upstream server which eventually links to a single primary server.
   </para>

   <para>
    A cascading standby sends not only WAL records received from the
    primary but also those restored from the archive. So even if the replication
    connection in some upstream connection is terminated, streaming replication
    continues downstream for as long as new WAL records are available.
   </para>

   <para>
    Cascading replication is currently asynchronous. Synchronous replication
    (see <xref linkend="synchronous-replication"/>) settings have no effect on
    cascading replication at present.
   </para>

   <para>
    Hot standby feedback propagates upstream, whatever the cascaded arrangement.
   </para>

   <para>
    If an upstream standby server is promoted to become the new primary, downstream
    servers will continue to stream from the new primary if
    <varname>recovery_target_timeline</varname> is set to <literal>'latest'</literal> (the default).
   </para>

   <para>
    To use cascading replication, set up the cascading standby so that it can
    accept replication connections (that is, set
    <xref linkend="guc-max-wal-senders"/> and <xref linkend="guc-hot-standby"/>,
    and configure
    <link linkend="auth-pg-hba-conf">host-based authentication</link>).
    You will also need to set <varname>primary_conninfo</varname> in the downstream
    standby to point to the cascading standby.
   </para>
  </sect2>

  <sect2 id="synchronous-replication">
   <title>Synchronous Replication</title>

   <indexterm zone="high-availability">
    <primary>Synchronous Replication</primary>
   </indexterm>

   <para>
    <productname>PostgreSQL</productname> streaming replication is asynchronous by
    default. If the primary server
    crashes then some transactions that were committed may not have been
    replicated to the standby server, causing data loss. The amount
    of data loss is proportional to the replication delay at the time of
    failover.
   </para>

   <para>
    Synchronous replication offers the ability to confirm that all changes
    made by a transaction have been transferred to one or more synchronous
    standby servers. This extends that standard level of durability
    offered by a transaction commit. This level of protection is referred
    to as 2-safe replication in computer science theory, and group-1-safe
    (group-safe and 1-safe) when <varname>synchronous_commit</varname> is set to
    <literal>remote_write</literal>.
   </para>

   <para>
    When requesting synchronous replication, each commit of a
    write transaction will wait until confirmation is
    received that the commit has been written to the write-ahead log on disk
    of both the primary and standby server. The only possibility that data
    can be lost is if both the primary and the standby suffer crashes at the
    same time. This can provide a much higher level of durability, though only
    if the sysadmin is cautious about the placement and management of the two
    servers.  Waiting for confirmation increases the user's confidence that the
    changes will not be lost in the event of server crashes but it also
    necessarily increases the response time for the requesting transaction.
    The minimum wait time is the round-trip time between primary and standby.
   </para>

   <para>
    Read-only transactions and transaction rollbacks need not wait for
    replies from standby servers. Subtransaction commits do not wait for
    responses from standby servers, only top-level commits. Long
    running actions such as data loading or index building do not wait
    until the very final commit message. All two-phase commit actions
    require commit waits, including both prepare and commit.
   </para>

   <para>
    A synchronous standby can be a physical replication standby or a logical
    replication subscriber.  It can also be any other physical or logical WAL
    replication stream consumer that knows how to send the appropriate
    feedback messages.  Besides the built-in physical and logical replication
    systems, this includes special programs such
    as <command>pg_receivewal</command> and <command>pg_recvlogical</command>
    as well as some third-party replication systems and custom programs.
    Check the respective documentation for details on synchronous replication
    support.
   </para>

   <sect3 id="synchronous-replication-config">
    <title>Basic Configuration</title>

   <para>
    Once streaming replication has been configured, configuring synchronous
    replication requires only one additional configuration step:
    <xref linkend="guc-synchronous-standby-names"/> must be set to
    a non-empty value.  <varname>synchronous_commit</varname> must also be set to
    <literal>on</literal>, but since this is the default value, typically no change is
    required.  (See <xref linkend="runtime-config-wal-settings"/> and
    <xref linkend="runtime-config-replication-primary"/>.)
    This configuration will cause each commit to wait for
    confirmation that the standby has written the commit record to durable
    storage.
    <varname>synchronous_commit</varname> can be set by individual
    users, so it can be configured in the configuration file, for particular
    users or databases, or dynamically by applications, in order to control
    the durability guarantee on a per-transaction basis.
   </para>

   <para>
    After a commit record has been written to disk on the primary, the
    WAL record is then sent to the standby. The standby sends reply
    messages each time a new batch of WAL data is written to disk, unless
    <varname>wal_receiver_status_interval</varname> is set to zero on the standby.
    In the case that <varname>synchronous_commit</varname> is set to
    <literal>remote_apply</literal>, the standby sends reply messages when the commit
    record is replayed, making the transaction visible.
    If the standby is chosen as a synchronous standby, according to the setting
    of <varname>synchronous_standby_names</varname> on the primary, the reply
    messages from that standby will be considered along with those from other
    synchronous standbys to decide when to release transactions waiting for
    confirmation that the commit record has been received. These parameters
    allow the administrator to specify which standby servers should be
    synchronous standbys. Note that the configuration of synchronous
    replication is mainly on the primary. Named standbys must be directly
    connected to the primary; the primary knows nothing about downstream
    standby servers using cascaded replication.
   </para>

   <para>
    Setting <varname>synchronous_commit</varname> to <literal>remote_write</literal> will
    cause each commit to wait for confirmation that the standby has received
    the commit record and written it out to its own operating system, but not
    for the data to be flushed to disk on the standby.  This
    setting provides a weaker guarantee of durability than <literal>on</literal>
    does: the standby could lose the data in the event of an operating system
    crash, though not a <productname>PostgreSQL</productname> crash.
    However, it's a useful setting in practice
    because it can decrease the response time for the transaction.
    Data loss could only occur if both the primary and the standby crash and
    the database of the primary gets corrupted at the same time.
   </para>

   <para>
    Setting <varname>synchronous_commit</varname> to <literal>remote_apply</literal> will
    cause each commit to wait until the current synchronous standbys report
    that they have replayed the transaction, making it visible to user
    queries.  In simple cases, this allows for load balancing with causal
    consistency.
   </para>

   <para>
    Users will stop waiting if a fast shutdown is requested.  However, as
    when using asynchronous replication, the server will not fully
    shutdown until all outstanding WAL records are transferred to the currently
    connected standby servers.
   </para>

   </sect3>

   <sect3 id="synchronous-replication-multiple-standbys">
    <title>Multiple Synchronous Standbys</title>

   <para>
    Synchronous replication supports one or more synchronous standby servers;
    transactions will wait until all the standby servers which are considered
    as synchronous confirm receipt of their data. The number of synchronous
    standbys that transactions must wait for replies from is specified in
    <varname>synchronous_standby_names</varname>. This parameter also specifies
    a list of standby names and the method (<literal>FIRST</literal> and
    <literal>ANY</literal>) to choose synchronous standbys from the listed ones.
   </para>
   <para>
    The method <literal>FIRST</literal> specifies a priority-based synchronous
    replication and makes transaction commits wait until their WAL records are
    replicated to the requested number of synchronous standbys chosen based on
    their priorities. The standbys whose names appear earlier in the list are
    given higher priority and will be considered as synchronous. Other standby
    servers appearing later in this list represent potential synchronous
    standbys. If any of the current synchronous standbys disconnects for
    whatever reason, it will be replaced immediately with the
    next-highest-priority standby.
   </para>
   <para>
    An example of <varname>synchronous_standby_names</varname> for
    a priority-based multiple synchronous standbys is:
<programlisting>
synchronous_standby_names = 'FIRST 2 (s1, s2, s3)'
</programlisting>
    In this example, if four standby servers <literal>s1</literal>, <literal>s2</literal>,
    <literal>s3</literal> and <literal>s4</literal> are running, the two standbys
    <literal>s1</literal> and <literal>s2</literal> will be chosen as synchronous standbys
    because their names appear early in the list of standby names.
    <literal>s3</literal> is a potential synchronous standby and will take over
    the role of synchronous standby when either of <literal>s1</literal> or
    <literal>s2</literal> fails. <literal>s4</literal> is an asynchronous standby since
    its name is not in the list.
   </para>
   <para>
    The method <literal>ANY</literal> specifies a quorum-based synchronous
    replication and makes transaction commits wait until their WAL records
    are replicated to <emphasis>at least</emphasis> the requested number of
    synchronous standbys in the list.
   </para>
   <para>
    An example of <varname>synchronous_standby_names</varname> for
    a quorum-based multiple synchronous standbys is:
<programlisting>
synchronous_standby_names = 'ANY 2 (s1, s2, s3)'
</programlisting>
    In this example, if four standby servers <literal>s1</literal>, <literal>s2</literal>,
    <literal>s3</literal> and <literal>s4</literal> are running, transaction commits will
    wait for replies from at least any two standbys of <literal>s1</literal>,
    <literal>s2</literal> and <literal>s3</literal>. <literal>s4</literal> is an asynchronous
    standby since its name is not in the list.
   </para>
   <para>
    The synchronous states of standby servers can be viewed using
    the <structname>pg_stat_replication</structname> view.
   </para>
   </sect3>

   <sect3 id="synchronous-replication-performance">
    <title>Planning for Performance</title>

   <para>
    Synchronous replication usually requires carefully planned and placed
    standby servers to ensure applications perform acceptably. Waiting
    doesn't utilize system resources, but transaction locks continue to be
    held until the transfer is confirmed. As a result, incautious use of
    synchronous replication will reduce performance for database
    applications because of increased response times and higher contention.
   </para>

   <para>
    <productname>PostgreSQL</productname> allows the application developer
    to specify the durability level required via replication. This can be
    specified for the system overall, though it can also be specified for
    specific users or connections, or even individual transactions.
   </para>

   <para>
    For example, an application workload might consist of:
    10% of changes are important customer details, while
    90% of changes are less important data that the business can more
    easily survive if it is lost, such as chat messages between users.
   </para>

   <para>
    With synchronous replication options specified at the application level
    (on the primary) we can offer synchronous replication for the most
    important changes, without slowing down the bulk of the total workload.
    Application level options are an important and practical tool for allowing
    the benefits of synchronous replication for high performance applications.
   </para>

   <para>
    You should consider that the network bandwidth must be higher than
    the rate of generation of WAL data.
   </para>

   </sect3>

   <sect3 id="synchronous-replication-ha">
    <title>Planning for High Availability</title>

   <para>
    <varname>synchronous_standby_names</varname> specifies the number and
    names of synchronous standbys that transaction commits made when
    <varname>synchronous_commit</varname> is set to <literal>on</literal>,
    <literal>remote_apply</literal> or <literal>remote_write</literal> will wait for
    responses from. Such transaction commits may never be completed
    if any one of synchronous standbys should crash.
   </para>

   <para>
    The best solution for high availability is to ensure you keep as many
    synchronous standbys as requested. This can be achieved by naming multiple
    potential synchronous standbys using <varname>synchronous_standby_names</varname>.
   </para>

   <para>
    In a priority-based synchronous replication, the standbys whose names
    appear earlier in the list will be used as synchronous standbys.
    Standbys listed after these will take over the role of synchronous standby
    if one of current ones should fail.
   </para>

   <para>
    In a quorum-based synchronous replication, all the standbys appearing
    in the list will be used as candidates for synchronous standbys.
    Even if one of them should fail, the other standbys will keep performing
    the role of candidates of synchronous standby.
   </para>

   <para>
    When a standby first attaches to the primary, it will not yet be properly
    synchronized. This is described as <literal>catchup</literal> mode. Once
    the lag between standby and primary reaches zero for the first time
    we move to real-time <literal>streaming</literal> state.
    The catch-up duration may be long immediately after the standby has
    been created. If the standby is shut down, then the catch-up period
    will increase according to the length of time the standby has been down.
    The standby is only able to become a synchronous standby
    once it has reached <literal>streaming</literal> state.
    This state can be viewed using
    the <structname>pg_stat_replication</structname> view.
   </para>

   <para>
    If primary restarts while commits are waiting for acknowledgment, those
    waiting transactions will be marked fully committed once the primary
    database recovers.
    There is no way to be certain that all standbys have received all
    outstanding WAL data at time of the crash of the primary. Some
    transactions may not show as committed on the standby, even though
    they show as committed on the primary. The guarantee we offer is that
    the application will not receive explicit acknowledgment of the
    successful commit of a transaction until the WAL data is known to be
    safely received by all the synchronous standbys.
   </para>

   <para>
    If you really cannot keep as many synchronous standbys as requested
    then you should decrease the number of synchronous standbys that
    transaction commits must wait for responses from
    in <varname>synchronous_standby_names</varname> (or disable it) and
    reload the configuration file on the primary server.
   </para>

   <para>
    If the primary is isolated from remaining standby servers you should
    fail over to the best candidate of those other remaining standby servers.
   </para>

   <para>
    If you need to re-create a standby server while transactions are
    waiting, make sure that the commands pg_backup_start() and
    pg_backup_stop() are run in a session with
    <varname>synchronous_commit</varname> = <literal>off</literal>, otherwise those
    requests will wait forever for the standby to appear.
   </para>

   </sect3>
  </sect2>

  <sect2 id="continuous-archiving-in-standby">
   <title>Continuous Archiving in Standby</title>

   <indexterm>
     <primary>continuous archiving</primary>
     <secondary>in standby</secondary>
   </indexterm>

   <para>
     When continuous WAL archiving is used in a standby, there are two
     different scenarios: the WAL archive can be shared between the primary
     and the standby, or the standby can have its own WAL archive. When
     the standby has its own WAL archive, set <varname>archive_mode</varname>
     to <literal>always</literal>, and the standby will call the archive
     command for every WAL segment it receives, whether it's by restoring
     from the archive or by streaming replication. The shared archive can
     be handled similarly, but the <varname>archive_command</varname> or <varname>archive_library</varname> must
     test if the file being archived exists already, and if the existing file
     has identical contents. This requires more care in the
     <varname>archive_command</varname> or <varname>archive_library</varname>, as it must
     be careful to not overwrite an existing file with different contents,
     but return success if the exactly same file is archived twice. And
     all that must be done free of race conditions, if two servers attempt
     to archive the same file at the same time.
   </para>

   <para>
     If <varname>archive_mode</varname> is set to <literal>on</literal>, the
     archiver is not enabled during recovery or standby mode. If the standby
     server is promoted, it will start archiving after the promotion, but
     will not archive any WAL or timeline history files that
     it did not generate itself. To get a complete
     series of WAL files in the archive, you must ensure that all WAL is
     archived, before it reaches the standby. This is inherently true with
     file-based log shipping, as the standby can only restore files that
     are found in the archive, but not if streaming replication is enabled.
     When a server is not in recovery mode, there is no difference between
     <literal>on</literal> and <literal>always</literal> modes.
   </para>
  </sect2>
  </sect1>

  <sect1 id="warm-standby-failover">
   <title>Failover</title>

   <para>
    If the primary server fails then the standby server should begin
    failover procedures.
   </para>

   <para>
    If the standby server fails then no failover need take place. If the
    standby server can be restarted, even some time later, then the recovery
    process can also be restarted immediately, taking advantage of
    restartable recovery. If the standby server cannot be restarted, then a
    full new standby server instance should be created.
   </para>

   <para>
    If the primary server fails and the standby server becomes the
    new primary, and then the old primary restarts, you must have
    a mechanism for informing the old primary that it is no longer the primary. This is
    sometimes known as <acronym>STONITH</acronym> (Shoot The Other Node In The Head), which is
    necessary to avoid situations where both systems think they are the
    primary, which will lead to confusion and ultimately data loss.
   </para>

   <para>
    Many failover systems use just two systems, the primary and the standby,
    connected by some kind of heartbeat mechanism to continually verify the
    connectivity between the two and the viability of the primary. It is
    also possible to use a third system (called a witness server) to prevent
    some cases of inappropriate failover, but the additional complexity
    might not be worthwhile unless it is set up with sufficient care and
    rigorous testing.
   </para>

   <para>
    <productname>PostgreSQL</productname> does not provide the system
    software required to identify a failure on the primary and notify
    the standby database server.  Many such tools exist and are well
    integrated with the operating system facilities required for
    successful failover, such as IP address migration.
   </para>

   <para>
    Once failover to the standby occurs, there is only a
    single server in operation. This is known as a degenerate state.
    The former standby is now the primary, but the former primary is down
    and might stay down.  To return to normal operation, a standby server
    must be recreated,
    either on the former primary system when it comes up, or on a third,
    possibly new, system. The <xref linkend="app-pgrewind"/> utility can be
    used to speed up this process on large clusters.
    Once complete, the primary and standby can be
    considered to have switched roles. Some people choose to use a third
    server to provide backup for the new primary until the new standby
    server is recreated,
    though clearly this complicates the system configuration and
    operational processes.
   </para>

   <para>
    So, switching from primary to standby server can be fast but requires
    some time to re-prepare the failover cluster. Regular switching from
    primary to standby is useful, since it allows regular downtime on
    each system for maintenance. This also serves as a test of the
    failover mechanism to ensure that it will really work when you need it.
    Written administration procedures are advised.
   </para>

   <para>
    To trigger failover of a log-shipping standby server, run
    <command>pg_ctl promote</command> or call <function>pg_promote()</function>.
    If you're setting up reporting servers that are only used to offload
    read-only queries from the primary, not for high availability purposes,
    you don't need to promote.
   </para>
  </sect1>

 <sect1 id="hot-standby">
  <title>Hot Standby</title>

  <indexterm zone="high-availability">
   <primary>hot standby</primary>
  </indexterm>

   <para>
    Hot standby is the term used to describe the ability to connect to
    the server and run read-only queries while the server is in archive
    recovery or standby mode. This
    is useful both for replication purposes and for restoring a backup
    to a desired state with great precision.
    The term hot standby also refers to the ability of the server to move
    from recovery through to normal operation while users continue running
    queries and/or keep their connections open.
   </para>

   <para>
    Running queries in hot standby mode is similar to normal query operation,
    though there are several usage and administrative differences
    explained below.
   </para>

  <sect2 id="hot-standby-users">
   <title>User's Overview</title>

   <para>
    When the <xref linkend="guc-hot-standby"/> parameter is set to true on a
    standby server, it will begin accepting connections once the recovery has
    brought the system to a consistent state.  All such connections are
    strictly read-only; not even temporary tables may be written.
   </para>

   <para>
    The data on the standby takes some time to arrive from the primary server
    so there will be a measurable delay between primary and standby. Running the
    same query nearly simultaneously on both primary and standby might therefore
    return differing results. We say that data on the standby is
    <firstterm>eventually consistent</firstterm> with the primary.  Once the
    commit record for a transaction is replayed on the standby, the changes
    made by that transaction will be visible to any new snapshots taken on
    the standby.  Snapshots may be taken at the start of each query or at the
    start of each transaction, depending on the current transaction isolation
    level.  For more details, see <xref linkend="transaction-iso"/>.
   </para>

   <para>
    Transactions started during hot standby may issue the following commands:

    <itemizedlist>
     <listitem>
      <para>
       Query access: <command>SELECT</command>, <command>COPY TO</command>
      </para>
     </listitem>
     <listitem>
      <para>
       Cursor commands: <command>DECLARE</command>, <command>FETCH</command>, <command>CLOSE</command>
      </para>
     </listitem>
     <listitem>
      <para>
       Settings: <command>SHOW</command>, <command>SET</command>, <command>RESET</command>
      </para>
     </listitem>
     <listitem>
      <para>
       Transaction management commands:
        <itemizedlist>
         <listitem>
          <para>
           <command>BEGIN</command>, <command>END</command>, <command>ABORT</command>, <command>START TRANSACTION</command>
          </para>
         </listitem>
         <listitem>
          <para>
           <command>SAVEPOINT</command>, <command>RELEASE</command>, <command>ROLLBACK TO SAVEPOINT</command>
          </para>
         </listitem>
         <listitem>
          <para>
           <command>EXCEPTION</command> blocks and other internal subtransactions
          </para>
         </listitem>
        </itemizedlist>
      </para>
     </listitem>
     <listitem>
      <para>
       <command>LOCK TABLE</command>, though only when explicitly in one of these modes:
       <literal>ACCESS SHARE</literal>, <literal>ROW SHARE</literal> or <literal>ROW EXCLUSIVE</literal>.
      </para>
     </listitem>
     <listitem>
      <para>
       Plans and resources: <command>PREPARE</command>, <command>EXECUTE</command>,
       <command>DEALLOCATE</command>, <command>DISCARD</command>
      </para>
     </listitem>
     <listitem>
      <para>
       Plugins and extensions: <command>LOAD</command>
      </para>
     </listitem>
     <listitem>
      <para>
       <command>UNLISTEN</command>
      </para>
     </listitem>
    </itemizedlist>
   </para>

   <para>
    Transactions started during hot standby will never be assigned a
    transaction ID and cannot write to the system write-ahead log.
    Therefore, the following actions will produce error messages:

    <itemizedlist>
     <listitem>
      <para>
       Data Manipulation Language (DML): <command>INSERT</command>,
       <command>UPDATE</command>, <command>DELETE</command>,
       <command>MERGE</command>, <command>COPY FROM</command>,
       <command>TRUNCATE</command>.
       Note that there are no allowed actions that result in a trigger
       being executed during recovery.  This restriction applies even to
       temporary tables, because table rows cannot be read or written without
       assigning a transaction ID, which is currently not possible in a
       hot standby environment.
      </para>
     </listitem>
     <listitem>
      <para>
       Data Definition Language (DDL): <command>CREATE</command>,
       <command>DROP</command>, <command>ALTER</command>, <command>COMMENT</command>.
       This restriction applies even to temporary tables, because carrying
       out these operations would require updating the system catalog tables.
      </para>
     </listitem>
     <listitem>
      <para>
       <command>SELECT ... FOR SHARE | UPDATE</command>, because row locks cannot be
       taken without updating the underlying data files.
      </para>
     </listitem>
     <listitem>
      <para>
       Rules on <command>SELECT</command> statements that generate DML commands.
      </para>
     </listitem>
     <listitem>
      <para>
       <command>LOCK</command> that explicitly requests a mode higher than <literal>ROW EXCLUSIVE MODE</literal>.
      </para>
     </listitem>
     <listitem>
      <para>
       <command>LOCK</command> in short default form, since it requests <literal>ACCESS EXCLUSIVE MODE</literal>.
      </para>
     </listitem>
     <listitem>
      <para>
       Transaction management commands that explicitly set non-read-only state:
        <itemizedlist>
         <listitem>
          <para>
            <command>BEGIN READ WRITE</command>,
            <command>START TRANSACTION READ WRITE</command>
          </para>
         </listitem>
         <listitem>
          <para>
            <command>SET TRANSACTION READ WRITE</command>,
            <command>SET SESSION CHARACTERISTICS AS TRANSACTION READ WRITE</command>
          </para>
         </listitem>
         <listitem>
          <para>
           <command>SET transaction_read_only = off</command>
          </para>
         </listitem>
        </itemizedlist>
      </para>
     </listitem>
     <listitem>
      <para>
       Two-phase commit commands: <command>PREPARE TRANSACTION</command>,
       <command>COMMIT PREPARED</command>, <command>ROLLBACK PREPARED</command>
       because even read-only transactions need to write WAL in the
       prepare phase (the first phase of two phase commit).
      </para>
     </listitem>
     <listitem>
      <para>
       Sequence updates: <function>nextval()</function>, <function>setval()</function>
      </para>
     </listitem>
     <listitem>
      <para>
       <command>LISTEN</command>, <command>NOTIFY</command>
      </para>
     </listitem>
    </itemizedlist>
   </para>

   <para>
    In normal operation, <quote>read-only</quote> transactions are allowed to
    use <command>LISTEN</command> and <command>NOTIFY</command>,
    so hot standby sessions operate under slightly tighter
    restrictions than ordinary read-only sessions.  It is possible that some
    of these restrictions might be loosened in a future release.
   </para>

   <para>
    During hot standby, the parameter <varname>transaction_read_only</varname> is always
    true and may not be changed.  But as long as no attempt is made to modify
    the database, connections during hot standby will act much like any other
    database connection.  If failover or switchover occurs, the database will
    switch to normal processing mode.  Sessions will remain connected while the
    server changes mode.  Once hot standby finishes, it will be possible to
    initiate read-write transactions (even from a session begun during
    hot standby).
   </para>

   <para>
    Users can determine whether hot standby is currently active for their
    session by issuing <command>SHOW in_hot_standby</command>.
    (In server versions before 14, the <varname>in_hot_standby</varname>
    parameter did not exist; a workable substitute method for older servers
    is <command>SHOW transaction_read_only</command>.)  In addition, a set of
    functions (<xref linkend="functions-recovery-info-table"/>) allow users to
    access information about the standby server. These allow you to write
    programs that are aware of the current state of the database. These
    can be used to monitor the progress of recovery, or to allow you to
    write complex programs that restore the database to particular states.
   </para>
  </sect2>

  <sect2 id="hot-standby-conflict">
   <title>Handling Query Conflicts</title>

   <para>
    The primary and standby servers are in many ways loosely connected. Actions
    on the primary will have an effect on the standby. As a result, there is
    potential for negative interactions or conflicts between them. The easiest
    conflict to understand is performance: if a huge data load is taking place
    on the primary then this will generate a similar stream of WAL records on the
    standby, so standby queries may contend for system resources, such as I/O.
   </para>

   <para>
    There are also additional types of conflict that can occur with hot standby.
    These conflicts are <emphasis>hard conflicts</emphasis> in the sense that queries
    might need to be canceled and, in some cases, sessions disconnected to resolve them.
    The user is provided with several ways to handle these
    conflicts. Conflict cases include:

      <itemizedlist>
       <listitem>
        <para>
         Access Exclusive locks taken on the primary server, including both
         explicit <command>LOCK</command> commands and various <acronym>DDL</acronym>
         actions, conflict with table accesses in standby queries.
        </para>
       </listitem>
       <listitem>
        <para>
         Dropping a tablespace on the primary conflicts with standby queries
         using that tablespace for temporary work files.
        </para>
       </listitem>
       <listitem>
        <para>
         Dropping a database on the primary conflicts with sessions connected
         to that database on the standby.
        </para>
       </listitem>
       <listitem>
        <para>
         Application of a vacuum cleanup record from WAL conflicts with
         standby transactions whose snapshots can still <quote>see</quote> any of
         the rows to be removed.
        </para>
       </listitem>
       <listitem>
        <para>
         Application of a vacuum cleanup record from WAL conflicts with
         queries accessing the target page on the standby, whether or not
         the data to be removed is visible.
        </para>
       </listitem>
      </itemizedlist>
   </para>

   <para>
    On the primary server, these cases simply result in waiting; and the
    user might choose to cancel either of the conflicting actions.  However,
    on the standby there is no choice: the WAL-logged action already occurred
    on the primary so the standby must not fail to apply it.  Furthermore,
    allowing WAL application to wait indefinitely may be very undesirable,
    because the standby's state will become increasingly far behind the
    primary's.  Therefore, a mechanism is provided to forcibly cancel standby
    queries that conflict with to-be-applied WAL records.
   </para>

   <para>
    An example of the problem situation is an administrator on the primary
    server running <command>DROP TABLE</command> on a table that is currently being
    queried on the standby server.  Clearly the standby query cannot continue
    if the <command>DROP TABLE</command> is applied on the standby. If this situation
    occurred on the primary, the <command>DROP TABLE</command> would wait until the
    other query had finished. But when <command>DROP TABLE</command> is run on the
    primary, the primary doesn't have information about what queries are
    running on the standby, so it will not wait for any such standby
    queries. The WAL change records come through to the standby while the
    standby query is still running, causing a conflict.  The standby server
    must either delay application of the WAL records (and everything after
    them, too) or else cancel the conflicting query so that the <command>DROP
    TABLE</command> can be applied.
   </para>

   <para>
    When a conflicting query is short, it's typically desirable to allow it to
    complete by delaying WAL application for a little bit; but a long delay in
    WAL application is usually not desirable.  So the cancel mechanism has
    parameters, <xref linkend="guc-max-standby-archive-delay"/> and <xref
    linkend="guc-max-standby-streaming-delay"/>, that define the maximum
    allowed delay in WAL application.  Conflicting queries will be canceled
    once it has taken longer than the relevant delay setting to apply any
    newly-received WAL data.  There are two parameters so that different delay
    values can be specified for the case of reading WAL data from an archive
    (i.e., initial recovery from a base backup or <quote>catching up</quote> a
    standby server that has fallen far behind) versus reading WAL data via
    streaming replication.
   </para>

   <para>
    In a standby server that exists primarily for high availability, it's
    best to set the delay parameters relatively short, so that the server
    cannot fall far behind the primary due to delays caused by standby
    queries.  However, if the standby server is meant for executing
    long-running queries, then a high or even infinite delay value may be
    preferable.  Keep in mind however that a long-running query could
    cause other sessions on the standby server to not see recent changes
    on the primary, if it delays application of WAL records.
   </para>

   <para>
    Once the delay specified by <varname>max_standby_archive_delay</varname> or
    <varname>max_standby_streaming_delay</varname> has been exceeded, conflicting
    queries will be canceled.  This usually results just in a cancellation
    error, although in the case of replaying a <command>DROP DATABASE</command>
    the entire conflicting session will be terminated.  Also, if the conflict
    is over a lock held by an idle transaction, the conflicting session is
    terminated (this behavior might change in the future).
   </para>

   <para>
    Canceled queries may be retried immediately (after beginning a new
    transaction, of course).  Since query cancellation depends on
    the nature of the WAL records being replayed, a query that was
    canceled may well succeed if it is executed again.
   </para>

   <para>
    Keep in mind that the delay parameters are compared to the elapsed time
    since the WAL data was received by the standby server.  Thus, the grace
    period allowed to any one query on the standby is never more than the
    delay parameter, and could be considerably less if the standby has already
    fallen behind as a result of waiting for previous queries to complete, or
    as a result of being unable to keep up with a heavy update load.
   </para>

   <para>
    The most common reason for conflict between standby queries and WAL replay
    is <quote>early cleanup</quote>.  Normally, <productname>PostgreSQL</productname> allows
    cleanup of old row versions when there are no transactions that need to
    see them to ensure correct visibility of data according to MVCC rules.
    However, this rule can only be applied for transactions executing on the
    primary.  So it is possible that cleanup on the primary will remove row
    versions that are still visible to a transaction on the standby.
   </para>

   <para>
    Row version cleanup isn't the only potential cause of conflicts with
    standby queries.  All index-only scans (including those that run on
    standbys) must use an <acronym>MVCC</acronym> snapshot that
    <quote>agrees</quote> with the visibility map.  Conflicts are therefore
    required whenever <command>VACUUM</command> <link
     linkend="vacuum-for-visibility-map">sets a page as all-visible in the
     visibility map</link> containing one or more rows
    <emphasis>not</emphasis> visible to all standby queries.  So even running
    <command>VACUUM</command> against a table with no updated or deleted rows
    requiring cleanup might lead to conflicts.
   </para>

   <para>
    Users should be clear that tables that are regularly and heavily updated
    on the primary server will quickly cause cancellation of longer running
    queries on the standby. In such cases the setting of a finite value for
    <varname>max_standby_archive_delay</varname> or
    <varname>max_standby_streaming_delay</varname> can be considered similar to
    setting <varname>statement_timeout</varname>.
   </para>

   <para>
    Remedial possibilities exist if the number of standby-query cancellations
    is found to be unacceptable.  The first option is to set the parameter
    <varname>hot_standby_feedback</varname>, which prevents <command>VACUUM</command> from
    removing recently-dead rows and so cleanup conflicts do not occur.
    If you do this, you
    should note that this will delay cleanup of dead rows on the primary,
    which may result in undesirable table bloat. However, the cleanup
    situation will be no worse than if the standby queries were running
    directly on the primary server, and you are still getting the benefit of
    off-loading execution onto the standby.
    If standby servers connect and disconnect frequently, you
    might want to make adjustments to handle the period when
    <varname>hot_standby_feedback</varname> feedback is not being provided.
    For example, consider increasing <varname>max_standby_archive_delay</varname>
    so that queries are not rapidly canceled by conflicts in WAL archive
    files during disconnected periods.  You should also consider increasing
    <varname>max_standby_streaming_delay</varname> to avoid rapid cancellations
    by newly-arrived streaming WAL entries after reconnection.
   </para>

   <para>
    The number of query cancels and the reason for them can be viewed using
    the <structname>pg_stat_database_conflicts</structname> system view on the standby
    server. The <structname>pg_stat_database</structname> system view also contains
    summary information.
   </para>

   <para>
    Users can control whether a log message is produced when WAL replay is waiting
    longer than <varname>deadlock_timeout</varname> for conflicts. This
    is controlled by the <xref linkend="guc-log-recovery-conflict-waits"/> parameter.
   </para>
  </sect2>

  <sect2 id="hot-standby-admin">
   <title>Administrator's Overview</title>

   <para>
    If <varname>hot_standby</varname> is <literal>on</literal> in <filename>postgresql.conf</filename>
    (the default value) and there is a
    <link linkend="file-standby-signal"><filename>standby.signal</filename></link><indexterm><primary>standby.signal</primary><secondary>for hot standby</secondary></indexterm>
    file present, the server will run in hot standby mode.
    However, it may take some time for hot standby connections to be allowed,
    because the server will not accept connections until it has completed
    sufficient recovery to provide a consistent state against which queries
    can run.  During this period,
    clients that attempt to connect will be refused with an error message.
    To confirm the server has come up, either loop trying to connect from
    the application, or look for these messages in the server logs:

<programlisting>
LOG:  entering standby mode

... then some time later ...

LOG:  consistent recovery state reached
LOG:  database system is ready to accept read-only connections
</programlisting>

    Consistency information is recorded once per checkpoint on the primary.
    It is not possible to enable hot standby when reading WAL
    written during a period when <varname>wal_level</varname> was not set to
    <literal>replica</literal> or <literal>logical</literal> on the primary.  Reaching
    a consistent state can also be delayed in the presence of both of these
    conditions:

      <itemizedlist>
       <listitem>
        <para>
         A write transaction has more than 64 subtransactions
        </para>
       </listitem>
       <listitem>
        <para>
         Very long-lived write transactions
        </para>
       </listitem>
      </itemizedlist>

    If you are running file-based log shipping ("warm standby"), you might need
    to wait until the next WAL file arrives, which could be as long as the
    <varname>archive_timeout</varname> setting on the primary.
   </para>

   <para>
    The settings of some parameters determine the size of shared memory for
    tracking transaction IDs, locks, and prepared transactions.  These shared
    memory structures must be no smaller on a standby than on the primary in
    order to ensure that the standby does not run out of shared memory during
    recovery.  For example, if the primary had used a prepared transaction but
    the standby had not allocated any shared memory for tracking prepared
    transactions, then recovery could not continue until the standby's
    configuration is changed.  The parameters affected are:

      <itemizedlist>
       <listitem>
        <para>
         <varname>max_connections</varname>
        </para>
       </listitem>
       <listitem>
        <para>
         <varname>max_prepared_transactions</varname>
        </para>
       </listitem>
       <listitem>
        <para>
         <varname>max_locks_per_transaction</varname>
        </para>
       </listitem>
       <listitem>
        <para>
         <varname>max_wal_senders</varname>
        </para>
       </listitem>
       <listitem>
        <para>
         <varname>max_worker_processes</varname>
        </para>
       </listitem>
      </itemizedlist>

    The easiest way to ensure this does not become a problem is to have these
    parameters set on the standbys to values equal to or greater than on the
    primary.  Therefore, if you want to increase these values, you should do
    so on all standby servers first, before applying the changes to the
    primary server.  Conversely, if you want to decrease these values, you
    should do so on the primary server first, before applying the changes to
    all standby servers.  Keep in mind that when a standby is promoted, it
    becomes the new reference for the required parameter settings for the
    standbys that follow it.  Therefore, to avoid this becoming a problem
    during a switchover or failover, it is recommended to keep these settings
    the same on all standby servers.
   </para>

   <para>
    The WAL tracks changes to these parameters on the
    primary.  If a hot standby processes WAL that indicates that the current
    value on the primary is higher than its own value, it will log a warning
    and pause recovery, for example:
<screen>
WARNING:  hot standby is not possible because of insufficient parameter settings
DETAIL:  max_connections = 80 is a lower setting than on the primary server, where its value was 100.
LOG:  recovery has paused
DETAIL:  If recovery is unpaused, the server will shut down.
HINT:  You can then restart the server after making the necessary configuration changes.
</screen>
    At that point, the settings on the standby need to be updated and the
    instance restarted before recovery can continue.  If the standby is not a
    hot standby, then when it encounters the incompatible parameter change, it
    will shut down immediately without pausing, since there is then no value
    in keeping it up.
   </para>

   <para>
    It is important that the administrator select appropriate settings for
    <xref linkend="guc-max-standby-archive-delay"/> and <xref
    linkend="guc-max-standby-streaming-delay"/>.  The best choices vary
    depending on business priorities.  For example if the server is primarily
    tasked as a High Availability server, then you will want low delay
    settings, perhaps even zero, though that is a very aggressive setting. If
    the standby server is tasked as an additional server for decision support
    queries then it might be acceptable to set the maximum delay values to
    many hours, or even -1 which means wait forever for queries to complete.
   </para>

   <para>
    Transaction status "hint bits" written on the primary are not WAL-logged,
    so data on the standby will likely re-write the hints again on the standby.
    Thus, the standby server will still perform disk writes even though
    all users are read-only; no changes occur to the data values
    themselves.  Users will still write large sort temporary files and
    re-generate relcache info files, so no part of the database
    is truly read-only during hot standby mode.
    Note also that writes to remote databases using
    <application>dblink</application> module, and other operations outside the
    database using PL functions will still be possible, even though the
    transaction is read-only locally.
   </para>

   <para>
    The following types of administration commands are not accepted
    during recovery mode:

      <itemizedlist>
       <listitem>
        <para>
         Data Definition Language (DDL): e.g., <command>CREATE INDEX</command>
        </para>
       </listitem>
       <listitem>
        <para>
         Privilege and Ownership: <command>GRANT</command>, <command>REVOKE</command>,
         <command>REASSIGN</command>
        </para>
       </listitem>
       <listitem>
        <para>
         Maintenance commands: <command>ANALYZE</command>, <command>VACUUM</command>,
         <command>CLUSTER</command>, <command>REINDEX</command>
        </para>
       </listitem>
      </itemizedlist>
   </para>

   <para>
    Again, note that some of these commands are actually allowed during
    "read only" mode transactions on the primary.
   </para>

   <para>
    As a result, you cannot create additional indexes that exist solely
    on the standby, nor statistics that exist solely on the standby.
    If these administration commands are needed, they should be executed
    on the primary, and eventually those changes will propagate to the
    standby.
   </para>

   <para>
    <function>pg_cancel_backend()</function>
    and <function>pg_terminate_backend()</function> will work on user backends,
    but not the startup process, which performs
    recovery. <structname>pg_stat_activity</structname> does not show
    recovering transactions as active. As a result,
    <structname>pg_prepared_xacts</structname> is always empty during
    recovery. If you wish to resolve in-doubt prepared transactions, view
    <literal>pg_prepared_xacts</literal> on the primary and issue commands to
    resolve transactions there or resolve them after the end of recovery.
   </para>

   <para>
    <structname>pg_locks</structname> will show locks held by backends,
    as normal. <structname>pg_locks</structname> also shows
    a virtual transaction managed by the startup process that owns all
    <literal>AccessExclusiveLocks</literal> held by transactions being replayed by recovery.
    Note that the startup process does not acquire locks to
    make database changes, and thus locks other than <literal>AccessExclusiveLocks</literal>
    do not show in <structname>pg_locks</structname> for the Startup
    process; they are just presumed to exist.
   </para>

   <para>
    The <productname>Nagios</productname> plugin <productname>check_pgsql</productname> will
    work, because the simple information it checks for exists.
    The <productname>check_postgres</productname> monitoring script will also work,
    though some reported values could give different or confusing results.
    For example, last vacuum time will not be maintained, since no
    vacuum occurs on the standby.  Vacuums running on the primary
    do still send their changes to the standby.
   </para>

   <para>
    WAL file control commands will not work during recovery,
    e.g., <function>pg_backup_start</function>, <function>pg_switch_wal</function> etc.
   </para>

   <para>
    Dynamically loadable modules work, including <structname>pg_stat_statements</structname>.
   </para>

   <para>
    Advisory locks work normally in recovery, including deadlock detection.
    Note that advisory locks are never WAL logged, so it is impossible for
    an advisory lock on either the primary or the standby to conflict with WAL
    replay. Nor is it possible to acquire an advisory lock on the primary
    and have it initiate a similar advisory lock on the standby. Advisory
    locks relate only to the server on which they are acquired.
   </para>

   <para>
    Trigger-based replication systems such as <productname>Slony</productname>,
    <productname>Londiste</productname> and <productname>Bucardo</productname> won't run on the
    standby at all, though they will run happily on the primary server as
    long as the changes are not sent to standby servers to be applied.
    WAL replay is not trigger-based so you cannot relay from the
    standby to any system that requires additional database writes or
    relies on the use of triggers.
   </para>

   <para>
    New OIDs cannot be assigned, though some <acronym>UUID</acronym> generators may still
    work as long as they do not rely on writing new status to the database.
   </para>

   <para>
    Currently, temporary table creation is not allowed during read-only
    transactions, so in some cases existing scripts will not run correctly.
    This restriction might be relaxed in a later release. This is
    both an SQL standard compliance issue and a technical issue.
   </para>

   <para>
    <command>DROP TABLESPACE</command> can only succeed if the tablespace is empty.
    Some standby users may be actively using the tablespace via their
    <varname>temp_tablespaces</varname> parameter. If there are temporary files in the
    tablespace, all active queries are canceled to ensure that temporary
    files are removed, so the tablespace can be removed and WAL replay
    can continue.
   </para>

   <para>
    Running <command>DROP DATABASE</command> or <command>ALTER DATABASE ... SET
    TABLESPACE</command> on the primary
    will generate a WAL entry that will cause all users connected to that
    database on the standby to be forcibly disconnected. This action occurs
    immediately, whatever the setting of
    <varname>max_standby_streaming_delay</varname>. Note that
    <command>ALTER DATABASE ... RENAME</command> does not disconnect users, which
    in most cases will go unnoticed, though might in some cases cause a
    program confusion if it depends in some way upon database name.
   </para>

   <para>
    In normal (non-recovery) mode, if you issue <command>DROP USER</command> or <command>DROP ROLE</command>
    for a role with login capability while that user is still connected then
    nothing happens to the connected user &mdash; they remain connected. The user cannot
    reconnect however. This behavior applies in recovery also, so a
    <command>DROP USER</command> on the primary does not disconnect that user on the standby.
   </para>

   <para>
    The cumulative statistics system is active during recovery. All scans,
    reads, blocks, index usage, etc., will be recorded normally on the
    standby. However, WAL replay will not increment relation and database
    specific counters. I.e. replay will not increment pg_stat_all_tables
    columns (like n_tup_ins), nor will reads or writes performed by the
    startup process be tracked in the pg_statio views, nor will associated
    pg_stat_database columns be incremented.
   </para>

   <para>
    Autovacuum is not active during recovery.  It will start normally at the
    end of recovery.
   </para>

   <para>
    The checkpointer process and the background writer process are active during
    recovery. The checkpointer process will perform restartpoints (similar to
    checkpoints on the primary) and the background writer process will perform
    normal block cleaning activities. This can include updates of the hint bit
    information stored on the standby server.
    The <command>CHECKPOINT</command> command is accepted during recovery,
    though it performs a restartpoint rather than a new checkpoint.
   </para>
  </sect2>

  <sect2 id="hot-standby-parameters">
   <title>Hot Standby Parameter Reference</title>

   <para>
    Various parameters have been mentioned above in
    <xref linkend="hot-standby-conflict"/> and
    <xref linkend="hot-standby-admin"/>.
   </para>

   <para>
    On the primary, the <xref linkend="guc-wal-level"/> parameter can be used.
    <xref linkend="guc-max-standby-archive-delay"/> and
    <xref linkend="guc-max-standby-streaming-delay"/> have no effect if set on
    the primary.
   </para>

   <para>
    On the standby, parameters <xref linkend="guc-hot-standby"/>,
    <xref linkend="guc-max-standby-archive-delay"/> and
    <xref linkend="guc-max-standby-streaming-delay"/> can be used.
   </para>
  </sect2>

  <sect2 id="hot-standby-caveats">
   <title>Caveats</title>

   <para>
    There are several limitations of hot standby.
    These can and probably will be fixed in future releases:

  <itemizedlist>
   <listitem>
    <para>
     Full knowledge of running transactions is required before snapshots
     can be taken. Transactions that use large numbers of subtransactions
     (currently greater than 64) will delay the start of read-only
     connections until the completion of the longest running write transaction.
     If this situation occurs, explanatory messages will be sent to the server log.
    </para>
   </listitem>
   <listitem>
    <para>
     Valid starting points for standby queries are generated at each
     checkpoint on the primary. If the standby is shut down while the primary
     is in a shutdown state, it might not be possible to re-enter hot standby
     until the primary is started up, so that it generates further starting
     points in the WAL logs.  This situation isn't a problem in the most
     common situations where it might happen. Generally, if the primary is
     shut down and not available anymore, that's likely due to a serious
     failure that requires the standby being converted to operate as
     the new primary anyway.  And in situations where the primary is
     being intentionally taken down, coordinating to make sure the standby
     becomes the new primary smoothly is also standard procedure.
    </para>
   </listitem>
   <listitem>
    <para>
     At the end of recovery, <literal>AccessExclusiveLocks</literal> held by prepared transactions
     will require twice the normal number of lock table entries. If you plan
     on running either a large number of concurrent prepared transactions
     that normally take <literal>AccessExclusiveLocks</literal>, or you plan on having one
     large transaction that takes many <literal>AccessExclusiveLocks</literal>, you are
     advised to select a larger value of <varname>max_locks_per_transaction</varname>,
     perhaps as much as twice the value of the parameter on
     the primary server. You need not consider this at all if
     your setting of <varname>max_prepared_transactions</varname> is 0.
    </para>
   </listitem>
   <listitem>
    <para>
     The Serializable transaction isolation level is not yet available in hot
     standby.  (See <xref linkend="xact-serializable"/> and
     <xref linkend="serializable-consistency"/> for details.)
     An attempt to set a transaction to the serializable isolation level in
     hot standby mode will generate an error.
    </para>
   </listitem>
  </itemizedlist>

   </para>
  </sect2>

 </sect1>

</chapter>