summaryrefslogtreecommitdiff
path: root/storage/innobase/fil/fil0fil.cc
Commit message (Collapse)AuthorAgeFilesLines
* Merge 10.11 into 11.0Marko Mäkelä2023-03-171-11/+32
|\
| * Merge 10.6 into 10.8Marko Mäkelä2023-03-161-11/+32
| |\
| | * Merge 10.5 into 10.6Marko Mäkelä2023-03-161-11/+32
| | |\
| | | * MDEV-30775 Performance regression in fil_space_t::try_to_close() introduced ↵Vlad Lesin2023-03-101-10/+23
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | in MDEV-23855 fil_node_open_file_low() tries to close files from the top of fil_system.space_list if the number of opened files is exceeded. It invokes fil_space_t::try_to_close(), which iterates the list searching for the first opened space. Then it just closes the space, leaving it in the same position in fil_system.space_list. On heavy files opening, like during 'SHOW TABLE STATUS ...' execution, if the number of opened files limit is reached, fil_space_t::try_to_close() iterates more and more closed spaces before reaching any opened space for each fil_node_open_file_low() call. What causes performance regression if the number of spaces is big enough. The fix is to keep opened spaces at the top of fil_system.space_list, and move closed files at the end of the list. For this purpose fil_space_t::space_list_last_opened pointer is introduced. It points to the last inserted opened space in fil_space_t::space_list. When space is opened, it's inserted to the position just after the pointer points to in fil_space_t::space_list to preserve the logic, inroduced in MDEV-23855. Any closed space is added to the end of fil_space_t::space_list. As opened spaces are located at the top of fil_space_t::space_list, fil_space_t::try_to_close() finds opened space faster. There can be the case when opened and closed spaces are mixed in fil_space_t::space_list if fil_system.freeze_space_list was set during fil_node_open_file_low() execution. But this should not cause any error, as fil_space_t::try_to_close() still iterates spaces in the list. There is no need in any test case for the fix, as it does not change any functionality, but just fixes performance regression.
* | | | Merge 10.11 into 11.0Marko Mäkelä2023-01-251-1/+1
|\ \ \ \ | |/ / /
| * | | Merge 10.7 into 10.8Marko Mäkelä2023-01-241-1/+1
| |\ \ \
| | * \ \ Merge 10.6 into 10.7Marko Mäkelä2023-01-241-1/+1
| | |\ \ \ | | | |/ /
| | | * | MDEV-30400 Assertion height == btr_page_get_level(...) on INSERTMarko Mäkelä2023-01-241-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This also fixes part of MDEV-29835 Partial server freeze which is caused by violations of the latching order that was defined in https://dev.mysql.com/worklog/task/?id=6326 (WL#6326: InnoDB: fix index->lock contention). Unless the current thread is holding an exclusive dict_index_t::lock, it must acquire page latches in a strict parent-to-child, left-to-right order. Not all cases of MDEV-29835 are fixed yet. Failure to follow the correct latching order will cause deadlocks of threads due to lock order inversion. As part of these changes, the BTR_MODIFY_TREE mode is modified so that an Update latch (U a.k.a. SX) will be acquired on the root page, and eXclusive latches (X) will be acquired on all pages leading to the leaf page, as well as any left and right siblings of the pages along the path. The DEBUG_SYNC test innodb.innodb_wl6326 will be removed, because at the time the DEBUG_SYNC point is hit, the thread is actually holding several page latches that will be blocking a concurrent SELECT statement. We also remove double bookkeeping that was caused due to excessive information hiding in mtr_t::m_memo. We simply let mtr_t::m_memo store information of latched pages, and ensure that mtr_memo_slot_t::object is never a null pointer. The tree_blocks[] and tree_savepoints[] were redundant. buf_page_get_low(): If innodb_change_buffering_debug=1, to avoid a hang, do not try to evict blocks if we are holding a latch on a modified page. The test innodb.innodb-change-buffer-recovery will be removed, because change buffering may no longer be forced by debug injection when the change buffer comprises multiple pages. Remove a debug assertion that could fail when innodb_change_buffering_debug=1 fails to evict a page. For other cases, the assertion is redundant, because we already checked that right after the got_block: label. The test innodb.innodb-change-buffering-recovery will be removed, because due to this change, we will be unable to evict the desired page. mtr_t::lock_register(): Register a change of a page latch on an unmodified buffer-fixed block. mtr_t::x_latch_at_savepoint(), mtr_t::sx_latch_at_savepoint(): Replaced by the use of mtr_t::upgrade_buffer_fix(), which now also handles RW_S_LATCH. mtr_t::set_modified(): For temporary tables, invoke buf_page_t::set_modified() here and not in mtr_t::commit(). We will never set the MTR_MEMO_MODIFY flag on other than persistent data pages, nor set mtr_t::m_modifications when temporary data pages are modified. mtr_t::commit(): Only invoke the buf_flush_note_modification() loop if persistent data pages were modified. mtr_t::get_already_latched(): Look up a latched page in mtr_t::m_memo. This avoids many redundant entries in mtr_t::m_memo, as well as redundant calls to buf_page_get_gen() for blocks that had already been looked up in a mini-transaction. btr_get_latched_root(): Return a pointer to an already latched root page. This replaces btr_root_block_get() in cases where the mini-transaction has already latched the root page. btr_page_get_parent(): Fetch a parent page that was already latched in BTR_MODIFY_TREE, by invoking mtr_t::get_already_latched(). If needed, upgrade the root page U latch to X. This avoids bloating mtr_t::m_memo as well as performing redundant buf_pool.page_hash lookups. For non-QUICK CHECK TABLE as well as for B-tree defragmentation, we will invoke btr_cur_search_to_nth_level(). btr_cur_search_to_nth_level(): This will only be used for non-leaf (level>0) B-tree searches that were formerly named BTR_CONT_SEARCH_TREE or BTR_CONT_MODIFY_TREE. In MDEV-29835, this function could be removed altogether, or retained for the case of CHECK TABLE without QUICK. btr_cur_t::left_block: Remove. btr_pcur_move_backward_from_page() can retrieve the left sibling from the end of mtr_t::m_memo. btr_cur_t::open_leaf(): Some clean-up. btr_cur_t::search_leaf(): Replaces btr_cur_search_to_nth_level() for searches to level=0 (the leaf level). We will never release parent page latches before acquiring leaf page latches. If we need to temporarily release the level=1 page latch in the BTR_SEARCH_PREV or BTR_MODIFY_PREV latch_mode, we will reposition the cursor on the child node pointer so that we will land on the correct leaf page. btr_cur_t::pessimistic_search_leaf(): Implement new BTR_MODIFY_TREE latching logic in the case that page splits or merges will be needed. The parent pages (and their siblings) should already be latched on the first dive to the leaf and be present in mtr_t::m_memo; there should be no need for BTR_CONT_MODIFY_TREE. This pre-latching almost suffices; it must be revised in MDEV-29835 and work-arounds removed for cases where mtr_t::get_already_latched() fails to find a block. rtr_search_to_nth_level(): A SPATIAL INDEX version of btr_search_to_nth_level() that can search to any level (including the leaf level). rtr_search_leaf(), rtr_insert_leaf(): Wrappers for rtr_search_to_nth_level(). rtr_search(): Replaces rtr_pcur_open(). rtr_latch_leaves(): Replaces btr_cur_latch_leaves(). Note that unlike in the B-tree code, there is no error handling in case the sibling pages are corrupted. rtr_cur_restore_position(): Remove an unused constant parameter. btr_pcur_open_on_user_rec(): Remove the constant parameter mode=PAGE_CUR_GE. row_ins_clust_index_entry_low(): Use a new mode=BTR_MODIFY_ROOT_AND_LEAF to gain access to the root page when mode!=BTR_MODIFY_TREE, to write the PAGE_ROOT_AUTO_INC. BTR_SEARCH_TREE, BTR_CONT_SEARCH_TREE: Remove. BTR_CONT_MODIFY_TREE: Note that this is only used by rtr_search_to_nth_level(). btr_pcur_optimistic_latch_leaves(): Replaces btr_cur_optimistic_latch_leaves(). ibuf_delete_rec(): Acquire exclusive ibuf.index->lock in order to avoid a deadlock with ibuf_insert_low(BTR_MODIFY_PREV). btr_blob_log_check_t(): Acquire a U latch on the root page, so that btr_page_alloc() in btr_store_big_rec_extern_fields() will avoid a deadlock. btr_store_big_rec_extern_fields(): Assert that the root page latch is being held. Tested by: Matthias Leich Reviewed by: Vladislav Lesin
* | | | | MDEV-29694 Remove the InnoDB change bufferMarko Mäkelä2023-01-111-9/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The purpose of the change buffer was to reduce random disk access, which could be useful on rotational storage, but maybe less so on solid-state storage. When we wished to (1) insert a record into a non-unique secondary index, (2) delete-mark a secondary index record, (3) delete a secondary index record as part of purge (but not ROLLBACK), and the B-tree leaf page where the record belongs to is not in the buffer pool, we inserted a record into the change buffer B-tree, indexed by the page identifier. When the page was eventually read into the buffer pool, we looked up the change buffer B-tree for any modifications to the page, applied these upon the completion of the read operation. This was called the insert buffer merge. We remove the change buffer, because it has been the source of various hard-to-reproduce corruption bugs, including those fixed in commit 5b9ee8d8193a8c7a8ebdd35eedcadc3ae78e7fc1 and commit 165564d3c33ae3d677d70644a83afcb744bdbf65 but not limited to them. A downgrade will fail with a clear message starting with commit db14eb16f9977453467ec4765f481bb2f71814ba (MDEV-30106). buf_page_t::state: Merge IBUF_EXIST to UNFIXED and WRITE_FIX_IBUF to WRITE_FIX. buf_pool_t::watch[]: Remove. trx_t: Move isolation_level, check_foreigns, check_unique_secondary, bulk_insert into the same bit-field. The only purpose of trx_t::check_unique_secondary is to enable bulk insert into an empty table. It no longer enables insert buffering for UNIQUE INDEX. btr_cur_t::thr: Remove. This field was originally needed for change buffering. Later, its use was extended to cover SPATIAL INDEX. Much of the time, rtr_info::thr holds this field. When it does not, we will add parameters to SPATIAL INDEX specific functions. ibuf_upgrade_needed(): Check if the change buffer needs to be updated. ibuf_upgrade(): Merge and upgrade the change buffer after all redo log has been applied. Free any pages consumed by the change buffer, and zero out the change buffer root page to mark the upgrade completed, and to prevent a downgrade to an earlier version. dict_load_tablespaces(): Renamed from dict_check_tablespaces_and_store_max_id(). This needs to be invoked before ibuf_upgrade(). btr_cur_open_at_rnd_pos(): Specialize for use in persistent statistics. The change buffer merge does not need this function anymore. btr_page_alloc(): Renamed from btr_page_alloc_low(). We no longer allocate any change buffer pages. btr_cur_open_at_rnd_pos(): Specialize for use in persistent statistics. The change buffer merge does not need this function anymore. row_search_index_entry(), btr_lift_page_up(): Add a parameter thr for the SPATIAL INDEX case. rtr_page_split_and_insert(): Specialized from btr_page_split_and_insert(). rtr_root_raise_and_insert(): Specialized from btr_root_raise_and_insert(). Note: The support for upgrading from the MySQL 3.23 or MySQL 4.0 change buffer format that predates the MySQL 4.1 introduction of the option innodb_file_per_table was removed in MySQL 5.6.5 as part of mysql/mysql-server@69b6241a79876ae98bb0c9dce7c8d8799d6ad273 and MariaDB 10.0.11 as part of 1d0f70c2f894b27e98773a282871d32802f67964. In the tests innodb.log_upgrade and innodb.log_corruption, we create valid (upgraded) change buffer pages. Tested by: Matthias Leich
* | | | | MDEV-30136: Deprecate innodb_flush_methodMarko Mäkelä2023-01-111-17/+122
|/ / / / | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We introduce the following settable Boolean global variables: innodb_log_file_write_through: Whether writes to ib_logfile0 are write-through (disabling any caching, as in O_SYNC or O_DSYNC). innodb_data_file_write_through: Whether writes to any InnoDB data files (including the temporary tablespace) are write-through. innodb_data_file_buffering: Whether the file system cache is enabled for InnoDB data files. All these parameters are OFF by default, that is, the file system cache will be disabled, but any hardware caching is enabled, that is, explicit calls to fsync(), fdatasync() or similar functions are needed. On systems that support FUA it may make sense to enable write-through, to avoid extra system calls. If the deprecated read-only start-up parameter is set to one of the following values, then the values of the 4 Boolean flags (the above 3 plus innodb_log_file_buffering) will be set as follows: O_DSYNC: innodb_log_file_write_through=ON, innodb_data_file_write_through=ON, innodb_data_file_buffering=OFF, and (if supported) innodb_log_file_buffering=OFF. fsync, littlesync, nosync, or (Microsoft Windows specific) normal: innodb_log_file_write_through=OFF, innodb_data_file_write_through=OFF, and innodb_data_file_buffering=ON. Note: fsync() or fdatasync() will only be disabled if the separate parameter debug_no_sync (in the code, my_disable_sync) is set. In mariadb-backup, the parameter innodb_flush_method will be ignored. The Boolean parameters can be modified by SET GLOBAL while the server is running. This will require reopening the ib_logfile0 or all currently open InnoDB data files. We will open files straight in O_DSYNC or O_SYNC mode when applicable. Data files we will try to open straight in O_DIRECT mode when the page size is at least 4096 bytes. For atomically creating data files, we will invoke os_file_set_nocache() to enable O_DIRECT afterwards, because O_DIRECT is not supported on some file systems. We will also continue to invoke os_file_set_nocache() on ib_logfile0 when innodb_log_file_buffering=OFF can be fulfilled. For reopening the ib_logfile0, we use the same logic that was developed for online log resizing and reused for updates of innodb_log_file_buffering. Reopening all data files is implemented in the new function fil_space_t::reopen_all(). Reviewed by: Vladislav Vaintroub Tested by: Matthias Leich
* | | | Merge 10.7 into 10.8Marko Mäkelä2022-11-301-4/+1
|\ \ \ \ | |/ / /
| * | | Merge 10.6 into 10.7Marko Mäkelä2022-11-301-4/+1
| |\ \ \ | | |/ /
| | * | MDEV-30132 Crash after recovery, with InnoDB: Tried to read ...Marko Mäkelä2022-11-301-4/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | os_file_read(): Merged with os_file_read_no_error_handling(). Crashing on a partial page read is as unhelpful as crashing on a corrupted page read (commit 0b47c126e31cddda1e94588799599e138400bcf8). Report the file name if it is available via IORequest.
| | * | MDEV-30132 Crash after recovery, with InnoDB: Tried to read ... bytes at offsetMarko Mäkelä2022-11-301-0/+1
| | | | | | | | | | | | | | | | | | | | fil_space_t::prepare_acquired(): Do not attempt to extend (or shrink) files that will be processed by recv_sys_t::recover_deferred().
* | | | Merge 10.7 into 10.8Marko Mäkelä2022-09-211-4/+4
|\ \ \ \ | |/ / /
| * | | Merge 10.6 into 10.7Marko Mäkelä2022-09-211-4/+4
| |\ \ \ | | |/ /
| | * | Merge 10.5 into 10.6Marko Mäkelä2022-09-201-4/+4
| | |\ \ | | | |/
| | | * Merge 10.4 into 10.5Marko Mäkelä2022-09-201-4/+4
| | | |\
| | | | * Merge 10.3 into 10.4Marko Mäkelä2022-09-201-4/+4
| | | | |\
| | | | * \ Merge 10.3 into 10.4Marko Mäkelä2022-08-301-1/+1
| | | | |\ \ | | | | | |/
| | | | | * MDEV-29258 Failing assertion for name length on RENAME TABLEMarko Mäkelä2022-08-301-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | trx_undo_page_report_rename(): Use the correct maximum length of a table name. Both the database name and the table name can be up to NAME_CHAR_LEN (64 characters) times 5 bytes per character in the my_charset_filename encoding. They are not encoded in UTF-8! fil_op_write_log(): Reserve the correct amount of log buffer for a rename operation. The file name will be appended by mlog_catenate_string(). rename_file_ext(): Reserve a large enough buffer for the file names.
| | | | * | Merge 10.3 into 10.4Marko Mäkelä2022-07-271-14/+25
| | | | |\ \ | | | | | |/
| | | | | * MDEV-28495 InnoDB corruption due to lack of file lockingMarko Mäkelä2022-07-271-14/+25
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Starting with commit da094188f60bf67e3d90227304a4ea256fe2630f (MDEV-24393), MariaDB will no longer acquire advisory file locks on InnoDB data files by default, because it would create a large number of entries in Linux /proc/locks. The motivation for acquiring the file locks is to prevent accidental concurrent startup of multiple server processes on the same data files. Such mistake still turns out to be relatively common, based on corruption bug reports from the community. To prevent corruption due to concurrent startup attempts, the Aria storage engine would unconditionally acquire an advisory lock on one of its log files. Solution: InnoDB will always lock its system tablespace files. (Ever since commit 685d958e38b825ad9829be311f26729cccf37c46 the InnoDB log file will not necessarily be open while the server is running, because it can be accessed via memory-mapped I/O.) If more protection is desired, then the option --external-locking can be used. The mandatory advisory lock also fixes intermittent failures of some crash recovery tests. It turns out that when the mtr test harness kills and restarts the server, it will not actually ensure that the old process has terminated before starting the new one.
| | | | * | Merge branch '10.3' into 10.4Oleksandr Byelkin2022-07-271-7/+1
| | | | |\ \ | | | | | |/
| | | | | * Correct a bogus comment.Marko Mäkelä2022-07-261-7/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The logic on Windows was originally simplified in commit 4bca1a786f9986cec7d0487059451e51e2b9479b for storage/xtradb and later in commit 6304c0bfc7242ad54d93168e97e82022765934e3 for storage/innobase.
* | | | | | Merge 10.7 into 10.8Marko Mäkelä2022-07-011-7/+13
|\ \ \ \ \ \ | |/ / / / /
| * | | | | Merge 10.6 into 10.7Marko Mäkelä2022-07-011-7/+13
| |\ \ \ \ \ | | |/ / / /
| | * | | | Merge 10.5 into 10.6Marko Mäkelä2022-06-301-7/+13
| | |\ \ \ \ | | | |/ / /
| | | * | | MDEV-26293 InnoDB: Failing assertion: space->is_ready_to_close() ...Marko Mäkelä2022-06-301-7/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | fil_space_t::acquire_low(): Introduce a parameter that specifies which flags should be avoided. At all times, referenced() must not be incremented if the STOPPING flag is set. When fil_system.mutex is not being held by the current thread, the reference must not be incremented if the CLOSING flag is set (unless NEEDS_FSYNC is set, in fil_space_t::flush()). fil_space_t::acquire(): Invoke acquire_low(STOPPING | CLOSING). In this way, the reference count cannot be incremented after fil_space_t::try_to_close() invoked fil_space_t::set_closing(). If the CLOSING flag was set, we must retry acquire_low() after acquiring fil_system.mutex. fil_space_t::prepare_acquired(): Replaces prepare(true). fil_space_t::acquire_and_prepare(): Replaces prepare(). This basically retries fil_space_t::acquire() after acquiring fil_system.mutex.
* | | | | | Merge 10.7 into 10.8Marko Mäkelä2022-06-211-181/+44
|\ \ \ \ \ \ | |/ / / / /
| * | | | | Merge 10.6 into 10.7Marko Mäkelä2022-06-211-170/+39
| |\ \ \ \ \ | | |/ / / /
| | * | | | MDEV-28870 InnoDB: Missing FILE_CREATE, FILE_DELETE or FILE_MODIFY before ↵Marko Mäkelä2022-06-211-184/+39
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | FILE_CHECKPOINT There was a race condition between log_checkpoint_low() and deleting or renaming data files. The scenario is as follows: 1. The buffer pool does not contain dirty pages. 2. A FILE_DELETE or FILE_RENAME record is written. 3. The checkpoint LSN will be moved ahead of the write of the record. 4. The server is killed before the file is actually renamed or deleted. We will prevent this race condition by ensuring that a log checkpoint cannot occur between the durable write and the file system operation: 1. Durably write the FILE_DELETE or FILE_RENAME record. 2. Perform the file system operation. 3. Allow any log checkpoint to proceed. mtr_t::commit_file(): Implement the DELETE or RENAME logic. fil_delete_tablespace(): Delegate some of the logic to mtr_t::commit_file(). fil_space_t::rename(): Delegate some logic to mtr_t::commit_file(). Remove the debug injection point fil_rename_tablespace_failure_2 because we do test RENAME failures without any debug injection. fil_name_write_rename_low(), fil_name_write_rename(): Remove. Tested by Matthias Leich
* | | | | | Merge 10.7 into 10.8Marko Mäkelä2022-06-091-1/+1
|\ \ \ \ \ \ | |/ / / / /
| * | | | | Merge 10.6 into 10.7Marko Mäkelä2022-06-091-1/+1
| |\ \ \ \ \ | | |/ / / /
| | * | | | MDEV-28457 Crash in page_dir_find_owner_slot()Marko Mäkelä2022-06-081-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | A prominent remaining source of crashes on corrupted index pages is page directory corruption. A frequent caller of page_dir_find_owner_slot() is page_rec_get_prev(). Some of those calls can be replaced with simpler logic that is less prone to fail. page_dir_find_owner_slot(), page_rec_get_prev(), page_rec_get_prev_const(), btr_pcur_move_to_prev(), btr_pcur_move_to_prev_on_page(), btr_cur_upd_rec_sys(), page_delete_rec_list_end(), rtr_page_copy_rec_list_end_no_locks(), rtr_page_copy_rec_list_start_no_locks(): Return an error code on failure. fil_space_t::io(), buf_page_get_low(): Use DB_CORRUPTION for out-of-bounds page reads. PageBulk::getSplitRec(), PageBulk::copyOut(): Simplify the code. btr_validate_level(): Prevent some more CHECK TABLE crashes on corrupted pages. btr_block_get(), btr_pcur_move_to_next_page(): Implement some checks that were previously only part of IndexPurge::next(). IndexPurge::next(): Use btr_pcur_move_to_next_page().
| | * | | | MDEV-18519: Assertion failure in btr_page_reorganize_low()Marko Mäkelä2022-06-081-3/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Even after commit 0b47c126e31cddda1e94588799599e138400bcf8 there are a few ib::fatal() calls in non-debug code that can be replaced easily. btr_page_reorganize_low(): On size invariant violation, return an error code instead of crashing. btr_check_blob_fil_page_type(): On an invalid page type, report an error but do not crash. btr_copy_blob_prefix(): Truncate the output if a page type is invalid. dict_load_foreign_cols(): On an error, return DB_CORRUPTION instead of crashing. fil_space_decrypt_full_crc32(), fil_space_decrypt_for_non_full_checksum(): On error, return DB_DECRYPTION_FAILED instead of crashing. fil_set_max_space_id_if_bigger(): Replace ib::fatal() with an equivalent ut_a() assertion.
* | | | | | Merge 10.7 into 10.8Marko Mäkelä2022-06-061-39/+42
|\ \ \ \ \ \ | |/ / / / /
| * | | | | Merge 10.6 into 10.7Marko Mäkelä2022-06-061-41/+42
| |\ \ \ \ \ | | |/ / / /
| | * | | | MDEV-18976 Implement OPT_PAGE_CHECKSUM log record for improved validationMarko Mäkelä2022-06-061-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We will introduce an optional log record OPT_PAGE_CHECKSUM for recording page checksums, so that more inconsistencies on crash recovery may be caught. mtr_t::page_checksum(const buf_page_t&): Write OPT_PAGE_CHECKSUM (currently not for ROW_FORMAT=COMPRESSED pages). mtr_t::do_write(): Write OPT_PAGE_CHECKSUM records for all pages (currently, in debug builds only). mtr_t::is_logged(): Return whether log should be written. mtr_t::set_log_mode_sub(const mtr_t&): Set the logging mode of a sub-minitransaction when another mini-transaction is holding latches on some modified pages. When creating or freeing BLOB pages, we may only write OPT_PAGE_CHECKSUM records in the main mini-transaction, after all changes have been written to the log. MTR_LOG_SUB: Log mode for a sub-mini-transaction. mtr_t::free(): Define non-inline, and invoke MarkFreed. MarkFreed: For any matching page in the mini-transaction log, change the first entry to say MTR_MEMO_PAGE_X_MODIFY and any subsequent entries to MTR_MEMO_PAGE_X_FIX. FindModified: Simplify a condition. MTR_MEMO_MODIFY can only be set if MTR_MEMO_PAGE_X_FIX or MTR_MEMO_PAGE_SX_FIX are set. FindBlockX: Consider also MTR_MEMO_PAGE_X_MODIFY. recv_sys_t::parse(): Store OPT_PAGE_CHECKSUM records. log_phys_t::apply(): Validate OPT_PAGE_CHECKSUM records. log_phys_t::page_checksum(): Validate an OPT_PAGE_CHECKSUM record. Tested by: Matthias Leich
| | * | | | MDEV-13542: Implement page read fault injectionMarko Mäkelä2022-06-061-1/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | --debug-dbug=d,intermittent_read_failure is effective after the database has been started up. --debug-dbug=d,intermittent_recovery_failure is always effective, including during recovery.
| | * | | | MDEV-13542: Crashing on corrupted page is unhelpfulMarko Mäkelä2022-06-061-34/+30
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The approach to handling corruption that was chosen by Oracle in commit 177d8b0c125b841c0650d27d735e3b87509dc286 is not really useful. Not only did it actually fail to prevent InnoDB from crashing, but it is making things worse by blocking attempts to rescue data from or rebuild a partially readable table. We will try to prevent crashes in a different way: by propagating errors up the call stack. We will never mark the clustered index persistently corrupted, so that data recovery may be attempted by reading from the table, or by rebuilding the table. This should also fix MDEV-13680 (crash on btr_page_alloc() failure); it was extensively tested with innodb_file_per_table=0 and a non-autoextend system tablespace. We should now avoid crashes in many cases, such as when a page cannot be read or allocated, or an inconsistency is detected when attempting to update multiple pages. We will not crash on double-free, such as on the recovery of DDL in system tablespace in case something was corrupted. Crashes on corrupted data are still possible. The fault injection mechanism that is introduced in the subsequent commit may help catch more of them. buf_page_import_corrupt_failure: Remove the fault injection, and instead corrupt some pages using Perl code in the tests. btr_cur_pessimistic_insert(): Always reserve extents (except for the change buffer), in order to prevent a subsequent allocation failure. btr_pcur_open_at_rnd_pos(): Merged to the only caller ibuf_merge_pages(). btr_assert_not_corrupted(), btr_corruption_report(): Remove. Similar checks are already part of btr_block_get(). FSEG_MAGIC_N_BYTES: Replaces FSEG_MAGIC_N_VALUE. dict_hdr_get(), trx_rsegf_get_new(), trx_undo_page_get(), trx_undo_page_get_s_latched(): Replaced with error-checking calls. trx_rseg_t::get(mtr_t*): Replaces trx_rsegf_get(). trx_rseg_header_create(): Let the caller update the TRX_SYS page if needed. trx_sys_create_sys_pages(): Merged with trx_sysf_create(). dict_check_tablespaces_and_store_max_id(): Do not access DICT_HDR_MAX_SPACE_ID, because it was already recovered in dict_boot(). Merge dict_check_sys_tables() with this function. dir_pathname(): Replaces os_file_make_new_pathname(). row_undo_ins_remove_sec(): Do not modify the undo page by adding a terminating NUL byte to the record. btr_decryption_failed(): Report decryption failures dict_set_corrupted_by_space(), dict_set_encrypted_by_space(), dict_set_corrupted_index_cache_only(): Remove. dict_set_corrupted(): Remove the constant parameter dict_locked=false. Never flag the clustered index corrupted in SYS_INDEXES, because that would deny further access to the table. It might be possible to repair the table by executing ALTER TABLE or OPTIMIZE TABLE, in case no B-tree leaf page is corrupted. dict_table_skip_corrupt_index(), dict_table_next_uncorrupted_index(), row_purge_skip_uncommitted_virtual_index(): Remove, and refactor the callers to read dict_index_t::type only once. dict_table_is_corrupted(): Remove. dict_index_t::is_btree(): Determine if the index is a valid B-tree. BUF_GET_NO_LATCH, BUF_EVICT_IF_IN_POOL: Remove. UNIV_BTR_DEBUG: Remove. Any inconsistency will no longer trigger assertion failures, but error codes being returned. buf_corrupt_page_release(): Replaced with a direct call to buf_pool.corrupted_evict(). fil_invalid_page_access_msg(): Never crash on an invalid read; let the caller of buf_page_get_gen() decide. btr_pcur_t::restore_position(): Propagate failure status to the caller by returning CORRUPTED. opt_search_plan_for_table(): Simplify the code. row_purge_del_mark(), row_purge_upd_exist_or_extern_func(), row_undo_ins_remove_sec_rec(), row_undo_mod_upd_del_sec(), row_undo_mod_del_mark_sec(): Avoid mem_heap_create()/mem_heap_free() when no secondary indexes exist. row_undo_mod_upd_exist_sec(): Simplify the code. row_upd_clust_step(), dict_load_table_one(): Return DB_TABLE_CORRUPT if the clustered index (and therefore the table) is corrupted, similar to what we do in row_insert_for_mysql(). fut_get_ptr(): Replace with buf_page_get_gen() calls. buf_page_get_gen(): Return nullptr and *err=DB_CORRUPTION if the page is marked as freed. For other modes than BUF_GET_POSSIBLY_FREED or BUF_PEEK_IF_IN_POOL this will trigger a debug assertion failure. For BUF_GET_POSSIBLY_FREED, we will return nullptr for freed pages, so that the callers can be simplified. The purge of transaction history will be a new user of BUF_GET_POSSIBLY_FREED, to avoid crashes on corrupted data. buf_page_get_low(): Never crash on a corrupted page, but simply return nullptr. fseg_page_is_allocated(): Replaces fseg_page_is_free(). fts_drop_common_tables(): Return an error if the transaction was rolled back. fil_space_t::set_corrupted(): Report a tablespace as corrupted if it was not reported already. fil_space_t::io(): Invoke fil_space_t::set_corrupted() to report out-of-bounds page access or other errors. Clean up mtr_t::page_lock() buf_page_get_low(): Validate the page identifier (to check for recently read corrupted pages) after acquiring the page latch. buf_page_t::read_complete(): Flag uninitialized (all-zero) pages with DB_FAIL. Return DB_PAGE_CORRUPTED on page number mismatch. mtr_t::defer_drop_ahi(): Renamed from mtr_defer_drop_ahi(). recv_sys_t::free_corrupted_page(): Only set_corrupt_fs() if any log records exist for the page. We do not mind if read-ahead produces corrupted (or all-zero) pages that were not actually needed during recovery. recv_recover_page(): Return whether the operation succeeded. recv_sys_t::recover_low(): Simplify the logic. Check for recovery error. Thanks to Matthias Leich for testing this extensively and to the authors of https://rr-project.org for making it easy to diagnose and fix any failures that were found during the testing.
| | * | | | Cleanup: Remove fil_space_t::magic_nMarko Mäkelä2022-06-061-6/+1
| | | | | |
| | * | | | Merge 10.5 into 10.6Marko Mäkelä2022-06-021-1/+1
| | |\ \ \ \ | | | |/ / /
| | | * | | Clean up mtr_tMarko Mäkelä2022-06-021-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | mtr_t::is_empty(): Replaces mtr_t::get_log() and mtr_t::get_memo(). mtr_t::get_log_size(): Replaces mtr_t::get_log(). mtr_t::print(): Remove, unused function. ReleaseBlocks::ReleaseBlocks(): Remove an unused parameter.
* | | | | | Merge 10.7 into 10.8Marko Mäkelä2022-04-271-0/+10
|\ \ \ \ \ \ | |/ / / / /
| * | | | | Merge 10.6 into 10.7Marko Mäkelä2022-04-261-2/+13
| |\ \ \ \ \ | | |/ / / /
| | * | | | Merge 10.5 into 10.6Marko Mäkelä2022-04-261-8/+2
| | |\ \ \ \ | | | |/ / /
| | | * | | MDEV-27094 Debug builds include useless InnoDB "disabled" optionsMarko Mäkelä2022-04-221-8/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This is a backport of commit 4489a89c71ea78f2562159ca0303fbb83da5baa0 in order to remove the test innodb.redo_log_during_checkpoint that would cause trouble in the DBUG subsystem invoked by safe_mutex_lock() via log_checkpoint(). Before commit 7cffb5f6e8a231a041152447be8980ce35d2c9b8 these mutexes were of different type. The following options were introduced in commit 2e814d4702d71a04388386a9f591d14a35980bfe (mariadb-10.2.2) and have little use: innodb_disable_resize_buffer_pool_debug had no effect even in MariaDB 10.2.2 or MySQL 5.7.9. It was introduced in mysql/mysql-server@5c4094cf4971eebab89da4ee4ae92c71f69cd524 to work around a problem that was fixed in mysql/mysql-server@2957ae4f990bf3aed25822b0ce15d3ccad0b54b6 (but the parameter was not removed). innodb_page_cleaner_disabled_debug and innodb_master_thread_disabled_debug are only used by the test innodb.redo_log_during_checkpoint that will be removed as part of this commit. innodb_dict_stats_disabled_debug is only used by that test, and it is redundant because one could simply use innodb_stats_persistent=OFF or the STATS_PERSISTENT=0 attribute of the table in the test to achieve the same effect.
| | * | | | Merge 10.5 into 10.6Marko Mäkelä2022-04-211-0/+3
| | |\ \ \ \ | | | |/ / /
| | | * | | MDEV-28371 Assertion fold == id.fold() failed in buf_flush_check_neighbor()Marko Mäkelä2022-04-211-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Due to 32-bit arithmetics, SRV_TMP_SPACE_ID page number 0x200002 would be folded to 0, which is incompatible with the assumption that was made in commit 7cffb5f6e8a231a041152447be8980ce35d2c9b8 (MDEV-23399). page_id_t::fold(): Compute in the native word width instead of uint32_t. On 64-bit platforms, an alternative would be to return the 64-bit m_id directly, but that was measured to cause a performance regression. fil_space_t::open(): Invoke fil_node_t::find_metadata() when the tablespace is being created. In this way, we will actually detect that the temporary tablespace resides on SSD. (During database creation, also the system tablespace will correctly be detected as residing on SSD.)