summaryrefslogtreecommitdiff
path: root/storage/innobase/ibuf
Commit message (Collapse)AuthorAgeFilesLines
* MDEV-13542: Crashing on corrupted page is unhelpfulMarko Mäkelä2022-06-061-358/+349
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The approach to handling corruption that was chosen by Oracle in commit 177d8b0c125b841c0650d27d735e3b87509dc286 is not really useful. Not only did it actually fail to prevent InnoDB from crashing, but it is making things worse by blocking attempts to rescue data from or rebuild a partially readable table. We will try to prevent crashes in a different way: by propagating errors up the call stack. We will never mark the clustered index persistently corrupted, so that data recovery may be attempted by reading from the table, or by rebuilding the table. This should also fix MDEV-13680 (crash on btr_page_alloc() failure); it was extensively tested with innodb_file_per_table=0 and a non-autoextend system tablespace. We should now avoid crashes in many cases, such as when a page cannot be read or allocated, or an inconsistency is detected when attempting to update multiple pages. We will not crash on double-free, such as on the recovery of DDL in system tablespace in case something was corrupted. Crashes on corrupted data are still possible. The fault injection mechanism that is introduced in the subsequent commit may help catch more of them. buf_page_import_corrupt_failure: Remove the fault injection, and instead corrupt some pages using Perl code in the tests. btr_cur_pessimistic_insert(): Always reserve extents (except for the change buffer), in order to prevent a subsequent allocation failure. btr_pcur_open_at_rnd_pos(): Merged to the only caller ibuf_merge_pages(). btr_assert_not_corrupted(), btr_corruption_report(): Remove. Similar checks are already part of btr_block_get(). FSEG_MAGIC_N_BYTES: Replaces FSEG_MAGIC_N_VALUE. dict_hdr_get(), trx_rsegf_get_new(), trx_undo_page_get(), trx_undo_page_get_s_latched(): Replaced with error-checking calls. trx_rseg_t::get(mtr_t*): Replaces trx_rsegf_get(). trx_rseg_header_create(): Let the caller update the TRX_SYS page if needed. trx_sys_create_sys_pages(): Merged with trx_sysf_create(). dict_check_tablespaces_and_store_max_id(): Do not access DICT_HDR_MAX_SPACE_ID, because it was already recovered in dict_boot(). Merge dict_check_sys_tables() with this function. dir_pathname(): Replaces os_file_make_new_pathname(). row_undo_ins_remove_sec(): Do not modify the undo page by adding a terminating NUL byte to the record. btr_decryption_failed(): Report decryption failures dict_set_corrupted_by_space(), dict_set_encrypted_by_space(), dict_set_corrupted_index_cache_only(): Remove. dict_set_corrupted(): Remove the constant parameter dict_locked=false. Never flag the clustered index corrupted in SYS_INDEXES, because that would deny further access to the table. It might be possible to repair the table by executing ALTER TABLE or OPTIMIZE TABLE, in case no B-tree leaf page is corrupted. dict_table_skip_corrupt_index(), dict_table_next_uncorrupted_index(), row_purge_skip_uncommitted_virtual_index(): Remove, and refactor the callers to read dict_index_t::type only once. dict_table_is_corrupted(): Remove. dict_index_t::is_btree(): Determine if the index is a valid B-tree. BUF_GET_NO_LATCH, BUF_EVICT_IF_IN_POOL: Remove. UNIV_BTR_DEBUG: Remove. Any inconsistency will no longer trigger assertion failures, but error codes being returned. buf_corrupt_page_release(): Replaced with a direct call to buf_pool.corrupted_evict(). fil_invalid_page_access_msg(): Never crash on an invalid read; let the caller of buf_page_get_gen() decide. btr_pcur_t::restore_position(): Propagate failure status to the caller by returning CORRUPTED. opt_search_plan_for_table(): Simplify the code. row_purge_del_mark(), row_purge_upd_exist_or_extern_func(), row_undo_ins_remove_sec_rec(), row_undo_mod_upd_del_sec(), row_undo_mod_del_mark_sec(): Avoid mem_heap_create()/mem_heap_free() when no secondary indexes exist. row_undo_mod_upd_exist_sec(): Simplify the code. row_upd_clust_step(), dict_load_table_one(): Return DB_TABLE_CORRUPT if the clustered index (and therefore the table) is corrupted, similar to what we do in row_insert_for_mysql(). fut_get_ptr(): Replace with buf_page_get_gen() calls. buf_page_get_gen(): Return nullptr and *err=DB_CORRUPTION if the page is marked as freed. For other modes than BUF_GET_POSSIBLY_FREED or BUF_PEEK_IF_IN_POOL this will trigger a debug assertion failure. For BUF_GET_POSSIBLY_FREED, we will return nullptr for freed pages, so that the callers can be simplified. The purge of transaction history will be a new user of BUF_GET_POSSIBLY_FREED, to avoid crashes on corrupted data. buf_page_get_low(): Never crash on a corrupted page, but simply return nullptr. fseg_page_is_allocated(): Replaces fseg_page_is_free(). fts_drop_common_tables(): Return an error if the transaction was rolled back. fil_space_t::set_corrupted(): Report a tablespace as corrupted if it was not reported already. fil_space_t::io(): Invoke fil_space_t::set_corrupted() to report out-of-bounds page access or other errors. Clean up mtr_t::page_lock() buf_page_get_low(): Validate the page identifier (to check for recently read corrupted pages) after acquiring the page latch. buf_page_t::read_complete(): Flag uninitialized (all-zero) pages with DB_FAIL. Return DB_PAGE_CORRUPTED on page number mismatch. mtr_t::defer_drop_ahi(): Renamed from mtr_defer_drop_ahi(). recv_sys_t::free_corrupted_page(): Only set_corrupt_fs() if any log records exist for the page. We do not mind if read-ahead produces corrupted (or all-zero) pages that were not actually needed during recovery. recv_recover_page(): Return whether the operation succeeded. recv_sys_t::recover_low(): Simplify the logic. Check for recovery error. Thanks to Matthias Leich for testing this extensively and to the authors of https://rr-project.org for making it easy to diagnose and fix any failures that were found during the testing.
* MDEV-28465 Some calls to btr_pcur_close() are unnecessaryMarko Mäkelä2022-05-031-15/+7
| | | | | | | | | | The function btr_pcur_close() is being invoked on local variables even when no cleanup needs to be done. In particular, for B-tree indexes (not SPATIAL INDEX), unless btr_pcur_store_position() was invoked in the past, there is no need to invoke btr_pcur_close(). On purge and rollback, we will retain btr_pcur_close(&pcur) because otherwise some ./mtr --suite=innodb_gis tests would leak memory.
* MDEV-28454 Latching order violation (and hang) in ibuf_insert_lowMarko Mäkelä2022-05-031-7/+8
| | | | | | | | | | | | | | | | | Since commit 2ca112346438611ab7800b70bea6af1fd1169308 (MDEV-26217) the function trx_t::commit(std::vector<pfs_os_file_t>&) holds exclusive lock_sys.latch while invoking fil_delete_tablespace(), which in turn may wait for change buffer tree latches in ibuf_delete_for_discarded_space(). ibuf_insert_low(): If a shared lock_sys.latch cannot be acquired without waiting, refuse to buffer the insert. In this way, a deadlock with a DDL operation will be avoided. ibuf_insert_to_index_page(), ibuf_delete(): Remove redundant calls to record locking. In ibuf_insert_low() we already ensured that no record locks existed on the page. No locks can be added before the buffered changes have been merged.
* MDEV-26217 Failing assertion: list.count > 0 in ut_list_remove or Assertion ↵Marko Mäkelä2022-04-261-7/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | `lock->trx == this' failed in dberr_t trx_t::drop_table This follows up the previous fix in commit c3c53926c467c95386ae98d61ada87294bd61478 (MDEV-26554). ha_innobase::delete_table(): Work around the insufficient metadata locking (MDL) during DML operations by acquiring exclusive InnoDB table locks on all child tables. Previously, this was only done on TRUNCATE and ALTER. ibuf_delete_rec(), btr_cur_optimistic_delete(): Do not invoke lock_update_delete() during change buffer operations. The revised trx_t::commit(std::vector<pfs_os_file_t>&) will hold exclusive lock_sys.latch while invoking fil_delete_tablespace(), which in turn may invoke ibuf_delete_rec(). dict_index_t::has_locking(): A new predicate, replacing the dummy !dict_table_is_locking_disabled(index->table). Used for skipping lock operations during ibuf_delete_rec(). trx_t::commit(std::vector<pfs_os_file_t>&): Release the locks and remove the table from the cache while holding exclusive lock_sys.latch. trx_t::commit_in_memory(): Skip release_locks() if dict_operation holds. trx_t::commit(): Reset dict_operation before invoking commit_in_memory() via commit_persist(). lock_release_on_drop(): Release locks while lock_sys.latch is exclusively locked. lock_table(): Add a parameter for a pointer to the table. We must not dereference the table before a lock_sys.latch has been acquired. If the pointer to the table does not match the table at that point, the table is invalid and DB_DEADLOCK will be returned. row_ins_foreign_check_on_constraint(): Improve the checks. Remove a bogus DB_LOCK_WAIT_TIMEOUT return that was needed before commit c5fd9aa562fb15e8d6ededceccbec0c9792a3243 (MDEV-25919). row_upd_check_references_constraints(), wsrep_row_upd_check_foreign_constraints(): Simplify checks.
* Merge 10.5 into 10.6Marko Mäkelä2022-04-211-24/+10
|\
| * Merge 10.4 into 10.5Marko Mäkelä2022-04-211-26/+9
| |\
| | * Merge 10.3 into 10.4Marko Mäkelä2022-04-211-25/+8
| | |\
| | | * MDEV-28369 ibuf_bitmap_mutex is an unnecessary contention pointMarko Mäkelä2022-04-211-26/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The only purpose of ibuf_bitmap_mutex is to prevent a deadlock between two concurrent invocations of ibuf_update_free_bits_for_two_pages_low() on the same pair of bitmap pages, but in opposite order. The mutex is unnecessarily serializing the execution of the function even when it is being invoked on totally different tablespaces. To avoid deadlocks, it suffices to ensure that the two page latches are being acquired in a deterministic (sorted) order.
| | * | Merge 10.3 into 10.4Marko Mäkelä2022-04-071-1/+10
| | |\ \ | | | |/
| | | * MDEV-28247 : Disable background ibuf merge during Galera SSTJan Lindström2022-04-071-0/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This failure was caused by MDEV-25975, which removed the parameter innodb_disallow_writes. Added a check for wsrep_sst_disable_writes to the function ibuf_merge_in_background().
| | * | Merge 10.3 into 10.4Vlad Lesin2022-02-211-1/+2
| | |\ \ | | | |/
| | | * MDEV-20605 Awaken transaction can miss inserted by other transaction records ↵Vlad Lesin2022-02-211-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | due to wrong persistent cursor restoration Backported from 10.5 20e9e804c131c6522bc7c469e4863e8d1eaa3ee0 and 5948d7602ec7f61937c368dcb134e6ec226a2990. sel_restore_position_for_mysql() moves forward persistent cursor position after btr_pcur_restore_position() call if cursor relative position is BTR_PCUR_ON and the cursor points to the record with NOT the same field values as in a stored record(and some other not important for this case conditions). It was done because btr_pcur_restore_position() sets page_cur_mode_t mode to PAGE_CUR_LE for cursor->rel_pos == BTR_PCUR_ON before opening cursor. So we are searching for the record less or equal to stored one. And if the found record is not equal to stored one, then it is less and we need to move cursor forward. But there can be a situation when the stored record was purged, but the new one with the same key but different value was inserted while row_search_mvcc() was suspended. In this case, when the thread is awaken, it will invoke sel_restore_position_for_mysql(), which, in turns, invoke btr_pcur_restore_position(), which will return false because found record don't match stored record, and sel_restore_position_for_mysql() will move forward cursor position. The above can lead to the case when awaken row_search_mvcc() do not see records inserted by other transactions while it slept. The mtr test case shows the example how it can be. The fix is to return special value from persistent cursor restoring function which would notify its caller that uniq fields of restored record and stored record are the same, and in this case sel_restore_position_for_mysql() don't move cursor forward. Delete-marked records are correctly processed in row_search_mvcc(). Non-unique secondary indexes are "uniquified" by adding the PK, the index->n_uniq should then be index->n_fields. So there is no need in additional checks in the fix. If transaction's readview can't see the changes made in secondary index record, it requests clustered index record in row_search_mvcc() to check its transaction id and get the correspondent record version. After this row_search_mvcc() commits mtr to preserve clustered index latching order, and starts mtr. Between those mtr commit and start secondary index pages are unlatched, and purge has the ability to remove stored in the cursor record, what causes rows duplication in result set for non-locking reads, as cursor position is restored to the previously visited record. To solve this the changes are just switched off for non-locking reads, it's quite simple solution, besides the changes don't make sense for non-locking reads. The more complex and effective from performance perspective solution is to create mtr savepoint before clustered record requesting and rolling back to that savepoint after that. See MDEV-27557. One more solution is to have per-record transaction id for secondary indexes. See MDEV-17598. If any of those is implemented, just remove select_lock_type argument in sel_restore_position_for_mysql().
* | | | Merge 10.5 into 10.6Vlad Lesin2022-02-141-1/+2
|\ \ \ \ | |/ / /
| * | | MDEV-20605 Awaken transaction can miss inserted by other transaction records ↵Vlad Lesin2022-02-141-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | due to wrong persistent cursor restoration sel_restore_position_for_mysql() moves forward persistent cursor position after btr_pcur_restore_position() call if cursor relative position is BTR_PCUR_ON and the cursor points to the record with NOT the same field values as in a stored record(and some other not important for this case conditions). It was done because btr_pcur_restore_position() sets page_cur_mode_t mode to PAGE_CUR_LE for cursor->rel_pos == BTR_PCUR_ON before opening cursor. So we are searching for the record less or equal to stored one. And if the found record is not equal to stored one, then it is less and we need to move cursor forward. But there can be a situation when the stored record was purged, but the new one with the same key but different value was inserted while row_search_mvcc() was suspended. In this case, when the thread is awaken, it will invoke sel_restore_position_for_mysql(), which, in turns, invoke btr_pcur_restore_position(), which will return false because found record don't match stored record, and sel_restore_position_for_mysql() will move forward cursor position. The above can lead to the case when awaken row_search_mvcc() do not see records inserted by other transactions while it slept. The mtr test case shows the example how it can be. The fix is to return special value from persistent cursor restoring function which would notify its caller that uniq fields of restored record and stored record are the same, and in this case sel_restore_position_for_mysql() don't move cursor forward. Delete-marked records are correctly processed in row_search_mvcc(). Non-unique secondary indexes are "uniquified" by adding the PK, the index->n_uniq should then be index->n_fields. So there is no need in additional checks in the fix. If transaction's readview can't see the changes made in secondary index record, it requests clustered index record in row_search_mvcc() to check its transaction id and get the correspondent record version. After this row_search_mvcc() commits mtr to preserve clustered index latching order, and starts mtr. Between those mtr commit and start secondary index pages are unlatched, and purge has the ability to remove stored in the cursor record, what causes rows duplication in result set for non-locking reads, as cursor position is restored to the previously visited record. To solve this the changes are just switched off for non-locking reads, it's quite simple solution, besides the changes don't make sense for non-locking reads. The more complex and effective from performance perspective solution is to create mtr savepoint before clustered record requesting and rolling back to that savepoint after that. See MDEV-27557. One more solution is to have per-record transaction id for secondary indexes. See MDEV-17598. If any of those is implemented, just remove select_lock_type argument in sel_restore_position_for_mysql().
* | | | MDEV-27058: Reduce the size of buf_block_t and buf_page_tMarko Mäkelä2021-11-181-39/+43
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | buf_page_t::frame: Moved from buf_block_t::frame. All 'thin' buf_page_t describing compressed-only ROW_FORMAT=COMPRESSED pages will have frame=nullptr, while all 'fat' buf_block_t will have a non-null frame pointing to aligned innodb_page_size bytes. This eliminates the need for separate states for BUF_BLOCK_FILE_PAGE and BUF_BLOCK_ZIP_PAGE. buf_page_t::lock: Moved from buf_block_t::lock. That is, all block descriptors will have a page latch. The IO_PIN state that was used for discarding or creating the uncompressed page frame of a ROW_FORMAT=COMPRESSED block is replaced by a combination of read-fix and page X-latch. page_zip_des_t::fix: Replaces state_, buf_fix_count_, io_fix_, status of buf_page_t with a single std::atomic<uint32_t>. All modifications will use store(), fetch_add(), fetch_sub(). This space was previously wasted to alignment on 64-bit systems. We will use the following encoding that combines a state (partly read-fix or write-fix) and a buffer-fix count: buf_page_t::NOT_USED=0 (previously BUF_BLOCK_NOT_USED) buf_page_t::MEMORY=1 (previously BUF_BLOCK_MEMORY) buf_page_t::REMOVE_HASH=2 (previously BUF_BLOCK_REMOVE_HASH) buf_page_t::FREED=3 + fix: pages marked as freed in the file buf_page_t::UNFIXED=1U<<29 + fix: normal pages buf_page_t::IBUF_EXIST=2U<<29 + fix: normal pages; may need ibuf merge buf_page_t::REINIT=3U<<29 + fix: reinitialized pages (skip doublewrite) buf_page_t::READ_FIX=4U<<29 + fix: read-fixed pages (also X-latched) buf_page_t::WRITE_FIX=5U<<29 + fix: write-fixed pages (also U-latched) buf_page_t::WRITE_FIX_IBUF=6U<<29 + fix: write-fixed; may have ibuf buf_page_t::WRITE_FIX_REINIT=7U<<29 + fix: write-fixed (no doublewrite) buf_page_t::write_complete(): Change WRITE_FIX or WRITE_FIX_REINIT to UNFIXED, and WRITE_FIX_IBUF to IBUF_EXIST, before releasing the U-latch. buf_page_t::read_complete(): Renamed from buf_page_read_complete(). Change READ_FIX to UNFIXED or IBUF_EXIST, before releasing the X-latch. buf_page_t::can_relocate(): If the page latch is being held or waited for, or the block is buffer-fixed or io-fixed, return false. (The condition on the page latch is new.) Outside buf_page_get_gen(), buf_page_get_low() and buf_page_free(), we will acquire the page latch before fix(), and unfix() before unlocking. buf_page_t::flush(): Replaces buf_flush_page(). Optimize the handling of FREED pages. buf_pool_t::release_freed_page(): Assume that buf_pool.mutex is held by the caller. buf_page_t::is_read_fixed(), buf_page_t::is_write_fixed(): New predicates. buf_page_get_low(): Ignore guesses that are read-fixed because they may not yet be registered in buf_pool.page_hash and buf_pool.LRU. buf_page_optimistic_get(): Acquire latch before buffer-fixing. buf_page_make_young(): Leave read-fixed blocks alone, because they might not be registered in buf_pool.LRU yet. recv_sys_t::recover_deferred(), recv_sys_t::recover_low(): Possibly fix MDEV-26326, by holding a page X-latch instead of only buffer-fixing the page.
* | | | MDEV-26933 InnoDB fails to detect page number mismatchbb-10.6-MDEV-26933Marko Mäkelä2021-11-011-2/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | mtr_t::page_lock(): Validate the page number. ibuf_tree_root_get(): Remove assertions that became redundant. The assertions in btr_validate_level() are kind of redundant as well, but because they are ut_a(), they are also present in release builds, while the ones in mtr_t::page_lock() are only present in debug builds. btr_cur_position(): Do not duplicate an assertion that is part of page_cur_position(). dict_load_tablespace(): Introduce a new option DICT_ERR_IGNORE_TABLESPACE that will suppress loading a tablespace when a table is going to be dropped.
* | | | MDEV-26769 InnoDB does not support hardware lock elisionMarko Mäkelä2021-10-221-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This implements memory transaction support for: * Intel Restricted Transactional Memory (RTM), also known as TSX-NI (Transactional Synchronization Extensions New Instructions) * POWER v2.09 Hardware Trace Monitor (HTM) on GNU/Linux transactional_lock_guard, transactional_shared_lock_guard: RAII lock guards that try to elide the lock acquisition when transactional memory is available. buf_pool.page_hash: Try to elide latches whenever feasible. Related to the InnoDB change buffer and ROW_FORMAT=COMPRESSED tables, this is not always possible. In buf_page_get_low(), memory transactions only work reasonably well for validating a guessed block address. TMLockGuard, TMLockTrxGuard, TMLockMutexGuard: RAII lock guards that try to elide lock_sys.latch and related latches.
* | | | MDEV-26826 Duplicated computations of buf_pool.page_hash addressesMarko Mäkelä2021-10-221-2/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Since commit bd5a6403cace36c6ed428cde62e35adcd3f7e7d0 (MDEV-26033) we can actually calculate the buf_pool.page_hash cell and latch addresses while not holding buf_pool.mutex. buf_page_alloc_descriptor(): Remove the MEM_UNDEFINED. We now expect buf_page_t::hash to be zero-initialized. buf_pool_t::hash_chain: Dedicated data type for buf_pool.page_hash.array. buf_LRU_free_one_page(): Merged to the only caller buf_pool_t::corrupted_evict().
* | | | Merge 10.5 into 10.6Marko Mäkelä2021-06-091-0/+27
|\ \ \ \ | |/ / /
| * | | MDEV-25783: Change buffer records are lost under heavy loadMarko Mäkelä2021-06-071-0/+27
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ibuf_read_merge_pages(): Disable some code that was added in MDEV-20394 in order to avoid a server hang if the change buffer is corrupted, presumably due to the race condition in recovery that was later fixed in MDEV-24449. The code will still be available in debug builds when the command line option --debug=d,ibuf_merge_corruption is specified. Due to MDEV-19514, the impact of this code is much worse starting with the 10.5 series. In older versions, the code was only enabled during a shutdown with innodb_fast_shutdown=0, but in 10.5 it was active during the normal operation of the server.
* | | | Merge 10.5 into 10.6Marko Mäkelä2021-06-011-5/+6
|\ \ \ \ | |/ / /
| * | | MDEV-19514 fixup: Optimize ibuf_merge_or_delete_for_page() callsMarko Mäkelä2021-05-311-5/+6
| | | |
* | | | MDEV-25743: Unnecessary copying of table names in InnoDB dictionaryMarko Mäkelä2021-05-211-4/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Many InnoDB data dictionary cache operations require that the table name be copied so that it will be NUL terminated. (For example, SYS_TABLES.NAME is not guaranteed to be NUL-terminated.) dict_table_t::is_garbage_name(): Check if a name belongs to the background drop table queue. dict_check_if_system_table_exists(): Remove. dict_sys_t::load_sys_tables(): Load the non-hard-coded system tables SYS_FOREIGN, SYS_FOREIGN_COLS, SYS_VIRTUAL on startup. dict_sys_t::create_or_check_sys_tables(): Replaces dict_create_or_check_foreign_constraint_tables() and dict_create_or_check_sys_virtual(). dict_sys_t::load_table(): Replaces dict_table_get_low() and dict_load_table(). dict_sys_t::find_table(): Renamed from get_table(). dict_sys_t::sys_tables_exist(): Check whether all the non-hard-coded tables SYS_FOREIGN, SYS_FOREIGN_COLS, SYS_VIRTUAL exist. trx_t::has_stats_table_lock(): Moved to dict0stats.cc. Some error messages will now report table names in the internal databasename/tablename format, instead of `databasename`.`tablename`.
* | | | Merge 10.5 into 10.6Marko Mäkelä2021-04-141-4/+5
|\ \ \ \ | |/ / /
| * | | Merge 10.4 into 10.5Marko Mäkelä2021-04-141-4/+5
| |\ \ \ | | |/ /
| | * | Merge 10.3 into 10.4Marko Mäkelä2021-04-141-10/+13
| | |\ \ | | | |/
| | | * MDEV-24620 ASAN heap-buffer-overflow in btr_pcur_restore_position()bb-10.3-MDEV-24620Marko Mäkelä2021-04-131-4/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Between btr_pcur_store_position() and btr_pcur_restore_position() it is possible that purge empties a table and enlarges index->n_core_fields and index->n_core_null_bytes. Therefore, we must cache index->n_core_fields in btr_pcur_t::old_n_core_fields so that btr_pcur_t::old_rec can be parsed correctly. Unfortunately, this is a huge change, because we will replace "bool leaf" parameters with "ulint n_core" (passing index->n_core_fields, or 0 for non-leaf pages). For special cases where we know that index->is_instant() cannot hold, we may also pass index->n_fields.
| | | * Merge 10.2 into 10.3Marko Mäkelä2021-04-091-6/+8
| | | |\
| | | | * MDEV-13103 fixup: Actually fix a crash during IMPORT TABLESPACEMarko Mäkelä2021-03-311-6/+8
| | | | |
* | | | | Merge 10.5 into 10.6Marko Mäkelä2021-03-051-16/+23
|\ \ \ \ \ | |/ / / /
| * | | | MDEV-25018 [FATAL] InnoDB: Unable to read page (of a dropped tablespace)bb-10.5-MDEV-25018Thirunarayanan Balathandayuthapani2021-03-031-24/+27
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | - This issue is caused by commit deadec4e689c9435e20ebb89fd8f84d3f0f90ff3 (MDEV-24569). InnoDB fails to read the change buffer bitmap page from dropped tablespace. In ibuf_bitmap_get_map_page_func(), InnoDB should fetch the page using BUF_GET_POSSIBLY_FREED mode. Callers of ibuf_bitmap_get_map_page() should be adjusted in that case.
* | | | | Merge 10.5 into 10.6Marko Mäkelä2021-03-021-0/+1
|\ \ \ \ \ | |/ / / /
| * | | | MDEV-24997 Assertion mtr->is_named_space(page_id.space()) in ibuf0ibuf.cc:624Thirunarayanan Balathandayuthapani2021-02-261-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | - This is caused by commit deadec4e689c9435e20ebb89fd8f84d3f0f90ff3 (MDEV-24569). InnoDB fails to set the tablespace associated with mini-transacton while resetting the change buffer bitmap bits of the page.
* | | | | MDEV-20612 fixup: Reduce hash table lookupsMarko Mäkelä2021-02-241-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Let us calculate the hash table cell address while we are calculating the latch address, to avoid repeated computations of the address. The latch address can be derived from the cell address with a simple bitmask operation.
* | | | | MDEV-20612: Partition lock_sys.latchMarko Mäkelä2021-02-121-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We replace the old lock_sys.mutex (which was renamed to lock_sys.latch) with a combination of a global lock_sys.latch and table or page hash lock mutexes. The global lock_sys.latch can be acquired in exclusive mode, or it can be acquired in shared mode and another mutex will be acquired to protect the locks for a particular page or a table. This is inspired by mysql/mysql-server@1d259b87a63defa814e19a7534380cb43ee23c48 but the optimization of lock_release() will be done in the next commit. Also, we will interleave mutexes with the hash table elements, similar to how buf_pool.page_hash was optimized in commit 5155a300fab85e97217c75e3ba3c3ce78082dd8a (MDEV-22871). dict_table_t::autoinc_trx: Use Atomic_relaxed. dict_table_t::autoinc_mutex: Use srw_mutex in order to reduce the memory footprint. On 64-bit Linux or OpenBSD, both this and the new dict_table_t::lock_mutex should be 32 bits and be stored in the same 64-bit word. On Microsoft Windows, the underlying SRWLOCK is 32 or 64 bits, and on other systems, sizeof(pthread_mutex_t) can be much larger. ib_lock_t::trx_locks, trx_lock_t::trx_locks: Document the new rules. Writers must assert lock_sys.is_writer() || trx->mutex_is_owner(). LockGuard: A RAII wrapper for acquiring a page hash table lock. LockGGuard: Like LockGuard, but when Galera Write-Set Replication is enabled, we must acquire all shards, for updating arbitrary trx_locks. LockMultiGuard: A RAII wrapper for acquiring two page hash table locks. lock_rec_create_wsrep(), lock_table_create_wsrep(): Special Galera conflict resolution in non-inlined functions in order to keep the common code paths shorter. lock_sys_t::prdt_page_free_from_discard(): Refactored from lock_prdt_page_free_from_discard() and lock_rec_free_all_from_discard_page(). trx_t::commit_tables(): Replaces trx_update_mod_tables_timestamp(). lock_release(): Let trx_t::commit_tables() invalidate the query cache for those tables that were actually modified by the transaction. Merge lock_check_dict_lock() to lock_release(). We must never release lock_sys.latch while holding any lock_sys_t::hash_latch. Failure to do that could lead to memory corruption if the buffer pool is resized between the time lock_sys.latch is released and the hash_latch is released.
* | | | | MDEV-20612: Replace lock_sys.mutex with lock_sys.latchMarko Mäkelä2021-02-111-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | For now, we will acquire the lock_sys.latch only in exclusive mode, that is, use it as a mutex. This is preparation for the next commit where we will introduce a less intrusive alternative, combining a shared lock_sys.latch with dict_table_t::lock_mutex or a mutex embedded in lock_sys.rec_hash, lock_sys.prdt_hash, or lock_sys.prdt_page_hash.
* | | | | MDEV-20612 preparation: LockMutexGuardMarko Mäkelä2021-02-111-4/+2
| | | | | | | | | | | | | | | | | | | | | | | | | Let us use the RAII wrapper LockMutexGuard for most operations where lock_sys.mutex is acquired.
* | | | | MDEV-20612 preparation: Fewer calls to buf_page_t::id()Marko Mäkelä2021-02-111-1/+2
| | | | |
* | | | | MDEV-24569: Merge 10.5 into 10.6Marko Mäkelä2021-01-141-13/+34
|\ \ \ \ \ | |/ / / / | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | fseg_page_is_free(): Because MDEV-24167 changed fil_space_t::latch to a simple non-recursive rw-lock, we must avoid acquiring a shared latch if the current thread already holds an exclusive latch. This affects the test innodb.innodb_bug59733, which is exercising the change buffer. fil_space_t::is_owner(): Make available in non-debug builds.
| * | | | MDEV-24569: Change buffer merge modifies freed pageMarko Mäkelä2021-01-141-13/+34
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Starting with MDEV-15528, we will avoid writing freed pages back to data files. During stress testing, the assertion mach_read_from_4(frame + 4U) == block.page.id().page_no() failed in log_phys_t::apply(), because before the server was killed and restarted, change buffer merge was executed on a previously freed page. During recovery, the assertion would fail when InnoDB would apply a FREE_PAGE and a WRITE record to the page. ibuf_merge_or_delete_for_page(): Before applying any changes, check whether the secondary index leaf page has already been freed according to the allocation bitmap page. If it is freed, then we must reset the bits in change buffer bitmap page. ibuf_reset_bitmap(): Auxiliary function for repeated code. mtr_t::sx_lock_space(): Replaces mtr_t::x_lock_space(). Due to the lazy approach of the change buffer, The function ibuf_merge_or_delete_for_page() may be executed in buf_page_create() while the tablespace is already X-latched. An S-latch must not be acquired while the thread already holds an X-latch, but a redundant SX-latch is fine. The fix was developed by Thirunarayanan Balathandayuthapani.
* | | | | Merge 10.5 into 10.6Marko Mäkelä2020-12-191-1/+1
|\ \ \ \ \ | |/ / / /
| * | | | MDEV-24445 Using innodb_undo_tablespaces corrupts system tablespaceMarko Mäkelä2020-12-181-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In the rewrite of MDEV-8139 (based on MDEV-15528), we introduced a wrong assumption that any persistent tablespace that is not an .ibd file is the system tablespace. This assumption is broken when innodb_undo_tablespaces (files undo001, undo002, ...) are being used. By default, we have innodb_undo_tablespaces=0 (the persistent undo log is being stored in the system tablespace). In MDEV-15528 and MDEV-8139 we rewrote the page scrubbing logic so that it will follow the tried-and-true write-ahead logging protocol, first writing FREE_PAGE records and then in the page flushing, zerofilling or hole-punching freed pages. Unfortunately, the implementation included a wrong assumption that that anything that is not in an .ibd file must be the system tablespace. This wrong assumption would cause overwrites of valid data pages in the system tablespace. mtr_t::m_freed_in_system_tablespace: Remove. mtr_t::m_freed_space: The tablespace associated with m_freed_pages. buf_page_free(): Take the tablespace and page number as a parameter, instead of taking a page identifier.
* | | | | MDEV-21452: Replace ib_mutex_t with mysql_mutex_tMarko Mäkelä2020-12-151-64/+60
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | SHOW ENGINE INNODB MUTEX functionality is completely removed, as are the InnoDB latching order checks. We will enforce innodb_fatal_semaphore_wait_threshold only for dict_sys.mutex and lock_sys.mutex. dict_sys_t::mutex_lock(): A single entry point for dict_sys.mutex. lock_sys_t::mutex_lock(): A single entry point for lock_sys.mutex. FIXME: srv_sys should be removed altogether; it is duplicating tpool functionality. fil_crypt_threads_init(): To prevent SAFE_MUTEX warnings, we must not hold fil_system.mutex. fil_close_all_files(): To prevent SAFE_MUTEX warnings for fil_space_destroy_crypt_data(), we must not hold fil_system.mutex while invoking fil_space_free_low() on a detached tablespace.
* | | | | MDEV-21452: Replace all direct use of os_event_tMarko Mäkelä2020-12-151-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Let us replace os_event_t with mysql_cond_t, and replace the necessary ib_mutex_t with mysql_mutex_t so that they can be used with condition variables. Also, let us replace polling (os_thread_sleep() or timed waits) with plain mysql_cond_wait() wherever possible. Furthermore, we will use the lightweight srw_mutex for trx_t::mutex, to hopefully reduce contention on lock_sys.mutex. FIXME: Add test coverage of mariabackup --backup --kill-long-queries-timeout
* | | | | MDEV-24142: Remove __FILE__,__LINE__ related to buf_block_t::lockMarko Mäkelä2020-12-031-26/+6
| | | | |
* | | | | MDEV-24142: Remove the LatchDebug interface to rw-locksMarko Mäkelä2020-12-031-49/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The latching order checks for rw-locks have not caught many bugs in the past few years and they are greatly complicating the code. Last time the debug checks were useful was in commit 59caf2c3c1fe128d1d2c3a8df9fadd4d25ab7102 (MDEV-13485). The B-tree hang MDEV-14637 was not caught by LatchDebug, because the granularity of the checks is not sufficient to distinguish the levels of non-leaf B-tree pages. The interface was already made dead code by the grandparent commit 03ca6495df31313c96e38834b9a235245e2ae2a8.
* | | | | MDEV-24142: Replace InnoDB rw_lock_t with sux_lockMarko Mäkelä2020-12-031-13/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | InnoDB buffer pool block and index tree latches depend on a special kind of read-update-write lock that allows reentrant (recursive) acquisition of the 'update' and 'write' locks as well as an upgrade from 'update' lock to 'write' lock. The 'update' lock allows any number of reader locks from other threads, but no concurrent 'update' or 'write' lock. If there were no requirement to support an upgrade from 'update' to 'write', we could compose the lock out of two srw_lock (implemented as any type of native rw-lock, such as SRWLOCK on Microsoft Windows). Removing this requirement is very difficult, so in commit f7e7f487d4b06695f91f6fbeb0396b9d87fc7bbf we implemented an 'update' mode to our srw_lock. Re-entrant or recursive locking is mostly needed when writing or freeing BLOB pages, but also in crash recovery or when merging buffered changes to an index page. The re-entrancy allows us to attach a previously acquired page to a sub-mini-transaction that will be committed before whatever else is holding the page latch. The SUX lock supports Shared ('read'), Update, and eXclusive ('write') locking modes. The S latches are not re-entrant, but a single S latch may be acquired even if the thread already holds an U latch. The idea of the U latch is to allow a write of something that concurrent readers do not care about (such as the contents of BTR_SEG_LEAF, BTR_SEG_TOP and other page allocation metadata structures, or the MDEV-6076 PAGE_ROOT_AUTO_INC). (The PAGE_ROOT_AUTO_INC field is only updated when a dict_table_t for the table exists, and only read when a dict_table_t for the table is being added to dict_sys.) block_lock::u_lock_try(bool for_io=true) is used in buf_flush_page() to allow concurrent readers but no concurrent modifications while the page is being written to the data file. That latch will be released by buf_page_write_complete() in a different thread. Hence, we use the special lock owner value FOR_IO. The index_lock::u_lock() improves concurrency on operations that involve non-leaf index pages. The interface has been cleaned up a little. We will use x_lock_recursive() instead of x_lock() when we know that a lock is already held by the current thread. Similarly, a lock upgrade from U to X is only allowed via u_x_upgrade() or x_lock_upgraded() but not via x_lock(). We will disable the LatchDebug and sync_array interfaces to InnoDB rw-locks. The SEMAPHORES section of SHOW ENGINE INNODB STATUS output will no longer include any information about InnoDB rw-locks, only TTASEventMutex (cmake -DMUTEXTYPE=event) waits. This will make a part of the 'innotop' script dead code. The block_lock buf_block_t::lock will not be covered by any PERFORMANCE_SCHEMA instrumentation. SHOW ENGINE INNODB MUTEX and INFORMATION_SCHEMA.INNODB_MUTEXES will no longer output source code file names or line numbers. The dict_index_t::lock will be identified by index and table names, which should be much more useful. PERFORMANCE_SCHEMA is lumping information about all dict_index_t::lock together as event_name='wait/synch/sxlock/innodb/index_tree_rw_lock'. buf_page_free(): Remove the file,line parameters. The sux_lock will not store such diagnostic information. buf_block_dbg_add_level(): Define as empty macro, to be removed in a subsequent commit. Unless the build was configured with cmake -DPLUGIN_PERFSCHEMA=NO the index_lock dict_index_t::lock will be instrumented via PERFORMANCE_SCHEMA. Similar to commit 1669c8890ca2e9092213626e5b047e58ca8b1e77 we will distinguish lock waits by registering shared_lock,exclusive_lock events instead of try_shared_lock,try_exclusive_lock. Actual 'try' operations will not be instrumented at all. rw_lock_list: Remove. After MDEV-24167, this only covered buf_block_t::lock and dict_index_t::lock. We will output their information by traversing buf_pool or dict_sys.
* | | | | MDEV-24167: Replace fil_space::latchMarko Mäkelä2020-11-241-3/+3
|/ / / / | | | | | | | | | | | | | | | | | | | | We must avoid acquiring a latch while we are already holding one. The tablespace latch was being acquired recursively in some operations that allocate or free pages.
* | | | Merge 10.4 into 10.5Marko Mäkelä2020-11-131-52/+31
|\ \ \ \ | |/ / /
| * | | Merge 10.3 into 10.4Marko Mäkelä2020-11-121-65/+40
| |\ \ \ | | |/ /