128962f5ab565a8c3251c7a86a31e072ed288f18
7100 Commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
f4e39e0ffb |
block: fix conversion of GPT partition name to 7-bit
commit e06472bab2a5393430cc2fbc3211cd3602422c1e upstream. The utf16_le_to_7bit function claims to, naively, convert a UTF-16 string to a 7-bit ASCII string. By naively, we mean that it: * drops the first byte of every character in the original UTF-16 string * checks if all characters are printable, and otherwise replaces them by exclamation mark "!". This means that theoretically, all characters outside the 7-bit ASCII range should be replaced by another character. Examples: * lower-case alpha (ɒ) 0x0252 becomes 0x52 (R) * ligature OE (œ) 0x0153 becomes 0x53 (S) * hangul letter pieup (ㅂ) 0x3142 becomes 0x42 (B) * upper-case gamma (Ɣ) 0x0194 becomes 0x94 (not printable) so gets replaced by "!" The result of this conversion for the GPT partition name is passed to user-space as PARTNAME via udev, which is confusing and feels questionable. However, there is a flaw in the conversion function itself. By dropping one byte of each character and using isprint() to check if the remaining byte corresponds to a printable character, we do not actually guarantee that the resulting character is 7-bit ASCII. This happens because we pass 8-bit characters to isprint(), which in the kernel returns 1 for many values > 0x7f - as defined in ctype.c. This results in many values which should be replaced by "!" to be kept as-is, despite not being valid 7-bit ASCII. Examples: * e with acute accent (é) 0x00E9 becomes 0xE9 - kept as-is because isprint(0xE9) returns 1. * euro sign (€) 0x20AC becomes 0xAC - kept as-is because isprint(0xAC) returns 1. This way has broken pyudev utility[1], fixes it by using a mask of 7 bits instead of 8 bits before calling isprint. Link: https://github.com/pyudev/pyudev/issues/490#issuecomment-2685794648 [1] Link: https://lore.kernel.org/linux-block/4cac90c2-e414-4ebb-ae62-2a4589d9dc6e@canonical.com/ Cc: Mulhern <amulhern@redhat.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: stable@vger.kernel.org Signed-off-by: Olivier Gayot <olivier.gayot@canonical.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20250305022154.3903128-1-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> |
||
|
|
92527100be |
partitions: mac: fix handling of bogus partition table
commit 80e648042e512d5a767da251d44132553fe04ae0 upstream. Fix several issues in partition probing: - The bailout for a bad partoffset must use put_dev_sector(), since the preceding read_part_sector() succeeded. - If the partition table claims a silly sector size like 0xfff bytes (which results in partition table entries straddling sector boundaries), bail out instead of accessing out-of-bounds memory. - We must not assume that the partition table contains proper NUL termination - use strnlen() and strncmp() instead of strlen() and strcmp(). Cc: stable@vger.kernel.org Signed-off-by: Jann Horn <jannh@google.com> Link: https://lore.kernel.org/r/20250214-partition-mac-v1-1-c1c626dffbd5@google.com Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> |
||
|
|
84671b0630 |
block: don't revert iter for -EIOCBQUEUED
commit b13ee668e8280ca5b07f8ce2846b9957a8a10853 upstream. blkdev_read_iter() has a few odd checks, like gating the position and count adjustment on whether or not the result is bigger-than-or-equal to zero (where bigger than makes more sense), and not checking the return value of blkdev_direct_IO() before doing an iov_iter_revert(). The latter can lead to attempting to revert with a negative value, which when passed to iov_iter_revert() as an unsigned value will lead to throwing a WARN_ON() because unroll is bigger than MAX_RW_COUNT. Be sane and don't revert for -EIOCBQUEUED, like what is done in other spots. Cc: stable@vger.kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> |
||
|
|
993121481b |
blk-cgroup: Fix class @block_class's subsystem refcount leakage
commit d1248436cbef1f924c04255367ff4845ccd9025e upstream.
blkcg_fill_root_iostats() iterates over @block_class's devices by
class_dev_iter_(init|next)(), but does not end iterating with
class_dev_iter_exit(), so causes the class's subsystem refcount leakage.
Fix by ending the iterating with class_dev_iter_exit().
Fixes:
|
||
|
|
a6cfeb1c28 |
partitions: ldm: remove the initial kernel-doc notation
[ Upstream commit e494e451611a3de6ae95f99e8339210c157d70fb ]
Remove the file's first comment describing what the file is.
This comment is not in kernel-doc format so it causes a kernel-doc
warning.
ldm.h:13: warning: expecting prototype for ldm(). Prototype was for _FS_PT_LDM_H_() instead
Fixes:
|
||
|
|
b1e537fa23 |
block: retry call probe after request_module in blk_request_module
[ Upstream commit 457ef47c08d2979f3e59ce66267485c3faed70c8 ] Set kernel config: CONFIG_BLK_DEV_LOOP=m CONFIG_BLK_DEV_LOOP_MIN_COUNT=0 Do latter: mknod loop0 b 7 0 exec 4<> loop0 Before commit |
||
|
|
a99bacb35c |
block: fix integer overflow in BLKSECDISCARD
commit 697ba0b6ec4ae04afb67d3911799b5e2043b4455 upstream.
I independently rediscovered
commit 22d24a544b0d49bbcbd61c8c0eaf77d3c9297155
block: fix overflow in blk_ioctl_discard()
but for secure erase.
Same problem:
uint64_t r[2] = {512, 18446744073709551104ULL};
ioctl(fd, BLKSECDISCARD, r);
will enter near infinite loop inside blkdev_issue_secure_erase():
a.out: attempt to access beyond end of device
loop0: rw=5, sector=3399043073, nr_sectors = 1024 limit=2048
bio_check_eod: 3286214 callbacks suppressed
Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com>
Link: https://lore.kernel.org/r/9e64057f-650a-46d1-b9f7-34af391536ef@p183
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Rajani Kantha <rajanikantha@engineer.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
|
||
|
|
1364a29b71 |
block: fix uaf for flush rq while iterating tags
commit 3802f73bd80766d70f319658f334754164075bc3 upstream. blk_mq_clear_flush_rq_mapping() is not called during scsi probe, by checking blk_queue_init_done(). However, QUEUE_FLAG_INIT_DONE is cleared in del_gendisk by commit |
||
|
|
be3eed59ac |
block, bfq: fix waker_bfqq UAF after bfq_split_bfqq()
[ Upstream commit fcede1f0a043ccefe9bc6ad57f12718e42f63f1d ] Our syzkaller report a following UAF for v6.6: BUG: KASAN: slab-use-after-free in bfq_init_rq+0x175d/0x17a0 block/bfq-iosched.c:6958 Read of size 8 at addr ffff8881b57147d8 by task fsstress/232726 CPU: 2 PID: 232726 Comm: fsstress Not tainted 6.6.0-g3629d1885222 #39 Call Trace: <TASK> __dump_stack lib/dump_stack.c:88 [inline] dump_stack_lvl+0x91/0xf0 lib/dump_stack.c:106 print_address_description.constprop.0+0x66/0x300 mm/kasan/report.c:364 print_report+0x3e/0x70 mm/kasan/report.c:475 kasan_report+0xb8/0xf0 mm/kasan/report.c:588 hlist_add_head include/linux/list.h:1023 [inline] bfq_init_rq+0x175d/0x17a0 block/bfq-iosched.c:6958 bfq_insert_request.isra.0+0xe8/0xa20 block/bfq-iosched.c:6271 bfq_insert_requests+0x27f/0x390 block/bfq-iosched.c:6323 blk_mq_insert_request+0x290/0x8f0 block/blk-mq.c:2660 blk_mq_submit_bio+0x1021/0x15e0 block/blk-mq.c:3143 __submit_bio+0xa0/0x6b0 block/blk-core.c:639 __submit_bio_noacct_mq block/blk-core.c:718 [inline] submit_bio_noacct_nocheck+0x5b7/0x810 block/blk-core.c:747 submit_bio_noacct+0xca0/0x1990 block/blk-core.c:847 __ext4_read_bh fs/ext4/super.c:205 [inline] ext4_read_bh+0x15e/0x2e0 fs/ext4/super.c:230 __read_extent_tree_block+0x304/0x6f0 fs/ext4/extents.c:567 ext4_find_extent+0x479/0xd20 fs/ext4/extents.c:947 ext4_ext_map_blocks+0x1a3/0x2680 fs/ext4/extents.c:4182 ext4_map_blocks+0x929/0x15a0 fs/ext4/inode.c:660 ext4_iomap_begin_report+0x298/0x480 fs/ext4/inode.c:3569 iomap_iter+0x3dd/0x1010 fs/iomap/iter.c:91 iomap_fiemap+0x1f4/0x360 fs/iomap/fiemap.c:80 ext4_fiemap+0x181/0x210 fs/ext4/extents.c:5051 ioctl_fiemap.isra.0+0x1b4/0x290 fs/ioctl.c:220 do_vfs_ioctl+0x31c/0x11a0 fs/ioctl.c:811 __do_sys_ioctl fs/ioctl.c:869 [inline] __se_sys_ioctl+0xae/0x190 fs/ioctl.c:857 do_syscall_x64 arch/x86/entry/common.c:51 [inline] do_syscall_64+0x70/0x120 arch/x86/entry/common.c:81 entry_SYSCALL_64_after_hwframe+0x78/0xe2 Allocated by task 232719: kasan_save_stack+0x22/0x50 mm/kasan/common.c:45 kasan_set_track+0x25/0x30 mm/kasan/common.c:52 __kasan_slab_alloc+0x87/0x90 mm/kasan/common.c:328 kasan_slab_alloc include/linux/kasan.h:188 [inline] slab_post_alloc_hook mm/slab.h:768 [inline] slab_alloc_node mm/slub.c:3492 [inline] kmem_cache_alloc_node+0x1b8/0x6f0 mm/slub.c:3537 bfq_get_queue+0x215/0x1f00 block/bfq-iosched.c:5869 bfq_get_bfqq_handle_split+0x167/0x5f0 block/bfq-iosched.c:6776 bfq_init_rq+0x13a4/0x17a0 block/bfq-iosched.c:6938 bfq_insert_request.isra.0+0xe8/0xa20 block/bfq-iosched.c:6271 bfq_insert_requests+0x27f/0x390 block/bfq-iosched.c:6323 blk_mq_insert_request+0x290/0x8f0 block/blk-mq.c:2660 blk_mq_submit_bio+0x1021/0x15e0 block/blk-mq.c:3143 __submit_bio+0xa0/0x6b0 block/blk-core.c:639 __submit_bio_noacct_mq block/blk-core.c:718 [inline] submit_bio_noacct_nocheck+0x5b7/0x810 block/blk-core.c:747 submit_bio_noacct+0xca0/0x1990 block/blk-core.c:847 __ext4_read_bh fs/ext4/super.c:205 [inline] ext4_read_bh_nowait+0x15a/0x240 fs/ext4/super.c:217 ext4_read_bh_lock+0xac/0xd0 fs/ext4/super.c:242 ext4_bread_batch+0x268/0x500 fs/ext4/inode.c:958 __ext4_find_entry+0x448/0x10f0 fs/ext4/namei.c:1671 ext4_lookup_entry fs/ext4/namei.c:1774 [inline] ext4_lookup.part.0+0x359/0x6f0 fs/ext4/namei.c:1842 ext4_lookup+0x72/0x90 fs/ext4/namei.c:1839 __lookup_slow+0x257/0x480 fs/namei.c:1696 lookup_slow fs/namei.c:1713 [inline] walk_component+0x454/0x5c0 fs/namei.c:2004 link_path_walk.part.0+0x773/0xda0 fs/namei.c:2331 link_path_walk fs/namei.c:3826 [inline] path_openat+0x1b9/0x520 fs/namei.c:3826 do_filp_open+0x1b7/0x400 fs/namei.c:3857 do_sys_openat2+0x5dc/0x6e0 fs/open.c:1428 do_sys_open fs/open.c:1443 [inline] __do_sys_openat fs/open.c:1459 [inline] __se_sys_openat fs/open.c:1454 [inline] __x64_sys_openat+0x148/0x200 fs/open.c:1454 do_syscall_x64 arch/x86/entry/common.c:51 [inline] do_syscall_64+0x70/0x120 arch/x86/entry/common.c:81 entry_SYSCALL_64_after_hwframe+0x78/0xe2 Freed by task 232726: kasan_save_stack+0x22/0x50 mm/kasan/common.c:45 kasan_set_track+0x25/0x30 mm/kasan/common.c:52 kasan_save_free_info+0x2b/0x50 mm/kasan/generic.c:522 ____kasan_slab_free mm/kasan/common.c:236 [inline] __kasan_slab_free+0x12a/0x1b0 mm/kasan/common.c:244 kasan_slab_free include/linux/kasan.h:164 [inline] slab_free_hook mm/slub.c:1827 [inline] slab_free_freelist_hook mm/slub.c:1853 [inline] slab_free mm/slub.c:3820 [inline] kmem_cache_free+0x110/0x760 mm/slub.c:3842 bfq_put_queue+0x6a7/0xfb0 block/bfq-iosched.c:5428 bfq_forget_entity block/bfq-wf2q.c:634 [inline] bfq_put_idle_entity+0x142/0x240 block/bfq-wf2q.c:645 bfq_forget_idle+0x189/0x1e0 block/bfq-wf2q.c:671 bfq_update_vtime block/bfq-wf2q.c:1280 [inline] __bfq_lookup_next_entity block/bfq-wf2q.c:1374 [inline] bfq_lookup_next_entity+0x350/0x480 block/bfq-wf2q.c:1433 bfq_update_next_in_service+0x1c0/0x4f0 block/bfq-wf2q.c:128 bfq_deactivate_entity+0x10a/0x240 block/bfq-wf2q.c:1188 bfq_deactivate_bfqq block/bfq-wf2q.c:1592 [inline] bfq_del_bfqq_busy+0x2e8/0xad0 block/bfq-wf2q.c:1659 bfq_release_process_ref+0x1cc/0x220 block/bfq-iosched.c:3139 bfq_split_bfqq+0x481/0xdf0 block/bfq-iosched.c:6754 bfq_init_rq+0xf29/0x17a0 block/bfq-iosched.c:6934 bfq_insert_request.isra.0+0xe8/0xa20 block/bfq-iosched.c:6271 bfq_insert_requests+0x27f/0x390 block/bfq-iosched.c:6323 blk_mq_insert_request+0x290/0x8f0 block/blk-mq.c:2660 blk_mq_submit_bio+0x1021/0x15e0 block/blk-mq.c:3143 __submit_bio+0xa0/0x6b0 block/blk-core.c:639 __submit_bio_noacct_mq block/blk-core.c:718 [inline] submit_bio_noacct_nocheck+0x5b7/0x810 block/blk-core.c:747 submit_bio_noacct+0xca0/0x1990 block/blk-core.c:847 __ext4_read_bh fs/ext4/super.c:205 [inline] ext4_read_bh+0x15e/0x2e0 fs/ext4/super.c:230 __read_extent_tree_block+0x304/0x6f0 fs/ext4/extents.c:567 ext4_find_extent+0x479/0xd20 fs/ext4/extents.c:947 ext4_ext_map_blocks+0x1a3/0x2680 fs/ext4/extents.c:4182 ext4_map_blocks+0x929/0x15a0 fs/ext4/inode.c:660 ext4_iomap_begin_report+0x298/0x480 fs/ext4/inode.c:3569 iomap_iter+0x3dd/0x1010 fs/iomap/iter.c:91 iomap_fiemap+0x1f4/0x360 fs/iomap/fiemap.c:80 ext4_fiemap+0x181/0x210 fs/ext4/extents.c:5051 ioctl_fiemap.isra.0+0x1b4/0x290 fs/ioctl.c:220 do_vfs_ioctl+0x31c/0x11a0 fs/ioctl.c:811 __do_sys_ioctl fs/ioctl.c:869 [inline] __se_sys_ioctl+0xae/0x190 fs/ioctl.c:857 do_syscall_x64 arch/x86/entry/common.c:51 [inline] do_syscall_64+0x70/0x120 arch/x86/entry/common.c:81 entry_SYSCALL_64_after_hwframe+0x78/0xe2 commit 1ba0403ac644 ("block, bfq: fix uaf for accessing waker_bfqq after splitting") fix the problem that if waker_bfqq is in the merge chain, and current is the only procress, waker_bfqq can be freed from bfq_split_bfqq(). However, the case that waker_bfqq is not in the merge chain is missed, and if the procress reference of waker_bfqq is 0, waker_bfqq can be freed as well. Fix the problem by checking procress reference if waker_bfqq is not in the merge_chain. Fixes: 1ba0403ac644 ("block, bfq: fix uaf for accessing waker_bfqq after splitting") Signed-off-by: Hou Tao <houtao1@huawei.com> Signed-off-by: Yu Kuai <yukuai3@huawei.com> Reviewed-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20250108084148.1549973-1-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org> |
||
|
|
ee18012c80 |
block: avoid to reuse hctx not removed from cpuhp callback list
commit 85672ca9ceeaa1dcf2777a7048af5f4aee3fd02b upstream. If the 'hctx' isn't removed from cpuhp callback list, we can't reuse it, otherwise use-after-free may be triggered. Reported-by: kernel test robot <oliver.sang@intel.com> Closes: https://lore.kernel.org/oe-lkp/202412172217.b906db7c-lkp@intel.com Tested-by: kernel test robot <oliver.sang@intel.com> Fixes: 22465bbac53c ("blk-mq: move cpuhp callback registering out of q->sysfs_lock") Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20241218101617.3275704-3-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> |
||
|
|
58bf93580f |
blk-mq: move cpuhp callback registering out of q->sysfs_lock
[ Upstream commit 22465bbac53c821319089016f268a2437de9b00a ] Registering and unregistering cpuhp callback requires global cpu hotplug lock, which is used everywhere. Meantime q->sysfs_lock is used in block layer almost everywhere. It is easy to trigger lockdep warning[1] by connecting the two locks. Fix the warning by moving blk-mq's cpuhp callback registering out of q->sysfs_lock. Add one dedicated global lock for covering registering & unregistering hctx's cpuhp, and it is safe to do so because hctx is guaranteed to be live if our request_queue is live. [1] https://lore.kernel.org/lkml/Z04pz3AlvI4o0Mr8@agluck-desk3/ Cc: Reinette Chatre <reinette.chatre@intel.com> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: Peter Newman <peternewman@google.com> Cc: Babu Moger <babu.moger@amd.com> Reported-by: Luck Tony <tony.luck@intel.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Tested-by: Tony Luck <tony.luck@intel.com> Link: https://lore.kernel.org/r/20241206111611.978870-3-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org> |
||
|
|
079fcc926b |
blk-mq: register cpuhp callback after hctx is added to xarray table
[ Upstream commit 4bf485a7db5d82ddd0f3ad2b299893199090375e ] We need to retrieve 'hctx' from xarray table in the cpuhp callback, so the callback should be registered after this 'hctx' is added to xarray table. Cc: Reinette Chatre <reinette.chatre@intel.com> Cc: Fenghua Yu <fenghua.yu@intel.com> Cc: Peter Newman <peternewman@google.com> Cc: Babu Moger <babu.moger@amd.com> Cc: Luck Tony <tony.luck@intel.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Tested-by: Tony Luck <tony.luck@intel.com> Link: https://lore.kernel.org/r/20241206111611.978870-2-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org> |
||
|
|
4a54211845 |
blk-iocost: Avoid using clamp() on inuse in __propagate_weights()
[ Upstream commit 57e420c84f9ab55ba4c5e2ae9c5f6c8e1ea834d2 ]
After a recent change to clamp() and its variants [1] that increases the
coverage of the check that high is greater than low because it can be
done through inlining, certain build configurations (such as s390
defconfig) fail to build with clang with:
block/blk-iocost.c:1101:11: error: call to '__compiletime_assert_557' declared with 'error' attribute: clamp() low limit 1 greater than high limit active
1101 | inuse = clamp_t(u32, inuse, 1, active);
| ^
include/linux/minmax.h:218:36: note: expanded from macro 'clamp_t'
218 | #define clamp_t(type, val, lo, hi) __careful_clamp(type, val, lo, hi)
| ^
include/linux/minmax.h:195:2: note: expanded from macro '__careful_clamp'
195 | __clamp_once(type, val, lo, hi, __UNIQUE_ID(v_), __UNIQUE_ID(l_), __UNIQUE_ID(h_))
| ^
include/linux/minmax.h:188:2: note: expanded from macro '__clamp_once'
188 | BUILD_BUG_ON_MSG(statically_true(ulo > uhi), \
| ^
__propagate_weights() is called with an active value of zero in
ioc_check_iocgs(), which results in the high value being less than the
low value, which is undefined because the value returned depends on the
order of the comparisons.
The purpose of this expression is to ensure inuse is not more than
active and at least 1. This could be written more simply with a ternary
expression that uses min(inuse, active) as the condition so that the
value of that condition can be used if it is not zero and one if it is.
Do this conversion to resolve the error and add a comment to deter
people from turning this back into clamp().
Fixes:
|
||
|
|
5baa28569c |
blk-cgroup: Fix UAF in blkcg_unpin_online()
commit 86e6ca55b83c575ab0f2e105cf08f98e58d3d7af upstream.
blkcg_unpin_online() walks up the blkcg hierarchy putting the online pin. To
walk up, it uses blkcg_parent(blkcg) but it was calling that after
blkcg_destroy_blkgs(blkcg) which could free the blkcg, leading to the
following UAF:
==================================================================
BUG: KASAN: slab-use-after-free in blkcg_unpin_online+0x15a/0x270
Read of size 8 at addr ffff8881057678c0 by task kworker/9:1/117
CPU: 9 UID: 0 PID: 117 Comm: kworker/9:1 Not tainted 6.13.0-rc1-work-00182-gb8f52214c61a-dirty #48
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS unknown 02/02/2022
Workqueue: cgwb_release cgwb_release_workfn
Call Trace:
<TASK>
dump_stack_lvl+0x27/0x80
print_report+0x151/0x710
kasan_report+0xc0/0x100
blkcg_unpin_online+0x15a/0x270
cgwb_release_workfn+0x194/0x480
process_scheduled_works+0x71b/0xe20
worker_thread+0x82a/0xbd0
kthread+0x242/0x2c0
ret_from_fork+0x33/0x70
ret_from_fork_asm+0x1a/0x30
</TASK>
...
Freed by task 1944:
kasan_save_track+0x2b/0x70
kasan_save_free_info+0x3c/0x50
__kasan_slab_free+0x33/0x50
kfree+0x10c/0x330
css_free_rwork_fn+0xe6/0xb30
process_scheduled_works+0x71b/0xe20
worker_thread+0x82a/0xbd0
kthread+0x242/0x2c0
ret_from_fork+0x33/0x70
ret_from_fork_asm+0x1a/0x30
Note that the UAF is not easy to trigger as the free path is indirected
behind a couple RCU grace periods and a work item execution. I could only
trigger it with artifical msleep() injected in blkcg_unpin_online().
Fix it by reading the parent pointer before destroying the blkcg's blkg's.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: Abagail ren <renzezhongucas@gmail.com>
Suggested-by: Linus Torvalds <torvalds@linuxfoundation.org>
Fixes:
|
||
|
|
906cdbdd3b |
block, bfq: fix bfqq uaf in bfq_limit_depth()
[ Upstream commit e8b8344de3980709080d86c157d24e7de07d70ad ]
Set new allocated bfqq to bic or remove freed bfqq from bic are both
protected by bfqd->lock, however bfq_limit_depth() is deferencing bfqq
from bic without the lock, this can lead to UAF if the io_context is
shared by multiple tasks.
For example, test bfq with io_uring can trigger following UAF in v6.6:
==================================================================
BUG: KASAN: slab-use-after-free in bfqq_group+0x15/0x50
Call Trace:
<TASK>
dump_stack_lvl+0x47/0x80
print_address_description.constprop.0+0x66/0x300
print_report+0x3e/0x70
kasan_report+0xb4/0xf0
bfqq_group+0x15/0x50
bfqq_request_over_limit+0x130/0x9a0
bfq_limit_depth+0x1b5/0x480
__blk_mq_alloc_requests+0x2b5/0xa00
blk_mq_get_new_requests+0x11d/0x1d0
blk_mq_submit_bio+0x286/0xb00
submit_bio_noacct_nocheck+0x331/0x400
__block_write_full_folio+0x3d0/0x640
writepage_cb+0x3b/0xc0
write_cache_pages+0x254/0x6c0
write_cache_pages+0x254/0x6c0
do_writepages+0x192/0x310
filemap_fdatawrite_wbc+0x95/0xc0
__filemap_fdatawrite_range+0x99/0xd0
filemap_write_and_wait_range.part.0+0x4d/0xa0
blkdev_read_iter+0xef/0x1e0
io_read+0x1b6/0x8a0
io_issue_sqe+0x87/0x300
io_wq_submit_work+0xeb/0x390
io_worker_handle_work+0x24d/0x550
io_wq_worker+0x27f/0x6c0
ret_from_fork_asm+0x1b/0x30
</TASK>
Allocated by task 808602:
kasan_save_stack+0x1e/0x40
kasan_set_track+0x21/0x30
__kasan_slab_alloc+0x83/0x90
kmem_cache_alloc_node+0x1b1/0x6d0
bfq_get_queue+0x138/0xfa0
bfq_get_bfqq_handle_split+0xe3/0x2c0
bfq_init_rq+0x196/0xbb0
bfq_insert_request.isra.0+0xb5/0x480
bfq_insert_requests+0x156/0x180
blk_mq_insert_request+0x15d/0x440
blk_mq_submit_bio+0x8a4/0xb00
submit_bio_noacct_nocheck+0x331/0x400
__blkdev_direct_IO_async+0x2dd/0x330
blkdev_write_iter+0x39a/0x450
io_write+0x22a/0x840
io_issue_sqe+0x87/0x300
io_wq_submit_work+0xeb/0x390
io_worker_handle_work+0x24d/0x550
io_wq_worker+0x27f/0x6c0
ret_from_fork+0x2d/0x50
ret_from_fork_asm+0x1b/0x30
Freed by task 808589:
kasan_save_stack+0x1e/0x40
kasan_set_track+0x21/0x30
kasan_save_free_info+0x27/0x40
__kasan_slab_free+0x126/0x1b0
kmem_cache_free+0x10c/0x750
bfq_put_queue+0x2dd/0x770
__bfq_insert_request.isra.0+0x155/0x7a0
bfq_insert_request.isra.0+0x122/0x480
bfq_insert_requests+0x156/0x180
blk_mq_dispatch_plug_list+0x528/0x7e0
blk_mq_flush_plug_list.part.0+0xe5/0x590
__blk_flush_plug+0x3b/0x90
blk_finish_plug+0x40/0x60
do_writepages+0x19d/0x310
filemap_fdatawrite_wbc+0x95/0xc0
__filemap_fdatawrite_range+0x99/0xd0
filemap_write_and_wait_range.part.0+0x4d/0xa0
blkdev_read_iter+0xef/0x1e0
io_read+0x1b6/0x8a0
io_issue_sqe+0x87/0x300
io_wq_submit_work+0xeb/0x390
io_worker_handle_work+0x24d/0x550
io_wq_worker+0x27f/0x6c0
ret_from_fork+0x2d/0x50
ret_from_fork_asm+0x1b/0x30
Fix the problem by protecting bic_to_bfqq() with bfqd->lock.
CC: Jan Kara <jack@suse.cz>
Fixes:
|
||
|
|
68a69ed52a |
blk-mq: Make blk_mq_quiesce_tagset() hold the tag list mutex less long
commit ccd9e252c515ac5a3ed04a414c95d1307d17f159 upstream.
Make sure that the tag_list_lock mutex is not held any longer than
necessary. This change reduces latency if e.g. blk_mq_quiesce_tagset()
is called concurrently from more than one thread. This function is used
by the NVMe core and also by the UFS driver.
Reported-by: Peter Wang <peter.wang@mediatek.com>
Cc: Chao Leng <lengchao@huawei.com>
Cc: Ming Lei <ming.lei@redhat.com>
Cc: stable@vger.kernel.org
Fixes:
|
||
|
|
e95080fba1 |
block: fix ordering between checking BLK_MQ_S_STOPPED request adding
commit 96a9fe64bfd486ebeeacf1e6011801ffe89dae18 upstream.
Supposing first scenario with a virtio_blk driver.
CPU0 CPU1
blk_mq_try_issue_directly()
__blk_mq_issue_directly()
q->mq_ops->queue_rq()
virtio_queue_rq()
blk_mq_stop_hw_queue()
virtblk_done()
blk_mq_request_bypass_insert() 1) store
blk_mq_start_stopped_hw_queue()
clear_bit(BLK_MQ_S_STOPPED) 3) store
blk_mq_run_hw_queue()
if (!blk_mq_hctx_has_pending()) 4) load
return
blk_mq_sched_dispatch_requests()
blk_mq_run_hw_queue()
if (!blk_mq_hctx_has_pending())
return
blk_mq_sched_dispatch_requests()
if (blk_mq_hctx_stopped()) 2) load
return
__blk_mq_sched_dispatch_requests()
Supposing another scenario.
CPU0 CPU1
blk_mq_requeue_work()
blk_mq_insert_request() 1) store
virtblk_done()
blk_mq_start_stopped_hw_queue()
blk_mq_run_hw_queues() clear_bit(BLK_MQ_S_STOPPED) 3) store
blk_mq_run_hw_queue()
if (!blk_mq_hctx_has_pending()) 4) load
return
blk_mq_sched_dispatch_requests()
if (blk_mq_hctx_stopped()) 2) load
continue
blk_mq_run_hw_queue()
Both scenarios are similar, the full memory barrier should be inserted
between 1) and 2), as well as between 3) and 4) to make sure that either
CPU0 sees BLK_MQ_S_STOPPED is cleared or CPU1 sees dispatch list.
Otherwise, either CPU will not rerun the hardware queue causing
starvation of the request.
The easy way to fix it is to add the essential full memory barrier into
helper of blk_mq_hctx_stopped(). In order to not affect the fast path
(hardware queue is not stopped most of the time), we only insert the
barrier into the slow path. Actually, only slow path needs to care about
missing of dispatching the request to the low-level device driver.
Fixes:
|
||
|
|
679b1874eb |
block: fix ordering between checking QUEUE_FLAG_QUIESCED request adding
commit 6bda857bcbb86fb9d0e54fbef93a093d51172acc upstream.
Supposing the following scenario.
CPU0 CPU1
blk_mq_insert_request() 1) store
blk_mq_unquiesce_queue()
blk_queue_flag_clear() 3) store
blk_mq_run_hw_queues()
blk_mq_run_hw_queue()
if (!blk_mq_hctx_has_pending()) 4) load
return
blk_mq_run_hw_queue()
if (blk_queue_quiesced()) 2) load
return
blk_mq_sched_dispatch_requests()
The full memory barrier should be inserted between 1) and 2), as well as
between 3) and 4) to make sure that either CPU0 sees QUEUE_FLAG_QUIESCED
is cleared or CPU1 sees dispatch list or setting of bitmap of software
queue. Otherwise, either CPU will not rerun the hardware queue causing
starvation.
So the first solution is to 1) add a pair of memory barrier to fix the
problem, another solution is to 2) use hctx->queue->queue_lock to
synchronize QUEUE_FLAG_QUIESCED. Here, we chose 2) to fix it since
memory barrier is not easy to be maintained.
Fixes:
|
||
|
|
fe0d9800ea |
block: fix missing dispatching request when queue is started or unquiesced
commit 2003ee8a9aa14d766b06088156978d53c2e9be3d upstream.
Supposing the following scenario with a virtio_blk driver.
CPU0 CPU1 CPU2
blk_mq_try_issue_directly()
__blk_mq_issue_directly()
q->mq_ops->queue_rq()
virtio_queue_rq()
blk_mq_stop_hw_queue()
virtblk_done()
blk_mq_try_issue_directly()
if (blk_mq_hctx_stopped())
blk_mq_request_bypass_insert() blk_mq_run_hw_queue()
blk_mq_run_hw_queue() blk_mq_run_hw_queue()
blk_mq_insert_request()
return
After CPU0 has marked the queue as stopped, CPU1 will see the queue is
stopped. But before CPU1 puts the request on the dispatch list, CPU2
receives the interrupt of completion of request, so it will run the
hardware queue and marks the queue as non-stopped. Meanwhile, CPU1 also
runs the same hardware queue. After both CPU1 and CPU2 complete
blk_mq_run_hw_queue(), CPU1 just puts the request to the same hardware
queue and returns. It misses dispatching a request. Fix it by running
the hardware queue explicitly. And blk_mq_request_issue_directly()
should handle a similar situation. Fix it as well.
Fixes:
|
||
|
|
fad4262bd4 |
block: fix bio_split_rw_at to take zone_write_granularity into account
[ Upstream commit 7ecd2cd4fae3e8410c0a6620f3a83dcdbb254f02 ]
Otherwise it can create unaligned writes on zoned devices.
Fixes:
|
||
|
|
f49a9d86c4 |
block: Fix elevator_get_default() checking for NULL q->tag_set
[ Upstream commit b402328a24ee7193a8ab84277c0c90ae16768126 ] elevator_get_default() and elv_support_iosched() both check for whether or not q->tag_set is non-NULL, however it's not possible for them to be NULL. This messes up some static checkers, as the checking of tag_set isn't consistent. Remove the checks, which both simplifies the logic and avoids checker errors. Signed-off-by: SurajSonawane2415 <surajsonawane0215@gmail.com> Link: https://lore.kernel.org/r/20241007111416.13814-1-surajsonawane0215@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org> |
||
|
|
b3c301b859 |
block: fix sanity checks in blk_rq_map_user_bvec
[ Upstream commit 2ff949441802a8d076d9013c7761f63e8ae5a9bd ]
blk_rq_map_user_bvec contains a check bytes + bv->bv_len > nr_iter which
causes unnecessary failures in NVMe passthrough I/O, reproducible as
follows:
- register a 2 page, page-aligned buffer against a ring
- use that buffer to do a 1 page io_uring NVMe passthrough read
The second (i = 1) iteration of the loop in blk_rq_map_user_bvec will
then have nr_iter == 1 page, bytes == 1 page, bv->bv_len == 1 page, so
the check bytes + bv->bv_len > nr_iter will succeed, causing the I/O to
fail. This failure is unnecessary, as when the check succeeds, it means
we've checked the entire buffer that will be used by the request - i.e.
blk_rq_map_user_bvec should complete successfully. Therefore, terminate
the loop early and return successfully when the check bytes + bv->bv_len
> nr_iter succeeds.
While we're at it, also remove the check that all segments in the bvec
are single-page. While this seems to be true for all users of the
function, it doesn't appear to be required anywhere downstream.
CC: stable@vger.kernel.org
Signed-off-by: Xinyu Zhang <xizhang@purestorage.com>
Co-developed-by: Uday Shankar <ushankar@purestorage.com>
Signed-off-by: Uday Shankar <ushankar@purestorage.com>
Fixes:
|
||
|
|
4c5b123ab2 |
blk-rq-qos: fix crash on rq_qos_wait vs. rq_qos_wake_function race
commit e972b08b91ef48488bae9789f03cfedb148667fb upstream.
We're seeing crashes from rq_qos_wake_function that look like this:
BUG: unable to handle page fault for address: ffffafe180a40084
#PF: supervisor write access in kernel mode
#PF: error_code(0x0002) - not-present page
PGD 100000067 P4D 100000067 PUD 10027c067 PMD 10115d067 PTE 0
Oops: Oops: 0002 [#1] PREEMPT SMP PTI
CPU: 17 UID: 0 PID: 0 Comm: swapper/17 Not tainted 6.12.0-rc3-00013-geca631b8fe80 #11
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
RIP: 0010:_raw_spin_lock_irqsave+0x1d/0x40
Code: 90 90 90 90 90 90 90 90 90 90 90 90 90 f3 0f 1e fa 0f 1f 44 00 00 41 54 9c 41 5c fa 65 ff 05 62 97 30 4c 31 c0 ba 01 00 00 00 <f0> 0f b1 17 75 0a 4c 89 e0 41 5c c3 cc cc cc cc 89 c6 e8 2c 0b 00
RSP: 0018:ffffafe180580ca0 EFLAGS: 00010046
RAX: 0000000000000000 RBX: ffffafe180a3f7a8 RCX: 0000000000000011
RDX: 0000000000000001 RSI: 0000000000000003 RDI: ffffafe180a40084
RBP: 0000000000000000 R08: 00000000001e7240 R09: 0000000000000011
R10: 0000000000000028 R11: 0000000000000888 R12: 0000000000000002
R13: ffffafe180a40084 R14: 0000000000000000 R15: 0000000000000003
FS: 0000000000000000(0000) GS:ffff9aaf1f280000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: ffffafe180a40084 CR3: 000000010e428002 CR4: 0000000000770ef0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
PKRU: 55555554
Call Trace:
<IRQ>
try_to_wake_up+0x5a/0x6a0
rq_qos_wake_function+0x71/0x80
__wake_up_common+0x75/0xa0
__wake_up+0x36/0x60
scale_up.part.0+0x50/0x110
wb_timer_fn+0x227/0x450
...
So rq_qos_wake_function() calls wake_up_process(data->task), which calls
try_to_wake_up(), which faults in raw_spin_lock_irqsave(&p->pi_lock).
p comes from data->task, and data comes from the waitqueue entry, which
is stored on the waiter's stack in rq_qos_wait(). Analyzing the core
dump with drgn, I found that the waiter had already woken up and moved
on to a completely unrelated code path, clobbering what was previously
data->task. Meanwhile, the waker was passing the clobbered garbage in
data->task to wake_up_process(), leading to the crash.
What's happening is that in between rq_qos_wake_function() deleting the
waitqueue entry and calling wake_up_process(), rq_qos_wait() is finding
that it already got a token and returning. The race looks like this:
rq_qos_wait() rq_qos_wake_function()
==============================================================
prepare_to_wait_exclusive()
data->got_token = true;
list_del_init(&curr->entry);
if (data.got_token)
break;
finish_wait(&rqw->wait, &data.wq);
^- returns immediately because
list_empty_careful(&wq_entry->entry)
is true
... return, go do something else ...
wake_up_process(data->task)
(NO LONGER VALID!)-^
Normally, finish_wait() is supposed to synchronize against the waker.
But, as noted above, it is returning immediately because the waitqueue
entry has already been removed from the waitqueue.
The bug is that rq_qos_wake_function() is accessing the waitqueue entry
AFTER deleting it. Note that autoremove_wake_function() wakes the waiter
and THEN deletes the waitqueue entry, which is the proper order.
Fix it by swapping the order. We also need to use
list_del_init_careful() to match the list_empty_careful() in
finish_wait().
Fixes:
|
||
|
|
1ab2cfe197 |
blk_iocost: fix more out of bound shifts
[ Upstream commit 9bce8005ec0dcb23a58300e8522fe4a31da606fa ] Recently running UBSAN caught few out of bound shifts in the ioc_forgive_debts() function: UBSAN: shift-out-of-bounds in block/blk-iocost.c:2142:38 shift exponent 80 is too large for 64-bit type 'u64' (aka 'unsigned long long') ... UBSAN: shift-out-of-bounds in block/blk-iocost.c:2144:30 shift exponent 80 is too large for 64-bit type 'u64' (aka 'unsigned long long') ... Call Trace: <IRQ> dump_stack_lvl+0xca/0x130 __ubsan_handle_shift_out_of_bounds+0x22c/0x280 ? __lock_acquire+0x6441/0x7c10 ioc_timer_fn+0x6cec/0x7750 ? blk_iocost_init+0x720/0x720 ? call_timer_fn+0x5d/0x470 call_timer_fn+0xfa/0x470 ? blk_iocost_init+0x720/0x720 __run_timer_base+0x519/0x700 ... Actual impact of this issue was not identified but I propose to fix the undefined behaviour. The proposed fix to prevent those out of bound shifts consist of precalculating exponent before using it the shift operations by taking min value from the actual exponent and maximum possible number of bits. Reported-by: Breno Leitao <leitao@debian.org> Signed-off-by: Konstantin Ovsepian <ovs@ovs.to> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20240822154137.2627818-1-ovs@ovs.to Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org> |
||
|
|
80f5bfbb80 |
block: fix potential invalid pointer dereference in blk_add_partition
[ Upstream commit 26e197b7f9240a4ac301dd0ad520c0c697c2ea7d ]
The blk_add_partition() function initially used a single if-condition
(IS_ERR(part)) to check for errors when adding a partition. This was
modified to handle the specific case of -ENXIO separately, allowing the
function to proceed without logging the error in this case. However,
this change unintentionally left a path where md_autodetect_dev()
could be called without confirming that part is a valid pointer.
This commit separates the error handling logic by splitting the
initial if-condition, improving code readability and handling specific
error scenarios explicitly. The function now distinguishes the general
error case from -ENXIO without altering the existing behavior of
md_autodetect_dev() calls.
Fixes:
|
||
|
|
0d7ddfc892 |
block: print symbolic error name instead of error code
[ Upstream commit 25c1772a0493463408489b1fae65cf77fe46cac1 ] Utilize the %pe print specifier to get the symbolic error name as a string (i.e "-ENOMEM") in the log message instead of the error code to increase its readablility. This change was suggested in https://lore.kernel.org/all/92972476-0b1f-4d0a-9951-af3fc8bc6e65@suswa.mountain/ Signed-off-by: Christian Heusel <christian@heusel.eu> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Link: https://lore.kernel.org/r/20240111231521.1596838-1-christian@heusel.eu Signed-off-by: Jens Axboe <axboe@kernel.dk> Stable-dep-of: 26e197b7f924 ("block: fix potential invalid pointer dereference in blk_add_partition") Signed-off-by: Sasha Levin <sashal@kernel.org> |
||
|
|
c3eba0a4e9 |
block, bfq: fix procress reference leakage for bfqq in merge chain
[ Upstream commit 73aeab373557fa6ee4ae0b742c6211ccd9859280 ]
Original state:
Process 1 Process 2 Process 3 Process 4
(BIC1) (BIC2) (BIC3) (BIC4)
Λ | | |
\--------------\ \-------------\ \-------------\|
V V V
bfqq1--------->bfqq2---------->bfqq3----------->bfqq4
ref 0 1 2 4
After commit 0e456dba86c7 ("block, bfq: choose the last bfqq from merge
chain in bfq_setup_cooperator()"), if P1 issues a new IO:
Without the patch:
Process 1 Process 2 Process 3 Process 4
(BIC1) (BIC2) (BIC3) (BIC4)
Λ | | |
\------------------------------\ \-------------\|
V V
bfqq1--------->bfqq2---------->bfqq3----------->bfqq4
ref 0 0 2 4
bfqq3 will be used to handle IO from P1, this is not expected, IO
should be redirected to bfqq4;
With the patch:
-------------------------------------------
| |
Process 1 Process 2 Process 3 | Process 4
(BIC1) (BIC2) (BIC3) | (BIC4)
| | | |
\-------------\ \-------------\|
V V
bfqq1--------->bfqq2---------->bfqq3----------->bfqq4
ref 0 0 2 4
IO is redirected to bfqq4, however, procress reference of bfqq3 is still
2, while there is only P2 using it.
Fix the problem by calling bfq_merge_bfqqs() for each bfqq in the merge
chain. Also change bfqq_merge_bfqqs() to return new_bfqq to simplify
code.
Fixes: 0e456dba86c7 ("block, bfq: choose the last bfqq from merge chain in bfq_setup_cooperator()")
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20240909134154.954924-3-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
||
|
|
0780451f03 |
block, bfq: fix uaf for accessing waker_bfqq after splitting
[ Upstream commit 1ba0403ac6447f2d63914fb760c44a3b19c44eaf ]
After commit 42c306ed7233 ("block, bfq: don't break merge chain in
bfq_split_bfqq()"), if the current procress is the last holder of bfqq,
the bfqq can be freed after bfq_split_bfqq(). Hence recored the bfqq and
then access bfqq->waker_bfqq may trigger UAF. What's more, the waker_bfqq
may in the merge chain of bfqq, hence just recored waker_bfqq is still
not safe.
Fix the problem by adding a helper bfq_waker_bfqq() to check if
bfqq->waker_bfqq is in the merge chain, and current procress is the only
holder.
Fixes: 42c306ed7233 ("block, bfq: don't break merge chain in bfq_split_bfqq()")
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20240909134154.954924-2-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
|
||
|
|
19f3bec2ac |
block, bfq: don't break merge chain in bfq_split_bfqq()
[ Upstream commit 42c306ed723321af4003b2a41bb73728cab54f85 ]
Consider the following scenario:
Process 1 Process 2 Process 3 Process 4
(BIC1) (BIC2) (BIC3) (BIC4)
Λ | | |
\-------------\ \-------------\ \--------------\|
V V V
bfqq1--------->bfqq2---------->bfqq3----------->bfqq4
ref 0 1 2 4
If Process 1 issue a new IO and bfqq2 is found, and then bfq_init_rq()
decide to spilt bfqq2 by bfq_split_bfqq(). Howerver, procress reference
of bfqq2 is 1 and bfq_split_bfqq() just clear the coop flag, which will
break the merge chain.
Expected result: caller will allocate a new bfqq for BIC1
Process 1 Process 2 Process 3 Process 4
(BIC1) (BIC2) (BIC3) (BIC4)
| | |
\-------------\ \--------------\|
V V
bfqq1--------->bfqq2---------->bfqq3----------->bfqq4
ref 0 0 1 3
Since the condition is only used for the last bfqq4 when the previous
bfqq2 and bfqq3 are already splited. Fix the problem by checking if
bfqq is the last one in the merge chain as well.
Fixes:
|
||
|
|
e50c9a3526 |
block, bfq: choose the last bfqq from merge chain in bfq_setup_cooperator()
[ Upstream commit 0e456dba86c7f9a19792204a044835f1ca2c8dbb ]
Consider the following merge chain:
Process 1 Process 2 Process 3 Process 4
(BIC1) (BIC2) (BIC3) (BIC4)
Λ | | |
\--------------\ \-------------\ \-------------\|
V V V
bfqq1--------->bfqq2---------->bfqq3----------->bfqq4
IO from Process 1 will get bfqf2 from BIC1 first, then
bfq_setup_cooperator() will found bfqq2 already merged to bfqq3 and then
handle this IO from bfqq3. However, the merge chain can be much deeper
and bfqq3 can be merged to other bfqq as well.
Fix this problem by iterating to the last bfqq in
bfq_setup_cooperator().
Fixes:
|
||
|
|
7faed2896d |
block, bfq: fix possible UAF for bfqq->bic with merge chain
[ Upstream commit 18ad4df091dd5d067d2faa8fce1180b79f7041a7 ]
1) initial state, three tasks:
Process 1 Process 2 Process 3
(BIC1) (BIC2) (BIC3)
| Λ | Λ | Λ
| | | | | |
V | V | V |
bfqq1 bfqq2 bfqq3
process ref: 1 1 1
2) bfqq1 merged to bfqq2:
Process 1 Process 2 Process 3
(BIC1) (BIC2) (BIC3)
| | | Λ
\--------------\| | |
V V |
bfqq1--------->bfqq2 bfqq3
process ref: 0 2 1
3) bfqq2 merged to bfqq3:
Process 1 Process 2 Process 3
(BIC1) (BIC2) (BIC3)
here -> Λ | |
\--------------\ \-------------\|
V V
bfqq1--------->bfqq2---------->bfqq3
process ref: 0 1 3
In this case, IO from Process 1 will get bfqq2 from BIC1 first, and then
get bfqq3 through merge chain, and finially handle IO by bfqq3.
Howerver, current code will think bfqq2 is owned by BIC1, like initial
state, and set bfqq2->bic to BIC1.
bfq_insert_request
-> by Process 1
bfqq = bfq_init_rq(rq)
bfqq = bfq_get_bfqq_handle_split
bfqq = bic_to_bfqq
-> get bfqq2 from BIC1
bfqq->ref++
rq->elv.priv[0] = bic
rq->elv.priv[1] = bfqq
if (bfqq_process_refs(bfqq) == 1)
bfqq->bic = bic
-> record BIC1 to bfqq2
__bfq_insert_request
new_bfqq = bfq_setup_cooperator
-> get bfqq3 from bfqq2->new_bfqq
bfqq_request_freed(bfqq)
new_bfqq->ref++
rq->elv.priv[1] = new_bfqq
-> handle IO by bfqq3
Fix the problem by checking bfqq is from merge chain fist. And this
might fix a following problem reported by our syzkaller(unreproducible):
==================================================================
BUG: KASAN: slab-use-after-free in bfq_do_early_stable_merge block/bfq-iosched.c:5692 [inline]
BUG: KASAN: slab-use-after-free in bfq_do_or_sched_stable_merge block/bfq-iosched.c:5805 [inline]
BUG: KASAN: slab-use-after-free in bfq_get_queue+0x25b0/0x2610 block/bfq-iosched.c:5889
Write of size 1 at addr ffff888123839eb8 by task kworker/0:1H/18595
CPU: 0 PID: 18595 Comm: kworker/0:1H Tainted: G L 6.6.0-07439-gba2303cacfda #6
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
Workqueue: kblockd blk_mq_requeue_work
Call Trace:
<TASK>
__dump_stack lib/dump_stack.c:88 [inline]
dump_stack_lvl+0x91/0xf0 lib/dump_stack.c:106
print_address_description mm/kasan/report.c:364 [inline]
print_report+0x10d/0x610 mm/kasan/report.c:475
kasan_report+0x8e/0xc0 mm/kasan/report.c:588
bfq_do_early_stable_merge block/bfq-iosched.c:5692 [inline]
bfq_do_or_sched_stable_merge block/bfq-iosched.c:5805 [inline]
bfq_get_queue+0x25b0/0x2610 block/bfq-iosched.c:5889
bfq_get_bfqq_handle_split+0x169/0x5d0 block/bfq-iosched.c:6757
bfq_init_rq block/bfq-iosched.c:6876 [inline]
bfq_insert_request block/bfq-iosched.c:6254 [inline]
bfq_insert_requests+0x1112/0x5cf0 block/bfq-iosched.c:6304
blk_mq_insert_request+0x290/0x8d0 block/blk-mq.c:2593
blk_mq_requeue_work+0x6bc/0xa70 block/blk-mq.c:1502
process_one_work kernel/workqueue.c:2627 [inline]
process_scheduled_works+0x432/0x13f0 kernel/workqueue.c:2700
worker_thread+0x6f2/0x1160 kernel/workqueue.c:2781
kthread+0x33c/0x440 kernel/kthread.c:388
ret_from_fork+0x4d/0x80 arch/x86/kernel/process.c:147
ret_from_fork_asm+0x1b/0x30 arch/x86/entry/entry_64.S:305
</TASK>
Allocated by task 20776:
kasan_save_stack+0x20/0x40 mm/kasan/common.c:45
kasan_set_track+0x25/0x30 mm/kasan/common.c:52
__kasan_slab_alloc+0x87/0x90 mm/kasan/common.c:328
kasan_slab_alloc include/linux/kasan.h:188 [inline]
slab_post_alloc_hook mm/slab.h:763 [inline]
slab_alloc_node mm/slub.c:3458 [inline]
kmem_cache_alloc_node+0x1a4/0x6f0 mm/slub.c:3503
ioc_create_icq block/blk-ioc.c:370 [inline]
ioc_find_get_icq+0x180/0xaa0 block/blk-ioc.c:436
bfq_prepare_request+0x39/0xf0 block/bfq-iosched.c:6812
blk_mq_rq_ctx_init.isra.7+0x6ac/0xa00 block/blk-mq.c:403
__blk_mq_alloc_requests+0xcc0/0x1070 block/blk-mq.c:517
blk_mq_get_new_requests block/blk-mq.c:2940 [inline]
blk_mq_submit_bio+0x624/0x27c0 block/blk-mq.c:3042
__submit_bio+0x331/0x6f0 block/blk-core.c:624
__submit_bio_noacct_mq block/blk-core.c:703 [inline]
submit_bio_noacct_nocheck+0x816/0xb40 block/blk-core.c:732
submit_bio_noacct+0x7a6/0x1b50 block/blk-core.c:826
xlog_write_iclog+0x7d5/0xa00 fs/xfs/xfs_log.c:1958
xlog_state_release_iclog+0x3b8/0x720 fs/xfs/xfs_log.c:619
xlog_cil_push_work+0x19c5/0x2270 fs/xfs/xfs_log_cil.c:1330
process_one_work kernel/workqueue.c:2627 [inline]
process_scheduled_works+0x432/0x13f0 kernel/workqueue.c:2700
worker_thread+0x6f2/0x1160 kernel/workqueue.c:2781
kthread+0x33c/0x440 kernel/kthread.c:388
ret_from_fork+0x4d/0x80 arch/x86/kernel/process.c:147
ret_from_fork_asm+0x1b/0x30 arch/x86/entry/entry_64.S:305
Freed by task 946:
kasan_save_stack+0x20/0x40 mm/kasan/common.c:45
kasan_set_track+0x25/0x30 mm/kasan/common.c:52
kasan_save_free_info+0x2b/0x50 mm/kasan/generic.c:522
____kasan_slab_free mm/kasan/common.c:236 [inline]
__kasan_slab_free+0x12c/0x1c0 mm/kasan/common.c:244
kasan_slab_free include/linux/kasan.h:164 [inline]
slab_free_hook mm/slub.c:1815 [inline]
slab_free_freelist_hook mm/slub.c:1841 [inline]
slab_free mm/slub.c:3786 [inline]
kmem_cache_free+0x118/0x6f0 mm/slub.c:3808
rcu_do_batch+0x35c/0xe30 kernel/rcu/tree.c:2189
rcu_core+0x819/0xd90 kernel/rcu/tree.c:2462
__do_softirq+0x1b0/0x7a2 kernel/softirq.c:553
Last potentially related work creation:
kasan_save_stack+0x20/0x40 mm/kasan/common.c:45
__kasan_record_aux_stack+0xaf/0xc0 mm/kasan/generic.c:492
__call_rcu_common kernel/rcu/tree.c:2712 [inline]
call_rcu+0xce/0x1020 kernel/rcu/tree.c:2826
ioc_destroy_icq+0x54c/0x830 block/blk-ioc.c:105
ioc_release_fn+0xf0/0x360 block/blk-ioc.c:124
process_one_work kernel/workqueue.c:2627 [inline]
process_scheduled_works+0x432/0x13f0 kernel/workqueue.c:2700
worker_thread+0x6f2/0x1160 kernel/workqueue.c:2781
kthread+0x33c/0x440 kernel/kthread.c:388
ret_from_fork+0x4d/0x80 arch/x86/kernel/process.c:147
ret_from_fork_asm+0x1b/0x30 arch/x86/entry/entry_64.S:305
Second to last potentially related work creation:
kasan_save_stack+0x20/0x40 mm/kasan/common.c:45
__kasan_record_aux_stack+0xaf/0xc0 mm/kasan/generic.c:492
__call_rcu_common kernel/rcu/tree.c:2712 [inline]
call_rcu+0xce/0x1020 kernel/rcu/tree.c:2826
ioc_destroy_icq+0x54c/0x830 block/blk-ioc.c:105
ioc_release_fn+0xf0/0x360 block/blk-ioc.c:124
process_one_work kernel/workqueue.c:2627 [inline]
process_scheduled_works+0x432/0x13f0 kernel/workqueue.c:2700
worker_thread+0x6f2/0x1160 kernel/workqueue.c:2781
kthread+0x33c/0x440 kernel/kthread.c:388
ret_from_fork+0x4d/0x80 arch/x86/kernel/process.c:147
ret_from_fork_asm+0x1b/0x30 arch/x86/entry/entry_64.S:305
The buggy address belongs to the object at ffff888123839d68
which belongs to the cache bfq_io_cq of size 1360
The buggy address is located 336 bytes inside of
freed 1360-byte region [ffff888123839d68, ffff88812383a2b8)
The buggy address belongs to the physical page:
page:ffffea00048e0e00 refcount:1 mapcount:0 mapping:0000000000000000 index:0xffff88812383f588 pfn:0x123838
head:ffffea00048e0e00 order:3 entire_mapcount:0 nr_pages_mapped:0 pincount:0
flags: 0x17ffffc0000a40(workingset|slab|head|node=0|zone=2|lastcpupid=0x1fffff)
page_type: 0xffffffff()
raw: 0017ffffc0000a40 ffff88810588c200 ffffea00048ffa10 ffff888105889488
raw: ffff88812383f588 0000000000150006 00000001ffffffff 0000000000000000
page dumped because: kasan: bad access detected
Memory state around the buggy address:
ffff888123839d80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
ffff888123839e00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
>ffff888123839e80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
^
ffff888123839f00: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
ffff888123839f80: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
==================================================================
Fixes:
|
||
|
|
c20e89c96f |
block: Fix where bio IO priority gets set
[ Upstream commit f3c89983cb4fc00be64eb0d5cbcfcdf2cacb965e ] Commit |
||
|
|
2e91ea2962 |
block: remove the blk_flush_integrity call in blk_integrity_unregister
[ Upstream commit e8bc14d116aeac8f0f133ec8d249acf4e0658da7 ]
Now that there are no indirect calls for PI processing there is no
way to dereference a NULL pointer here. Additionally drivers now always
freeze the queue (or in case of stacking drivers use their internal
equivalent) around changing the integrity profile.
This is effectively a revert of commit
|
||
|
|
edb39f621b |
block: Fix lockdep warning in blk_mq_mark_tag_wait
[ Upstream commit b313a8c835516bdda85025500be866ac8a74e022 ]
Lockdep reported a warning in Linux version 6.6:
[ 414.344659] ================================
[ 414.345155] WARNING: inconsistent lock state
[ 414.345658] 6.6.0-07439-gba2303cacfda #6 Not tainted
[ 414.346221] --------------------------------
[ 414.346712] inconsistent {IN-SOFTIRQ-W} -> {SOFTIRQ-ON-W} usage.
[ 414.347545] kworker/u10:3/1152 [HC0[0]:SC0[0]:HE0:SE1] takes:
[ 414.349245] ffff88810edd1098 (&sbq->ws[i].wait){+.?.}-{2:2}, at: blk_mq_dispatch_rq_list+0x131c/0x1ee0
[ 414.351204] {IN-SOFTIRQ-W} state was registered at:
[ 414.351751] lock_acquire+0x18d/0x460
[ 414.352218] _raw_spin_lock_irqsave+0x39/0x60
[ 414.352769] __wake_up_common_lock+0x22/0x60
[ 414.353289] sbitmap_queue_wake_up+0x375/0x4f0
[ 414.353829] sbitmap_queue_clear+0xdd/0x270
[ 414.354338] blk_mq_put_tag+0xdf/0x170
[ 414.354807] __blk_mq_free_request+0x381/0x4d0
[ 414.355335] blk_mq_free_request+0x28b/0x3e0
[ 414.355847] __blk_mq_end_request+0x242/0xc30
[ 414.356367] scsi_end_request+0x2c1/0x830
[ 414.345155] WARNING: inconsistent lock state
[ 414.345658] 6.6.0-07439-gba2303cacfda #6 Not tainted
[ 414.346221] --------------------------------
[ 414.346712] inconsistent {IN-SOFTIRQ-W} -> {SOFTIRQ-ON-W} usage.
[ 414.347545] kworker/u10:3/1152 [HC0[0]:SC0[0]:HE0:SE1] takes:
[ 414.349245] ffff88810edd1098 (&sbq->ws[i].wait){+.?.}-{2:2}, at: blk_mq_dispatch_rq_list+0x131c/0x1ee0
[ 414.351204] {IN-SOFTIRQ-W} state was registered at:
[ 414.351751] lock_acquire+0x18d/0x460
[ 414.352218] _raw_spin_lock_irqsave+0x39/0x60
[ 414.352769] __wake_up_common_lock+0x22/0x60
[ 414.353289] sbitmap_queue_wake_up+0x375/0x4f0
[ 414.353829] sbitmap_queue_clear+0xdd/0x270
[ 414.354338] blk_mq_put_tag+0xdf/0x170
[ 414.354807] __blk_mq_free_request+0x381/0x4d0
[ 414.355335] blk_mq_free_request+0x28b/0x3e0
[ 414.355847] __blk_mq_end_request+0x242/0xc30
[ 414.356367] scsi_end_request+0x2c1/0x830
[ 414.356863] scsi_io_completion+0x177/0x1610
[ 414.357379] scsi_complete+0x12f/0x260
[ 414.357856] blk_complete_reqs+0xba/0xf0
[ 414.358338] __do_softirq+0x1b0/0x7a2
[ 414.358796] irq_exit_rcu+0x14b/0x1a0
[ 414.359262] sysvec_call_function_single+0xaf/0xc0
[ 414.359828] asm_sysvec_call_function_single+0x1a/0x20
[ 414.360426] default_idle+0x1e/0x30
[ 414.360873] default_idle_call+0x9b/0x1f0
[ 414.361390] do_idle+0x2d2/0x3e0
[ 414.361819] cpu_startup_entry+0x55/0x60
[ 414.362314] start_secondary+0x235/0x2b0
[ 414.362809] secondary_startup_64_no_verify+0x18f/0x19b
[ 414.363413] irq event stamp: 428794
[ 414.363825] hardirqs last enabled at (428793): [<ffffffff816bfd1c>] ktime_get+0x1dc/0x200
[ 414.364694] hardirqs last disabled at (428794): [<ffffffff85470177>] _raw_spin_lock_irq+0x47/0x50
[ 414.365629] softirqs last enabled at (428444): [<ffffffff85474780>] __do_softirq+0x540/0x7a2
[ 414.366522] softirqs last disabled at (428419): [<ffffffff813f65ab>] irq_exit_rcu+0x14b/0x1a0
[ 414.367425]
other info that might help us debug this:
[ 414.368194] Possible unsafe locking scenario:
[ 414.368900] CPU0
[ 414.369225] ----
[ 414.369548] lock(&sbq->ws[i].wait);
[ 414.370000] <Interrupt>
[ 414.370342] lock(&sbq->ws[i].wait);
[ 414.370802]
*** DEADLOCK ***
[ 414.371569] 5 locks held by kworker/u10:3/1152:
[ 414.372088] #0: ffff88810130e938 ((wq_completion)writeback){+.+.}-{0:0}, at: process_scheduled_works+0x357/0x13f0
[ 414.373180] #1: ffff88810201fdb8 ((work_completion)(&(&wb->dwork)->work)){+.+.}-{0:0}, at: process_scheduled_works+0x3a3/0x13f0
[ 414.374384] #2: ffffffff86ffbdc0 (rcu_read_lock){....}-{1:2}, at: blk_mq_run_hw_queue+0x637/0xa00
[ 414.375342] #3: ffff88810edd1098 (&sbq->ws[i].wait){+.?.}-{2:2}, at: blk_mq_dispatch_rq_list+0x131c/0x1ee0
[ 414.376377] #4: ffff888106205a08 (&hctx->dispatch_wait_lock){+.-.}-{2:2}, at: blk_mq_dispatch_rq_list+0x1337/0x1ee0
[ 414.378607]
stack backtrace:
[ 414.379177] CPU: 0 PID: 1152 Comm: kworker/u10:3 Not tainted 6.6.0-07439-gba2303cacfda #6
[ 414.380032] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
[ 414.381177] Workqueue: writeback wb_workfn (flush-253:0)
[ 414.381805] Call Trace:
[ 414.382136] <TASK>
[ 414.382429] dump_stack_lvl+0x91/0xf0
[ 414.382884] mark_lock_irq+0xb3b/0x1260
[ 414.383367] ? __pfx_mark_lock_irq+0x10/0x10
[ 414.383889] ? stack_trace_save+0x8e/0xc0
[ 414.384373] ? __pfx_stack_trace_save+0x10/0x10
[ 414.384903] ? graph_lock+0xcf/0x410
[ 414.385350] ? save_trace+0x3d/0xc70
[ 414.385808] mark_lock.part.20+0x56d/0xa90
[ 414.386317] mark_held_locks+0xb0/0x110
[ 414.386791] ? __pfx_do_raw_spin_lock+0x10/0x10
[ 414.387320] lockdep_hardirqs_on_prepare+0x297/0x3f0
[ 414.387901] ? _raw_spin_unlock_irq+0x28/0x50
[ 414.388422] trace_hardirqs_on+0x58/0x100
[ 414.388917] _raw_spin_unlock_irq+0x28/0x50
[ 414.389422] __blk_mq_tag_busy+0x1d6/0x2a0
[ 414.389920] __blk_mq_get_driver_tag+0x761/0x9f0
[ 414.390899] blk_mq_dispatch_rq_list+0x1780/0x1ee0
[ 414.391473] ? __pfx_blk_mq_dispatch_rq_list+0x10/0x10
[ 414.392070] ? sbitmap_get+0x2b8/0x450
[ 414.392533] ? __blk_mq_get_driver_tag+0x210/0x9f0
[ 414.393095] __blk_mq_sched_dispatch_requests+0xd99/0x1690
[ 414.393730] ? elv_attempt_insert_merge+0x1b1/0x420
[ 414.394302] ? __pfx___blk_mq_sched_dispatch_requests+0x10/0x10
[ 414.394970] ? lock_acquire+0x18d/0x460
[ 414.395456] ? blk_mq_run_hw_queue+0x637/0xa00
[ 414.395986] ? __pfx_lock_acquire+0x10/0x10
[ 414.396499] blk_mq_sched_dispatch_requests+0x109/0x190
[ 414.397100] blk_mq_run_hw_queue+0x66e/0xa00
[ 414.397616] blk_mq_flush_plug_list.part.17+0x614/0x2030
[ 414.398244] ? __pfx_blk_mq_flush_plug_list.part.17+0x10/0x10
[ 414.398897] ? writeback_sb_inodes+0x241/0xcc0
[ 414.399429] blk_mq_flush_plug_list+0x65/0x80
[ 414.399957] __blk_flush_plug+0x2f1/0x530
[ 414.400458] ? __pfx___blk_flush_plug+0x10/0x10
[ 414.400999] blk_finish_plug+0x59/0xa0
[ 414.401467] wb_writeback+0x7cc/0x920
[ 414.401935] ? __pfx_wb_writeback+0x10/0x10
[ 414.402442] ? mark_held_locks+0xb0/0x110
[ 414.402931] ? __pfx_do_raw_spin_lock+0x10/0x10
[ 414.403462] ? lockdep_hardirqs_on_prepare+0x297/0x3f0
[ 414.404062] wb_workfn+0x2b3/0xcf0
[ 414.404500] ? __pfx_wb_workfn+0x10/0x10
[ 414.404989] process_scheduled_works+0x432/0x13f0
[ 414.405546] ? __pfx_process_scheduled_works+0x10/0x10
[ 414.406139] ? do_raw_spin_lock+0x101/0x2a0
[ 414.406641] ? assign_work+0x19b/0x240
[ 414.407106] ? lock_is_held_type+0x9d/0x110
[ 414.407604] worker_thread+0x6f2/0x1160
[ 414.408075] ? __kthread_parkme+0x62/0x210
[ 414.408572] ? lockdep_hardirqs_on_prepare+0x297/0x3f0
[ 414.409168] ? __kthread_parkme+0x13c/0x210
[ 414.409678] ? __pfx_worker_thread+0x10/0x10
[ 414.410191] kthread+0x33c/0x440
[ 414.410602] ? __pfx_kthread+0x10/0x10
[ 414.411068] ret_from_fork+0x4d/0x80
[ 414.411526] ? __pfx_kthread+0x10/0x10
[ 414.411993] ret_from_fork_asm+0x1b/0x30
[ 414.412489] </TASK>
When interrupt is turned on while a lock holding by spin_lock_irq it
throws a warning because of potential deadlock.
blk_mq_prep_dispatch_rq
blk_mq_get_driver_tag
__blk_mq_get_driver_tag
__blk_mq_alloc_driver_tag
blk_mq_tag_busy -> tag is already busy
// failed to get driver tag
blk_mq_mark_tag_wait
spin_lock_irq(&wq->lock) -> lock A (&sbq->ws[i].wait)
__add_wait_queue(wq, wait) -> wait queue active
blk_mq_get_driver_tag
__blk_mq_tag_busy
-> 1) tag must be idle, which means there can't be inflight IO
spin_lock_irq(&tags->lock) -> lock B (hctx->tags)
spin_unlock_irq(&tags->lock) -> unlock B, turn on interrupt accidentally
-> 2) context must be preempt by IO interrupt to trigger deadlock.
As shown above, the deadlock is not possible in theory, but the warning
still need to be fixed.
Fix it by using spin_lock_irqsave to get lockB instead of spin_lock_irq.
Fixes:
|
||
|
|
5a5625a83e |
block: fix deadlock between sd_remove & sd_release
commit 7e04da2dc7013af50ed3a2beb698d5168d1e594b upstream.
Our test report the following hung task:
[ 2538.459400] INFO: task "kworker/0:0":7 blocked for more than 188 seconds.
[ 2538.459427] Call trace:
[ 2538.459430] __switch_to+0x174/0x338
[ 2538.459436] __schedule+0x628/0x9c4
[ 2538.459442] schedule+0x7c/0xe8
[ 2538.459447] schedule_preempt_disabled+0x24/0x40
[ 2538.459453] __mutex_lock+0x3ec/0xf04
[ 2538.459456] __mutex_lock_slowpath+0x14/0x24
[ 2538.459459] mutex_lock+0x30/0xd8
[ 2538.459462] del_gendisk+0xdc/0x350
[ 2538.459466] sd_remove+0x30/0x60
[ 2538.459470] device_release_driver_internal+0x1c4/0x2c4
[ 2538.459474] device_release_driver+0x18/0x28
[ 2538.459478] bus_remove_device+0x15c/0x174
[ 2538.459483] device_del+0x1d0/0x358
[ 2538.459488] __scsi_remove_device+0xa8/0x198
[ 2538.459493] scsi_forget_host+0x50/0x70
[ 2538.459497] scsi_remove_host+0x80/0x180
[ 2538.459502] usb_stor_disconnect+0x68/0xf4
[ 2538.459506] usb_unbind_interface+0xd4/0x280
[ 2538.459510] device_release_driver_internal+0x1c4/0x2c4
[ 2538.459514] device_release_driver+0x18/0x28
[ 2538.459518] bus_remove_device+0x15c/0x174
[ 2538.459523] device_del+0x1d0/0x358
[ 2538.459528] usb_disable_device+0x84/0x194
[ 2538.459532] usb_disconnect+0xec/0x300
[ 2538.459537] hub_event+0xb80/0x1870
[ 2538.459541] process_scheduled_works+0x248/0x4dc
[ 2538.459545] worker_thread+0x244/0x334
[ 2538.459549] kthread+0x114/0x1bc
[ 2538.461001] INFO: task "fsck.":15415 blocked for more than 188 seconds.
[ 2538.461014] Call trace:
[ 2538.461016] __switch_to+0x174/0x338
[ 2538.461021] __schedule+0x628/0x9c4
[ 2538.461025] schedule+0x7c/0xe8
[ 2538.461030] blk_queue_enter+0xc4/0x160
[ 2538.461034] blk_mq_alloc_request+0x120/0x1d4
[ 2538.461037] scsi_execute_cmd+0x7c/0x23c
[ 2538.461040] ioctl_internal_command+0x5c/0x164
[ 2538.461046] scsi_set_medium_removal+0x5c/0xb0
[ 2538.461051] sd_release+0x50/0x94
[ 2538.461054] blkdev_put+0x190/0x28c
[ 2538.461058] blkdev_release+0x28/0x40
[ 2538.461063] __fput+0xf8/0x2a8
[ 2538.461066] __fput_sync+0x28/0x5c
[ 2538.461070] __arm64_sys_close+0x84/0xe8
[ 2538.461073] invoke_syscall+0x58/0x114
[ 2538.461078] el0_svc_common+0xac/0xe0
[ 2538.461082] do_el0_svc+0x1c/0x28
[ 2538.461087] el0_svc+0x38/0x68
[ 2538.461090] el0t_64_sync_handler+0x68/0xbc
[ 2538.461093] el0t_64_sync+0x1a8/0x1ac
T1: T2:
sd_remove
del_gendisk
__blk_mark_disk_dead
blk_freeze_queue_start
++q->mq_freeze_depth
bdev_release
mutex_lock(&disk->open_mutex)
sd_release
scsi_execute_cmd
blk_queue_enter
wait_event(!q->mq_freeze_depth)
mutex_lock(&disk->open_mutex)
SCSI does not set GD_OWNS_QUEUE, so QUEUE_FLAG_DYING is not set in
this scenario. This is a classic ABBA deadlock. To fix the deadlock,
make sure we don't try to acquire disk->open_mutex after freezing
the queue.
Cc: stable@vger.kernel.org
Fixes:
|
||
|
|
2a6849a2b6 |
block/mq-deadline: Fix the tag reservation code
[ Upstream commit 39823b47bbd40502632ffba90ebb34fff7c8b5e8 ]
The current tag reservation code is based on a misunderstanding of the
meaning of data->shallow_depth. Fix the tag reservation code as follows:
* By default, do not reserve any tags for synchronous requests because
for certain use cases reserving tags reduces performance. See also
Harshit Mogalapalli, [bug-report] Performance regression with fio
sequential-write on a multipath setup, 2024-03-07
(https://lore.kernel.org/linux-block/5ce2ae5d-61e2-4ede-ad55-551112602401@oracle.com/)
* Reduce min_shallow_depth to one because min_shallow_depth must be less
than or equal any shallow_depth value.
* Scale dd->async_depth from the range [1, nr_requests] to [1,
bits_per_sbitmap_word].
Cc: Christoph Hellwig <hch@lst.de>
Cc: Damien Le Moal <dlemoal@kernel.org>
Cc: Zhiguo Niu <zhiguo.niu@unisoc.com>
Fixes:
|
||
|
|
b2c67e1f80 |
block: Call .limit_depth() after .hctx has been set
[ Upstream commit 6259151c04d4e0085e00d2dcb471ebdd1778e72e ]
Call .limit_depth() after data->hctx has been set such that data->hctx can
be used in .limit_depth() implementations.
Cc: Christoph Hellwig <hch@lst.de>
Cc: Damien Le Moal <dlemoal@kernel.org>
Cc: Zhiguo Niu <zhiguo.niu@unisoc.com>
Fixes:
|
||
|
|
23a19655fb |
block: initialize integrity buffer to zero before writing it to media
[ Upstream commit 899ee2c3829c5ac14bfc7d3c4a5846c0b709b78f ]
Metadata added by bio_integrity_prep is using plain kmalloc, which leads
to random kernel memory being written media. For PI metadata this is
limited to the app tag that isn't used by kernel generated metadata,
but for non-PI metadata the entire buffer leaks kernel memory.
Fix this by adding the __GFP_ZERO flag to allocations for writes.
Fixes:
|
||
|
|
fd841ee01f |
block/ioctl: prefer different overflow check
[ Upstream commit ccb326b5f9e623eb7f130fbbf2505ec0e2dcaff9 ] Running syzkaller with the newly reintroduced signed integer overflow sanitizer shows this report: [ 62.982337] ------------[ cut here ]------------ [ 62.985692] cgroup: Invalid name [ 62.986211] UBSAN: signed-integer-overflow in ../block/ioctl.c:36:46 [ 62.989370] 9pnet_fd: p9_fd_create_tcp (7343): problem connecting socket to 127.0.0.1 [ 62.992992] 9223372036854775807 + 4095 cannot be represented in type 'long long' [ 62.997827] 9pnet_fd: p9_fd_create_tcp (7345): problem connecting socket to 127.0.0.1 [ 62.999369] random: crng reseeded on system resumption [ 63.000634] GUP no longer grows the stack in syz-executor.2 (7353): 20002000-20003000 (20001000) [ 63.000668] CPU: 0 PID: 7353 Comm: syz-executor.2 Not tainted 6.8.0-rc2-00035-gb3ef86b5a957 #1 [ 63.000677] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-debian-1.16.3-2 04/01/2014 [ 63.000682] Call Trace: [ 63.000686] <TASK> [ 63.000731] dump_stack_lvl+0x93/0xd0 [ 63.000919] __get_user_pages+0x903/0xd30 [ 63.001030] __gup_longterm_locked+0x153e/0x1ba0 [ 63.001041] ? _raw_read_unlock_irqrestore+0x17/0x50 [ 63.001072] ? try_get_folio+0x29c/0x2d0 [ 63.001083] internal_get_user_pages_fast+0x1119/0x1530 [ 63.001109] iov_iter_extract_pages+0x23b/0x580 [ 63.001206] bio_iov_iter_get_pages+0x4de/0x1220 [ 63.001235] iomap_dio_bio_iter+0x9b6/0x1410 [ 63.001297] __iomap_dio_rw+0xab4/0x1810 [ 63.001316] iomap_dio_rw+0x45/0xa0 [ 63.001328] ext4_file_write_iter+0xdde/0x1390 [ 63.001372] vfs_write+0x599/0xbd0 [ 63.001394] ksys_write+0xc8/0x190 [ 63.001403] do_syscall_64+0xd4/0x1b0 [ 63.001421] ? arch_exit_to_user_mode_prepare+0x3a/0x60 [ 63.001479] entry_SYSCALL_64_after_hwframe+0x6f/0x77 [ 63.001535] RIP: 0033:0x7f7fd3ebf539 [ 63.001551] Code: 28 00 00 00 75 05 48 83 c4 28 c3 e8 f1 14 00 00 90 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b8 ff ff ff f7 d8 64 89 01 48 [ 63.001562] RSP: 002b:00007f7fd32570c8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001 [ 63.001584] RAX: ffffffffffffffda RBX: 00007f7fd3ff3f80 RCX: 00007f7fd3ebf539 [ 63.001590] RDX: 4db6d1e4f7e43360 RSI: 0000000020000000 RDI: 0000000000000004 [ 63.001595] RBP: 00007f7fd3f1e496 R08: 0000000000000000 R09: 0000000000000000 [ 63.001599] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000 [ 63.001604] R13: 0000000000000006 R14: 00007f7fd3ff3f80 R15: 00007ffd415ad2b8 ... [ 63.018142] ---[ end trace ]--- Historically, the signed integer overflow sanitizer did not work in the kernel due to its interaction with `-fwrapv` but this has since been changed [1] in the newest version of Clang; It was re-enabled in the kernel with Commit 557f8c582a9ba8ab ("ubsan: Reintroduce signed overflow sanitizer"). Let's rework this overflow checking logic to not actually perform an overflow during the check itself, thus avoiding the UBSAN splat. [1]: https://github.com/llvm/llvm-project/pull/82432 Signed-off-by: Justin Stitt <justinstitt@google.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20240507-b4-sio-block-ioctl-v3-1-ba0c2b32275e@google.com Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org> |
||
|
|
fe1e395563 |
block: fix request.queuelist usage in flush
[ Upstream commit d0321c812d89c5910d8da8e4b10c891c6b96ff70 ] Friedrich Weber reported a kernel crash problem and bisected to commit |
||
|
|
6b7155458e |
block: sed-opal: avoid possible wrong address reference in read_sed_opal_key()
[ Upstream commit 9b1ebce6a1fded90d4a1c6c57dc6262dac4c4c14 ]
Clang static checker (scan-build) warning:
block/sed-opal.c:line 317, column 3
Value stored to 'ret' is never read.
Fix this problem by returning the error code when keyring_search() failed.
Otherwise, 'key' will have a wrong value when 'kerf' stores the error code.
Fixes:
|
||
|
|
b1bee99312 |
blk-cgroup: Properly propagate the iostat update up the hierarchy
[ Upstream commit 9d230c09964e6e18c8f6e4f0d41ee90eef45ec1c ] During a cgroup_rstat_flush() call, the lowest level of nodes are flushed first before their parents. Since commit |
||
|
|
714e59b545 |
blk-cgroup: fix list corruption from reorder of WRITE ->lqueued
[ Upstream commit d0aac2363549e12cc79b8e285f13d5a9f42fd08e ]
__blkcg_rstat_flush() can be run anytime, especially when blk_cgroup_bio_start
is being executed.
If WRITE of `->lqueued` is re-ordered with READ of 'bisc->lnode.next' in
the loop of __blkcg_rstat_flush(), `next_bisc` can be assigned with one
stat instance being added in blk_cgroup_bio_start(), then the local
list in __blkcg_rstat_flush() could be corrupted.
Fix the issue by adding one barrier.
Cc: Tejun Heo <tj@kernel.org>
Cc: Waiman Long <longman@redhat.com>
Fixes:
|
||
|
|
d4a60298ac |
blk-cgroup: fix list corruption from resetting io stat
[ Upstream commit 6da6680632792709cecf2b006f2fe3ca7857e791 ] Since commit |
||
|
|
e5d98cc331 |
block: support to account io_ticks precisely
[ Upstream commit 99dc422335d8b2bd4d105797241d3e715bae90e9 ] Currently, io_ticks is accounted based on sampling, specifically update_io_ticks() will always account io_ticks by 1 jiffies from bdev_start_io_acct()/blk_account_io_start(), and the result can be inaccurate, for example(HZ is 250): Test script: fio -filename=/dev/sda -bs=4k -rw=write -direct=1 -name=test -thinktime=4ms Test result: util is about 90%, while the disk is really idle. This behaviour is introduced by commit |
||
|
|
99bbbd9aea |
block: fix and simplify blkdevparts= cmdline parsing
[ Upstream commit bc2e07dfd2c49aaa4b52302cf7b55cf94e025f79 ] Fix the cmdline parsing of the "blkdevparts=" parameter using strsep(), which makes the code simpler. Before commit |
||
|
|
910717920c |
block: refine the EOF check in blkdev_iomap_begin
[ Upstream commit 0c12028aec837f5a002009bbf68d179d506510e8 ]
blkdev_iomap_begin rounds down the offset to the logical block size
before stashing it in iomap->offset and checking that it still is
inside the inode size.
Check the i_size check to the raw pos value so that we don't try a
zero size write if iter->pos is unaligned.
Fixes:
|
||
|
|
3ffef55116 |
block: add a partscan sysfs attribute for disks
commit a4217c6740dc64a3eb6815868a9260825e8c68c6 upstream. Userspace had been unknowingly relying on a non-stable interface of kernel internals to determine if partition scanning is enabled for a given disk. Provide a stable interface for this purpose instead. Cc: stable@vger.kernel.org # 6.3+ Depends-on: 140ce28dd3be ("block: add a disk_has_partscan helper") Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/linux-block/ZhQJf8mzq_wipkBH@gardel-login/ Link: https://lore.kernel.org/r/20240502130033.1958492-3-hch@lst.de [axboe: add links and commit message from Keith] Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> |
||
|
|
d6b6dfff6c |
block: add a disk_has_partscan helper
commit 140ce28dd3bee8e53acc27f123ae474d69ef66f0 upstream. Add a helper to check if partition scanning is enabled instead of open coding the check in a few places. This now always checks for the hidden flag even if all but one of the callers are never reachable for hidden gendisks. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20240502130033.1958492-2-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> |
||
|
|
1c172ac7af |
blk-iocost: do not WARN if iocg was already offlined
[ Upstream commit 01bc4fda9ea0a6b52f12326486f07a4910666cf6 ] In iocg_pay_debt(), warn is triggered if 'active_list' is empty, which is intended to confirm iocg is active when it has debt. However, warn can be triggered during a blkcg or disk removal, if iocg_waitq_timer_fn() is run at that time: WARNING: CPU: 0 PID: 2344971 at block/blk-iocost.c:1402 iocg_pay_debt+0x14c/0x190 Call trace: iocg_pay_debt+0x14c/0x190 iocg_kick_waitq+0x438/0x4c0 iocg_waitq_timer_fn+0xd8/0x130 __run_hrtimer+0x144/0x45c __hrtimer_run_queues+0x16c/0x244 hrtimer_interrupt+0x2cc/0x7b0 The warn in this situation is meaningless. Since this iocg is being removed, the state of the 'active_list' is irrelevant, and 'waitq_timer' is canceled after removing 'active_list' in ioc_pd_free(), which ensures iocg is freed after iocg_waitq_timer_fn() returns. Therefore, add the check if iocg was already offlined to avoid warn when removing a blkcg or disk. Signed-off-by: Li Nan <linan122@huawei.com> Reviewed-by: Yu Kuai <yukuai3@huawei.com> Acked-by: Tejun Heo <tj@kernel.org> Link: https://lore.kernel.org/r/20240419093257.3004211-1-linan666@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Sasha Levin <sashal@kernel.org> |