[ Upstream commit 8176080d59 ]
Kernel hang in blkg_destroy_all() when total blkg greater than
BLKG_DESTROY_BATCH_SIZE, because of not removing destroyed blkg in
blkg_list. So the size of blkg_list is same after destroying a
batch of blkg, and the infinite 'restart' occurs.
Since blkg should stay on the queue list until blkg_free_workfn(),
skip destroyed blkg when restart a new round, which will solve this
kernel hang issue and satisfy the previous will to restart.
Reported-by: Xiangfei Ma <xiangfeix.ma@intel.com>
Tested-by: Xiangfei Ma <xiangfeix.ma@intel.com>
Tested-by: Farrah Chen <farrah.chen@intel.com>
Signed-off-by: Tao Su <tao1.su@linux.intel.com>
Fixes: f1c006f1c6 ("blk-cgroup: synchronize pd_free_fn() from blkg_free_workfn() and blkcg_deactivate_policy()")
Suggested-and-reviewed-by: Yu Kuai <yukuai3@huawei.com>
Link: https://lore.kernel.org/r/20230428045149.1310073-1-tao1.su@linux.intel.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit f1c006f1c6 ]
Currently parent pd can be freed before child pd:
t1: remove cgroup C1
blkcg_destroy_blkgs
blkg_destroy
list_del_init(&blkg->q_node)
// remove blkg from queue list
percpu_ref_kill(&blkg->refcnt)
blkg_release
call_rcu
t2: from t1
__blkg_release
blkg_free
schedule_work
t4: deactivate policy
blkcg_deactivate_policy
pd_free_fn
// parent of C1 is freed first
t3: from t2
blkg_free_workfn
pd_free_fn
If policy(for example, ioc_timer_fn() from iocost) access parent pd from
child pd after pd_offline_fn(), then UAF can be triggered.
Fix the problem by delaying 'list_del_init(&blkg->q_node)' from
blkg_destroy() to blkg_free_workfn(), and using a new disk level mutex to
synchronize blkg_free_workfn() and blkcg_deactivate_policy().
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Acked-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20230119110350.2287325-4-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
[ Upstream commit c7241babf0 ]
Some cgroup policies will access parent pd through child pd even
after pd_offline_fn() is done. If pd_free_fn() for parent is called
before child, then UAF can be triggered. Hence it's better to guarantee
the order of pd_free_fn().
Currently refcount of parent blkg is dropped in __blkg_release(), which
is before pd_free_fn() is called in blkg_free_work_fn() while
blkg_free_work_fn() is called asynchronously.
This patch make sure pd_free_fn() called from removing cgroup is ordered
by delaying dropping parent refcount after calling pd_free_fn() for
child.
BTW, pd_free_fn() will also be called from blkcg_deactivate_policy()
from deleting device, and following patches will guarantee the order.
Signed-off-by: Yu Kuai <yukuai3@huawei.com>
Acked-by: Tejun Heo <tj@kernel.org>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20230119110350.2287325-2-yukuai1@huaweicloud.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Sasha Levin <sashal@kernel.org>
commit 813e693023 upstream.
When a gendisk is successfully initialized but add_disk() fails such as when
a loop device has invalid number of minor device numbers specified,
blkcg_init_disk() is called during init and then blkcg_exit_disk() during
error handling. Unfortunately, iolatency gets initialized in the former but
doesn't get cleaned up in the latter.
This is because, in non-error cases, the cleanup is performed by
del_gendisk() calling rq_qos_exit(), the assumption being that rq_qos
policies, iolatency being one of them, can only be activated once the disk
is fully registered and visible. That assumption is true for wbt and iocost,
but not so for iolatency as it gets initialized before add_disk() is called.
It is desirable to lazy-init rq_qos policies because they are optional
features and add to hot path overhead once initialized - each IO has to walk
all the registered rq_qos policies. So, we want to switch iolatency to lazy
init too. However, that's a bigger change. As a fix for the immediate
problem, let's just add an extra call to rq_qos_exit() in blkcg_exit_disk().
This is safe because duplicate calls to rq_qos_exit() become noop's.
Signed-off-by: Tejun Heo <tj@kernel.org>
Reported-by: darklight2357@icloud.com
Cc: Josef Bacik <josef@toxicpanda.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Fixes: d706751215 ("block: introduce blk-iolatency io controller")
Cc: stable@vger.kernel.org # v4.19+
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/Y5TQ5gm3O4HXrXR3@slm.duckdns.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Pass the gendisk to blkcg_init_disk and blkcg_exit_disk as part of moving
the blk-cgroup infrastructure to be gendisk based. Also remove the
rather pointless kerneldoc comments for these internal functions with a
single caller each.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Andreas Herrmann <aherrmann@suse.de>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20220921180501.1539876-7-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Use atomic_try_cmpxchg instead of atomic_cmpxchg (*ptr, old, new) == old
in blkcg_unuse_delay, blkcg_set_delay and blkcg_clear_delay and
atomic64_try_cmpxchg in blkcg_scale_delay. x86 CMPXCHG instruction
returns success in ZF flag, so this change saves a compare after cmpxchg
(and related move instruction in front of cmpxchg).
Also, atomic_try_cmpxchg implicitly assigns old *ptr value to "old" when
cmpxchg fails, enabling further code simplifications.
No functional change intended.
Signed-off-by: Uros Bizjak <ubizjak@gmail.com>
Cc: Jens Axboe <axboe@kernel.dk>
Link: https://lore.kernel.org/r/20220712154455.66868-1-ubizjak@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Convert blk-ioprio handling from a rqos policy to a direct call from
blk_mq_submit_bio(). Firstly, blk-ioprio is not much of a rqos policy
anyway, it just needs a hook in bio submission path to set the bio's IO
priority. Secondly, the rqos .track hook gets actually called too late
for blk-ioprio purposes and introducing a special rqos hook just for
blk-ioprio looks even weirder.
Reviewed-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Tested-by: Damien Le Moal <damien.lemoal@opensource.wdc.com>
Signed-off-by: Jan Kara <jack@suse.cz>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Link: https://lore.kernel.org/r/20220623074840.5960-7-jack@suse.cz
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Commit d92c370a16 ("block: really clone the block cgroup in
bio_clone_blkg_association") changed bio_clone_blkg_association() to
just clone bio->bi_blkg reference from source to destination bio. This
is however wrong if the source and destination bios are against
different block devices because struct blkcg_gq is different for each
bdev-blkcg pair. This will result in IOs being accounted (and throttled
as a result) multiple times against the same device (src bdev) while
throttling of the other device (dst bdev) is ignored. In case of BFQ the
inconsistency can even result in crashes in bfq_bic_update_cgroup().
Fix the problem by looking up correct blkcg_gq for the cloned bio.
Reported-by: Logan Gunthorpe <logang@deltatee.com>
Reported-and-tested-by: Donald Buczek <buczek@molgen.mpg.de>
Fixes: d92c370a16 ("block: really clone the block cgroup in bio_clone_blkg_association")
CC: stable@vger.kernel.org
Reviewed-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Jan Kara <jack@suse.cz>
Link: https://lore.kernel.org/r/20220602081242.7731-1-jack@suse.cz
Signed-off-by: Jens Axboe <axboe@kernel.dk>
A previous commit got rid of unnecessary rcu_read_lock() inside the
IRQ disabling queue_lock, but this debug statement was left. It's now
firing since we are indeed not inside a RCU read lock, but we don't
need to be as we're still preempt safe.
Get rid of the check, as we have a lockdep assert for holding the
queue lock right after it anyway.
Link: https://lore.kernel.org/linux-block/46253c48-81cb-0787-20ad-9133afdd9e21@samsung.com/
Reported-by: Marek Szyprowski <m.szyprowski@samsung.com>
Fixes: 77c570a1ea ("blk-cgroup: Remove unnecessary rcu_read_lock/unlock()")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
With the removal of seq_get_buf in blkcg_print_one_stat, we
cannot make adding the newline conditional on there being
relevant stats because the name was already written out
unconditionally.
Otherwise we may end up with multiple device names in one
line which is confusing and doesn't follow the nested-keyed
file format.
Signed-off-by: Wolfgang Bumiller <w.bumiller@proxmox.com>
Fixes: 252c651a4c ("blk-cgroup: stop using seq_get_buf")
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20220111083159.42340-1-w.bumiller@proxmox.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
All callers of bio_blkcg actually want the CSS, so replace it with an
interface that does return the CSS. This now allows to move
struct blkcg_gq to block/blk-cgroup.h instead of exposing it in a
public header.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20220420042723.1010598-10-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Move these two functions out of line as there is no good reason
to inline them. Also switch to passing a cgroup_subsys_state
instead of doing the conversion in the caller to prepare for making
the blkcg structure private to blk-cgroup.
Signed-off-by: Christoph Hellwig <hch@lst.de>
Acked-by: Tejun Heo <tj@kernel.org>
Link: https://lore.kernel.org/r/20220420042723.1010598-7-hch@lst.de
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Pull block fixes from Jens Axboe:
"Either fixes or a few additions that got missed in the initial merge
window pull. In detail:
- List iterator fix to avoid leaking value post loop (Jakob)
- One-off fix in minor count (Christophe)
- Fix for a regression in how io priority setting works for an
exiting task (Jiri)
- Fix a regression in this merge window with blkg_free() being called
in an inappropriate context (Ming)
- Misc fixes (Ming, Tom)"
* tag 'for-5.18/block-2022-04-01' of git://git.kernel.dk/linux-block:
blk-wbt: remove wbt_track stub
block: use dedicated list iterator variable
block: Fix the maximum minor value is blk_alloc_ext_minor()
block: restore the old set_task_ioprio() behaviour wrt PF_EXITING
block: avoid calling blkg_free() in atomic context
lib/sbitmap: allocate sb->map via kvzalloc_node
Pull ptrace cleanups from Eric Biederman:
"This set of changes removes tracehook.h, moves modification of all of
the ptrace fields inside of siglock to remove races, adds a missing
permission check to ptrace.c
The removal of tracehook.h is quite significant as it has been a major
source of confusion in recent years. Much of that confusion was around
task_work and TIF_NOTIFY_SIGNAL (which I have now decoupled making the
semantics clearer).
For people who don't know tracehook.h is a vestiage of an attempt to
implement uprobes like functionality that was never fully merged, and
was later superseeded by uprobes when uprobes was merged. For many
years now we have been removing what tracehook functionaly a little
bit at a time. To the point where anything left in tracehook.h was
some weird strange thing that was difficult to understand"
* tag 'ptrace-cleanups-for-v5.18' of git://git.kernel.org/pub/scm/linux/kernel/git/ebiederm/user-namespace:
ptrace: Remove duplicated include in ptrace.c
ptrace: Check PTRACE_O_SUSPEND_SECCOMP permission on PTRACE_SEIZE
ptrace: Return the signal to continue with from ptrace_stop
ptrace: Move setting/clearing ptrace_message into ptrace_stop
tracehook: Remove tracehook.h
resume_user_mode: Move to resume_user_mode.h
resume_user_mode: Remove #ifdef TIF_NOTIFY_RESUME in set_notify_resume
signal: Move set_notify_signal and clear_notify_signal into sched/signal.h
task_work: Decouple TIF_NOTIFY_SIGNAL and task_work
task_work: Call tracehook_notify_signal from get_signal on all architectures
task_work: Introduce task_work_pending
task_work: Remove unnecessary include from posix_timers.h
ptrace: Remove tracehook_signal_handler
ptrace: Remove arch_syscall_{enter,exit}_tracehook
ptrace: Create ptrace_report_syscall_{entry,exit} in ptrace.h
ptrace/arm: Rename tracehook_report_syscall report_syscall
ptrace: Move ptrace_report_syscall into ptrace.h
In the whole lifetime of blkcg_gq instance, ->q will be referred, such
as, ->pd_free_fn() is called in blkg_free, and throtl_pd_free() still
may touch the request queue via &tg->service_queue.pending_timer which
is handled by throtl_pending_timer_fn(), so it is reasonable to grab
request queue's refcnt by blkcg_gq instance.
Previously blkcg_exit_queue() is called from blk_release_queue, and it
is hard to avoid the use-after-free. But recently commit 1059699f87 ("block:
move blkcg initialization/destroy into disk allocation/release handler")
is merged to for-5.18/block, it becomes simple to fix the issue by simply
grabbing request queue's refcnt.
Reported-by: Christoph Hellwig <hch@lst.de>
Signed-off-by: Ming Lei <ming.lei@redhat.com>
Link: https://lore.kernel.org/r/20220318130144.1066064-3-ming.lei@redhat.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
Move set_notify_resume and tracehook_notify_resume into resume_user_mode.h.
While doing that rename tracehook_notify_resume to resume_user_mode_work.
Update all of the places that included tracehook.h for these functions to
include resume_user_mode.h instead.
Update all of the callers of tracehook_notify_resume to call
resume_user_mode_work.
Reviewed-by: Kees Cook <keescook@chromium.org>
Link: https://lkml.kernel.org/r/20220309162454.123006-12-ebiederm@xmission.com
Signed-off-by: "Eric W. Biederman" <ebiederm@xmission.com>