How to traverse and dequeue all tasks on the cpu runqueue?
by sebastianjiang from LinuxQuestions.org on (#6KY2F)
Hello people:
I am trying to add some functionalities to the kernel scheduler.
Basically, I'm trying to let each CPU check some conditions every time it calls __schedule(), to decide whether it should go "offline"/"online".
I am keeping CPU 0 always online, and it's in charge of bringing offline CPUs online upon some condition.
I'm trying to find a way to traverse the rq of a CPU (let's say CPU x) and dequeue all tasks on it (or at least the cfs and rt run queues), and enqueue it onto some other CPU (for example, let's say CPU 0). I also want to not let any tasks get on the run queue of CPU x unless CPU 0 has been told that some condition is met and CPU x can go online.
I am facing difficulties with both the 2 ways I can think of:
1. I want to try to reuse some functions in kernel/sched/core.c, for example, sched_cpu_activate() and sched_cpu_deactivate(), but from my experiment sched_cpu_deactivate() causes the kernel to panic sometimes. (dmesg attached in the end)
2. I want to manually dequeue all tasks on the cfs_rq and the rt_rq, but they are implemented as rb_tree and I do not have a clear idea of how to do it properly.
I would really appreciate any kind of advice on this. Thank you all!
Code snippet for using sched_cpu_(de)activate() (in __schedule()):
Code:static void __sched notrace __schedule(bool preempt)
{
struct task_struct *prev, *next;
unsigned long *switch_count;
unsigned long prev_state;
struct rq_flags rf;
struct rq *rq;
int cpu;
unsigned long magic_val;
cpu = smp_processor_id();
rq = cpu_rq(cpu);
prev = rq->curr;
unsigned long long int start_time, end_time;
unsigned long long int delta_ns;
if (cpu == 0) {
if (some condition) {
unsigned int i;
unsigned int prev_cpu_num = ...; // saved globally
unsigned int curr_cpu_num = ...; // calculated from condition
if (curr_cpu_num == prev_cpu_num) {
// Do nothing
} else if (curr_cpu_num > prev_cpu_num) {
for (i = prev_cpu_num; i < curr_cpu_num; i++) {
// Increase CPU num
sched_cpu_activate(i);
mark_cpu_online(i);
}
} else {
for (i = curr_cpu_num; i < prev_cpu_num; i++) {
// Decrease CPU num
mark_cpu_offline(i);
}
}
update_cpu_num(curr_cpu_num); // Update globally
}
}
else { // not CPU 0
if (!CPU_marked_online(cpu)) {
sched_cpu_deactivate(cpu);
}
}
... //The rest of __schedule()The dmesg with sched_cpu_deactivate:
Code:[ 109.084087] Deactivating cpu 2, elapsed time: 40615543 ns
[ 109.085247] bad: scheduling from the idle thread!
[ 109.086310] bad: scheduling from the idle thread!
[ 109.087336] bad: scheduling from the idle thread!
[ 109.088529] Deactivating cpu 3, elapsed time: 44167798 ns
[ 109.089682] bad: scheduling from the idle thread!
[ 109.090736] bad: scheduling from the idle thread!
[ 109.091788] bad: scheduling from the idle thread!
[ 109.092838] Deactivating cpu 4, elapsed time: 47313416 ns
[ 109.093098] Deactivating cpu 5, elapsed time: 47255525 ns
[ 109.099778] BUG: unable to handle page fault for address: 0000001959d04839
[ 109.103577] #PF: supervisor instruction fetch in kernel mode
[ 109.106777] #PF: error_code(0x0010) - not-present page
[ 109.109567] PGD 0 P4D 0
[ 109.110601] Oops: 0010 [#1] SMP PTI
[ 109.111960] CPU: 0 PID: 0 Comm: swapper/5 Tainted: G W 5.11.0-ghost-guest+ #19
[ 109.114921] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-1ubuntu1.1 04/01/2014
[ 109.117975] RIP: 0010:0x1959d04839
[ 109.119280] Code: Unable to access opcode bytes at RIP 0x1959d0480f.
[ 109.120802] RSP: 0018:ffffa466000abe10 EFLAGS: 00010086
[ 109.122095] RAX: ffff983381acddc0 RBX: ffffa466000abe28 RCX: 0000000000000048
[ 109.123799] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000048
[ 109.125496] RBP: ffffffffaab42bac R08: 0000001966d9f5b9 R09: 0000000000000000
[ 109.127184] R10: ffffffffac46b180 R11: 0000000000000000 R12: 00000000ffff4543
[ 109.128859] R13: 000000195f892039 R14: ffff9833fbd5cdc0 R15: ffffffffaab43fe6
[ 109.130506] FS: 0000000000000000(0000) GS:ffff9833fbc00000(0000) knlGS:0000000000000000
[ 109.132067] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 109.133215] CR2: 0000001959d04839 CR3: 00000001109a8003 CR4: 0000000000770ef0
[ 109.134618] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 109.136003] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 109.137389] PKRU: 55555554
[ 109.137990] Call Trace:
[ 109.138545] ? tick_nohz_next_event+0x90/0x180
[ 109.139465] ? tick_nohz_idle_stop_tick+0x164/0x280
[ 109.140467] ? default_idle+0xe/0x20
[ 109.141237] ? arch_cpu_idle+0x15/0x20
[ 109.142034] ? default_idle_call+0x38/0xc0
[ 109.142903] ? do_idle+0x1fd/0x260
[ 109.143632] ? complete+0x3f/0x50
[ 109.144346] ? cpu_startup_entry+0x20/0x30
[ 109.145205] ? start_secondary+0x11f/0x160
[ 109.146054] ? secondary_startup_64_no_verify+0xb0/0xbb
[ 109.147097] Modules linked in: nls_iso8859_1 dm_multipath scsi_dh_rdac scsi_dh_emc scsi_dh_alua binfmt_misc kvm_intel kvm joydev input_leds serio_raw sch_fq_codel drm msr efi_pstore virtio_rng ip_tables x_tables autofs4 btrfs blake2b_generic zstd_compress raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx libcrc32c xor raid6_pq raid1 raid0 multipath linear crct10dif_pclmul crc32_pclmul psmouse ghash_clmulni_intel virtio_net aesni_intel glue_helper net_failover crypto_simd ahci cryptd virtio_blk libahci failover
[ 109.155703] CR2: 0000001959d04839
[ 109.156417] ---[ end trace fca5170169a8dc51 ]---
[ 109.157358] RIP: 0010:0x1959d04839
[ 109.158086] Code: Unable to access opcode bytes at RIP 0x1959d0480f.
[ 109.159339] RSP: 0018:ffffa466000abe10 EFLAGS: 00010086
[ 109.160390] RAX: ffff983381acddc0 RBX: ffffa466000abe28 RCX: 0000000000000048
[ 109.161778] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000048
[ 109.163155] RBP: ffffffffaab42bac R08: 0000001966d9f5b9 R09: 0000000000000000
[ 109.164530] R10: ffffffffac46b180 R11: 0000000000000000 R12: 00000000ffff4543
[ 109.165921] R13: 000000195f892039 R14: ffff9833fbd5cdc0 R15: ffffffffaab43fe6
[ 109.167303] FS: 0000000000000000(0000) GS:ffff9833fbc00000(0000) knlGS:0000000000000000
[ 109.168860] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 109.170005] CR2: 0000001959d04839 CR3: 00000001109a8003 CR4: 0000000000770ef0
[ 109.171390] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 109.172767] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 109.174145] PKRU: 55555554
[ 109.174749] Kernel panic - not syncing: Attempted to kill the idle task!
[ 110.237772] Shutting down cpus with NMI
[ 110.238668] Kernel Offset: 0x29a00000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
[ 110.240707] ---[ end Kernel panic - not syncing: Attempted to kill the idle task! ]---
I am trying to add some functionalities to the kernel scheduler.
Basically, I'm trying to let each CPU check some conditions every time it calls __schedule(), to decide whether it should go "offline"/"online".
I am keeping CPU 0 always online, and it's in charge of bringing offline CPUs online upon some condition.
I'm trying to find a way to traverse the rq of a CPU (let's say CPU x) and dequeue all tasks on it (or at least the cfs and rt run queues), and enqueue it onto some other CPU (for example, let's say CPU 0). I also want to not let any tasks get on the run queue of CPU x unless CPU 0 has been told that some condition is met and CPU x can go online.
I am facing difficulties with both the 2 ways I can think of:
1. I want to try to reuse some functions in kernel/sched/core.c, for example, sched_cpu_activate() and sched_cpu_deactivate(), but from my experiment sched_cpu_deactivate() causes the kernel to panic sometimes. (dmesg attached in the end)
2. I want to manually dequeue all tasks on the cfs_rq and the rt_rq, but they are implemented as rb_tree and I do not have a clear idea of how to do it properly.
I would really appreciate any kind of advice on this. Thank you all!
Code snippet for using sched_cpu_(de)activate() (in __schedule()):
Code:static void __sched notrace __schedule(bool preempt)
{
struct task_struct *prev, *next;
unsigned long *switch_count;
unsigned long prev_state;
struct rq_flags rf;
struct rq *rq;
int cpu;
unsigned long magic_val;
cpu = smp_processor_id();
rq = cpu_rq(cpu);
prev = rq->curr;
unsigned long long int start_time, end_time;
unsigned long long int delta_ns;
if (cpu == 0) {
if (some condition) {
unsigned int i;
unsigned int prev_cpu_num = ...; // saved globally
unsigned int curr_cpu_num = ...; // calculated from condition
if (curr_cpu_num == prev_cpu_num) {
// Do nothing
} else if (curr_cpu_num > prev_cpu_num) {
for (i = prev_cpu_num; i < curr_cpu_num; i++) {
// Increase CPU num
sched_cpu_activate(i);
mark_cpu_online(i);
}
} else {
for (i = curr_cpu_num; i < prev_cpu_num; i++) {
// Decrease CPU num
mark_cpu_offline(i);
}
}
update_cpu_num(curr_cpu_num); // Update globally
}
}
else { // not CPU 0
if (!CPU_marked_online(cpu)) {
sched_cpu_deactivate(cpu);
}
}
... //The rest of __schedule()The dmesg with sched_cpu_deactivate:
Code:[ 109.084087] Deactivating cpu 2, elapsed time: 40615543 ns
[ 109.085247] bad: scheduling from the idle thread!
[ 109.086310] bad: scheduling from the idle thread!
[ 109.087336] bad: scheduling from the idle thread!
[ 109.088529] Deactivating cpu 3, elapsed time: 44167798 ns
[ 109.089682] bad: scheduling from the idle thread!
[ 109.090736] bad: scheduling from the idle thread!
[ 109.091788] bad: scheduling from the idle thread!
[ 109.092838] Deactivating cpu 4, elapsed time: 47313416 ns
[ 109.093098] Deactivating cpu 5, elapsed time: 47255525 ns
[ 109.099778] BUG: unable to handle page fault for address: 0000001959d04839
[ 109.103577] #PF: supervisor instruction fetch in kernel mode
[ 109.106777] #PF: error_code(0x0010) - not-present page
[ 109.109567] PGD 0 P4D 0
[ 109.110601] Oops: 0010 [#1] SMP PTI
[ 109.111960] CPU: 0 PID: 0 Comm: swapper/5 Tainted: G W 5.11.0-ghost-guest+ #19
[ 109.114921] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-1ubuntu1.1 04/01/2014
[ 109.117975] RIP: 0010:0x1959d04839
[ 109.119280] Code: Unable to access opcode bytes at RIP 0x1959d0480f.
[ 109.120802] RSP: 0018:ffffa466000abe10 EFLAGS: 00010086
[ 109.122095] RAX: ffff983381acddc0 RBX: ffffa466000abe28 RCX: 0000000000000048
[ 109.123799] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000048
[ 109.125496] RBP: ffffffffaab42bac R08: 0000001966d9f5b9 R09: 0000000000000000
[ 109.127184] R10: ffffffffac46b180 R11: 0000000000000000 R12: 00000000ffff4543
[ 109.128859] R13: 000000195f892039 R14: ffff9833fbd5cdc0 R15: ffffffffaab43fe6
[ 109.130506] FS: 0000000000000000(0000) GS:ffff9833fbc00000(0000) knlGS:0000000000000000
[ 109.132067] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 109.133215] CR2: 0000001959d04839 CR3: 00000001109a8003 CR4: 0000000000770ef0
[ 109.134618] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 109.136003] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 109.137389] PKRU: 55555554
[ 109.137990] Call Trace:
[ 109.138545] ? tick_nohz_next_event+0x90/0x180
[ 109.139465] ? tick_nohz_idle_stop_tick+0x164/0x280
[ 109.140467] ? default_idle+0xe/0x20
[ 109.141237] ? arch_cpu_idle+0x15/0x20
[ 109.142034] ? default_idle_call+0x38/0xc0
[ 109.142903] ? do_idle+0x1fd/0x260
[ 109.143632] ? complete+0x3f/0x50
[ 109.144346] ? cpu_startup_entry+0x20/0x30
[ 109.145205] ? start_secondary+0x11f/0x160
[ 109.146054] ? secondary_startup_64_no_verify+0xb0/0xbb
[ 109.147097] Modules linked in: nls_iso8859_1 dm_multipath scsi_dh_rdac scsi_dh_emc scsi_dh_alua binfmt_misc kvm_intel kvm joydev input_leds serio_raw sch_fq_codel drm msr efi_pstore virtio_rng ip_tables x_tables autofs4 btrfs blake2b_generic zstd_compress raid10 raid456 async_raid6_recov async_memcpy async_pq async_xor async_tx libcrc32c xor raid6_pq raid1 raid0 multipath linear crct10dif_pclmul crc32_pclmul psmouse ghash_clmulni_intel virtio_net aesni_intel glue_helper net_failover crypto_simd ahci cryptd virtio_blk libahci failover
[ 109.155703] CR2: 0000001959d04839
[ 109.156417] ---[ end trace fca5170169a8dc51 ]---
[ 109.157358] RIP: 0010:0x1959d04839
[ 109.158086] Code: Unable to access opcode bytes at RIP 0x1959d0480f.
[ 109.159339] RSP: 0018:ffffa466000abe10 EFLAGS: 00010086
[ 109.160390] RAX: ffff983381acddc0 RBX: ffffa466000abe28 RCX: 0000000000000048
[ 109.161778] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000048
[ 109.163155] RBP: ffffffffaab42bac R08: 0000001966d9f5b9 R09: 0000000000000000
[ 109.164530] R10: ffffffffac46b180 R11: 0000000000000000 R12: 00000000ffff4543
[ 109.165921] R13: 000000195f892039 R14: ffff9833fbd5cdc0 R15: ffffffffaab43fe6
[ 109.167303] FS: 0000000000000000(0000) GS:ffff9833fbc00000(0000) knlGS:0000000000000000
[ 109.168860] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 109.170005] CR2: 0000001959d04839 CR3: 00000001109a8003 CR4: 0000000000770ef0
[ 109.171390] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 109.172767] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 109.174145] PKRU: 55555554
[ 109.174749] Kernel panic - not syncing: Attempted to kill the idle task!
[ 110.237772] Shutting down cpus with NMI
[ 110.238668] Kernel Offset: 0x29a00000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
[ 110.240707] ---[ end Kernel panic - not syncing: Attempted to kill the idle task! ]---