larger call graph
start_kernel
start_kernel() 是 Linux kernel "正式的"進入點,但是 start_kernel() 通常不是在核心被載入後就立刻被執行,在它開始執行之前還有些準備工作要先完成。 在核心被載入後,通常最開始被執行的是的放在 Linux kernel image 最開頭的 bootstrap code, 負責關閉中斷,記憶體設定等硬體初始化準備工作,甚至還包含將壓縮的內核解壓縮。 這些 bootstrap code 是屬於平台架構相依的,它通常是位於 arch/xxx/boot/ 之下的 assembly code (xxx 可以是 x86 或是 arm 等)。嚴格來說,這些 bootstrap code 並不能算是 Linux kernel 的一部份, 它們在完成核心載入的任務之後就不需要了,在這之後才是開始執行 start_kernel(),開始屬於核心層級的初始化流程。
簡單來說,在機器啟動(power on)之後的開機流程通常會是:
- Bootloader 將 kernel image 載入到記憶體之中。
- 在 kernel image 前段的 bootstrap code 進行硬體初始化等準備工作,並將內核解壓縮。
- 最後呼叫 start_kernel(),開始一連串真正屬於 kernel level 的初始化工作。
start_kernel() 本身是個非常龐大的函式,主要的 OS 資料結構,基礎設施及子系統都由這邊進行初始化,在追蹤 start_kernel() 的過程中能夠看到 OS 的全貌。 因為如此,所以我覺得想要了解 Linux kernel 的話,start_kernel() 是一個非常好的進入點。
本篇先不深入 start_kernel() 的細節,先給一個關於核心初始化流程的概觀。 在本篇最開始的地方有一張 call graph,經由這張圖我們可以概略地了解 start_kernel() 之後的流程:
呼叫 setup_arch() 進行架構相關的初始化。
setup_arch() 是由各個架構提供,通常是在 arch/xxx/kernel/setup.c 之中
設定中斷向量,初始化記憶體管理 (memory management), 排程器 (scheduler), 虛擬檔案系統 (virtual file system), ... 等
在 start_kernel() 的最後會呼叫 rest_init(),到此已經完成了 OS 最核心部份的初始化,基本上 OS 已經算可以動了。
rest_init() 就字面上的意義是"其餘的初始化工作",它會呼叫兩次 kernel_thread() 產生另外兩個核心程序 kernel_init 及 kthreadd,最後它會進入 cpu_idle_loop() 之中成為 pid = 0 的 idle process。
pid = 1 的 kernel_init 會繼續更高階的初始化,如初始化 driver, 打開 console, 最後根據不同的系統配置, 執行對應的初始化腳本 (可能是 /linuxrc 或是 /init),或者是下列任一個預設的 init 程序,完成整個系統的初始化。
- /sbin/init
- /etc/init
- /bin/init
- /bin/sh (最後的這個 sh 程序是當系統出問題時作為系統修復使用)
pid = 2 的 kthreadd 是一個核心守護線程 (daemon thread),它是所有其他核心守護線程的父線程,負責處理其他核心線程創建請求。
接下來,讓我們來實際看看程式碼 (kernel version 4.1.15)。為了說明方便,下面的內核程式碼會有些簡化。
linux/init/main.c : start_kernel()
在 bootloader 將 kernel image 載入並解壓縮到記憶體,完成必要的硬體設定,及初始記憶體分頁後,start_kernel() 將會被呼叫,開始進行核心層級的初始化。
asmlinkage __visible void __init start_kernel(void) { char * command_line; extern const struct kernel_param __start___param[], __stop___param[]; ... init procedures, omitted ... boot_cpu_init(); setup_arch(&command_line); // architecture-specific setup setup_command_line(command_line); // store the untouched command line trap_init(); // architecture-specific, interrupt vector table, handle hardware traps, exceptions and faults. mm_init(); // memory management /* * Set up the scheduler prior starting any interrupts (such as the * timer interrupt). Full topology setup happens at smp_init() * time - but meanwhile we still have a functioning scheduler. */ sched_init(); init_IRQ(); tick_init(); init_timers(); ... init procedures, omitted ... /* * HACK ALERT! This is early. We're enabling the console before * we've done PCI setups etc, and console_init() must be aware of * this. But we do want output early, in case something goes wrong. */ console_init(); ... init procedures, omitted ... sched_clock_init(); ... init procedures, omitted ... vfs_caches_init(totalram_pages); // file system, including kernfs, sysfs, rootfs, mount tree proc_root_init(); // /proc, /proc/fs, /proc/driver, ... nsfs_init(); cpuset_init(); cgroup_init(); ... init procedures, omitted ... /* Do the rest non-__init'ed, we're now alive */ rest_init(); }
到這邊,屬於 OS 最核心的的基礎設施都已經完成初始化,基本上 OS 已經可以開始作用了。接下來在 start_kernel() 的最後會呼叫 rest_init() ,它會產生另一個核心程序 kernel_init,繼續更高階系統的初始化。
在 rest_init() 中主要進行 4 件工作:
- 創建核心線程 kernel_init
- 創建核心線程 kthreadd
- 至少執行一次 schedule() 進行排程調度,讓剛剛創建的核心線程能夠開始執行
- 進入 cpu_idle_loop() 變成 idle process (pid=0) 處理 idle task
static noinline void __init_refok rest_init(void) { int pid; rcu_scheduler_starting(); smpboot_thread_init(); /* * We need to spawn init first so that it obtains pid 1, however * the init task will end up wanting to create kthreads, which, if * we schedule it before we create kthreadd, will OOPS. */ kernel_thread(kernel_init, NULL, CLONE_FS); numa_default_policy(); pid = kernel_thread(kthreadd, NULL, CLONE_FS | CLONE_FILES); rcu_read_lock(); kthreadd_task = find_task_by_pid_ns(pid, &init_pid_ns); rcu_read_unlock(); complete(&kthreadd_done); // for synchronization. kernel_init_freeable() will wait for this signal /* * The boot idle thread must execute schedule() * at least once to get things moving: */ init_idle_bootup_task(current); // set its scheduling class to idle_sched_class schedule_preempt_disabled(); // this function will call schedule() /* Call into cpu_idle with preempt disabled \*/ cpu_startup_entry(CPUHP_ONLINE); }/** * schedule_preempt_disabled - called with preemption disabled * * Returns with preemption disabled. Note: preempt_count must be 1 */ void __sched schedule_preempt_disabled(void) { sched_preempt_enable_no_resched(); // Enables kernel preemption but do not check for any pending reschedules schedule(); preempt_disable(); // Disables kernel preemption by incrementing the preemption counter }void cpu_startup_entry(enum cpuhp_state state) { ... omitted ... arch_cpu_idle_prepare(); cpu_idle_loop(); }
idle loop (pid=0)
rest_init() 在最後會進入 cpu_idle_loop() 之中成為 pid = 0 的 idle process,到這邊它已經完成系統初始化的任務了。
idle process 的優先權是最低的,當 CPU 真的沒事做時才會輪到它。在 x86 的架構下,會執行 CPU hlt 指令,在 ARM 架構下則是 wfe 指令 (wait for event),讓 CPU 進入睡眠。
kernel/sched/idle.c: cpu_idle_loop()
static void cpu_idle_loop(void) { while (1) { /* * If the arch has a polling bit, we maintain an invariant: * * Our polling bit is clear if we're not scheduled (i.e. if * rq->curr != rq->idle). This means that, if rq->idle has * the polling bit set, then setting need_resched is * guaranteed to cause the cpu to reschedule. */ __current_set_polling(); tick_nohz_idle_enter(); // stop the idle tick from the idle task while (!need_resched()) { check_pgt_cache(); rmb(); // read memory barrier. // It ensures that no loads are reordered across the rmb() call. // no loads prior to the call will be reordered to after the call // and no loads after the call will be reordered to before the call. // http://www.makelinux.net/books/lkd2/ch09lev1sec10 if (cpu_is_offline(smp_processor_id())) { rcu_cpu_notify(NULL, CPU_DYING_IDLE, (void *)(long)smp_processor_id()); smp_mb(); /* all activity before dead. */ this_cpu_write(cpu_dead_idle, true); arch_cpu_idle_dead(); } local_irq_disable(); arch_cpu_idle_enter(); /* * In poll mode we reenable interrupts and spin. * * Also if we detected in the wakeup from idle * path that the tick broadcast device expired * for us, we don't want to go deep idle as we * know that the IPI is going to arrive right * away */ if (cpu_idle_force_poll || tick_check_broadcast_expired()) cpu_idle_poll(); else cpuidle_idle_call(); arch_cpu_idle_exit(); } /* * Since we fell out of the loop above, we know * TIF_NEED_RESCHED must be set, propagate it into * PREEMPT_NEED_RESCHED. * * This is required because for polling idle loops we will * not have had an IPI to fold the state for us. */ preempt_set_need_resched(); tick_nohz_idle_exit(); // restart the idle tick from the idle task __current_clr_polling(); /* * We promise to call sched_ttwu_pending and reschedule * if need_resched is set while polling is set. That * means that clearing polling needs to be visible * before doing these things. */ smp_mb__after_atomic(); sched_ttwu_pending(); schedule_preempt_disabled(); } }/** * schedule_preempt_disabled - called with preemption disabled * * Returns with preemption disabled. Note: preempt_count must be 1 */ void __sched schedule_preempt_disabled(void) { sched_preempt_enable_no_resched(); // Enables kernel preemption but do not check for any pending reschedules schedule(); preempt_disable(); // Disables kernel preemption by incrementing the preemption counter }
kernel_init (pid=1)
kernel_init 會繼續接手系統層級的初始化工作。一個系統除了 CPU 及記憶體外等核心硬體外,還有許多 I/O 週邊需要 OS 的支援,而除了硬體之外,還有像檔案系統,網路協議處理等屬於軟體中間層的部份需要 OS 的支援。 這些部份的初始化由 kernel_init() 來完成。
init/main.c : kernel_init()
static int __ref kernel_init(void *unused) { int ret; kernel_init_freeable(); // init drivers, modules, and open /dev/console /* need to finish all async __init code before freeing the memory */ async_synchronize_full(); // waits until all asynchronous function calls have been done free_initmem(); // free .init section from memory mark_rodata_ro(); // mark rodata read-only system_state = SYSTEM_RUNNING; numa_default_policy(); flush_delayed_fput(); if (ramdisk_execute_command) { ret = run_init_process(ramdisk_execute_command); if (!ret) return 0; pr_err("Failed to execute %s (error %d)\n", ramdisk_execute_command, ret); } /* * We try each of these until one succeeds. * * The Bourne shell can be used instead of init if we are * trying to recover a really broken machine. */ if (execute_command) { ret = run_init_process(execute_command); if (!ret) return 0; #ifndef CONFIG_INIT_FALLBACK panic("Requested init %s failed (error %d).", execute_command, ret); #else pr_err("Failed to execute %s (error %d). Attempting defaults...\n", execute_command, ret); #endif } if (!try_to_run_init_process("/sbin/init") || !try_to_run_init_process("/etc/init") || !try_to_run_init_process("/bin/init") || !try_to_run_init_process("/bin/sh")) return 0; panic("No working init found. Try passing init= option to kernel. " "See Linux Documentation/init.txt for guidance."); }
init/main.c: kernel_init_freeable()
freeable 就字面上的意思是可以被卸載的,這個函式主要是將一些系統週邊及軟體中間層掛進 OS 及初始化。 kernel_init_freeable() 所處理的初始化工作非常廣,如下面的程式碼所表示,它包含初始化 device, driver, rootfs, 掛載 /dev, /sys 等虛擬檔案系統目錄,開啟 /dev/console 做為訊息輸出等。 它的大部份的工作都是由 do_basic_setup() 所完成,要深入它需要比較多的時間,在這邊我們先回到比較高階的觀點來看整體初始化的流程。
static noinline void __init kernel_init_freeable(void) { /* * Wait until kthreadd is all set-up. */ wait_for_completion(&kthreadd_done); /* Now the scheduler is fully set up and can do blocking allocations */ gfp_allowed_mask = __GFP_BITS_MASK; /* * init can allocate pages on any node */ set_mems_allowed(node_states[N_MEMORY]); /* * init can run on any cpu. */ set_cpus_allowed_ptr(current, cpu_all_mask); cad_pid = task_pid(current); smp_prepare_cpus(setup_max_cpus); do_pre_smp_initcalls(); lockup_detector_init(); smp_init(); sched_init_smp(); do_basic_setup(); /* Open the /dev/console on the rootfs, this should never fail */ if (sys_open((const char __user *) "/dev/console", O_RDWR, 0) < 0) pr_err("Warning: unable to open an initial console.\n"); (void) sys_dup(0); (void) sys_dup(0); /* * check if there is an early userspace init. If yes, let it do all * the work */ if (!ramdisk_execute_command) ramdisk_execute_command = "/init"; if (sys_access((const char __user *) ramdisk_execute_command, 0) != 0) { ramdisk_execute_command = NULL; prepare_namespace(); } /* * Ok, we have completed the initial bootup, and * we're essentially up and running. Get rid of the * initmem segments and start the user-mode stuff.. * * rootfs is available now, try loading the public keys * and default modules */ integrity_load_keys(); load_default_modules(); }
init/main.c: do_basic_setup()
事實上,大部份的初始化工作都在這個函式內完成,do_basic_setup() 絕不簡單。
/* * Ok, the machine is now initialized. None of the devices * have been touched yet, but the CPU subsystem is up and * running, and memory and process management works. * * Now we can finally start doing some real work.. */ static void __init do_basic_setup(void) { cpuset_init_smp(); usermodehelper_init(); shmem_init(); driver_init(); // init driver model. (kobject, kset) init_irq_proc(); do_ctors(); // call constructor functions in .ctors section usermodehelper_enable(); do_initcalls(); // call init functions in .initcall[0~9].init sections random_int_secret_init(); }/** * driver_init - initialize driver model. * * Call the driver model init functions to initialize their * subsystems. Called early from init/main.c. */ void __init driver_init(void) { /* These are the core pieces */ devtmpfs_init(); // mount root node: "/" devices_init(); buses_init(); classes_init(); firmware_init(); hypervisor_init(); /* These are also core pieces, but must come after the * core core pieces. */ platform_bus_init(); cpu_dev_init(); memory_dev_init(); container_dev_init(); of_core_init(); }
kthreadd (pid=2)
kernel/kthread.c: kthreadd()
- kthreadd 是一個核心守護線程 (daemon thread),它是所有其他核心線程的父線程。它負責處理經由 kthread_create_on_node() 記錄在
kthread_create_list
的核心線程創建請求。 - 當
kthread_create_list
為空時,kthreadd 會將自已的狀態設為TASK_INTERRUPTIBLE
,並讓出 CPU。
struct kthread_create_info { /* Information passed to kthread() from kthreadd. */ int (*threadfn)(void *data); void *data; int node; /* Result passed back to kthread_create() from kthreadd. */ struct task_struct *result; struct completion *done; struct list_head list; }; int kthreadd(void *unused) { struct task_struct *tsk = current; /* Setup a clean context for our children to inherit. */ set_task_comm(tsk, "kthreadd"); ignore_signals(tsk); set_cpus_allowed_ptr(tsk, cpu_all_mask); set_mems_allowed(node_states[N_MEMORY]); current->flags |= PF_NOFREEZE; for (;;) { set_current_state(TASK_INTERRUPTIBLE); if (list_empty(&kthread_create_list)) // if no kthread create request schedule(); // yield CPU __set_current_state(TASK_RUNNING); spin_lock(&kthread_create_lock); while (!list_empty(&kthread_create_list)) { // handle all kthread create requests struct kthread_create_info *create; create = list_entry(kthread_create_list.next, struct kthread_create_info, list); list_del_init(&create->list); // remove the entry from list spin_unlock(&kthread_create_lock); create_kthread(create); spin_lock(&kthread_create_lock); } spin_unlock(&kthread_create_lock); } return 0; }
小結
本篇介紹 Linux kernel 在 start_kernel() 之後大致的流程。 在系統開機之後基本上就是一連串的初始化流程,由低階到高階,而高階的部份常常會根據各個系統不同的應用情境進行調整。
要了解一個系統,我喜歡由巨觀的程式流程入手,然後再深入了解微觀的實作細節,對我來說,這是比較好的順序。 在看實作的細節時,我會想先知道我目前處在整個大架構的什麼位置,資料的上下游關係是什麼,這樣可以讓我比較能理解程式是怎樣運作的,以及實作時的考量。 因此本篇先講核心啟動的大架構流程,接下來將會對 Linux 系統初始化流程的各個部份有更深入的探討。