核心的進入點: start_kernel()

larger call graph

start_kernel

start_kernel() 是 Linux kernel "正式的"進入點，但是 start_kernel() 通常不是在核心被載入後就立刻被執行，在它開始執行之前還有些準備工作要先完成。在核心被載入後，通常最開始被執行的是的放在 Linux kernel image 最開頭的 bootstrap code, 負責關閉中斷，記憶體設定等硬體初始化準備工作，甚至還包含將壓縮的內核解壓縮。這些 bootstrap code 是屬於平台架構相依的，它通常是位於 arch/xxx/boot/ 之下的 assembly code (xxx 可以是 x86 或是 arm 等)。嚴格來說，這些 bootstrap code 並不能算是 Linux kernel 的一部份，它們在完成核心載入的任務之後就不需要了，在這之後才是開始執行 start_kernel()，開始屬於核心層級的初始化流程。

簡單來說，在機器啟動(power on)之後的開機流程通常會是:

Bootloader 將 kernel image 載入到記憶體之中。
在 kernel image 前段的 bootstrap code 進行硬體初始化等準備工作，並將內核解壓縮。
最後呼叫 start_kernel()，開始一連串真正屬於 kernel level 的初始化工作。

start_kernel() 本身是個非常龐大的函式，主要的 OS 資料結構，基礎設施及子系統都由這邊進行初始化，在追蹤 start_kernel() 的過程中能夠看到 OS 的全貌。因為如此，所以我覺得想要了解 Linux kernel 的話，start_kernel() 是一個非常好的進入點。

本篇先不深入 start_kernel() 的細節，先給一個關於核心初始化流程的概觀。在本篇最開始的地方有一張 call graph，經由這張圖我們可以概略地了解 start_kernel() 之後的流程：

呼叫 setup_arch() 進行架構相關的初始化。

setup_arch() 是由各個架構提供，通常是在 arch/xxx/kernel/setup.c 之中
設定中斷向量，初始化記憶體管理 (memory management), 排程器 (scheduler), 虛擬檔案系統 (virtual file system), ... 等
在 start_kernel() 的最後會呼叫 rest_init()，到此已經完成了 OS 最核心部份的初始化，基本上 OS 已經算可以動了。
rest_init() 就字面上的意義是"其餘的初始化工作"，它會呼叫兩次 kernel_thread() 產生另外兩個核心程序 kernel_init 及 kthreadd，最後它會進入 cpu_idle_loop() 之中成為 pid = 0 的 idle process。
pid = 1 的 kernel_init 會繼續更高階的初始化，如初始化 driver, 打開 console, 最後根據不同的系統配置，執行對應的初始化腳本 (可能是 /linuxrc 或是 /init)，或者是下列任一個預設的 init 程序，完成整個系統的初始化。
- /sbin/init
- /etc/init
- /bin/init
- /bin/sh (最後的這個 sh 程序是當系統出問題時作為系統修復使用)
pid = 2 的 kthreadd 是一個核心守護線程 (daemon thread)，它是所有其他核心守護線程的父線程，負責處理其他核心線程創建請求。

接下來，讓我們來實際看看程式碼 (kernel version 4.1.15)。為了說明方便，下面的內核程式碼會有些簡化。

linux/init/main.c : start_kernel()

在 bootloader 將 kernel image 載入並解壓縮到記憶體，完成必要的硬體設定，及初始記憶體分頁後，start_kernel() 將會被呼叫，開始進行核心層級的初始化。

asmlinkage __visible void __init start_kernel(void)
{
    char * command_line;
    extern const struct kernel_param __start___param[], __stop___param[];

    ... init procedures, omitted ...

    boot_cpu_init();

    setup_arch(&command_line);         // architecture-specific setup
    setup_command_line(command_line);  // store the untouched command line

    trap_init();  // architecture-specific, interrupt vector table, handle hardware traps, exceptions and faults.
    mm_init();    // memory management


    /*
     * Set up the scheduler prior starting any interrupts (such as the
     * timer interrupt). Full topology setup happens at smp_init()
     * time - but meanwhile we still have a functioning scheduler.
     */
    sched_init();

    init_IRQ();
    tick_init();
    init_timers();

    ... init procedures, omitted ...


    /*
     * HACK ALERT! This is early. We're enabling the console before
     * we've done PCI setups etc, and console_init() must be aware of
     * this. But we do want output early, in case something goes wrong.
     */
    console_init();

    ... init procedures, omitted ...


    sched_clock_init();

    ... init procedures, omitted ...


    vfs_caches_init(totalram_pages);  // file system, including kernfs, sysfs, rootfs, mount tree

    proc_root_init();  // /proc, /proc/fs, /proc/driver, ...
    nsfs_init();
    cpuset_init();
    cgroup_init();

    ... init procedures, omitted ...


    /* Do the rest non-__init'ed, we're now alive */
    rest_init();
}

到這邊，屬於 OS 最核心的的基礎設施都已經完成初始化，基本上 OS 已經可以開始作用了。接下來在 start_kernel() 的最後會呼叫 rest_init() ，它會產生另一個核心程序 kernel_init，繼續更高階系統的初始化。

在 rest_init() 中主要進行 4 件工作:

創建核心線程 kernel_init
創建核心線程 kthreadd
至少執行一次 schedule() 進行排程調度，讓剛剛創建的核心線程能夠開始執行
進入 cpu_idle_loop() 變成 idle process (pid=0) 處理 idle task

static noinline void __init_refok rest_init(void)
{
    int pid;

    rcu_scheduler_starting();
    smpboot_thread_init();
    /*
     * We need to spawn init first so that it obtains pid 1, however
     * the init task will end up wanting to create kthreads, which, if
     * we schedule it before we create kthreadd, will OOPS.
     */
    kernel_thread(kernel_init, NULL, CLONE_FS);
    numa_default_policy();
    pid = kernel_thread(kthreadd, NULL, CLONE_FS | CLONE_FILES);
    rcu_read_lock();
    kthreadd_task = find_task_by_pid_ns(pid, &init_pid_ns);
    rcu_read_unlock();
    complete(&kthreadd_done); // for synchronization. kernel_init_freeable() will wait for this signal

    /*
     * The boot idle thread must execute schedule()
     * at least once to get things moving:
     */
    init_idle_bootup_task(current); // set its scheduling class to idle_sched_class
    schedule_preempt_disabled();    // this function will call schedule()

    /* Call into cpu_idle with preempt disabled \*/
    cpu_startup_entry(CPUHP_ONLINE);
}

/**
 * schedule_preempt_disabled - called with preemption disabled
 *
 * Returns with preemption disabled. Note: preempt_count must be 1
 */
void __sched schedule_preempt_disabled(void)
{
    sched_preempt_enable_no_resched();  // Enables kernel preemption but do not check for any pending reschedules
    schedule();
    preempt_disable();  // Disables kernel preemption by incrementing the preemption counter
}

void cpu_startup_entry(enum cpuhp_state state)
{
    ... omitted ...

    arch_cpu_idle_prepare();
    cpu_idle_loop();
}

idle loop (pid=0)

rest_init() 在最後會進入 cpu_idle_loop() 之中成為 pid = 0 的 idle process，到這邊它已經完成系統初始化的任務了。

idle process 的優先權是最低的，當 CPU 真的沒事做時才會輪到它。在 x86 的架構下，會執行 CPU hlt 指令，在 ARM 架構下則是 wfe 指令 (wait for event)，讓 CPU 進入睡眠。

kernel/sched/idle.c: cpu_idle_loop()

static void cpu_idle_loop(void)
{
    while (1) {
        /*
         * If the arch has a polling bit, we maintain an invariant:
         *
         * Our polling bit is clear if we're not scheduled (i.e. if
         * rq->curr != rq->idle).  This means that, if rq->idle has
         * the polling bit set, then setting need_resched is
         * guaranteed to cause the cpu to reschedule.
         */

        __current_set_polling();
        tick_nohz_idle_enter();   // stop the idle tick from the idle task

        while (!need_resched()) {
            check_pgt_cache();
            rmb();      // read memory barrier.
                        // It ensures that no loads are reordered across the rmb() call.
                        // no loads prior to the call will be reordered to after the call
                        // and no loads after the call will be reordered to before the call.
                        // http://www.makelinux.net/books/lkd2/ch09lev1sec10

            if (cpu_is_offline(smp_processor_id())) {
                rcu_cpu_notify(NULL, CPU_DYING_IDLE,
                           (void *)(long)smp_processor_id());
                smp_mb(); /* all activity before dead. */
                this_cpu_write(cpu_dead_idle, true);
                arch_cpu_idle_dead();
            }

            local_irq_disable();
            arch_cpu_idle_enter();

            /*
             * In poll mode we reenable interrupts and spin.
             *
             * Also if we detected in the wakeup from idle
             * path that the tick broadcast device expired
             * for us, we don't want to go deep idle as we
             * know that the IPI is going to arrive right
             * away
             */
            if (cpu_idle_force_poll || tick_check_broadcast_expired())
                cpu_idle_poll();
            else
                cpuidle_idle_call();

            arch_cpu_idle_exit();
        }

        /*
         * Since we fell out of the loop above, we know
         * TIF_NEED_RESCHED must be set, propagate it into
         * PREEMPT_NEED_RESCHED.
         *
         * This is required because for polling idle loops we will
         * not have had an IPI to fold the state for us.
         */
        preempt_set_need_resched();
        tick_nohz_idle_exit();   // restart the idle tick from the idle task
        __current_clr_polling();

        /*
         * We promise to call sched_ttwu_pending and reschedule
         * if need_resched is set while polling is set.  That
         * means that clearing polling needs to be visible
         * before doing these things.
         */
        smp_mb__after_atomic();

        sched_ttwu_pending();
        schedule_preempt_disabled();
    }
}

/**
 * schedule_preempt_disabled - called with preemption disabled
 *
 * Returns with preemption disabled. Note: preempt_count must be 1
 */
void __sched schedule_preempt_disabled(void)
{
    sched_preempt_enable_no_resched();  // Enables kernel preemption but do not check for any pending reschedules
    schedule();
    preempt_disable();  // Disables kernel preemption by incrementing the preemption counter
}

kernel_init (pid=1)

kernel_init 會繼續接手系統層級的初始化工作。一個系統除了 CPU 及記憶體外等核心硬體外，還有許多 I/O 週邊需要 OS 的支援，而除了硬體之外，還有像檔案系統，網路協議處理等屬於軟體中間層的部份需要 OS 的支援。這些部份的初始化由 kernel_init() 來完成。

init/main.c : kernel_init()

static int __ref kernel_init(void *unused)
{
    int ret;

    kernel_init_freeable();    // init drivers, modules, and open /dev/console
    /* need to finish all async __init code before freeing the memory */
    async_synchronize_full();  // waits until all asynchronous function calls have been done
    free_initmem();            // free .init section from memory
    mark_rodata_ro();          // mark rodata read-only
    system_state = SYSTEM_RUNNING;
    numa_default_policy();

    flush_delayed_fput();

    if (ramdisk_execute_command) {
            ret = run_init_process(ramdisk_execute_command);
            if (!ret)
                    return 0;
            pr_err("Failed to execute %s (error %d)\n",
                   ramdisk_execute_command, ret);
    }

    /*
     * We try each of these until one succeeds.
     *
     * The Bourne shell can be used instead of init if we are
     * trying to recover a really broken machine.
     */
    if (execute_command) {
            ret = run_init_process(execute_command);
            if (!ret)
                    return 0;
#ifndef CONFIG_INIT_FALLBACK
            panic("Requested init %s failed (error %d).",
                  execute_command, ret);
#else
            pr_err("Failed to execute %s (error %d).  Attempting defaults...\n",
                   execute_command, ret);
#endif
    }
    if (!try_to_run_init_process("/sbin/init") ||
        !try_to_run_init_process("/etc/init") ||
        !try_to_run_init_process("/bin/init") ||
        !try_to_run_init_process("/bin/sh"))
            return 0;

    panic("No working init found.  Try passing init= option to kernel. "
          "See Linux Documentation/init.txt for guidance.");
}

init/main.c: kernel_init_freeable()

freeable 就字面上的意思是可以被卸載的，這個函式主要是將一些系統週邊及軟體中間層掛進 OS 及初始化。 kernel_init_freeable() 所處理的初始化工作非常廣，如下面的程式碼所表示，它包含初始化 device, driver, rootfs, 掛載 /dev, /sys 等虛擬檔案系統目錄，開啟 /dev/console 做為訊息輸出等。它的大部份的工作都是由 do_basic_setup() 所完成，要深入它需要比較多的時間，在這邊我們先回到比較高階的觀點來看整體初始化的流程。

static noinline void __init kernel_init_freeable(void)
{
    /*
     * Wait until kthreadd is all set-up.
     */
    wait_for_completion(&kthreadd_done);

    /* Now the scheduler is fully set up and can do blocking allocations */
    gfp_allowed_mask = __GFP_BITS_MASK;

    /*
     * init can allocate pages on any node
     */
    set_mems_allowed(node_states[N_MEMORY]);
    /*
     * init can run on any cpu.
     */
    set_cpus_allowed_ptr(current, cpu_all_mask);

    cad_pid = task_pid(current);

    smp_prepare_cpus(setup_max_cpus);

    do_pre_smp_initcalls();
    lockup_detector_init();

    smp_init();
    sched_init_smp();

    do_basic_setup();

    /* Open the /dev/console on the rootfs, this should never fail */
    if (sys_open((const char __user *) "/dev/console", O_RDWR, 0) < 0)
        pr_err("Warning: unable to open an initial console.\n");

    (void) sys_dup(0);
    (void) sys_dup(0);
    /*
     * check if there is an early userspace init.  If yes, let it do all
     * the work
     */

    if (!ramdisk_execute_command)
        ramdisk_execute_command = "/init";

    if (sys_access((const char __user *) ramdisk_execute_command, 0) != 0) {
        ramdisk_execute_command = NULL;
        prepare_namespace();
    }

    /*
     * Ok, we have completed the initial bootup, and
     * we're essentially up and running. Get rid of the
     * initmem segments and start the user-mode stuff..
     *
     * rootfs is available now, try loading the public keys
     * and default modules
     */

    integrity_load_keys();
    load_default_modules();
}

init/main.c: do_basic_setup()

事實上，大部份的初始化工作都在這個函式內完成，do_basic_setup() 絕不簡單。

/*
 * Ok, the machine is now initialized. None of the devices
 * have been touched yet, but the CPU subsystem is up and
 * running, and memory and process management works.
 *
 * Now we can finally start doing some real work..
 */
static void __init do_basic_setup(void)
{
    cpuset_init_smp();
    usermodehelper_init();
    shmem_init();
    driver_init();           // init driver model. (kobject, kset)
    init_irq_proc();
    do_ctors();              // call constructor functions in .ctors section
    usermodehelper_enable();
    do_initcalls();          // call init functions in .initcall[0~9].init sections
    random_int_secret_init();
}

/**
 * driver_init - initialize driver model.
 *
 * Call the driver model init functions to initialize their
 * subsystems. Called early from init/main.c.
 */
void __init driver_init(void)
{
    /* These are the core pieces */
    devtmpfs_init();  // mount root node: "/"
    devices_init();
    buses_init();
    classes_init();
    firmware_init();
    hypervisor_init();

    /* These are also core pieces, but must come after the
     * core core pieces.
     */
    platform_bus_init();
    cpu_dev_init();
    memory_dev_init();
    container_dev_init();
    of_core_init();
}

kthreadd (pid=2)

kernel/kthread.c: kthreadd()

kthreadd 是一個核心守護線程 (daemon thread)，它是所有其他核心線程的父線程。它負責處理經由 kthread_create_on_node() 記錄在 kthread_create_list 的核心線程創建請求。
當 kthread_create_list 為空時，kthreadd 會將自已的狀態設為 TASK_INTERRUPTIBLE，並讓出 CPU。

struct kthread_create_info
{
    /* Information passed to kthread() from kthreadd. */
    int (*threadfn)(void *data);
    void *data;
    int node;

    /* Result passed back to kthread_create() from kthreadd. */
    struct task_struct *result;
    struct completion *done;

    struct list_head list;
};

int kthreadd(void *unused)
{
    struct task_struct *tsk = current;

    /* Setup a clean context for our children to inherit. */
    set_task_comm(tsk, "kthreadd");
    ignore_signals(tsk);
    set_cpus_allowed_ptr(tsk, cpu_all_mask);
    set_mems_allowed(node_states[N_MEMORY]);

    current->flags |= PF_NOFREEZE;

    for (;;) {
        set_current_state(TASK_INTERRUPTIBLE);
        if (list_empty(&kthread_create_list))  // if no kthread create request
            schedule();                        // yield CPU
        __set_current_state(TASK_RUNNING);

        spin_lock(&kthread_create_lock);
        while (!list_empty(&kthread_create_list)) {  // handle all kthread create requests
            struct kthread_create_info *create;

            create = list_entry(kthread_create_list.next,
                        struct kthread_create_info, list);
            list_del_init(&create->list);   // remove the entry from list
            spin_unlock(&kthread_create_lock);

            create_kthread(create);

            spin_lock(&kthread_create_lock);
        }
        spin_unlock(&kthread_create_lock);
    }

    return 0;
}

小結

本篇介紹 Linux kernel 在 start_kernel() 之後大致的流程。在系統開機之後基本上就是一連串的初始化流程，由低階到高階，而高階的部份常常會根據各個系統不同的應用情境進行調整。

要了解一個系統，我喜歡由巨觀的程式流程入手，然後再深入了解微觀的實作細節，對我來說，這是比較好的順序。在看實作的細節時，我會想先知道我目前處在整個大架構的什麼位置，資料的上下游關係是什麼，這樣可以讓我比較能理解程式是怎樣運作的，以及實作時的考量。因此本篇先講核心啟動的大架構流程，接下來將會對 Linux 系統初始化流程的各個部份有更深入的探討。

Reference

Kernel doc:

x86 booting: https://www.kernel.org/doc/Documentation/x86/boot.txt
arm booting: https://www.kernel.org/doc/Documentation/arm/Booting

Daniel Jslin

May the source be with you