深入淺出 start_kernel()

當 bootloader 載入核心映像檔(zImage, or bzImage) 之後，在核心映像檔最開頭的 bootstrap code 會負責關閉中斷，初始化記憶體設定等硬體初始化工作，最後解壓縮內核。不同的硬體架構會有不同的設定方式，大致的流程如下圖所示：

在 start_kernel() 之前的 bootstrap code 最主要的任務就是將環境準備好，滿足 start_kernel() 的要求，然後就轉到 start_kernel() ，由 start_kernel() 開始進行一般化 (generic and architecture independent) 的初始化流程。

start_kernel() 是核心的主要進入點，核心從這裡開始走訪各個子系統的初始化函式。以下是 version 4.1.15 的 start_kernel() ，大約呼叫了一百個左右的函式。若每個函式都能理解的話，那麼我想對 Linux kernel 也就有相當的了解了。下面將 start_kernel() 呼叫到的函式加上註解說明，方便大家理解各個函式的作用。

asmlinkage __visible void __init start_kernel(void)
{
    char *command_line;  // a pointer to the kernel command line
    char *after_dashes;  // a pointer to the kernel command line after "--", which will be passed to the init process

    /*
     * Need to run as early as possible, to initialize the
     * lockdep hash:
     */
    lockdep_init();             // lock dependency validator, https://www.kernel.org/doc/Documentation/locking/lockdep-design.txt
    set_task_stack_end_magic(&init_task);  // setup magic number in the end of stack of init_task for overflow detection
    smp_setup_processor_id();   // assign SMP CPU id. archs can override it.
    debug_objects_early_init(); // infrastructure for lifetime debugging of objects, https://lwn.net/Articles/271614/

    /*
     * Set up the the initial canary ASAP:
     */
    boot_init_stack_canary();   // stack smashing protector, http://wiki.osdev.org/Stack_Smashing_Protector

    cgroup_init_early();        // initialize cgroup subsystems, https://en.wikipedia.org/wiki/Cgroups

    local_irq_disable();        // disable IRQ first because interrupt vector table has not been setup yet
    early_boot_irqs_disabled = true;

/*
 * Interrupts are still disabled. Do necessary setups, then
 * enable them
 */
    boot_cpu_init();                  // activate the first processor. mark the boot cpu "present", "online" etc for SMP and UP case
    page_address_init();              // initializes page_address_htable
    pr_notice("%s", linux_banner);
    setup_arch(&command_line);        // architecture-specific setup
    mm_init_cpumask(&init_mm);        // => cpumask_clear(mm->cpu_vm_mask_var), for lazy TLB switches
    setup_command_line(command_line); // store the untouched command line
    setup_nr_cpu_ids();               // set "nr_cpu_ids" according to the last bit in possible mask
    setup_per_cpu_areas();            // per cpu memory allocator
    smp_prepare_boot_cpu(); /* arch-specific boot-cpu hooks */

    build_all_zonelists(NULL, NULL);  // memory zones, https://www.kernel.org/doc/gorman/html/understand/understand005.html
    page_alloc_init();                // add a handler for CPU hotplug

    pr_notice("Kernel command line: %s\n", boot_command_line);
    parse_early_param();              // parse options for early_param()
    after_dashes = parse_args("Booting kernel",
                  static_command_line, __start___param,  // parse options for module_param(), module_param_named(), core_param()
                  __stop___param - __start___param,
                  -1, -1, &unknown_bootoption);          // parse options for __setup()
    if (!IS_ERR_OR_NULL(after_dashes))
        parse_args("Setting init args", after_dashes, NULL, 0, -1, -1,  // after_dashes will be passed to the init process as argv
               set_init_arg);

    jump_label_init();         // Jump label: https://lwn.net/Articles/412072/

    /*
     * These use large bootmem allocations and must precede
     * kmem_cache_init()
     */
    setup_log_buf(0);          // buf for printk
    pidhash_init();            // pid hash table
    vfs_caches_init_early();   // allocate and caches initialize for hash tables of dcache and inode
    sort_main_extable();       // sort the kernel's built-in exception table (for page faults)
    trap_init();               // architecture-specific, interrupt vector table, handle hardware traps, exceptions and faults.
    mm_init();                 // memory management

    /*
     * Set up the scheduler prior starting any interrupts (such as the
     * timer interrupt). Full topology setup happens at smp_init()
     * time - but meanwhile we still have a functioning scheduler.
     */
    sched_init();
    /*
     * Disable preemption - early bootup scheduling is extremely
     * fragile until we cpu_idle() for the first time.
     */
    preempt_disable();
    if (WARN(!irqs_disabled(), "Interrupts were enabled *very* early, fixing it\n"))
        local_irq_disable();
    idr_init_cache();          // idr: ID radix, sparse array indexed by the id to obtain the pointer
    rcu_init();                // Read-Copy Update mechanism, https://www.kernel.org/doc/Documentation/RCU/

    /* trace_printk() and trace points may be used after this */
    trace_init();              // https://www.kernel.org/doc/Documentation/trace/

    context_tracking_init();   // prepare for using a static key in the context tracking subsystem
    radix_tree_init();         // allocate a cache for radix_tree. [LWN] radix_tree: https://lwn.net/Articles/175432/
    /* init some links before init_ISA_irqs() */
    early_irq_init();          // allocate caches for irq_desc, interrupt descriptor
    init_IRQ();                // architecture-specific, initialize kernel's interrupt subsystem and the interrupt controllers.
    tick_init();               // initialize the tick control
    rcu_init_nohz();
    init_timers();             // init timer stats, register cpu notifier, and open softirq for timer
    hrtimers_init();           // high-resolution timer, https://www.kernel.org/doc/Documentation/timers/hrtimers.txt
    softirq_init();            // initialize tasklet_vec and open softirq for tasklet
    timekeeping_init();        // https://www.kernel.org/doc/Documentation/timers/timekeeping.txt
    time_init();               // architecture-specific, timer initialization
    sched_clock_postinit();    // start the high-resolution timer to keep sched_clock() properly updated and sets the initial epoch
    perf_event_init();         // perf is a profiler tool for Linux, https://perf.wiki.kernel.org/index.php/Tutorial
    profile_init();            // initializes basic kernel profiler
    call_function_init();      // SMP initializes call_single_queue and register notifier
    WARN(!irqs_disabled(), "Interrupts were enabled early\n");
    early_boot_irqs_disabled = false;
    local_irq_enable();        // after this point, interrupts are enabled

    kmem_cache_init_late();    // post-initialization of cache (slab)

    /*
     * HACK ALERT! This is early. We're enabling the console before
     * we've done PCI setups etc, and console_init() must be aware of
     * this. But we do want output early, in case something goes wrong.
     */
    console_init();            // call console initcalls to initialize the console device, usually it's tty device.
    if (panic_later)
        panic("Too many boot %s vars at `%s'", panic_later,
              panic_param);

    lockdep_info();            // print lockdep information

    /*
     * Need to run this when irqs are enabled, because it wants
     * to self-test [hard/soft]-irqs on/off lock inversion bugs
     * too:
     */
    locking_selftest();        // test various locking APIs: spinlocks, rwlocks, mutexes, and rwsemaphores

#ifdef CONFIG_BLK_DEV_INITRD
    if (initrd_start && !initrd_below_start_ok &&
        page_to_pfn(virt_to_page((void *)initrd_start)) < min_low_pfn) {
        pr_crit("initrd overwritten (0x%08lx < 0x%08lx) - disabling it.\n",
            page_to_pfn(virt_to_page((void *)initrd_start)),
            min_low_pfn);
        initrd_start = 0;
    }
#endif
    page_ext_init();           // memory page extension, allocates memory for extended data per page
    debug_objects_mem_init();  // allocate a dedicated cache pool for debug objects
    kmemleak_init();           // initialize kmemleak (memory leak check facility)
    setup_per_cpu_pageset();   // allocate and initialize per cpu pagesets
    numa_policy_init();        // allocate caches and do initialization for NUMA memory policy
    if (late_time_init)        // default late_time_init is NULL. archs can override it
        late_time_init();      // architecture-specific
    sched_clock_init();        // set the time info for scheduler and make sched clock running
    calibrate_delay();         // calibrate the delay loop
    pidmap_init();             // initialize PID map for initial PID namespace
    anon_vma_init();           // allocate a cache for "anon_vma" (anonymous memory), http://lwn.net/Kernel/Index/#anon_vma
    acpi_early_init();         // initialize ACPI subsystem and populate the ACPI namespace
#ifdef CONFIG_X86
    if (efi_enabled(EFI_RUNTIME_SERVICES))  // Extensible Firmware Interface
        efi_enter_virtual_mode();           // switch EFI to virtual mode, if possible
#endif
#ifdef CONFIG_X86_ESPFIX64
    /* Should be run before the first non-init thread is created */
    init_espfix_bsp();         // workaround to prevent leaking of 31:16 bits of the esp register, https://github.com/torvalds/linux/commit/3891a04aafd668686239349ea58f3314ea2af86b
#endif
    thread_info_cache_init();  // allocate cache for thread_info if THREAD_SIZE < PAGE_SIZE
    cred_init();               // credential
    fork_init();               // allocate a cache for task_struct
    proc_caches_init();        // allocate caches for sighand_struct, signal_struct, files_struct, fs_struct, mm_struct, and vm_area_struct
    buffer_init();             // allocate a cache for buffer_head
    key_init();                // initialize the authentication token and access key management
    security_init();           // initialize the security framework, do_security_initcalls
    dbg_late_init();           // late init for kgdb
    vfs_caches_init(totalram_pages);  // file system, including kernfs, sysfs, rootfs, mount tree
    signals_init();            // allocate a cache for sigqueue
    /* rootfs populating might need page-writeback */
    page_writeback_init();     // set the ratio limits for the dirty pages
    proc_root_init();          // initializes /proc filesystem, and creates several standard entries like /proc/fs and /proc/driver
    nsfs_init();               // mount pseudo-filesystem: nsfs
    cpuset_init();             // initialize top_cpuset and the cpuset internal file system
    cgroup_init();             // initialize the rest of cgroups
    taskstats_init_early();    // allocate a cache and initialize per-task statistics
    delayacct_init();          // per-task delay accounting

    check_bugs();              // check for architecture-dependent bugs

    acpi_subsystem_init();     // enable ACPI subsystem
    sfi_init_late();           // SFI: Simple Firmware Interface. Map SFI tables again by using ioremap

    if (efi_enabled(EFI_RUNTIME_SERVICES)) {  // Extensible Firmware Interface
        efi_late_init();
        efi_free_boot_services();
    }

    ftrace_init();             // function trace, https://www.kernel.org/doc/Documentation/trace/ftrace.txt

    /* Do the rest non-__init'ed, we're now alive */
    rest_init();
}

關於 start_kernel() 的說明

以下將對 start_kernel() 所牽涉到的內核功能及特色 (feature) 做簡單說明及介紹。由於我只是一個對 Linux 核心有興趣的業餘愛好者，並沒有實際從事跟核心相關的開發工作，對核心許多相關的主題只有粗淺的認識，缺乏更深入的了解，以下的說明將以觀念介紹為主，不對細節做太深入的探討。

在 start_kernel() 的一開始，核心先對除錯機制進行初始化，除了 smp_setup_processor_id() 之外，其它四個函式呼叫都是核心用來除錯的機制。

lockdep_init();
set_task_stack_end_magic(&init_task);
smp_setup_processor_id();
debug_objects_early_init();
boot_init_stack_canary();

lockdep

lockdep 是核心用來檢查死鎖的機制，它會檢查一些會造成死鎖的上鎖方式，如重複上鎖 (AA)，不一致的上鎖順序 (AB-BA) 等問題。當它偵測到問題時會發出如下的警告:

modprobe/2287 is trying to acquire lock:
 (&sio_locks[i].lock){-.-...}, at: [<c02867fd>] mutex_lock+0x21/0x24

but task is already holding lock:
 (&sio_locks[i].lock){-.-...}, at: [<c02867fd>] mutex_lock+0x21/0x24

通常死鎖的問題很難重現，這增加了除錯的困難。 lockdep 的強大之處在於它不需要等到真正發生死鎖之後才回報問題，它被設計成只要發現只要發現有造成死鎖疑慮的上鎖步驟就會發出警告。

不過由於內核非常頻繁地在上鎖解鎖，lockdep 的檢查無可避免地會拖慢系統的速度，在正式上線的內核不會將這個檢查打開，通常是在內部測試時會設定 CONFIG_LOCKDEP 啟動 lockdep 的檢查。

由於 lockdep 實在太有用了，這個功能後來也變成一個獨立的工具，放在 tools/lib/lockdep/ 之下，讓 user space 的應用程式也能利用 lockdep 檢查死鎖。

set_task_stack_end_magic()

set_task_stack_end_magic() 這個函式呼叫會在 task 的 stack 後面放一個檢查碼，如果這個檢查碼的值被改變了，表示有人寫到 stack 以外的區域，也就是發生了 stack overflow。

#define STACK_END_MAGIC         0x57AC6E9D

void set_task_stack_end_magic(struct task_struct *tsk)
{
    unsigned long *stackend;

    stackend = end_of_stack(tsk);
    *stackend = STACK_END_MAGIC;    /* for overflow detection */
}

debug objects

debug objects 是針對核心內部物件的生命週期管理除錯的設施。根據原作者 Thomas Gleixner 的說明，在核心開發常常會遇到物件的生命週期管理相關的錯誤，如:

使用已經被釋放的物件
重新初始化已經在作用的物件

像這樣的問題很難被除錯，因為發生問題的點常不是問題的源頭 (root cause)。以「使用已經被釋放的物件」這類型的錯誤來說，真正的問題可能是物件釋放的時間點不對。

debug objects 就是設計來追蹤這一類關於物件生命週期管理相關的問題，子系統只需在相對應的函式加入 debug objects 的函式就可以利用它來幫忙追蹤物件的生命週期。 debug objects 原先是設計來追蹤 timer 的生命週期管理，除了 timer 以外，目前 worker queue 也有使用。

boot_init_stack_canary()

這邊的 stack canary (金絲雀) 和上面 stack end magic 是同樣作用，都是放在 stack 後面用來檢查是否發生 stack overflow。內核這邊的 boot_init_stack_canary() 函式的功用是設定一個不固定的 stack canary 值，用以防止 stack overflow 的攻擊，不過內核這邊也僅僅是設定一個不固定的 canary 值，真正的檢查 stack overflow 的機制是由 gcc 實現。 gcc 提供 -fstack-protector 編譯選項，它會參考這個 canary 值，加入用來檢查的程式碼，在函式返回前檢查這個值是否被覆寫。詳細的設明，推薦閱讀: IBM developerWorks: GCC 中的编译器堆栈保护技术。

到目前為此，核心都只是在初始化一些除錯相關的功能。在 boot_init_stack_canary() 之後有一個比較奇怪的傢伙 cgroup_init_early()，這是我們比較不熟悉的。

cgroup (control group)

cgroup 是 control group 的簡稱，中文翻譯為控制群組。 cgroup 主要的功用是用來限制行程 (process) 對系統資源的使用，舉例來說，限制某個行程只能使用 20% 的 CPU，或是只使用 256MB 的 RAM。

整個 cgroup 子系統主要是由 cgroup core 及子系統控制器 (controller) 所組成，常見的 cgroup 控制器有:

cpu: 限制控制群組中的任務 CPU 的使用量。
cpuset: 指派控制群組所使用的 CPU 與記憶體節點。
cpuacct: CPU accounting controller, 產生關於控制群組中的任務，CPU 資源使用的報告。
memmory: 限制控制群組中的任務所能使用的記憶體資源，並產生記憶體使用報告。
devices: 允許或拒絕控制群組中的任務存取特定裝置。
freezer: 中止或復原控制群組中的工作。
blkio: 設置對區塊設備的輸入輸出限制，限制存取功能或頻寬。
net_cls: Net class controller, 對網路封包標記 classid，讓流量控制器（tc）能夠辨識源自於特定控制群組的封包。
ns: Namespace, 以命名空間將控制群組中的任務隔離。

cgroup 是以階層（hierarchy）的形式被組織在一起，下層可使用的資源由上層繼承而來，也就是說上層會限制下層能使用的資源。以下圖來說， top_cgroup (root) 擁有 100% 的系統資源，假設 group A 被限制只能使用 50% 的 CPU，那麼 group B 和 group C 就不能使用超過 50% 的 CPU。

            top_cgroup (root)
               /      \
        group A        group D
       /       \
group B         group C

cgroup 最初是在 2008 年時進到核心版本 2.6.24 之中，但是初版 cgroup 的設計太過複雜為許多人所詬病，因此 Tejun Heo 在 2012 年發起討論，提議重新設計 cgroup，在 2013 年時，由他所主導的 cgroup v2 被整合到核心版本 3.15 和 3.16 之中。

新版的 cgroup 被稱為 cgroup v2，相較於舊版的 cgroup v1，它只允許一個唯一的階層（hierarchy）架構存在，這簡化了 cgroup 的設計。

在 start_kernel() 中， cgroup 的初始化分為兩個步驟: cgroup_init_early() 及 cgroup_init()。 cgroup_init_early() 會先對需要先初始化的子系統進行初始化，主要是 root cgroup 的子系統，接下來的 cgroup_init() 會完成後續整個 cgroup 子系統的初始化。

/**
 * cgroup_init_early - cgroup initialization at system boot
 *
 * Initialize cgroups at system boot, and initialize any
 * subsystems that request early init.
 */
int __init cgroup_init_early(void)
{
    static struct cgroup_sb_opts __initdata opts;
    struct cgroup_subsys *ss;
    int i;

    init_cgroup_root(&cgrp_dfl_root, &opts);
    cgrp_dfl_root.cgrp.self.flags |= CSS_NO_REF;

    RCU_INIT_POINTER(init_task.cgroups, &init_css_set);

    for_each_subsys(ss, i) {

        ... omitted ...

        ss->id = i;
        ss->name = cgroup_subsys_name[i];

        if (ss->early_init)
            cgroup_init_subsys(ss, /* early= */ true);
    }
    return 0;
}

CPU mask

static void __init boot_cpu_init(void)
{
    int cpu = smp_processor_id();
    /* Mark the boot cpu "present", "online" etc for SMP and UP case */
    set_cpu_online(cpu, true);   // set this CPU are available for scheduling
    set_cpu_active(cpu, true);   // set this CPU are available to migration (for load balance)
    set_cpu_present(cpu, true);  // set this CPU are currently plugged in
    set_cpu_possible(cpu, true); // set this CPU can be plugged in
}

在 Linux 內核中主要有 4 個 cpu mask array 記錄 CPU 的使用情形

cpu_possible_mask: 硬體上實際可用的 CPU， boot time 時決定
cpu_present_mask: 目前指派使用的 CPU
cpu_online_mask: 可以被排程的 CPU (boot CPU 以外的 CPU 由 smp_init() 完成 CPU 初始化後設定為 online)
cpu_active_mask: 可以依據 domain/group 進行排程的 CPU (負載平衡)

如果 CONFIG_HOTPLUG_CPU 有設定的話，Linux 內核啟動支援 CPU hotplug 的機制。這邊的 hotplug 不是指硬體上的熱插拔，而是指系統可以動態決定 CPU 的使用，系統可以經由設定 cpu_present_mask 決定要使用的 CPU，舉例來說，系統可以在低負載時將一些 CPU 關掉，節省電源的消秏。除了 cpu_present_mask 可以由外部設定，其它的 3個 mask 都是唯讀，由核心維護。

關於 CPU mask 可以參考:

Memory zones

一個 32位元的處理器最多只能定址 4GB 的記憶體，在一般 Linux 核心的設定中，這 4GB 會被切分為 user space 3GB 和 kernel space 1GB。如果某個裝置配備超過 4GB 的記憶體，Linux 要怎麼利用4GB 以上的記憶體呢？

為了解決這個問題，Linux 引進了 high memory 機制，採用動態映射的方式將大於 4GB 的記憶體映射到 32位元的定址空間中。Linux 將實體記憶體 (Physical Address, PA) 區分為以下的區域 (zones):

ZONE_DMA (0-16 MB): 保留給 x86 ISA/PCI DMA 的實體記憶體區域。某些古老的 x86 ISA DMA 只能定址 0-16MB 的記憶體位址。
ZONE_NORMAL (16-896 MB): 正常核心程式碼所能存取的區域。
ZONE_HIGHMEM (896 MB 以上): 動態映射的記憶體區域。

另外有2個新的 zone type 被加入到核心中:

ZONE_DMA32 (16MB-4G) : 在 64 位元系統 (e.g. x86_64) 新增的 zone，延伸 ZONE_DMA 到 4G 的定址空間。
- LWN: Add 4GB DMA32 zone
ZONE_MOVABLE: 這是一個 pseudo zone，經由搬移被標記 ZONE_MOVABLE 的 page 來避免記憶體的碎片化。
- LWN: Create ZONE_MOVABLE to partition memory between movable and non-movable pages

這邊要注意的是，上面所說的記憶體區域都是指實體記憶體空間，另外，核心在實體記憶體上是被放在低位址，在記憶體映射 (Memory Mapping) 建立之後，才會被映射到虛擬記憶體 (Virtual Address, VA) 的高位址上。

在核心 1G 的記憶體空間中，在 896MB 之後的 128MB 屬於 ZONE_HIGHMEM，保留給核心以動態映射的方式映射 high memory。所以，放在這區的資料基本上是屬於間接存取，必須先完成映射後才能被存取。

ZONE_DMA 和 ZONE_HIGHMEM 不一定要存在，像 x86_64 就沒有 ZONE_HIGHMEM。除了 x86_32 以外，大部份的系統都只有 ZONE_NORMAL，因為現在的大部份裝置都沒有 DMA 定址的限制，所以就不需要 ZONE_DMA，而64位元系統沒有 4GB 的定址限制，因此 ZONE_HIGHMEM 就不需要了，而就算是 32位元系統，如果不打算配備到 4GB 以上的記憶體，也不需要使用 high memory。

如果核心有設 high memory 配置的話 (CONFIG_HIGHMEM)，page_address_init() 會負責初始化 high memory 的映射表；若沒有 high memory 配置，page_address_init() 就只是個空函式。

#if defined(HASHED_PAGE_VIRTUAL)
void *page_address(const struct page *page);
void set_page_address(struct page *page, void *virtual);
void page_address_init(void);
#endif

#if !defined(HASHED_PAGE_VIRTUAL) && !defined(WANT_PAGE_VIRTUAL)
#define page_address(page) lowmem_page_address(page)
#define set_page_address(page, address)  do { } while(0)
#define page_address_init()  do { } while(0)
#endif

#if defined(HASHED_PAGE_VIRTUAL)

... omitted ...

void __init page_address_init(void)
{
    int i;

    for (i = 0; i < ARRAY_SIZE(page_address_htable); i++) {
        INIT_LIST_HEAD(&page_address_htable[i].lh);
        spin_lock_init(&page_address_htable[i].lock);
    }
}

#endif  /* defined(CONFIG_HIGHMEM) && !defined(WANT_PAGE_VIRTUAL) */

Memory architecture

在現代多處理器的系統中，處理器對記憶體的存取可以分為 UMA (Uniform memory access) 和 NUMA (Non-uniform memory access) 兩種架構。

在 UMA 架構下，CPU 共用同一個 Bus 存取記憶體，所以同時間只能允許一個 CPU 對 Memory 進行存取。很明顯地，記憶體的存取會是 UMA 架構的一個效能瓶頸。

為了改善 UMA 架構的記憶體存取瓶頸，NUMA 架構被提出，將 memory 分配到各個 CPU，成為各個 CPU 的 local memory，在 Linux 中稱為一個 node。屬於同一個 node 的 CPU 可以直接存取，不需經由 Bus；只有要存取其他 node 的 memory 時才需要經由 Bus。從架構上可以理解，與存取 local memory 相比，存取其他 node 的 memory 需要花費較多的時間，不同於 UMA 存取所有的記憶體都花費一樣的時間，所以這樣的架構被稱為 Non-uniform memory access, NUMA。

Memory node 在 Linux 核心中由 pg_data_t 來表示，每個 node 包含一到數個 struct zone ，由各個架構所提供的 setup_arch() 完成初始化。

setup_arch()

setup_arch() 通常被定義在各個架構的 arch/xxx/kernel/setup.c 之內，負責與硬體架構相關的初始化設定，如 CPU, 記憶體, 中斷, I/O, DMA 等。由於硬體設計不同，特性不同，應用不同，每個架構需要被初始化及設置的硬體及設置的方式都不相同，要了解這個函式必須要對目標架構的硬體特性有相當的了解。

不同架構的 setup_arch() 差異相當大，像 x86 架構支援功能多，硬體複雜度相對高，連帶它的 setup_arch() 也相對複雜，接近 400 多行，比 start_kernel() 還大；但 arm 架構的 setup_arch() 大概只有 70 多行，相對簡單許多。

要深入探討 setup_arch() 必須先說明目標架構的硬體，那須要另外寫一篇才有辦法講清楚，本篇的目標在探討 start_kernel()，不打算深入各別硬體架構的部份。

mm_init_cpumask()

這個函式很單純，就是把 mm_struct 裡的 cpu_vm_mask_var 初始化為 0，它記錄與這個 mm_struct 有相關的 CPU。

static inline void mm_init_cpumask(struct mm_struct *mm)
{
#ifdef CONFIG_CPUMASK_OFFSTACK
    mm->cpu_vm_mask_var = &mm->cpumask_allocation;
#endif
    cpumask_clear(mm->cpu_vm_mask_var);
}

在 OS 內部有一份記憶體映射表，用以將虛擬記憶體頁對應到實際的物理記憶體頁。當每次存取虛擬記憶體位址時，都必須參照這份記憶體映射表，找到真正對應的物理記憶體位址。這份記憶體映射表存放於記憶體中，但記憶體的存取速度跟不上 CPU 的速度，為了加速存取，CPU 內部有一份快取，稱為 TLB (Translation Lookaside Buffer)，CPU 會先看看 TLB 裡面有沒有要查找的位址，若沒有就需要去讀取記憶體中的映射表。

當記憶體映射表有更動用時，相關 CPU 的 TLB 也需要被更新 (flush)， cpu_vm_mask_var 記錄了那些 CPU 需要被通知。這個變數主要會經由 mm_cpumask() 這個 inline 函式存取:

/* Future-safe accessor for struct mm_struct's cpu_vm_mask. */
static inline cpumask_t *mm_cpumask(struct mm_struct *mm)
{
    return mm->cpu_vm_mask_var;
}

setup_command_line()

/*
 * We need to store the untouched command line for future reference.
 * We also need to store the touched command line since the parameter
 * parsing is performed in place, and we should allow a component to
 * store reference of name/value for future reference.
 */
static void __init setup_command_line(char *command_line)
{
    saved_command_line =
        memblock_virt_alloc(strlen(boot_command_line) + 1, 0);
    initcall_command_line =
        memblock_virt_alloc(strlen(boot_command_line) + 1, 0);
    static_command_line = memblock_virt_alloc(strlen(command_line) + 1, 0);
    strcpy(saved_command_line, boot_command_line);
    strcpy(static_command_line, command_line);
}

setup_command_line() 很單純，它先經由 memblock_virt_alloc() 配置記憶體，然後將 boot_command_line 和 command_line 拷貝到剛剛分配的記憶體存下來。

memblock 是一個簡單的記憶體管理機制，主要用於 Linux 核心啟動時，當完整的的記憶體管理架構尚未建立時，用來應付記憶體分配的需求。

setup_nr_cpu_ids()

由 cpumask 最後一個 bit 的位置計算目前 CPU 的數目 (nr = number)。

/* An arch may set nr_cpu_ids earlier if needed, so this would be redundant */
void __init setup_nr_cpu_ids(void)
{
    nr_cpu_ids = find_last_bit(cpumask_bits(cpu_possible_mask),NR_CPUS) + 1;
}

setup_per_cpu_areas()

per-CPU area 顧名思義就是每個 CPU 專屬的記憶體區域。 setup_per_cpu_areas() 就是在為每個 CPU 配置它們個別專屬的記憶體區域。那為什麼要為每個 CPU 配置它們個別專屬的記憶體區域？

在 Linux 核心中有許多的計數器 (counter)，像網路封包的統計，一般最直覺的實作方式就是設置一個公共變數，收到封包時就加一，這個變數通常使用 int 型別，配合 atomic operation (原子操作)減少同步操作的開銷。在單 CPU 的系統上 (uni-processor, UP)，這樣是很合理的設計，但在多 CPU 的系統上 (symmetric multiple processors，SMP)，這樣的設計會造成兩個問題:

atomic operation 需要把 memory 的特定區域或是 bus 鎖住，以免其他 CPU 干擾讀寫的動作。
由於這個變數常常被更動，所以 CPU 的 cache 常常需要被更新 (cache line bouncing)

為了改善這兩個問題，Linux 核心使用 per-CPU variable 的設計。被宣告為 per-CPU variable 的變數會被配置在每個 CPU 的專屬的記憶體區域，每個 CPU 都只存取自已專屬的變數。以上面所提的網路封包計數的例子來說，每個 CPU 只計數自已看到的封包，當要知道封包計數時只需把每個 CPU 的計數加總起來。

這樣的設計減少了同步操作的要求，每個 CPU 各自有一份自已的變數資料，彼此之間互相不干擾；也因為 CPU 各自有擁有一份自已的資料，不會產生 cache line bouncing 影響 cache 的效率。

build_all_zonelists()

zonelist 代表了一個優先級序列，表示記憶體分配在 zone 中嘗試的順序，當目前的 zone 已無空的記憶體時，會從 zonelist 中找到下一個有可分配記憶體的 zone。build_all_zonelists() 負責設定 pg_data_t 中的 node_zonelists。

zonelist 的順序有 2 種選擇:

ZONELIST_ORDER_NODE : 以 node 為主排序，著重在 memory locality。

e.g. node(0).ZONE_MOVABLE, node(0).ZONE_HIGHMEM, node(0).ZONE_NORMAL, node(0).ZONE_DMA,
     node(1).ZONE_MOVABLE, node(1).ZONE_HIGHMEM, node(1).ZONE_NORMAL, node(1).ZONE_DMA,
     ...

ZONELIST_ORDER_ZONE : 以 zone 為主排序，優先尋找相同的型態 zone。

e.g. node(0).ZONE_MOVABLE, node(1).ZONE_MOVABLE, ...,
     node(0).ZONE_HIGHMEM, node(1).ZONE_HIGHMEM, ...,
     node(0).ZONE_NORMAL, node(1).ZONE_NORMAL, ...,
     node(0).ZONE_DMA, node(1).ZONE_DMA, ...

numa_zonelist_order 可以由 bootloader 傳給核心 (屬於 early_param)，或經由設定 /proc/sys/vm/numa_zonelist_order 動態改變 zonelist order。

page_alloc_init()

page_alloc_init() 只是很單純地註冊一個關於 CPU 熱插拔事件處理的 callback function。由它所註冊的 page_alloc_cpu_notify() 來看，主要是關於 CPU 被動態移除後的相關處理。

void __init page_alloc_init(void)
{
    hotcpu_notifier(page_alloc_cpu_notify, 0);
}

static int page_alloc_cpu_notify(struct notifier_block *self,
                 unsigned long action, void *hcpu)
{
    int cpu = (unsigned long)hcpu;

    if (action == CPU_DEAD || action == CPU_DEAD_FROZEN) {
        lru_add_drain_cpu(cpu);
        drain_pages(cpu);

        /*
         * Spill the event counters of the dead processor
         * into the current processors event counters.
         * This artificially elevates the count of the current
         * processor.
         */
        vm_events_fold_cpu(cpu);

        /*
         * Zero the differential counters of the dead processor
         * so that the vm statistics are consistent.
         *
         * This is only okay since the processor is dead and cannot
         * race with what we are doing.
         */
        cpu_vm_stats_fold(cpu);
    }
    return NOTIFY_OK;
}

Reference

http://www.slideshare.net/shimosawa/linux-initialization-process-2
gitbook: Linux Insides: Kernel initialization process
init/main.c: start_kernel() (git blame, 可以查看程式碼的改動記錄以及原因)

Kernel Documentation:

x86 boot protocol: https://www.kernel.org/doc/Documentation/x86/boot.txt
arm booting: https://www.kernel.org/doc/Documentation/arm/Booting

Daniel Jslin

May the source be with you