Linux i386 引导代码 HOWTO

Feiyun Wang

2004-01-23

修订历史
版本 1.02004-02-19修订者: FW
初始发布,由 LDP 审核
版本 0.3.32004-01-23修订者: fyw
添加 decompress_kernel() 细节;修复了 TLDP 最终审核中报告的错误。
版本 0.32003-12-07修订者: fyw
添加关于 SMP、GRUB 和 LILO 的内容;修复和增强。
版本 0.22003-08-17修订者: fyw
适应 Linux 2.4.20。
版本 0.12003-04-20修订者: fyw
更改为 DocBook XML 格式。

本文档描述了 Linux i386 引导代码,作为学习指南和源代码注释。除了类似 C 语言的伪代码源代码注释之外,它还介绍了与内核开发相关的工具链和规范的要点。它的目的是帮助

  • 内核新手理解 Linux i386 引导代码,以及

  • 内核老手回忆 Linux 引导过程。


目录
1. 简介
1.1. 版权和许可
1.2. 免责声明
1.3. 鸣谢/贡献者
1.4. 反馈
1.5. 翻译
2. Linux Makefiles
2.1. linux/Makefile
2.2. linux/arch/i386/vmlinux.lds
2.3. linux/arch/i386/Makefile
2.4. linux/arch/i386/boot/Makefile
2.5. linux/arch/i386/boot/compressed/Makefile
2.6. linux/arch/i386/tools/build.c
2.7. 参考
3. linux/arch/i386/boot/bootsect.S
3.1. 移动引导扇区
3.2. 获取磁盘参数
3.3. 加载设置代码
3.4. 加载压缩镜像
3.5. 执行设置代码
3.6. 读取磁盘
3.7. 引导扇区助手
3.8. 杂项
3.9. 参考
4. linux/arch/i386/boot/setup.S
4.1. 头部
4.2. 检查代码完整性
4.3. 检查加载器类型
4.4. 获取内存大小
4.5. 硬件支持
4.6. APM 支持
4.7. 准备保护模式
4.8. 启用 A20
4.9. 切换到保护模式
4.10. 杂项
4.11. 参考
5. linux/arch/i386/boot/compressed/head.S
5.1. 解压内核
5.2. gunzip()
5.3. inflate()
5.4. 参考
6. linux/arch/i386/kernel/head.S
6.1. 启用分页
6.2. 获取内核参数
6.3. 检查 CPU 类型
6.4. 执行 Start Kernel
6.5. 杂项
6.6. 参考
7. linux/init/main.c
7.1. start_kernel()
7.2. init()
7.3. cpu_idle()
7.4. 参考
8. SMP 引导
8.1. 在 smp_init() 之前
8.2. smp_init()
8.3. linux/arch/i386/kernel/trampoline.S
8.4. initialize_secondary()
8.5. start_secondary()
8.6. 参考
A. 内核构建示例
B. 内部链接器脚本
C. GRUB 和 LILO
C.1. GNU GRUB
C.2. LILO
C.3. 参考
D. 常见问题解答

1. 简介

本文档作为 Linux i386 引导代码的学习指南和源代码注释。 除了类似 C 语言的伪代码源代码注释之外,它还介绍了与内核开发相关的工具链和规范的要点。它的目的是帮助

当前版本基于 Linux 2.4.20。

本文档的项目主页由 China Linux Forum 托管。 工作文档也可能在作者的个人网页 Yahoo! GeoCities 上找到。


1.1. 版权和许可

本文档,Linux i386 引导代码 HOWTO,版权 (c) 2003, 2004 归 Feiyun Wang 所有。 允许根据自由软件基金会发布的 GNU 自由文档许可证 1.2 版或任何后续版本复制、分发和/或修改本文档; 没有不变部分,没有封面文字,也没有封底文字。 许可证的副本可在 https://gnu.ac.cn/copyleft/fdl.html 获取。

Linux 是 Linus Torvalds 的注册商标。


1.2. 免责声明

对于本文档的内容不承担任何责任。 使用概念、示例和信息,风险自负。 可能存在错误和不准确之处,这些错误和不准确之处可能会对您的系统造成损害。 请谨慎操作,尽管这种情况不太可能发生,但作者不承担任何责任。

所有者拥有所有版权,除非另有明确说明。 在本文档中使用某个术语不应被视为影响任何商标或服务标志的有效性。 对特定产品或品牌的命名不应被视为认可。


1.3. 鸣谢/贡献者

在本文档中,我很荣幸地承认

姓名将在此列表中保留一年。


1.4. 反馈

非常欢迎您对此文档提供反馈。 将您的补充、评论和批评发送到以下电子邮件地址


1.5. 翻译

目前只有英文版本可用。


2. Linux Makefiles

在研读 Linux 代码之前,我们应该对 Linux 如何组成、编译和链接有一个基本的了解。 实现此目标的直接方法是了解 Linux makefiles。 如果您喜欢在线源代码浏览,请查看 Cross-Referencing Linux


2.1. linux/Makefile

以下是此顶级 makefile 中的一些著名目标

  • xconfig, menuconfig, config, oldconfig:生成内核配置文件linux/.config;

  • depend, dep:生成依赖文件,如linux/.depend, linux/.hdepend.depend在子目录中;

  • vmlinux:生成常驻内核镜像linux/vmlinux,最重要的目标;

  • modules, modules_install:生成并安装模块到/lib/modules/$(KERNELRELEASE);

  • tags:生成标签文件linux/tags,用于使用 vim 浏览源代码。

概述linux/Makefile概述如下
include .depend
include .config
include arch/i386/Makefile

vmlinux: generate linux/vmlinux
        /* entry point "stext" defined in arch/i386/kernel/head.S */
        $(LD) -T $(TOPDIR)/arch/i386/vmlinux.lds -e stext
        /* $(HEAD) */
        + from arch/i386/Makefile
                arch/i386/kernel/head.o
                arch/i386/kernel/init_task.o
        init/main.o
        init/version.o
        init/do_mounts.o
        --start-group
        /* $(CORE_FILES) */
        + from arch/i386/Makefile
                arch/i386/kernel/kernel.o
                arch/i386/mm/mm.o
        kernel/kernel.o
        mm/mm.o
        fs/fs.o
        ipc/ipc.o
        /* $(DRIVERS) */
        drivers/...
                char/char.o
                block/block.o
                misc/misc.o
                net/net.o
                media/media.o
                cdrom/driver.o
                and other static linked drivers
                + from arch/i386/Makefile
                        arch/i386/math-emu/math.o (ifdef CONFIG_MATH_EMULATION)
        /* $(NETWORKS) */
        net/network.o
        /* $(LIBS) */
        + from arch/i386/Makefile
                arch/i386/lib/lib.a
        lib/lib.a
        --end-group
        -o vmlinux
        $(NM) vmlinux | grep ... | sort > System.map
tags: generate linux/tags for vim
modules: generate modules
modules_install: install modules
clean mrproper distclean: clean up build directory
psdocs pdfdocs htmldocs mandocs: generate kernel documents

include Rules.make

rpm: generate an rpm
"--start-group" 和 "--end-group" 是 ld 命令行选项,用于解决符号引用问题。 有关详细信息,请参阅 使用 LD,GNU 链接器:命令行选项

Rules.make包含多个 Makefile 之间共享的规则。


2.2. linux/arch/i386/vmlinux.lds

编译后,ld 会组合多个对象和存档文件,重新定位它们的数据并绑定符号引用。linux/arch/i386/vmlinux.lds由指定linux/Makefile作为链接常驻内核镜像时使用的链接器脚本linux/vmlinux.

/* ld script to make i386 Linux kernel
 * Written by Martin Mares <mj@atrey.karlin.mff.cuni.cz>;
 */
OUTPUT_FORMAT("elf32-i386", "elf32-i386", "elf32-i386")
OUTPUT_ARCH(i386)
/* "ENTRY" is overridden by command line option "-e stext" in linux/Makefile */
ENTRY(_start)
/* Output file (linux/vmlinux) layout.
 * Refer to Using LD, the GNU linker: Specifying Output Sections */
SECTIONS
{
/* Output section .text starts at address 3G+1M.
 * Refer to Using LD, the GNU linker: The Location Counter */
  . = 0xC0000000 + 0x100000;
  _text = .;                    /* Text and read-only data */
  .text : {
        *(.text)
        *(.fixup)
        *(.gnu.warning)
        } = 0x9090
/* Unallocated holes filled with 0x9090, i.e. opcode for "NOP NOP".
 * Refer to Using LD, the GNU linker: Optional Section Attributes */

  _etext = .;                   /* End of text section */

  .rodata : { *(.rodata) *(.rodata.*) }
  .kstrtab : { *(.kstrtab) }

/* Aligned to next 16-bytes boundary.
 * Refer to Using LD, the GNU linker: Arithmetic Functions */
  . = ALIGN(16);                /* Exception table */
  __start___ex_table = .;
  __ex_table : { *(__ex_table) }
  __stop___ex_table = .;

  __start___ksymtab = .;        /* Kernel symbol table */
  __ksymtab : { *(__ksymtab) }
  __stop___ksymtab = .;

  .data : {                     /* Data */
        *(.data)
        CONSTRUCTORS
        }
/* For "CONSTRUCTORS", refer to
 * Using LD, the GNU linker: Option Commands */

  _edata = .;                   /* End of data section */

  . = ALIGN(8192);              /* init_task */
  .data.init_task : { *(.data.init_task) }

  . = ALIGN(4096);              /* Init code and data */
  __init_begin = .;
  .text.init : { *(.text.init) }
  .data.init : { *(.data.init) }
  . = ALIGN(16);
  __setup_start = .;
  .setup.init : { *(.setup.init) }
  __setup_end = .;
  __initcall_start = .;
  .initcall.init : { *(.initcall.init) }
  __initcall_end = .;
  . = ALIGN(4096);
  __init_end = .;

  . = ALIGN(4096);
  .data.page_aligned : { *(.data.idt) }

  . = ALIGN(32);
  .data.cacheline_aligned : { *(.data.cacheline_aligned) }

  __bss_start = .;              /* BSS */
  .bss : {
        *(.bss)
        }
  _end = . ;

/* Output section /DISCARD/ will not be included in the final link output.
 * Refer to Using LD, the GNU linker: Section Definitions */
  /* Sections to be discarded */
  /DISCARD/ : {
        *(.text.exit)
        *(.data.exit)
        *(.exitcall.exit)
        }

/* The following output sections are addressed at memory location 0.
 * Refer to Using LD, the GNU linker: Optional Section Attributes */
  /* Stabs debugging sections.  */
  .stab 0 : { *(.stab) }
  .stabstr 0 : { *(.stabstr) }
  .stab.excl 0 : { *(.stab.excl) }
  .stab.exclstr 0 : { *(.stab.exclstr) }
  .stab.index 0 : { *(.stab.index) }
  .stab.indexstr 0 : { *(.stab.indexstr) }
  .comment 0 : { *(.comment) }
}


2.3. linux/arch/i386/Makefile

linux/arch/i386/Makefile包含在linux/Makefile以提供 i386 特定项和术语。

以下所有目标都依赖于linux/Makefile的目标vmlinux。 它们是通过在linux/arch/i386/boot/Makefile中使用一些选项来完成的。

表 1. linux/arch/i386/Makefile 中的目标

目标命令
zImage [a]@$(MAKE) -C arch/i386/boot zImage [b]
bzImage@$(MAKE) -C arch/i386/boot bzImage
zlilo @$(MAKE) -C arch/i386/boot BOOTIMAGE=zImage zlilo
bzlilo @$(MAKE) -C arch/i386/boot BOOTIMAGE=bzImage zlilo
zdisk @$(MAKE) -C arch/i386/boot BOOTIMAGE=zImage zdisk
bzdisk @$(MAKE) -C arch/i386/boot BOOTIMAGE=bzImage zdisk
install @$(MAKE) -C arch/i386/boot BOOTIMAGE=bzImage install
注释
a. zImage 别名:compressed
b. "-C" 是一个 MAKE 命令行选项,用于在读取 makefiles 之前更改目录;
请参阅 GNU make:选项摘要GNU make:make 的递归使用

值得注意的是,此 makefile 重新定义了一些环境变量,这些环境变量由导出linux/Makefile,特别是
OBJCOPY=$(CROSS_COMPILE)objcopy -O binary -R .note -R .comment -S
该效果将传递到子目录 makefiles 并更改工具的行为。 有关 objcopy 命令行选项的详细信息,请参阅 GNU 二进制实用程序:objcopy

不确定为什么 $(LIBS) 包含两次 "$(TOPDIR)/arch/i386/lib/lib.a"
LIBS := $(TOPDIR)/arch/i386/lib/lib.a $(LIBS) $(TOPDIR)/arch/i386/lib/lib.a
它可能用于解决某些工具链的链接问题。


2.4. linux/arch/i386/boot/Makefile

linux/arch/i386/boot/Makefile在某种程度上是独立的,因为它既不包含在linux/arch/i386/Makefile也不包含在linux/Makefile.

但是,它们确实存在一些关系

  • linux/Makefile:提供常驻内核镜像linux/vmlinux;

  • linux/arch/i386/boot/Makefile:提供引导程序;

  • linux/arch/i386/Makefile:确保linux/vmlinux在构造引导程序之前已准备就绪,并将目标(如 bzImage)导出到linux/Makefile.

$(BOOTIMAGE) 值,用于目标 zdisk, zlilozdisk,来自linux/arch/i386/Makefile.

表 2. linux/arch/i386/boot/Makefile 中的目标

目标命令
zImage
$(OBJCOPY) compressed/vmlinux compressed/vmlinux.out
tools/build bootsect setup compressed/vmlinux.out $(ROOT_DEV) > zImage
bzImage
$(OBJCOPY) compressed/bvmlinux compressed/bvmlinux.out
tools/build -b bbootsect bsetup compressed/bvmlinux.out $(ROOT_DEV) \
        > bzImage
zdisk
dd bs=8192 if=$(BOOTIMAGE) of=/dev/fd0
zlilo
if [ -f $(INSTALL_PATH)/vmlinuz ]; then mv $(INSTALL_PATH)/vmlinuz
        $(INSTALL_PATH)/vmlinuz.old; fi
if [ -f $(INSTALL_PATH)/System.map ]; then mv $(INSTALL_PATH)/System.map
        $(INSTALL_PATH)/System.old; fi
cat $(BOOTIMAGE) > $(INSTALL_PATH)/vmlinuz
cp $(TOPDIR)/System.map $(INSTALL_PATH)/
if [ -x /sbin/lilo ]; then /sbin/lilo; else /etc/lilo/install; fi
install
sh -x ./install.sh $(KERNELRELEASE) $(BOOTIMAGE) $(TOPDIR)/System.map
        "$(INSTALL_PATH)"
tools/build 从 {bootsect, setup, compressed/vmlinux.out} 构建引导镜像 zImage,或从 {bbootsect, bsetup, compressed/bvmlinux,out} 构建 bzImagelinux/Makefile"export ROOT_DEV = CURRENT"。 请注意,$(OBJCOPY) 已在linux/arch/i386/Makefile中重新定义,如 第 2.3 节所述。

表 3. linux/arch/i386/boot/Makefile 中的支持目标

目标:先决条件命令
compressed/vmlinux: linux/vmlinux@$(MAKE) -C compressed vmlinux
compressed/bvmlinux: linux/vmlinux@$(MAKE) -C compressed bvmlinux
tools/build: tools/build.c $(HOSTCC) $(HOSTCFLAGS) -o $@ $< -I$(TOPDIR)/include [a]
bootsect: bootsect.o $(LD) -Ttext 0x0 -s --oformat binary bootsect.o [b]
bootsect.o: bootsect.s$(AS) -o $@ $<
bootsect.s: bootsect.S ... $(CPP) $(CPPFLAGS) -traditional $(SVGA_MODE) $(RAMDISK) $< -o $@
bbootsect: bbootsect.o $(LD) -Ttext 0x0 -s --oformat binary $< -o $@
bbootsect.o: bbootsect.s$(AS) -o $@ $<
bbootsect.s: bootsect.S ... $(CPP) $(CPPFLAGS) -D__BIG_KERNEL__ -traditional $(SVGA_MODE) $(RAMDISK) $< -o $@
setup: setup.o $(LD) -Ttext 0x0 -s --oformat binary -e begtext -o $@ $<
setup.o: setup.s$(AS) -o $@ $<
setup.s: setup.S video.S ... $(CPP) $(CPPFLAGS) -D__ASSEMBLY__ -traditional $(SVGA_MODE) $(RAMDISK) $< -o $@
bsetup: bsetup.o $(LD) -Ttext 0x0 -s --oformat binary -e begtext -o $@ $<
bsetup.o: bsetup.s$(AS) -o $@ $<
bsetup.s: setup.S video.S ... $(CPP) $(CPPFLAGS) -D__BIG_KERNEL__ -D__ASSEMBLY__ -traditional $(SVGA_MODE) $(RAMDISK) $< -o $@
注释
a. "$@" 表示目标文件,"$<" 表示第一个先决条件;请参考 GNU make: 自动变量
b. "--oformat binary" 要求原始二进制输出,这与可执行文件的内存转储相同;请参考 使用 LD,GNU 链接器:命令行选项
请注意,编译时使用了 "-D__BIG_KERNEL__"bootsect.Sbbootsect.s,以及setup.Sbsetup.s。它们必须是位置无关代码 (PIC),因此 "-Ttext" 选项是什么并不重要。


2.5. linux/arch/i386/boot/compressed/Makefile

此 Makefile 处理图像(解)压缩机制。

将(解)压缩与引导程序分离是个好主意。这种分而治之的解决方案允许我们轻松改进(解)压缩机制或采用新的引导方法。

目录linux/arch/i386/boot/compressed/包含两个源文件head.Smisc.c.

表 4. linux/arch/i386/boot/compressed/Makefile 中的目标

目标命令
vmlinux[a] $(LD) -Ttext 0x1000 -e startup_32 -o vmlinux head.o misc.o piggy.o
bvmlinux $(LD) -Ttext 0x100000 -e startup_32 -o bvmlinux head.o misc.o piggy.o
head.o $(CC) $(AFLAGS) -traditional -c head.S
misc.o
$(CC) $(CFLAGS) -DKBUILD_BASENAME=$(subst $(comma),_,$(subst -,_,$(*F)))
        -c misc.c[b]
piggy.o
tmppiggy=_tmp_$$$$piggy; \
rm -f $$tmppiggy $$tmppiggy.gz $$tmppiggy.lnk; \
$(OBJCOPY) $(SYSTEM) $$tmppiggy; \
gzip -f -9 < $$tmppiggy > $$tmppiggy.gz; \
echo "SECTIONS { .data : { input_len = .; \
        LONG(input_data_end - input_data) input_data = .; \
        *(.data) input_data_end = .; }}" > $$tmppiggy.lnk; \
$(LD) -r -o piggy.o -b binary $$tmppiggy.gz -b elf32-i386 \
        -T $$tmppiggy.lnk; \
rm -f $$tmppiggy $$tmppiggy.gz $$tmppiggy.lnk
注释
a. 这里的目标 *vmlinux* 与中定义的不同linux/Makefile;
b. "subst" 是一个 MAKE 函数;请参考 GNU make:用于字符串替换和分析的函数

piggy.o包含变量 *input_len* 和 gzippedlinux/vmlinux。 *input_len* 位于的开头piggy.o,它等于的大小piggy.o,不包括 *input_len* 本身。请参考 使用 LD,GNU 链接器:节数据表达式,了解 *piggy.o* 链接脚本中的 "LONG(expression)"。

准确地说,不是linux/vmlinux本身(以 ELF 格式)被 gzip 压缩,而是它的二进制映像,它由 objcopy 命令生成。请注意,$(OBJCOPY) 已被中的重新定义linux/arch/i386/Makefile第 2.3 节,使用 "-O binary" 选项输出原始二进制文件。

链接 {*bootsect, setup*} 或 {*bbootsect, bsetup*} 时,$(LD) 指定 "--oformat binary" 选项以二进制格式输出它们。制作 *zImage*(或 *bzImage*)时,$(OBJCOPY) 也从 *compressed/vmlinux*(或 *compressed/bvmlinux*)生成一个中间二进制输出。至关重要的是,*zImage* 或 *bzImage* 中的所有组件都采用原始二进制格式,以便该镜像可以自行运行,而无需加载器加载和重新定位它。

*vmlinux* 和 *bvmlinux* 都预先添加head.omisc.o在之前piggy.o,但它们针对不同的起始地址(0x1000 vs 0x100000)进行链接。


2.6. linux/arch/i386/tools/build.c

linux/arch/i386/tools/build.c是一个用于生成 *zImage* 或 *bzImage* 的主机实用程序。

linux/arch/i386/boot/Makefile:
tools/build bootsect setup compressed/vmlinux.out $(ROOT_DEV) > zImage

tools/build -b bbootsect bsetup compressed/bvmlinux.out $(ROOT_DEV) > bzImage
中,"-b" 表示 is_big_kernel,用于检查系统镜像是否过大。

tools/build 将以下组件输出到 stdout,stdout 被重定向到 *zImage* 或 *bzImage*

  1. bootsect 或 bbootsect:来自linux/arch/i386/boot/bootsect.S,512 字节;

  2. setup 或 bsetup:来自linux/arch/i386/boot/setup.S,4 个或更多扇区,扇区对齐;

  3. compressed/vmlinux.out 或 compressed/bvmlinux.out,包括

    1. head.o:来自linux/arch/i386/boot/compressed/head.S;

    2. misc.o:来自linux/arch/i386/boot/compressed/misc.c;

    3. piggy.o:来自 *input_len* 和 gzippedlinux/vmlinux.

tools/build 在输出到 stdout 时会更改 *bootsect* 或 *bbootsect* 的某些内容

表 5. tools/build 所做的修改

偏移量字节变量注释
1F1 (497)1setup_sectorssetup 扇区数,>=4
1F4 (500)2sys_size以 16 字节为单位的系统大小,小端序
1FC (508)1minor_root根设备次设备号
1FD (509)1major_root根设备主设备号

在以下章节中,如果不会造成混淆,compressed/vmlinux 将被称为 *vmlinux*,compressed/bvmlinux 将被称为 *bvmlinux*。


2.7. 参考


3. linux/arch/i386/boot/bootsect.S

假设我们正在启动 *bzImage*,它由 *bbootsect*、*bsetup* 和 *bvmlinux (head.o, misc.o, piggy.o)* 组成,第一个软盘扇区,*bbootsect* (512 字节),由编译linux/arch/i386/boot/bootsect.S,由 BIOS 加载到 07C0:0。 *bzImage* 的其余部分(*bsetup* 和 *bvmlinux*)尚未加载。


3.1. 移动 Bootsect

SETUPSECTS      = 4                     /* default nr of setup-sectors */
BOOTSEG         = 0x07C0                /* original address of boot-sector */
INITSEG         = DEF_INITSEG  (0x9000) /* we move boot here - out of the way */
SETUPSEG        = DEF_SETUPSEG (0x9020) /* setup starts here */
SYSSEG          = DEF_SYSSEG   (0x1000) /* system loaded at 0x10000 (65536) */
SYSSIZE         = DEF_SYSSIZE  (0x7F00) /* system size: # of 16-byte clicks */
                                        /* to be loaded */
ROOT_DEV        = 0                     /* ROOT_DEV is now written by "build" */
SWAP_DEV        = 0                     /* SWAP_DEV is now written by "build" */

.code16
.text

///////////////////////////////////////////////////////////////////////////////
_start:
{
        // move ourself from 0x7C00 to 0x90000 and jump there.
        move BOOTSEG:0 to INITSEG:0 (512 bytes);
        goto INITSEG:go;
}
*bbootsect* 已被移动到 INITSEG:0 (0x9000:0)。现在我们可以忘记 BOOTSEG 了。


3.2. 获取磁盘参数

///////////////////////////////////////////////////////////////////////////////
// prepare stack and disk parameter table
go:
{
        SS:SP = INITSEG:3FF4;   // put stack at INITSEG:0x4000-12
        /* 0x4000 is an arbitrary value >=
         *   length of bootsect + length of setup + room for stack;
         * 12 is disk parm size. */
        copy disk parameter (pointer in 0:0078) to INITSEG:3FF4 (12 bytes);
        // int1E: SYSTEM DATA - DISKETTE PARAMETERS
        patch sector count to 36 (offset 4 in parameter table, 1 byte);
        set disk parameter table pointer (0:0078, int1E) to INITSEG:3FF4;
}
确保在 SS 寄存器之后立即初始化 SP。根据 IA-32 Intel 架构软件开发人员手册(第 3 卷,第 5.8.3 章,切换堆栈时屏蔽异常和中断),修改 SS 的推荐方法是使用 "lss" 指令。

堆栈操作(如 push 和 pop)现在可以正常运行。磁盘参数的前 12 个字节已复制到 INITSEG:3FF4。

///////////////////////////////////////////////////////////////////////////////
// get disk drive parameters, specifically number of sectors/track.
        char disksizes[] = {36, 18, 15, 9};
        int sectors;
{
        SI = disksizes;                         // i = 0;
        do {
probe_loop:
                sectors = DS:[SI++];            // sectors = disksizes[i++];
                if (SI>=disksizes+4) break;     // if (i>=4) break;
                int13/AH=02h(AL=1, ES:BX=INITSEG:0200, CX=sectors, DX=0);
                // int13/AH=02h: DISK - READ SECTOR(S) INTO MEMORY
        } while (failed to read sectors);
}
"lodsb" 将 DS:[SI] 中的一个字节加载到 AL,并自动增加 SI。

每磁道扇区数已保存在变量 *sectors* 中。


3.3. 加载 Setup 代码

*bsetup* (*setup_sects* 扇区) 将紧跟在 *bbootsect* 之后加载,即 SETUPSEG:0。请注意,INITSEG:0200==SETUPSEG:0,并且 *setup_sects* 已被 tools/build 修改,以匹配 第 2.6 节 中的 *bsetup* 大小。

///////////////////////////////////////////////////////////////////////////////
got_sectors:
        word sread;             // sectors read for current track
        char setup_sects;       // overwritten by tools/build
{
        print out "Loading";
        /* int10/AH=03h(BH=0): VIDEO - GET CURSOR POSITION AND SIZE
         * int10/AH=13h(AL=1, BH=0, BL=7, CX=9, DH=DL=0, ES:BP=INITSEG:$msg1):
         *   VIDEO - WRITE STRING */

        // load setup-sectors directly after the moved bootblock (at 0x90200).
        SI = &sread;            // using SI to index sread, head and track
        sread = 1;              // the boot sector has already been read

        int13/AH=00h(DL=0);     // reset FDC

        BX = 0x0200;            // read bsetup right after bbootsect (512 bytes)
        do {
next_step:
                /* to prevent cylinder crossing reading,
                 *   calculate how many sectors to read this time */
                uint16 pushw_ax = AX = MIN(sectors-sread, setup_sects);
no_cyl_crossing:
                read_track(AL, ES:BX);          // AX is not modified
                // set ES:BX, sread, head and track for next read_track()
                set_next(AX);
                setup_sects -= pushw_ax;        // rest - for next step
        } while (setup_sects);
}
SI 被设置为 *sread* 的地址,以索引变量 *sread*、*head* 和 *track*,因为它们在内存中是连续的。查看 第 3.6 节 以了解 read_track() 和 set_next() 的详细信息。


3.4. 加载压缩镜像

*bvmlinux (head.o, misc.o, piggy.o)* 将加载到 0x100000,*syssize* * 16 字节。

///////////////////////////////////////////////////////////////////////////////
// load vmlinux/bvmlinux (head.o, misc.o, piggy.o)
{
        read_it(ES=SYSSEG);
        kill_motor();                           // turn off floppy drive motor
        print_nl();                             // print CR LF
}
查看 第 3.6 节 以了解 read_it() 的详细信息。如果我们正在启动 *zImage*,*vmlinux* 将加载到 0x10000 (SYSSEG:0)。

*bzImage (bbootsect, bsetup, bvmlinux)* 现在作为一个整体存在于内存中。


3.5. 转到 Setup

///////////////////////////////////////////////////////////////////////////////
// check which root-device to use and jump to setup.S
        int root_dev;                           // overwritten by tools/build
{
        if (!root_dev) {
                switch (sectors) {
                case 15: root_dev = 0x0208;     // /dev/ps0 - 1.2Mb
                        break;
                case 18: root_dev = 0x021C;     // /dev/PS0 - 1.44Mb
                        break;
                case 36: root_dev = 0x0220;     // /dev/fd0H2880 - 2.88Mb
                        break;
                default: root_dev = 0x0200;     // /dev/fd0 - auto detect
                        break;
                }
        }

        // jump to the setup-routine loaded directly after the bootblock
        goto SETUPSEG:0;
}
它将控制权传递给 *bsetup*。请参阅 第 4 节 中的 *linux/arch/i386/boot/setup.S:start*。


3.6. 读取磁盘

以下函数用于从磁盘加载 *bsetup* 和 *bvmlinux*。请注意,*syssize* 也已被 第 2.6 节中的 tools/build 修改。
sread:  .word 0                         # sectors read of current track
head:   .word 0                         # current head
track:  .word 0                         # current track
///////////////////////////////////////////////////////////////////////////////
// load the system image at address SYSSEG:0
read_it(ES=SYSSEG)
        int syssize;                    /* system size in 16-bytes,
                                         *   overwritten by tools/build */
{
        if (ES & 0x0fff) die;           // not 64KB aligned

        BX = 0;
        for (;;) {
rp_read:
#ifdef __BIG_KERNEL__
                bootsect_helper(ES:BX);
                /* INITSEG:0220==SETUPSEG:0020 is bootsect_kludge,
                 *   which contains pointer SETUPSEG:bootsect_helper().
                 * This function initializes some data structures
                 *   when it is called for the first time,
                 *   and moves SYSSEG:0 to 0x100000, 64KB each time,
                 *   in the following calls.
                 * See Section 3.7. */
#else
                AX = ES - SYSSEG + ( BX >> 4);  // how many 16-bytes read
#endif
                if (AX > syssize) return;       // everything loaded
ok1_read:
                /* Get proper AL (sectors to read) for this time
                 *   to prevent cylinder crossing reading and BX overflow. */
                AX = sectors - sread;
                CX = BX + (AX << 9);            // 1 sector = 2^9 bytes
                if (CX overflow && CX!=0) {     // > 64KB
                        AX = (-BX) >> 9;
                }
ok2_read:
                read_track(AL, ES:BX);
                set_next(AX);
        }
}

///////////////////////////////////////////////////////////////////////////////
// read disk with parameters (sread, track, head)
read_track(AL sectors, ES:BX destination)
{
        for (;;) {
                printf(".");
                // int10/AH=0Eh: VIDEO - TELETYPE OUTPUT

                // set CX, DX according to (sread, track, head)
                DX = track;
                CX = sread + 1;
                CH = DL;

                DX = head;
                DH = DL;
                DX &= 0x0100;

                int13/AH=02h(AL, ES:BX, CX, DX);
                // int13/AH=02h: DISK - READ SECTOR(S) INTO MEMORY
                if (read disk success) return;
                // "addw $8, %sp" is to cancel previous 4 "pushw" operations.
bad_rt:
                print_all();            // print error code, AX, BX, CX and DX
                int13/AH=00h(DL=0);     // reset FDC
        }
}

///////////////////////////////////////////////////////////////////////////////
// set ES:BX, sread, head and track for next read_track()
set_next(AX sectors_read)
{
        CX = AX;                        // sectors read
        AX += sread;
        if (AX==sectors) {
                head = 1 ^ head;        // flap head between 0 and 1
                if (head==0) track++;
ok4_set:
                AX = 0;
        }
ok3_set:
        sread = AX;
        BX += CX && 9;
        if (BX overflow) {              // > 64KB
                ES += 0x1000;
                BX = 0;
        }
set_next_fn:
}


3.7. Bootsect 助手

*setup.S:bootsect_helper()* 仅由 *bootsect.S:read_it()* 使用。

由于 *bbootsect* 和 *bsetup* 是单独链接的,它们使用相对于其自身代码/数据段的偏移量。我们必须对不同段中的 *bootsect_helper()* 进行“远调用”(lcall),然后必须“远返回”(lret)。这导致调用中的 CS 更改,这使得 CS!=DS,因此我们必须使用段修饰符来指定setup.S.

///////////////////////////////////////////////////////////////////////////////
// called by bootsect loader when loading bzImage
bootsect_helper(ES:BX)
        bootsect_es = 0;                // defined in setup.S
        type_of_loader = 0;             // defined in setup.S
{
        if (!bootsect_es) {             // called for the first time
                type_of_loader = 0x20;  // bootsect-loader, version 0
                AX = ES >> 4;
                *(byte*)(&bootsect_src_base+2) = AH;
                bootsect_es = ES;
                AX = ES - SYSSEG;
                return;
        }
bootsect_second:
        if (!BX) {                      // 64KB full
                // move from SYSSEG:0 to destination, 64KB each time
                int15/AH=87h(CX=0x8000, ES:SI=CS:bootsect_gdt);
                // int15/AH=87h: SYSTEM - COPY EXTENDED MEMORY
                if (failed to copy) {
                        bootsect_panic() {
                                prtstr("INT15 refuses to access high mem, "
                                        "giving up.");
bootsect_panic_loop:            goto bootsect_panic_loop;   // never return
                        }
                }
                ES = bootsect_es;       // reset ES to always point to 0x10000
                *(byte*)(&bootsect_dst_base+2)++;
        }
bootsect_ex:
        // have the number of moved frames (16-bytes) in AX
        AH = *(byte*)(&bootsect_dst_base+2) << 4;
        AL = 0;
}

///////////////////////////////////////////////////////////////////////////////
// data used by bootsect_helper()
bootsect_gdt:
        .word   0, 0, 0, 0
        .word   0, 0, 0, 0

bootsect_src:
        .word   0xffff

bootsect_src_base:
        .byte   0x00, 0x00, 0x01                # base = 0x010000
        .byte   0x93                            # typbyte
        .word   0                               # limit16,base24 =0

bootsect_dst:
        .word   0xffff

bootsect_dst_base:
        .byte   0x00, 0x00, 0x10                # base = 0x100000
        .byte   0x93                            # typbyte
        .word   0                               # limit16,base24 =0
        .word   0, 0, 0, 0                      # BIOS CS
        .word   0, 0, 0, 0                      # BIOS DS

bootsect_es:
        .word   0

bootsect_panic_mess:
        .string "INT15 refuses to access high mem, giving up."
请注意,*type_of_loader* 值已更改。它将在 第 4.3 节中被引用。


3.8. 杂项

其余的是支持函数、变量和“实模式内核头”的一部分。请注意,数据位于 .text 段中作为代码,因此可以在加载时正确初始化。
///////////////////////////////////////////////////////////////////////////////
// some small functions
print_all();  /* print error code, AX, BX, CX and DX */
print_nl();   /* print CR LF */
print_hex();  /* print the word pointed to by SS:BP in hexadecimal */
kill_motor()  /* turn off floppy drive motor */
{
#if 1
        int13/AH=00h(DL=0);     // reset FDC
#else
        outb(0, 0x3F2);         // outb(val, port)
#endif
}

///////////////////////////////////////////////////////////////////////////////
sectors:        .word 0
disksizes:      .byte 36, 18, 15, 9
msg1:           .byte 13, 10
                .ascii "Loading"

Bootsect 尾部,它是“实模式内核头”的一部分,从偏移量 497 开始。
.org 497
setup_sects:    .byte SETUPSECS         // overwritten by tools/build
root_flags:     .word ROOT_RDONLY
syssize:        .word SYSSIZE           // overwritten by tools/build
swap_dev:       .word SWAP_DEV
ram_size:       .word RAMDISK
vid_mode:       .word SVGA_MODE
root_dev:       .word ROOT_DEV          // overwritten by tools/build
boot_flag:      .word 0xAA55

此“头”必须符合linux/Documentation/i386/boot.txt:
Offset  Proto   Name            Meaning
/Size
01F1/1  ALL     setup_sects     The size of the setup in sectors
01F2/2  ALL     root_flags      If set, the root is mounted readonly
01F4/2  ALL     syssize         DO NOT USE - for bootsect.S use only
01F6/2  ALL     swap_dev        DO NOT USE - obsolete
01F8/2  ALL     ram_size        DO NOT USE - for bootsect.S use only
01FA/2  ALL     vid_mode        Video mode control
01FC/2  ALL     root_dev        Default root device number
01FE/2  ALL     boot_flag       0xAA55 magic number


3.9. 参考

由于 <IA-32 Intel 架构软件开发人员手册> 在本文档中被广泛引用,我将简称其为“IA-32 手册”。


4. linux/arch/i386/boot/setup.S

setup.S负责从 BIOS 获取系统数据,并将它们放入系统内存中的适当位置。

其他引导加载程序,例如 GNU GRUBLILO,也可以加载 *bzImage*。此类引导加载程序应将 *bzImage* 加载到内存中并设置“实模式内核头”,尤其是 *type_of_loader*,然后将控制权直接传递给 *bsetup*。setup.S假设


4.1. 头部

/* Signature words to ensure LILO loaded us right */
#define SIG1    0xAA55
#define SIG2    0x5A5A

INITSEG  = DEF_INITSEG          # 0x9000, we move boot here, out of the way
SYSSEG   = DEF_SYSSEG           # 0x1000, system loaded at 0x10000 (65536).
SETUPSEG = DEF_SETUPSEG         # 0x9020, this is the current segment
                                # ... and the former contents of CS

DELTA_INITSEG = SETUPSEG - INITSEG      # 0x0020

.code16
.text

///////////////////////////////////////////////////////////////////////////////
start:
{
        goto trampoline();              // skip the following header
}

# This is the setup header, and it must start at %cs:2 (old 0x9020:2)
                .ascii  "HdrS"          # header signature
                .word   0x0203          # header version number (>= 0x0105)
                                        # or else old loadlin-1.5 will fail)
realmode_swtch: .word   0, 0            # default_switch, SETUPSEG
start_sys_seg:  .word   SYSSEG
                .word   kernel_version  # pointing to kernel version string
                                        # above section of header is compatible
                                        # with loadlin-1.5 (header v1.5). Don't
                                        # change it.
// kernel_version defined below
type_of_loader: .byte   0               # = 0, old one (LILO, Loadlin,
                                        #      Bootlin, SYSLX, bootsect...)
                                        # See Documentation/i386/boot.txt for
                                        # assigned ids
# flags, unused bits must be zero (RFU) bit within loadflags
loadflags:
LOADED_HIGH     = 1                     # If set, the kernel is loaded high
CAN_USE_HEAP    = 0x80                  # If set, the loader also has set
                                        # heap_end_ptr to tell how much
                                        # space behind setup.S can be used for
                                        # heap purposes.
                                        # Only the loader knows what is free
#ifndef __BIG_KERNEL__
                .byte   0
#else
                .byte   LOADED_HIGH
#endif
setup_move_size: .word  0x8000          # size to move, when setup is not
                                        # loaded at 0x90000. We will move setup
                                        # to 0x90000 then just before jumping
                                        # into the kernel. However, only the
                                        # loader knows how much data behind
                                        # us also needs to be loaded.
code32_start:                           # here loaders can put a different
                                        # start address for 32-bit code.
#ifndef __BIG_KERNEL__
                .long   0x1000          #   0x1000 = default for zImage
#else
                .long   0x100000        # 0x100000 = default for big kernel
#endif
ramdisk_image:  .long   0               # address of loaded ramdisk image
                                        # Here the loader puts the 32-bit
                                        # address where it loaded the image.
                                        # This only will be read by the kernel.
ramdisk_size:   .long   0               # its size in bytes
bootsect_kludge:
                .word  bootsect_helper, SETUPSEG
heap_end_ptr:   .word   modelist+1024   # (Header version 0x0201 or later)
                                        # space from here (exclusive) down to
                                        # end of setup code can be used by setup
                                        # for local heap purposes.
// modelist is at the end of .text section
pad1:           .word   0
cmd_line_ptr:   .long 0                 # (Header version 0x0202 or later)
                                        # If nonzero, a 32-bit pointer
                                        # to the kernel command line.
                                        # The command line should be
                                        # located between the start of
                                        # setup and the end of low
                                        # memory (0xa0000), or it may
                                        # get overwritten before it
                                        # gets read.  If this field is
                                        # used, there is no longer
                                        # anything magical about the
                                        # 0x90000 segment; the setup
                                        # can be located anywhere in
                                        # low memory 0x10000 or higher.
ramdisk_max:    .long __MAXMEM-1        # (Header version 0x0203 or later)
                                        # The highest safe address for
                                        # the contents of an initrd

linux/asm-i386/page.h:
/*
 * A __PAGE_OFFSET of 0xC0000000 means that the kernel has
 * a virtual address space of one gigabyte, which limits the
 * amount of physical memory you can use to about 950MB.
 */
#define __PAGE_OFFSET           (0xC0000000)

/*
 * This much address space is reserved for vmalloc() and iomap()
 * as well as fixmap mappings.
 */
#define __VMALLOC_RESERVE       (128 << 20)

#define __MAXMEM                (-__PAGE_OFFSET-__VMALLOC_RESERVE)
中的 *__MAXMEM* 定义,它给出了 *__MAXMEM* = 1G - 128M。

setup 头部必须遵循某些布局模式。请参考linux/Documentation/i386/boot.txt:
Offset  Proto   Name            Meaning
/Size
0200/2  2.00+   jump            Jump instruction
0202/4  2.00+   header          Magic signature "HdrS"
0206/2  2.00+   version         Boot protocol version supported
0208/4  2.00+   realmode_swtch  Boot loader hook
020C/2  2.00+   start_sys       The load-low segment (0x1000) (obsolete)
020E/2  2.00+   kernel_version  Pointer to kernel version string
0210/1  2.00+   type_of_loader  Boot loader identifier
0211/1  2.00+   loadflags       Boot protocol option flags
0212/2  2.00+   setup_move_size Move to high memory size (used with hooks)
0214/4  2.00+   code32_start    Boot loader hook
0218/4  2.00+   ramdisk_image   initrd load address (set by boot loader)
021C/4  2.00+   ramdisk_size    initrd size (set by boot loader)
0220/4  2.00+   bootsect_kludge DO NOT USE - for bootsect.S use only
0224/2  2.01+   heap_end_ptr    Free memory after setup end
0226/2  N/A     pad1            Unused
0228/4  2.02+   cmd_line_ptr    32-bit pointer to the kernel command line
022C/4  2.03+   initrd_addr_max Highest legal initrd address


4.2. 检查代码完整性

由于 *setup* 代码可能不连续,我们应该首先检查代码完整性。
///////////////////////////////////////////////////////////////////////////////
trampoline()
{
        start_of_setup();       // never return
        .space 1024;
}

///////////////////////////////////////////////////////////////////////////////
// check signature to see if all code loaded
start_of_setup()
{
        // Bootlin depends on this being done early, check bootlin:technic.doc
        int13/AH=15h(AL=0, DL=0x81);
        // int13/AH=15h: DISK - GET DISK TYPE

#ifdef SAFE_RESET_DISK_CONTROLLER
        int13/AH=0(AL=0, DL=0x80);
        // int13/AH=00h: DISK - RESET DISK SYSTEM
#endif

        DS = CS;
        // check signature at end of setup
        if (setup_sig1!=SIG1 || setup_sig2!=SIG2) {
                goto bad_sig;
        }
        goto goodsig1;
}

///////////////////////////////////////////////////////////////////////////////
// some small functions
prtstr();  /* print asciiz string at DS:SI */
prtsp2();  /* print double space */
prtspc();  /* print single space */
prtchr();  /* print ascii in AL */
beep();    /* print CTRL-G, i.e. beep */
检查签名以验证代码完整性。

如果未找到签名,则其余 *setup* 代码可能位于 SYSSEG:0 处的 *vmlinux* 之前。
no_sig_mess: .string "No setup signature found ..."

goodsig1:
        goto goodsig;                           // make near jump

///////////////////////////////////////////////////////////////////////////////
// move the rest setup code from SYSSEG:0 to CS:0800
bad_sig()
        DELTA_INITSEG = 0x0020 (= SETUPSEG - INITSEG)
        SYSSEG = 0x1000
        word start_sys_seg = SYSSEG;            // defined in setup header
{
        DS = CS - DELTA_INITSEG;                // aka INITSEG
        BX = (byte)(DS:[497]);                  // i.e. setup_sects

        // first 4 sectors already loaded
        CX = (BX - 4) << 8;                     // rest code in word (2-bytes)
        start_sys_seg = (CX >> 3) + SYSSEG;     // real system code start
        move SYSSEG:0 to CS:0800 (CX*2 bytes);

        if (setup_sig1!=SIG1 || setup_sig2!=SIG2) {
no_sig:
                prtstr("No setup signature found ...");
no_sig_loop:
                hlt;
                goto no_sig_loop;
        }
}
"hlt" 指令停止指令执行并将处理器置于暂停状态。处理器生成一个特殊的总线周期以指示已进入暂停模式。当发出启用的中断(包括 NMI)时,处理器将在 "hlt" 指令后恢复执行,并且指令指针 (CS:EIP),指向紧随 "hlt" 之后的指令,将在调用中断处理程序之前保存到堆栈。因此,我们需要在 "hlt" 之后使用 "jmp" 指令,以使处理器再次返回暂停状态。

*setup* 代码已移动到正确的位置。变量 *start_sys_seg* 指向实际系统代码的起始位置。如果 "bad_sig" 没有发生,*start_sys_seg* 仍为 SYSSEG。


4.3. 检查加载程序类型

检查加载程序是否与镜像兼容。
///////////////////////////////////////////////////////////////////////////////
good_sig()
        char loadflags;                 // in setup header
        char type_of_loader;            // in setup header
        LOADHIGH = 1
{
        DS = CS - DELTA_INITSEG;        // aka INITSEG
        if ( (loadflags & LOADHIGH) && !type_of_loader ) {
                // Nope, old loader tries to load big-kernel
                prtstr("Wrong loader, giving up...");
                goto no_sig_loop;       // defined above in bad_sig()
        }
}

loader_panic_mess: .string "Wrong loader, giving up..."
请注意,加载 *bvmlinux* 时,*type_of_loader* 已被 *bootsect_helper()* 更改为 0x20。


4.4. 获取内存大小

尝试三种不同的内存检测方案,以获取扩展内存大小(1M 以上),单位为 KB。

首先,尝试 e820h,它允许我们组装内存映射;然后尝试 e801h,它返回一个 32 位的内存大小;最后是 88h,它返回 0-64M。
///////////////////////////////////////////////////////////////////////////////
// get memory size
loader_ok()
        E820NR  = 0x1E8
        E820MAP = 0x2D0
{
        // when entering this function, DS = CS-DELTA_INITSEG, aka INITSEG
        (long)DS:[0x1E0] = 0;

#ifndef STANDARD_MEMORY_BIOS_CALL
        (byte)DS:[0x1E8] = 0;                   // E820NR

        /* method E820H: see ACPI spec
         * the memory map from hell.  e820h returns memory classified into
         * a whole bunch of different types, and allows memory holes and
         * everything.  We scan through this memory map and build a list
         * of the first 32 memory areas, which we return at [E820MAP]. */
meme820:
        EBX = 0;
        DI = 0x02D0;                            // E820MAP
        do {
jmpe820:
                int15/EAX=E820h(EDX='SMAP', EBX, ECX=20, ES:DI=DS:DI);
                // int15/AX=E820h: GET SYSTEM MEMORY MAP
                if (failed || 'SMAP'!=EAX) break;
                // if (1!=DS:[DI+16]) continue; // not usable
good820:
                if (DS:[1E8]>=32) break;        // entry# > E820MAX
                DS:[0x1E8]++;                   // entry# ++;
                DI += 20;                       // adjust buffer for next
again820:
        } while (!EBX)                          // not finished
bail820:

        /* method E801H:
         * memory size is in 1k chunksizes, to avoid confusing loadlin.
         * we store the 0xe801 memory size in a completely different place,
         * because it will most likely be longer than 16 bits.
         * (use 1e0 because that's what Larry Augustine uses in his
         * alternative new memory detection scheme, and it's sensible
         * to write everything into the same place.) */
meme801:
        stc;            // to work around buggy BIOSes
        CX = DX = 0;
        int15/AX=E801h;
        /* int15/AX=E801h: GET MEMORY SIZE FOR >64M CONFIGURATIONS
         *   AX = extended memory between 1M and 16M, in K (max 3C00 = 15MB)
         *   BX = extended memory above 16M, in 64K blocks
         *   CX = configured memory 1M to 16M, in K
         *   DX = configured memory above 16M, in 64K blocks */
        if (failed) goto mem88;
        if (!CX && !DX) {
                CX = AX;
                DX = BX;
        }
e801usecxdx:
        (long)DS:[0x1E0] = ((EDX & 0xFFFF) << 6) + (ECX & 0xFFFF);      // in K
#endif

mem88:  // old traditional method
        int15/AH=88h;
        /* int15/AH=88h: SYSTEM - GET EXTENDED MEMORY SIZE
         *   AX = number of contiguous KB starting at absolute address 100000h */
        DS:[2] = AX;
}


4.5. 硬件支持

检查硬件支持,如键盘、视频适配器、硬盘、MCA 总线和指点设备。
{
        // set the keyboard repeat rate to the max
        int16/AX=0305h(BX=0);
        // int16/AH=03h: KEYBOARD - SET TYPEMATIC RATE AND DELAY

        /* Check for video adapter and its parameters and
         *   allow the user to browse video modes. */
        video();                        // see video.S

        // get hd0 and hd1 data
        copy hd0 data (*int41) to CS-DELTA_INITSEG:0080 (16 bytes);
        // int41: SYSTEM DATA - HARD DISK 0 PARAMETER TABLE ADDRESS
        copy hd1 data (*int46) to CS-DELTA_INITSEG:0090 (16 bytes);
        // int46: SYSTEM DATA - HARD DISK 1 PARAMETER TABLE ADDRESS
        // check if hd1 exists
        int13/AH=15h(AL=0, DL=0x81);
        // int13/AH=15h: DISK - GET DISK TYPE
        if (failed || AH!=03h) {        // AH==03h if it is a hard disk
no_disk1:
                clear CS-DELTA_INITSEG:0090 (16 bytes);
        }
is_disk1:

        // check for Micro Channel (MCA) bus
        CS-DELTA_INITSEG:[0xA0] = 0;    // set table length to 0
        int15/AH=C0h;
        /* int15/AH=C0h: SYSTEM - GET CONFIGURATION
         *   ES:BX = ROM configuration table */
        if (failed) goto no_mca;
        move ROM configuration table (ES:BX) to CS-DELTA_INITSEG:00A0;
        // CX = (table length<14)? CX:16;    first 16 bytes only
no_mca:

        // check for PS/2 pointing device
        CS-DELTA_INITSEG:[0x1FF] = 0;   // default is no pointing device
        int11h();
        // int11h: BIOS - GET EQUIPMENT LIST
        if (AL & 0x04) {                // mouse installed
                DS:[0x1FF] = 0xAA;
        }
}


4.6. APM 支持

检查 BIOS APM 支持。
#if defined(CONFIG_APM) || defined(CONFIG_APM_MODULE)
{
        DS:[0x40] = 0;                  // version = 0 means no APM BIOS
        int15/AX=5300h(BX=0);
        // int15/AX=5300h: Advanced Power Management v1.0+ - INSTALLATION CHECK
        if (failed || 'PM'!=BX || !(CX & 0x02)) goto done_apm_bios;
        // (CX & 0x02) means 32 bit is supported
        int15/AX=5304h(BX=0);
        // int15/AX=5304h: Advanced Power Management v1.0+ - DISCONNECT INTERFACE
        EBX = CX = DX = ESI = DI = 0;
        int15/AX=5303h(BX=0);
        /* int15/AX=5303h: Advanced Power Management v1.0+
         *   - CONNECT 32-BIT PROTMODE INTERFACE */
        if (failed) {
no_32_apm_bios:                         // I moved label no_32_apm_bios here
                DS:[0x4C] &= ~0x0002;   // remove 32 bit support bit
                goto done_apm_bios;
        }
        DS:[0x42] = AX, 32-bit code segment base address;
        DS:[0x44] = EBX, offset of entry point;
        DS:[0x48] = CX, 16-bit code segment base address;
        DS:[0x4A] = DX, 16-bit data segment base address;
        DS:[0x4E] = ESI, APM BIOS code segment length;
        DS:[0x52] = DI, APM BIOS data segment length;
        int15/AX=5300h(BX=0);           // check again
        // int15/AX=5300h: Advanced Power Management v1.0+ - INSTALLATION CHECK
        if (success &&  'PM'==BX) {
                DS:[0x40] = AX, APM version;
                DS:[0x4C] = CX, APM flags;
        } else {
apm_disconnect:
                int15/AX=5304h(BX=0);
                /* int15/AX=5304h: Advanced Power Management v1.0+
                 *   - DISCONNECT INTERFACE */
        }
done_apm_bios:
}
#endif


4.7. 准备进入保护模式

// call mode switch
{
        if (realmode_swtch) {
                realmode_swtch();               // mode switch hook
        } else {
rmodeswtch_normal:
                default_switch() {
                        cli;                    // no interrupts allowed
                        outb(0x80, 0x70);       // disable NMI
                }
        }
rmodeswtch_end:
}

// relocate code if necessary
{
        (long)code32 = code32_start;
        if (!(loadflags & LOADED_HIGH)) {       // low loaded zImage
                // 0x0100 <= start_sys_seg < CS-DELTA_INITSEG
do_move0:
                AX = 0x100;
                BP = CS - DELTA_INITSEG;        // aka INITSEG
                BX = start_sys_seg;
do_move:
                move system image from (start_sys_seg:0 .. CS-DELTA_INITSEG:0)
                        to 0100:0;              // move 0x1000 bytes each time
        }
end_move:
请注意,对于 zImagecode32_start 初始化为 0x1000;对于 bzImage,则初始化为 0x100000。 code32 值将在将控制权传递给linux/arch/i386/boot/compressed/head.S第 4.9 节中使用。如果我们启动 zImage,它会将 vmlinux 重新定位到 0100:0;如果我们启动 bzImagebvmlinux 保持在 start_sys_seg:0。重定位地址必须与“-Ttext”选项匹配,该选项在linux/arch/i386/boot/compressed/Makefile中。 请参阅 第 2.5 节

然后,如有必要,它会将代码从 CS-DELTA_INITSEG:0 (bbootsectbsetup) 重新定位到 INITSEG:0。
        DS = CS;                // aka SETUPSEG
        // Check whether we need to be downward compatible with version <=201
        if (!cmd_line_ptr && 0x20!=type_of_loader && SETUPSEG!=CS) {
                cli;            // as interrupt may use stack when we are moving
                // store new SS in DX
                AX = CS - DELTA_INITSEG;
                DX = SS;
                if (DX>=AX) {   // stack frame will be moved together
                        DX = DX + INITSEG - AX; // i.e. SS-CS+SETUPSEG
                }
move_self_1:
                /* move CS-DELTA_INITSEG:0 to INITSEG:0 (setup_move_size bytes)
                 *   in two steps in order not to overwrite code on CS:IP
                 * move up (src < dest) but downward ("std") */
                move CS-DELTA_INITSEG:move_self_here+0x200
                  to INITSEG:move_self_here+0x200,
                  setup_move_size-(move_self_here+0x200) bytes;
                // INITSEG:move_self_here+0x200 == SETUPSEG:move_self_here
                goto SETUPSEG:move_self_here;   // CS=SETUPSEG now
move_self_here:
                move CS-DELTA_INITSEG:0 to INITSEG:0,
                  move_self_here+0x200 bytes;   // I mean old CS before goto
                DS = SETUPSEG;
                SS = DX;
        }
end_move_self:
}
再次注意,当 bootsect_helper() 加载 bvmlinux 时,type_of_loader 已经被更改为 0x20。


4.8. 启用 A20

有关 A20 问题和解决方案,请参阅 A20 - a pain from the past
        A20_TEST_LOOPS          =  32   # Iterations per wait
        A20_ENABLE_LOOPS        = 255   # Total loops to try
{
#if defined(CONFIG_MELAN)
        // Enable A20. AMD Elan bug fix.
        outb(0x02, 0x92);               // outb(val, port)
a20_elan_wait:
        while (!a20_test());            // test not passed
        goto a20_done;
#endif

a20_try_loop:
        // First, see if we are on a system with no A20 gate.
a20_none:
        if (a20_test()) goto a20_done;  // test passed

        // Next, try the BIOS (INT 0x15, AX=0x2401)
a20_bios:
        int15/AX=2401h;
        // Int15/AX=2401h: SYSTEM - later PS/2s - ENABLE A20 GATE
        if (a20_test()) goto a20_done;  // test passed

        // Try enabling A20 through the keyboard controller
a20_kbc:
        empty_8042();
        if (a20_test()) goto a20_done;  // test again in case BIOS delayed
        outb(0xD1, 0x64);               // command write
        empty_8042();
        outb(0xDF, 0x60);               // A20 on
        empty_8042();
        // wait until a20 really *is* enabled
a20_kbc_wait:
        CX = 0;
a20_kbc_wait_loop:
        do {
                if (a20_test()) goto a20_done;  // test passed
        } while (--CX)

        // Final attempt: use "configuration port A"
        outb((inb(0x92) | 0x02) & 0xFE, 0x92);
        // wait for configuration port A to take effect
a20_fast_wait:
        CX = 0;
a20_fast_wait_loop:
        do {
                if (a20_test()) goto a20_done;  // test passed
        } while (--CX)

        // A20 is still not responding. Try frobbing it again.
        if (--a20_tries) goto a20_try_loop;
        prtstr("linux: fatal error: A20 gate not responding!");
a20_die:
        hlt;
        goto a20_die;
}

a20_tries:
        .byte   A20_ENABLE_LOOPS                // i.e. 255
a20_err_msg:
        .ascii  "linux: fatal error: A20 gate not responding!"
        .byte   13, 10, 0
有关 I/O 端口操作,请查看第 4.11 节中的相关参考资料。


4.9. 切换到保护模式

为了确保与所有 32 位 IA-32 处理器的代码兼容性,请执行以下步骤切换到保护模式

  1. 准备 GDT,其中第一个 GDT 条目包含空描述符,一个代码段描述符和一个数据段描述符;

  2. 禁用中断,包括可屏蔽硬件中断和 NMI;

  3. 使用“lgdt”指令将 GDT 的基地址和限制加载到 GDTR 寄存器中;

  4. 使用“mov cr0”(Intel 386 及更高版本)或“lmsw”指令(为了与 Intel 286 兼容)在 CR0 寄存器中设置 PE 标志;

  5. 立即执行远“jmp”或远“call”指令。

堆栈可以放置在普通的读/写数据段中,因此不需要专用的描述符。

a20_done:
{
        lidt    idt_48;         // load idt with 0, 0;

        // convert DS:gdt to a linear ptr
        *(long*)(gdt_48+2) = DS << 4 + &gdt;
        lgdt    gdt_48;

        // reset coprocessor
        outb(0, 0xF0);
        delay();
        outb(0, 0xF1);
        delay();

        // reprogram the interrupts
        outb(0xFF, 0xA1);       // mask all interrupts
        delay();
        outb(0xFB, 0x21);       // mask all irq's but irq2 which is cascaded

        // protected mode!
        AX = 1;
        lmsw ax;                // machine status word, bit 0 thru 15 of CR0
                                // only affects PE, MP, EM & TS flags
        goto flush_instr;

flush_instr:
        BX = 0;                                 // flag to indicate a boot
        ESI = (CS - DELTA_INITSEG) << 4;        // pointer to real-mode code
        /* NOTE: For high loaded big kernels we need a
         * jmpi    0x100000,__KERNEL_CS
         *
         * but we yet haven't reloaded the CS register, so the default size
         * of the target offset still is 16 bit.
         * However, using an operand prefix (0x66), the CPU will properly
         * take our 48 bit far pointer. (INTeL 80386 Programmer's Reference
         * Manual, Mixing 16-bit and 32-bit code, page 16-6) */

        // goto __KERNEL_CS:[(uint32*)code32]; */
        .byte   0x66, 0xea
code32: .long   0x1000          // overwritten in Section 4.7
        .word   __KERNEL_CS     // segment 0x10
        // see linux/arch/i386/boot/compressed/head.S:startup_32
}
远“jmp”指令 (0xea) 更新 CS 寄存器。其余段寄存器(DS、SS、ES、FS 和 GS)的内容应稍后重新加载。操作数大小前缀 (0x66) 用于强制“jmp”在 32 位操作数 code32 上执行。有关操作数大小前缀的详细信息,请查看 IA-32 手册(第 1 卷,第 3.6 章:操作数大小和地址大小属性,以及第 3 卷,第 17 章:混合使用 16 位和 32 位代码)。

控制权传递给 linux/arch/i386/boot/compressed/head.S:startup_32。对于 zImage,它位于地址 0x1000;对于 bzImage,它位于 0x100000。请参阅 第 5 节

ESI 指向收集到的系统数据的内存区域。它用于将参数从内核的 16 位实模式代码传递到 32 位部分。 请参阅linux/Documentation/i386/zero-page.txt了解详细信息。

有关模式切换的详细信息,请参阅 IA-32 手册第 3 卷(第 9.8 章:保护模式操作的软件初始化、第 9.9.1 章:切换到保护模式和第 17.4 章:在混合大小代码段之间传输控制)。


4.10. 其他

其余的是支持函数和变量。
/* macros created by linux/Makefile targets:
 *   include/linux/compile.h and include/linux/version.h */
kernel_version: .ascii  UTS_RELEASE
                .ascii  " ("
                .ascii  LINUX_COMPILE_BY
                .ascii  "@"
                .ascii  LINUX_COMPILE_HOST
                .ascii  ") "
                .ascii  UTS_VERSION
                .byte   0

///////////////////////////////////////////////////////////////////////////////
default_switch() { cli; outb(0x80, 0x70); } /* disable interrupts and NMI */
bootsect_helper(ES:BX); /* see Section 3.7 */

///////////////////////////////////////////////////////////////////////////////
a20_test()
{
        FS = 0;
        GS = 0xFFFF;
        CX = A20_TEST_LOOPS;                    // i.e. 32
        AX = FS:[0x200];
        do {
a20_test_wait:
                FS:[0x200] = ++AX;
                delay();
        } while (AX==GS:[0x210] && --CX);
        return (AX!=GS[0x210]);
        // ZF==0 (i.e. NZ/NE, a20_test!=0) means test passed
}

///////////////////////////////////////////////////////////////////////////////
// check that the keyboard command queue is empty
empty_8042()
{
        int timeout = 100000;

        for (;;) {
empty_8042_loop:
                if (!--timeout) return;
                delay();
                inb(0x64, &AL);                 // 8042 status port
                if (AL & 1) {                   // has output
                        delay();
                        inb(0x60, &AL);         // read it
no_output:      } else if (!(AL & 2)) return;   // no input either
        }
}

///////////////////////////////////////////////////////////////////////////////
// read the CMOS clock, return the seconds in AL, used in video.S
gettime()
{
        int1A/AH=02h();
        /* int1A/AH=02h: TIME - GET REAL-TIME CLOCK TIME
         * DH = seconds in BCD */
        AL = DH & 0x0F;
        AH = DH >> 4;
        aad;
}

///////////////////////////////////////////////////////////////////////////////
delay() { outb(AL, 0x80); }                     // needed after doing I/O

// Descriptor table
gdt:
        .word   0, 0, 0, 0                      # dummy
        .word   0, 0, 0, 0                      # unused
        // segment 0x10, __KERNEL_CS
        .word   0xFFFF                          # 4Gb - (0x100000*0x1000 = 4Gb)
        .word   0                               # base address = 0
        .word   0x9A00                          # code read/exec
        .word   0x00CF                          # granularity = 4096, 386
                                                #  (+5th nibble of limit)
        // segment 0x18, __KERNEL_DS
        .word   0xFFFF                          # 4Gb - (0x100000*0x1000 = 4Gb)
        .word   0                               # base address = 0
        .word   0x9200                          # data read/write
        .word   0x00CF                          # granularity = 4096, 386
                                                #  (+5th nibble of limit)
idt_48:
        .word   0                               # idt limit = 0
        .word   0, 0                            # idt base = 0L
/* [gdt_48] should be 0x0800 (2048) to match the comment,
 *   like what Linux 2.2.22 does. */
gdt_48:
        .word   0x8000                          # gdt limit=2048,
                                                #  256 GDT entries
        .word   0, 0                            # gdt base (filled in later)

#include "video.S"

// signature at the end of setup.S:
{
setup_sig1:     .word   SIG1                    // 0xAA55
setup_sig2:     .word   SIG2                    // 0x5A5A
modelist:
}

视频设置和检测代码在video.S:
ASK_VGA = 0xFFFD  // defined in linux/include/asm-i386/boot.h
///////////////////////////////////////////////////////////////////////////////
video()
{
        pushw DS;               // use different segments
        FS = DS;
        DS = ES = CS;
        GS = 0;
        cld;
        basic_detect();         // basic adapter type testing (EGA/VGA/MDA/CGA)
#ifdef CONFIG_VIDEO_SELECT
        if (FS:[0x01FA]!=ASK_VGA) {     // user selected video mode
                mode_set();
                if (failed) {
                        prtstr("You passed an undefined mode number.\n");
                        mode_menu();
                }
        } else {
vid2:           mode_menu();
        }
vid1:
#ifdef CONFIG_VIDEO_RETAIN
        restore_screen();               // restore screen contents
#endif /* CONFIG_VIDEO_RETAIN */
#endif /* CONFIG_VIDEO_SELECT */
        mode_params();                  // store mode parameters
        popw ds;                        // restore original DS
}
/* TODO: video() 详细信息 */


4.11. 参考


5. linux/arch/i386/boot/compressed/head.S

我们现在在 bvmlinux 中!在 misc.c:decompress_kernel() 的帮助下,我们将解压 piggy.o 以获取常驻内核镜像linux/vmlinux.

此文件是纯 32 位启动代码。与前两个文件不同,源文件中没有“.code16”语句。有关详细信息,请参阅 Using as: Writing 16-bit Code


5.1. 解压缩内核

段描述符中的段基地址(对应于段选择器 __KERNEL_CS 和 __KERNEL_DS)等于 0;因此,如果使用这些段选择器中的任何一个,则逻辑地址偏移量(以 segment:offset 格式)将等于其线性地址。 对于 zImage,CS:EIP 现在位于逻辑地址 10:1000(线性地址 0x1000);对于 bzImage,则位于 10:100000(线性地址 0x100000)。

由于未启用分页,因此线性地址与物理地址相同。 查看 IA-32 手册(第 1 卷,第 3.3 章:内存组织,以及第 3 卷,第 3 章:保护模式内存管理)和 Linux Device Drivers: Memory Management in Linux 了解地址问题。

它来自setup.S即 BX=0 且 ESI=INITSEG<<4。

.text
///////////////////////////////////////////////////////////////////////////////
startup_32()
{
        cld;
        cli;
        DS = ES = FS = GS = __KERNEL_DS;
        SS:ESP = *stack_start;  // end of user_stack[], defined in misc.c
        // all segment registers are reloaded after protected mode is enabled

        // check that A20 really IS enabled
        EAX = 0;
        do {
1:              DS:[0] = ++EAX;
        } while (DS:[0x100000]==EAX);

        EFLAGS = 0;
        clear BSS;                              // from _edata to _end

        struct moveparams mp;                   // subl $16,%esp
        if (!decompress_kernel(&mp, ESI)) {     // return value in AX
                restore ESI from stack;
                EBX = 0;
                goto __KERNEL_CS:100000;
                // see linux/arch/i386/kernel/head.S:startup_32
        }

        /*
         * We come here, if we were loaded high.
         * We need to move the move-in-place routine down to 0x1000
         * and then start it with the buffer addresses in registers,
         * which we got from the stack.
         */
3:      move move_rountine_start..move_routine_end to 0x1000;
        // move_routine_start & move_routine_end are defined below

        // prepare move_routine_start() parameters
        EBX = real mode pointer;        // ESI value passed from setup.S
        ESI = mp.low_buffer_start;
        ECX = mp.lcount;
        EDX = mp.high_buffer_star;
        EAX = mp.hcount;
        EDI = 0x100000;
        cli;                    // make sure we don't get interrupted
        goto __KERNEL_CS:1000;  // move_routine_start();
}

/* Routine (template) for moving the decompressed kernel in place,
 * if we were high loaded. This _must_ PIC-code ! */
///////////////////////////////////////////////////////////////////////////////
move_routine_start()
{
        move mp.low_buffer_start to 0x100000, mp.lcount bytes,
          in two steps: (lcount >> 2) words + (lcount & 3) bytes;
        move/append mp.high_buffer_start, ((mp.hcount + 3) >> 2) words
        // 1 word == 4 bytes, as I mean 32-bit code/data.

        ESI = EBX;              // real mode pointer, as that from setup.S
        EBX = 0;
        goto __KERNEL_CS:100000;
        // see linux/arch/i386/kernel/head.S:startup_32()
move_routine_end:
}
对于“je 1b”和“jnz 3f”的含义,请参阅 Using as: Local Symbol Names

找不到 _edata_end 定义? 没问题,它们是在“内部链接器脚本”中定义的。 如果未指定 -T (--script=) 选项,则 ld 使用此内置脚本来链接 compressed/bvmlinux。 使用“ld --verbose”显示此脚本,或查看附录 B。内部链接器脚本

有关 -T (--script=)、-L (--library-path=) 和 --verbose 选项的描述,请参阅 Using LD, the GNU linker: Command Line Options。 “man ld”和“info ld”也可能有所帮助。

piggy.o 已解压缩,控制权传递给 __KERNEL_CS:100000,即 linux/arch/i386/kernel/head.S:startup_32()。 请参阅 第 6 节

#define LOW_BUFFER_START      0x2000
#define LOW_BUFFER_MAX       0x90000
#define HEAP_SIZE             0x3000
///////////////////////////////////////////////////////////////////////////////
asmlinkage int decompress_kernel(struct moveparams *mv, void *rmode)
|-- setup real_mode(=rmode), vidmem, vidport, lines and cols;
|-- if (is_zImage) setup_normal_output_buffer() {
|       output_data      = 0x100000;
|       free_mem_end_ptr = real_mode;
|   } else (is_bzImage) setup_output_buffer_if_we_run_high(mv) {
|       output_data      = LOW_BUFFER_START;
|       low_buffer_end   = MIN(real_mode, LOW_BUFFER_MAX) & ~0xfff;
|       low_buffer_size  = low_buffer_end - LOW_BUFFER_START;
|       free_mem_end_ptr = &end + HEAP_SIZE;
|       // get mv->low_buffer_start and mv->high_buffer_start
|       mv->low_buffer_start = LOW_BUFFER_START;
|       /* To make this program work, we must have
|        *   high_buffer_start > &end+HEAP_SIZE;
|        * As we will move low_buffer from LOW_BUFFER_START to 0x100000
|        *   (max low_buffer_size bytes) finally, we should have
|        *   high_buffer_start > 0x100000+low_buffer_size; */
|       mv->high_buffer_start = high_buffer_start
|           = MAX(&end+HEAP_SIZE, 0x100000+low_buffer_size);
|       mv->hcount =  0 if (0x100000+low_buffer_size >  &end+HEAP_SIZE);
|                  = -1 if (0x100000+low_buffer_size <= &end+HEAP_SIZE);
|       /* mv->hcount==0 : we need not move high_buffer later,
|        *   as it is already at 0x100000+low_buffer_size.
|        * Used by close_output_buffer_if_we_run_high() below. */
|   }
|-- makecrc();          // create crc_32_tab[]
|   puts("Uncompressing Linux... ");
|-- gunzip();
|   puts("Ok, booting the kernel.\n");
|-- if (is_bzImage) close_output_buffer_if_we_run_high(mv) {
|       // get mv->lcount and mv->hcount
|       if (bytes_out > low_buffer_size) {
|           mv->lcount = low_buffer_size;
|           if (mv->hcount)
|               mv->hcount = bytes_out - low_buffer_size;
|       } else {
|           mv->lcount = bytes_out;
|           mv->hcount = 0;
|       }
|   }
`-- return is_bzImage;  // return value in AX
end 也在“内部链接器脚本”中定义。

decompress_kernel() 具有“asmlinkage”修饰符。 在linux/include/linux/linkage.h:
#ifdef __cplusplus
#define CPP_ASMLINKAGE extern "C"
#else
#define CPP_ASMLINKAGE
#endif

#if defined __i386__
#define asmlinkage CPP_ASMLINKAGE __attribute__((regparm(0)))
#elif defined __ia64__
#define asmlinkage CPP_ASMLINKAGE __attribute__((syscall_linkage))
#else
#define asmlinkage CPP_ASMLINKAGE
#endif
中,宏“asmlinkage”将强制编译器在堆栈上传递所有函数参数,以防某些优化方法可能尝试更改此约定。 查看 Using the GNU Compiler Collection (GCC): Declaring Attributes of Functions (regparm) 和 Kernelnewbies FAQ: What is asmlinkage 了解更多详细信息。


5.2. gunzip()

decompress_kernel() 调用 gunzip() -> inflate(),它们在linux/lib/inflate.c中定义,将常驻内核镜像解压缩到低缓冲区(由 output_data 指向)和高缓冲区(仅由 bzImagehigh_buffer_start 指向)。

gzip 文件格式在 RFC 1952 中指定。

表 6. gzip 文件格式

组件含义字节注释
ID1IDentification 1131 (0x1f, \037)
ID2IDentification 21139 (0x8b, \213) [a]
CM压缩方法18 - 表示“deflate”压缩方法
FLG标志1大多数情况下为 0
MTIME修改时间4原始文件的修改时间
XFL额外的标志12 - 压缩器使用最大压缩,最慢算法 [b]
OS操作系统13 - Unix
额外字段--可变长度,由 FLG 指示的字段 [c]
压缩块--可变长度
CRC32-4未压缩数据的 CRC 值
ISIZE输入大小4未压缩输入数据的大小模 2^32
注释
a. 对于 gzip 0.5,ID2 值可以为 158 (0x9e, \236);
b. XFL 值 4 - 压缩器使用最快算法;
c. FLG 位 0,FTEXT,不指示任何“额外字段”。

我们可以使用此文件格式知识来找出 gzipped 的开始位置linux/vmlinux.
[root@localhost boot]# hexdump -C /boot/vmlinuz-2.4.20-28.9 | grep '1f 8b 08 00'
00004c50  1f 8b 08 00 01 f6 e1 3f  02 03 ec 5d 7d 74 14 55  |.......?...]}t.U|
[root@localhost boot]# hexdump -C /boot/vmlinuz-2.4.20-28.9 -s 0x4c40 -n 64
00004c40  00 80 0b 00 00 fc 21 00  68 00 00 00 1e 01 11 00  |......!.h.......|
00004c50  1f 8b 08 00 01 f6 e1 3f  02 03 ec 5d 7d 74 14 55  |.......?...]}t.U|
00004c60  96 7f d5 a9 d0 1d 4d ac  56 93 35 ac 01 3a 9c 6a  |......M.V.5..:.j|
00004c70  4d 46 5c d3 7b f8 48 36  c9 6c 84 f0 25 88 20 9f  |MF\.{.H6.l..%. .|
00004c80
[root@localhost boot]# hexdump -C /boot/vmlinuz-2.4.20-28.9 | tail -n 4
00114d40  bd 77 66 da ce 6f 3d d6  33 5c 14 a2 9f 7e fa e9  |.wf..o=.3\...~..|
00114d50  a7 9f 7e fa ff 57 3f 00  00 00 00 00 d8 bc ab ea  |..~..W?.........|
00114d60  44 5d 76 d1 fd 03 33 58  c2 f0 00 51 27 00        |D]v...3X...Q'.|
00114d6e
我们可以看到,在上面的示例中,gzipped 文件从 0x4c50 开始。 “1f 8b 08 00”之前的四个字节是 input_len(0x0011011e,以小端字节序表示),0x4c50+0x0011011e=0x114d6e 等于 bzImage 的大小(/boot/vmlinuz-2.4.20-28.9).

static uch *inbuf;           /* input buffer */
static unsigned insize = 0;  /* valid bytes in inbuf */
static unsigned inptr = 0;   /* index of next byte to be processed in inbuf */
///////////////////////////////////////////////////////////////////////////////
static int gunzip(void)
{
        Check input buffer for {ID1, ID2, CM}, must be
                {0x1f, 0x8b, 0x08} (normal case), or
                {0x1f, 0x9e, 0x08} (for gzip 0.5);
        Check FLG (flag byte), must not set bit 1, 5, 6 and 7;
        Ignore {MTIME, XFL, OS};
        Handle optional structures, which correspond to FLG bit 2, 3 and 4;
        inflate();              // handle compressed blocks
        Validate {CRC32, ISIZE};
}
在第一次调用 get_byte()(在linux/arch/i386/boot/compressed/misc.c中定义)时,它会调用 fill_inbuf() 来设置输入缓冲区 inbuf=input_datainsize=input_len。 符号 input_datainput_lenpiggy.o 链接器脚本中定义。 请参阅 第 2.5 节


5.3. inflate()

// some important definitions in misc.c
#define WSIZE 0x8000            /* Window size must be at least 32k,
                                 * and a power of two */
static uch window[WSIZE];       /* Sliding window buffer */
static unsigned outcnt = 0;     /* bytes in output buffer */

// linux/lib/inflate.c
#define wp outcnt
#define flush_output(w) (wp=(w),flush_window())
STATIC unsigned long bb;        /* bit buffer */
STATIC unsigned bk;             /* bits in bit buffer */
STATIC unsigned hufts;          /* track memory usage */
static long free_mem_ptr = (long)&end;
///////////////////////////////////////////////////////////////////////////////
STATIC int inflate()
{
        int e;                  /* last block flag */
        int r;                  /* result code */
        unsigned h;             /* maximum struct huft's malloc'ed */
        void *ptr;

        wp = bb = bk = 0;

        // inflate compressed blocks one by one
        do {
                hufts = 0;
                gzip_mark() { ptr = free_mem_ptr; };
                if ((r = inflate_block(&e)) != 0) {
                        gzip_release() { free_mem_ptr = ptr; };
                        return r;
                }
                gzip_release() { free_mem_ptr = ptr; };
                if (hufts > h)
                h = hufts;
        } while (!e);

        /* Undo too much lookahead. The next read will be byte aligned so we
         * can discard unused bits in the last meaningful byte. */
        while (bk >= 8) {
                bk -= 8;
                inptr--;
        }

        /* write the output window window[0..outcnt-1] to output_data,
         * update output_ptr/output_data, crc and bytes_out accordingly, and
         * reset outcnt to 0. */
        flush_output(wp);

        /* return success */
        return 0;
}
free_mem_ptrmisc.c:malloc() 中用于动态内存分配。 在膨胀每个压缩块之前,gzip_mark() 会保存 free_mem_ptr 的值; 膨胀后,gzip_release() 将恢复此值。 这就是它如何“free()”在 inflate_block() 中分配的内存的方式。

Gzip 使用 Lempel-Ziv 编码 (LZ77) 来压缩文件。 压缩数据格式在 RFC 1951 中指定。 inflate_block() 将膨胀压缩块,这些压缩块可以被视为位序列。

每个压缩块的数据结构概述如下
BFINAL (1 bit)
    0  - not the last block
    1  - the last block
BTYPE  (2 bits)
    00 - no compression
        remaining bits until the byte boundary;
        LEN      (2 bytes);
        NLEN     (2 bytes, the one's complement of LEN);
        data     (LEN bytes);
    01 - compressed with fixed Huffman codes
        {
        literal  (7-9 bits, represent code 0..287, excluding 256);
                     // See RFC 1951, table in Paragraph 3.2.6.
        length   (0-5 bits if literal > 256, represent length 3..258);
                     // See RFC 1951, 1st alphabet table in Paragraph 3.2.5.
        data     (of literal bytes if literal < 256);
        distance (5 plus 0-13 extra bits if literal == 257..285, represent
                         distance 1..32768);
                     /* See RFC 1951, 2nd alphabet table in Paragraph 3.2.5,
                      *   but statement in Paragraph 3.2.6. */
                     /* Move backward "distance" bytes in the output stream,
                      * and copy "length" bytes */
        }*           // can be of multiple instances
        literal  (7 bits, all 0, literal == 256, means end of block);
    10 - compressed with dynamic Huffman codes
        HLIT     (5 bits, # of Literal/Length codes - 257, 257-286);
        HDIST    (5 bits, # of Distance codes - 1,         1-32);
        HCLEN    (4 bits, # of Code Length codes - 4,      4 - 19);
        Code Length sequence    ((HCLEN+4)*3 bits)
        /* The following two alphabet tables will be decoded using
         *   the Huffman decoding table which is generated from
         *   the preceeding Code Length sequence. */
        Literal/Length alphabet (HLIT+257 codes)
        Distance alphabet       (HDIST+1 codes)
        // Decoding tables will be built from these alphpabet tables.
        /* The following is similar to that of fixed Huffman codes portion,
         *   except that they use different decoding tables. */
        {
        literal/length
                 (variable length, depending on Literal/Length alphabet);
        data     (of literal bytes if literal < 256);
        distance (variable length if literal == 257..285, depending on
                         Distance alphabet);
        }*           // can be of multiple instances
        literal  (literal value 256, which means end of block);
    11 - reserved (error)
请注意,数据元素从最低有效位 (LSB) 到最高有效位 (MSB) 打包到字节中,而 Huffman 代码从 MSB 开始打包。 另请注意,literal 值 286-287 和distance 代码 30-31 实际上永远不会出现。

考虑到上述数据结构和 RFC 1951,不难理解 inflate_block()。 有关 Huffman 编码和字母表生成的更多信息,请参阅 RFC 1951 中的相关段落。

有关更多详细信息,请参阅linux/lib/inflate.c、gzip 源代码(许多内联注释)和相关参考资料。


6. linux/arch/i386/kernel/head.S

常驻内核镜像linux/vmlinux终于就位了! 它需要两个输入

ESI 指向 16 位实模式代码中的参数区域,稍后将被复制到 empty_zero_page。 ESI 仅对 BSP 有效。

BSP (引导处理器) 和 AP (应用处理器) 是 Intel 的术语。请查阅 IA-32 手册(Vol.3,Ch.7.5,多处理器 (MP) 初始化)和 多处理器规范,了解 MP 初始化问题。

从软件的角度来看,在多处理器系统中,BSP 和 AP 共享物理内存,但使用自己的寄存器集。 BSP 首先运行内核代码,设置 OS 执行环境,并触发 AP 也运行其上的代码。 在 BSP 启动之前,AP 将处于休眠状态。


6.1. 启用分页

.text
///////////////////////////////////////////////////////////////////////////////
startup_32()
{
        /* set segments to known values */
        cld;
        DS = ES = FS = GS = __KERNEL_DS;

#ifdef CONFIG_SMP
#define cr4_bits mmu_cr4_features-__PAGE_OFFSET
        /* long mmu_cr4_features defined in linux/arch/i386/kernel/setup.c
         * __PAGE_OFFSET = 0xC0000000, i.e. 3G */

        // AP with CR4 support (> Intel 486) will copy CR4 from BSP
        if (BX && cr4_bits) {
                // turn on paging options (PSE, PAE, ...)
                CR4 |= cr4_bits;
        } else
#endif
        {
                /* only BSP initializes page tables (pg0..empty_zero_page-1)
                 *   pg0 at .org 0x2000
                 *   empty_zero_page at .org 0x4000
                 *   total (0x4000-0x2000)/4 = 0x0800 entries */
                pg0 = {
                        0x00000007,             // 7 = PRESENT + RW + USER
                        0x00001007,             // 0x1000 = 4096 = 4K
                        0x00002007,
                        ...
                pg1:    0x00400007,
                        ...
                        0x007FF007              // total 8M
                empty_zero_page:
                };
        }
为什么在引用内核符号时,比如 pg0,必须添加 "-__PAGE_OFFSET"?

linux/arch/i386/vmlinux.lds,我们有
  . = 0xC0000000 + 0x100000;
  _text = .;                    /* Text and read-only data */
  .text : {
        *(.text)
...
由于 pg0 位于节区 .text 的偏移量 0x2000 处,在linux/arch/i386/kernel/head.o中,该文件是链接的第一个文件,因此linux/vmlinux,它将位于输出节区 .text 的偏移量 0x2000 处。 因此,链接后它将位于地址 0xC0000000+0x100000+0x2000。
[root@localhost boot]# nm --defined /boot/vmlinux-2.4.20-28.9 | grep 'startup_32
\|mmu_cr4_features\|pg0\|\<empty_zero_page\>' | sort
c0100000 t startup_32
c0102000 T pg0
c0104000 T empty_zero_page
c0376404 B mmu_cr4_features
在未启用分页的保护模式下,线性地址将直接映射到物理地址。"movl $pg0-__PAGE_OFFSET,%edi" 将设置 EDI=0x102000,这等于 pg0 的物理地址(因为linux/vmlinux被重定位到 0x100000)。 如果没有这个 "-PAGE_OFFSET" 方案,它将访问物理地址 0xC0102000,这将是错误的,并且可能超出 RAM 空间。

mmu_cr4_features 位于 .bss 节区,并在上面的示例中位于物理地址 0x376404。

初始化页表后,可以启用分页。
        // set page directory base pointer, physical address
        CR3 = swapper_pg_dir - __PAGE_OFFSET;
        // paging enabled!
        CR0 |= 0x80000000;      // set PG bit
        goto 1f;                // flush prefetch-queue
1:
        EAX = &1f;              // address following the next instruction
        goto *(EAX);            // relocate EIP
1:
        SS:ESP = *stack_start;
页目录 swapper_pg_dir(参见 第 6.5 节中的定义),以及页表 pg0pg1,定义了线性地址 0..8M-1 和 3G..3G+8M-1 都映射到物理地址 0..8M-1。 从现在开始,我们可以访问内核符号而无需 "-__PAGE_OFFSET",因为内核空间(驻留在线性地址 >=3G)在启用分页后将被正确映射到其物理地址。

"lss stack_start,%esp"(SS:ESP = *stack_start)是第一个引用没有 "-PAGE_OFFSET" 的符号的例子,它设置了一个新的堆栈。对于 BSP,堆栈位于 init_task_union 的末尾。 对于 AP,stack_start.esp 已经被 linux/arch/i386/kernel/smpboot.c:do_boot_cpu() 重新定义为 第 8.2 节中的 "(void *) (1024 + PAGE_SIZE + (char *)idle)"。

有关分页机制和数据结构,请参阅 IA-32 手册 Vol.3。(Ch.3.7,使用 32 位物理寻址进行页面转换,Ch.9.8.3,初始化分页,Ch.9.9.1,切换到保护模式,以及 Ch.18.26.3,启用和禁用分页)。


6.2. 获取内核参数

#define OLD_CL_MAGIC_ADDR       0x90020
#define OLD_CL_MAGIC            0xA33F
#define OLD_CL_BASE_ADDR        0x90000
#define OLD_CL_OFFSET           0x90022
#define NEW_CL_POINTER          0x228   /* Relative to real mode data */

#ifdef CONFIG_SMP
        if (BX) {
                EFLAGS = 0;             // AP clears EFLAGS
        } else
#endif
        {
                // Initial CPU cleans BSS
                clear BSS;              // i.e. __bss_start .. _end
                setup_idt() {
                        /* idt_table[256] defined in arch/i386/kernel/traps.c
                         *   located in section .data.idt
                        EAX = __KERNEL_CS << 16 + ignore_int;
                        DX = 0x8E00;    // interrupt gate, dpl = 0, present
                        idt_table[0..255] = {EAX, EDX};
                }
                EFLAGS = 0;
                /*
                 * Copy bootup parameters out of the way. First 2kB of
                 * _empty_zero_page is for boot parameters, second 2kB
                 * is for the command line.
                 */
                move *ESI (real-mode header) to empty_zero_page, 2KB;
                clear empty_zero_page+2K, 2KB;
                ESI = empty_zero_page[NEW_CL_POINTER];
                if (!ESI) {             // 32-bit command line pointer
                        if (OLD_CL_MAGIC==(uint16)[OLD_CL_MAGIC_ADDR]) {
                                ESI = [OLD_CL_BASE_ADDR]
                                      + (uint16)[OLD_CL_OFFSET];
                                move *ESI to empty_zero_page+2K, 2KB;
                        }
                } else {                // valid in 2.02+
                        move *ESI to empty_zero_page+2K, 2KB;
                }
        }
}
对于 BSP,内核参数从 ESI 指向的内存复制到 empty_zero_page。如果适用,内核命令行将被复制到 empty_zero_page+2K


6.3. 检查 CPU 类型

请参阅 IA-32 手册 Vol.1。(Ch.13,处理器识别和功能确定),了解如何识别处理器类型和处理器功能。

struct cpuinfo_x86;     // see include/asm-i386/processor.h
struct cpuinfo_x86 boot_cpu_data;       // see arch/i386/kernel/setup.c

#define CPU_PARAMS      SYMBOL_NAME(boot_cpu_data)
#define X86             CPU_PARAMS+0
#define X86_VENDOR      CPU_PARAMS+1
#define X86_MODEL       CPU_PARAMS+2
#define X86_MASK        CPU_PARAMS+3
#define X86_HARD_MATH   CPU_PARAMS+6
#define X86_CPUID       CPU_PARAMS+8
#define X86_CAPABILITY  CPU_PARAMS+12
#define X86_VENDOR_ID   CPU_PARAMS+28

checkCPUtype:
{
        X86_CPUID = -1;                 // no CPUID

        X86 = 3;                        // at least 386
        save original EFLAGS to ECX;
        flip AC bit (0x40000) in EFLAGS;
        if (AC bit not changed) goto is386;

        X86 = 4;                        // at least 486
        flip ID bit (0X200000) in EFLAGS;
        restore original EFLAGS;        // for AC & ID flags
        if (ID bit can not be changed) goto is486;

        // get CPU info
        CPUID(EAX=0);
        X86_CPUID = EAX;
        X86_VENDOR_ID = {EBX, EDX, ECX};
        if (!EAX) goto is486;

        CPUID(EAX=1);
        CL = AL;
        X86 = AH & 0x0f;                // family
        X86_MODEL = (AL & 0xf0) >> 4;   // model
        X86_MASK = CL & 0x0f;           // stepping id
        X86_CAPABILITY = EDX;           // feature

请参阅 IA-32 手册 Vol.3。(Ch.9.2,x87 FPU 初始化,以及 Ch.18.14,x87 FPU),了解如何设置 x87 FPU。

is486:
        // save PG, PE, ET and set AM, WP, NE, MP
        EAX = (CR0 & 0x80000011) | 0x50022;
        goto 2f;                        // skip "is386:" processing
is386:
        restore original EFLAGS from ECX;
        // save PG, PE, ET and set MP
        EAX = (CR0 & 0x80000011) | 0x02;

        /* ET: Extension Type (bit 4 of CR0).
         * In the Intel 386 and Intel 486 processors, this flag indicates
         * support of Intel 387 DX math coprocessor instructions when set.
         * In the Pentium 4, Intel Xeon, and P6 family processors,
         * this flag is hardcoded to 1.
         *     -- IA-32 Manual Vol.3. Ch.2.5. Control Registers (p.2-14) */

2:      CR0 = EAX;
        check_x87() {
                /* We depend on ET to be correct.
                 * This checks for 287/387. */
                X86_HARD_MATH = 0;
                clts;                   // CR0.TS = 0;
                fninit;                 // Init FPU;
                fstsw AX;               // AX = ST(0);
                if (AL) {
                        CR0 ^= 0x04;    // no coprocessor, set EM
                } else {
                        ALIGN
1:                      X86_HARD_MATH = 1;
                        /* IA-32 Manual Vol.3. Ch.18.14.7.14. FSETPM Instruction
                         * inform 287 that processor is in protected mode
                         * 287 only, ignored by 387 */
                        fsetpm;
                }
        }
}
宏 ALIGN,定义在linux/include/linux/linkage.h中,指定 16 字节对齐和填充值 0x90(NOP 的操作码)。 另请参阅 使用 as:汇编器指令,了解指令 .align 的含义。


6.4. 启动内核

        ready:  .byte 0;        // global variable
{
        ready++;                // how many CPUs are ready
        lgdt gdt_descr;         // use new descriptor table in safe place
        lidt idt_descr;
        goto __KERNEL_CS:$1f;   // reload segment registers after "lgdt"
1:      DS = ES = FS = GS = __KERNEL_DS;
#ifdef CONFIG_SMP
        SS = __KERNEL_DS;       // reload segment only
#else
        SS:ESP = *stack_start;  /* end of init_task_union, defined
                                 *   in linux/arch/i386/kernel/init_task.c */
#endif
        EAX = 0;
        lldt AX;
        cld;

#ifdef CONFIG_SMP
        if (1!=ready) {         // not first CPU
                initialize_secondary();
                // see linux/arch/i386/kernel/smpboot.c
        } else
#endif
        {
                start_kernel(); // see linux/init/main.c
        }
L6:     goto L6;
}
第一个 CPU (BSP) 将调用 linux/init/main.c:start_kernel(),其他 CPU (AP) 将调用 linux/arch/i386/kernel/smpboot.c:initialize_secondary()。 请参见 第 7 节中的 start_kernel()第 8.4 节中的 initialize_secondary()

init_task_union 恰好是第一个进程,“空闲”进程 (pid=0) 的任务结构,其堆栈从 init_task_union 的尾部增长。 以下是与 init_task_union 相关的代码
ENTRY(stack_start)
        .long init_task_union+8192;
        .long __KERNEL_DS;

#ifndef INIT_TASK_SIZE
# define INIT_TASK_SIZE 2048*sizeof(long)
#endif

union task_union {
        struct task_struct task;
        unsigned long stack[INIT_TASK_SIZE/sizeof(long)];
};

/* INIT_TASK is used to set up the first task table, touch at
 * your own risk! Base=0, limit=0x1fffff (=2MB) */
union task_union init_task_union
        __attribute__((__section__(".data.init_task"))) =
                { INIT_TASK(init_task_union.task) };

init_task_union 用于 BSP “空闲”进程。 不要将其与将在 第 7.2 节中提及的 “init” 进程混淆。


6.5. 杂项

///////////////////////////////////////////////////////////////////////////////
// default interrupt "handler"
ignore_int() { printk("Unknown interrupt\n"); iret; }

/*
 * The interrupt descriptor table has room for 256 idt's,
 * the global descriptor table is dependent on the number
 * of tasks we can have..
 */
#define IDT_ENTRIES     256
#define GDT_ENTRIES     (__TSS(NR_CPUS))

.globl SYMBOL_NAME(idt)
.globl SYMBOL_NAME(gdt)

        ALIGN
        .word 0
idt_descr:
        .word IDT_ENTRIES*8-1           # idt contains 256 entries
SYMBOL_NAME(idt):
        .long SYMBOL_NAME(idt_table)

        .word 0
gdt_descr:
        .word GDT_ENTRIES*8-1
SYMBOL_NAME(gdt):
        .long SYMBOL_NAME(gdt_table)

/*
 * This is initialized to create an identity-mapping at 0-8M (for bootup
 * purposes) and another mapping of the 0-8M area at virtual address
 * PAGE_OFFSET.
 */
.org 0x1000
ENTRY(swapper_pg_dir)   // "ENTRY" defined in linux/include/linux/linkage.h
        .long 0x00102007
        .long 0x00103007
        .fill BOOT_USER_PGD_PTRS-2,4,0
        /* default: 766 entries */
        .long 0x00102007
        .long 0x00103007
        /* default: 254 entries */
        .fill BOOT_KERNEL_PGD_PTRS-2,4,0

/*
 * The page tables are initialized to only 8MB here - the final page
 * tables are set up later depending on memory size.
 */
.org 0x2000
ENTRY(pg0)

.org 0x3000
ENTRY(pg1)

/*
 * empty_zero_page must immediately follow the page tables ! (The
 * initialization loop counts until empty_zero_page)
 */
.org 0x4000
ENTRY(empty_zero_page)

/*
 * Real beginning of normal "text" segment
 */
.org 0x5000
ENTRY(stext)
ENTRY(_stext)

///////////////////////////////////////////////////////////////////////////////
/*
 * This starts the data section. Note that the above is all
 * in the text section because it has alignment requirements
 * that we cannot fulfill any other way.
 */
.data

ALIGN
/*
 * This contains typically 140 quadwords, depending on NR_CPUS.
 *
 * NOTE! Make sure the gdt descriptor in head.S matches this if you
 * change anything.
 */
ENTRY(gdt_table)
        .quad 0x0000000000000000        /* NULL descriptor */
        .quad 0x0000000000000000        /* not used */
        .quad 0x00cf9a000000ffff        /* 0x10 kernel 4GB code at 0x00000000 */
        .quad 0x00cf92000000ffff        /* 0x18 kernel 4GB data at 0x00000000 */
        .quad 0x00cffa000000ffff        /* 0x23 user   4GB code at 0x00000000 */
        .quad 0x00cff2000000ffff        /* 0x2b user   4GB data at 0x00000000 */
        .quad 0x0000000000000000        /* not used */
        .quad 0x0000000000000000        /* not used */
        /*
         * The APM segments have byte granularity and their bases
         * and limits are set at run time.
         */
        .quad 0x0040920000000000        /* 0x40 APM set up for bad BIOS's */
        .quad 0x00409a0000000000        /* 0x48 APM CS    code */
        .quad 0x00009a0000000000        /* 0x50 APM CS 16 code (16 bit) */
        .quad 0x0040920000000000        /* 0x58 APM DS    data */
        .fill NR_CPUS*4,8,0             /* space for TSS's and LDT's */
idt_descrgdt_table 之前的宏 ALIGN 是为了性能考虑。


7. linux/init/main.c

编写本章时,我感到很内疚,因为关于它的文档太多了,甚至超过了足够的数量。 start_kernel() 支持函数随版本而变化,因为它们依赖于 OS 组件内部结构,这些内部结构一直在改进。 我可能没有时间进行频繁的文档更新,所以我决定尽可能简化本章。


7.1. start_kernel()

///////////////////////////////////////////////////////////////////////////////
asmlinkage void __init start_kernel(void)
{
        char * command_line;
        extern char saved_command_line[];
/*
 * Interrupts are still disabled. Do necessary setups, then enable them
 */
        lock_kernel();
        printk(linux_banner);

        /* Memory Management in Linux, esp. for setup_arch()
         * Linux-2.4.4 MM Initialization */
        setup_arch(&command_line);
        printk("Kernel command line: %s\n", saved_command_line);

        /* linux/Documentation/kernel-parameters.txt
         * The Linux BootPrompt-HowTo */
        parse_options(command_line);

        trap_init() {
#ifdef CONFIG_EISA
                if (isa_readl(0x0FFFD9) == 'E'+('I'<<8)+('S'<<16)+('A'<<24))
                        EISA_bus = 1;
#endif
#ifdef CONFIG_X86_LOCAL_APIC
                init_apic_mappings();
#endif
                set_xxxx_gate(x, &func);    // setup gates
                cpu_init();
        }
        init_IRQ();
        sched_init();
        softirq_init() {
                for (int i=0; i<32: i++)
                        tasklet_init(bh_task_vec+i, bh_action, i);
                open_softirq(TASKLET_SOFTIRQ, tasklet_action, NULL);
                open_softirq(HI_SOFTIRQ, tasklet_hi_action, NULL);
        }
        time_init();

        /*
         * HACK ALERT! This is early. We're enabling the console before
         * we've done PCI setups etc, and console_init() must be aware of
         * this. But we do want output early, in case something goes wrong.
         */
        console_init();
#ifdef CONFIG_MODULES
        init_modules();
#endif
        if (prof_shift) {
                unsigned int size;
                /* only text is profiled */
                prof_len = (unsigned long) &_etext - (unsigned long) &_stext;
                prof_len >>= prof_shift;
                size = prof_len * sizeof(unsigned int) + PAGE_SIZE-1;
                prof_buffer = (unsigned int *) alloc_bootmem(size);
        }

        kmem_cache_init();
        sti();

        // BogoMips mini-Howto
        calibrate_delay();

        // linux/Documentation/initrd.txt
#ifdef CONFIG_BLK_DEV_INITRD
        if (initrd_start && !initrd_below_start_ok &&
                        initrd_start < min_low_pfn << PAGE_SHIFT) {
                printk(KERN_CRIT "initrd overwritten (0x%08lx < 0x%08lx) - "
                    "disabling it.\n",initrd_start,min_low_pfn << PAGE_SHIFT);
                initrd_start = 0;
        }
#endif

        mem_init();
        kmem_cache_sizes_init();
        pgtable_cache_init();

        /*
         * For architectures that have highmem, num_mappedpages represents
         * the amount of memory the kernel can use.  For other architectures
         * it's the same as the total pages.  We need both numbers because
         * some subsystems need to initialize based on how much memory the
         * kernel can use.
         */
        if (num_mappedpages == 0)
                num_mappedpages =  num_physpages;

        fork_init(num_mempages);
        proc_caches_init();
        vfs_caches_init(num_physpages);
        buffer_init(num_physpages);
        page_cache_init(num_physpages);
#if defined(CONFIG_ARCH_S390)
        ccwcache_init();
#endif
        signals_init();
#ifdef CONFIG_PROC_FS
        proc_root_init();
#endif
#if defined(CONFIG_SYSVIPC)
        ipc_init();
#endif
        check_bugs();
        printk("POSIX conformance testing by UNIFIX\n");

        /*
         *      We count on the initial thread going ok
         *      Like idlers init is an unlocked kernel thread, which will
         *      make syscalls (and thus be locked).
         */
        smp_init() {
#ifndef CONFIG_SMP
#     ifdef CONFIG_X86_LOCAL_APIC
                APIC_init_uniprocessor();
#     else
                do { } while (0);
#     endif
#else
                /* Check Section 8.2. */
#endif
        }

        rest_init() {
                // init process, pid = 1
                kernel_thread(init, NULL, CLONE_FS | CLONE_FILES | CLONE_SIGNAL);
                unlock_kernel();
                current->need_resched = 1;
                // idle process, pid = 0
                cpu_idle();     // never return
        }
}
start_kernel() 调用 rest_init() 以派生一个 “init” 进程,并使其自身成为 “idle” 进程。


7.2. init()

“Init” 进程
///////////////////////////////////////////////////////////////////////////////
static int init(void * unused)
{
        lock_kernel();
        do_basic_setup();

        prepare_namespace();

        /*
         * Ok, we have completed the initial bootup, and
         * we're essentially up and running. Get rid of the
         * initmem segments and start the user-mode stuff..
         */
        free_initmem();
        unlock_kernel();

        if (open("/dev/console", O_RDWR, 0) < 0)        // stdin
                printk("Warning: unable to open an initial console.\n");

        (void) dup(0);                                  // stdout
        (void) dup(0);                                  // stderr

        /*
         * We try each of these until one succeeds.
         *
         * The Bourne shell can be used instead of init if we are
         * trying to recover a really broken machine.
         */

        if (execute_command)
                execve(execute_command,argv_init,envp_init);
        execve("/sbin/init",argv_init,envp_init);
        execve("/etc/init",argv_init,envp_init);
        execve("/bin/init",argv_init,envp_init);
        execve("/bin/sh",argv_init,envp_init);
        panic("No init found.  Try passing init= option to kernel.");
}
有关用户模式“init”进程的更多信息,请参阅 "man init" 或 SysVinit


7.3. cpu_idle()

“Idle” 进程
/*
 * The idle thread. There's no useful work to be
 * done, so just try to conserve power and have a
 * low exit latency (ie sit in a loop waiting for
 * somebody to say that they'd like to reschedule)
 */
void cpu_idle (void)
{
        /* endless idle loop with no priority at all */
        init_idle();
        current->nice = 20;
        current->counter = -100;

        while (1) {
                void (*idle)(void) = pm_idle;
                if (!idle)
                        idle = default_idle;
                while (!current->need_resched)
                        idle();
                schedule();
                check_pgt_cache();
        }
}

///////////////////////////////////////////////////////////////////////////////
void __init init_idle(void)
{
        struct schedule_data * sched_data;
        sched_data = &aligned_data[smp_processor_id()].schedule_data;

        if (current != &init_task && task_on_runqueue(current)) {
                printk("UGH! (%d:%d) was on the runqueue, removing.\n",
                        smp_processor_id(), current->pid);
                del_from_runqueue(current);
        }
        sched_data->curr = current;
        sched_data->last_schedule = get_cycles();
        clear_bit(current->processor, &wait_init_idle);
}

///////////////////////////////////////////////////////////////////////////////
void default_idle(void)
{
        if (current_cpu_data.hlt_works_ok && !hlt_counter) {
                __cli();
                if (!current->need_resched)
                        safe_halt();
                else
                        __sti();
        }
}

/* defined in linux/include/asm-i386/system.h */
#define __cli()                 __asm__ __volatile__("cli": : :"memory")
#define __sti()                 __asm__ __volatile__("sti": : :"memory")

/* used in the idle loop; sti takes one instruction cycle to complete */
#define safe_halt()             __asm__ __volatile__("sti; hlt": : :"memory")
CPU 将从中断处理程序返回时,恢复执行“hlt”之后的第一条指令。


8. SMP 启动

有一些与 SMP 相关的宏,例如 CONFIG_SMP, CONFIG_X86_LOCAL_APIC, CONFIG_X86_IO_APIC, CONFIG_MULTIQUADCONFIG_VISWS。 我将忽略需要 CONFIG_MULTIQUADCONFIG_VISWS 的代码,因为大多数人并不关心(如果没有使用 IBM 高端多处理器服务器或 SGI Visual Workstation)。

BSP 执行 start_kernel() -> smp_init() -> smp_boot_cpus() -> do_boot_cpu() -> wakeup_secondary_via_INIT() 来触发 AP。 请查阅 多处理器规范 和 IA-32 手册 Vol.3(Ch.7,多处理器管理,以及 Ch.8,高级可编程中断控制器),了解技术细节。


8.1. 在 smp_init() 之前

在调用 smp_init() 之前,start_kernel() 做了一些设置 SMP 环境的操作
start_kernel()
|-- setup_arch()
|   |-- parse_cmdline_early();  // SMP looks for "noht" and "acpismp=force"
|   |   `-- /* "noht" disables HyperThreading (2 logical cpus per Xeon) */
|   |       if (!memcmp(from, "noht", 4)) {
|   |           disable_x86_ht = 1;
|   |           set_bit(X86_FEATURE_HT, disabled_x86_caps);
|   |       }
|   |       /* "acpismp=force" forces parsing and use of the ACPI SMP table */
|   |       else if (!memcmp(from, "acpismp=force", 13))
|   |           enable_acpi_smp_table = 1;
|   |-- setup_memory();         // reserve memory for MP configuration table
|   |   |-- reserve_bootmem(PAGE_SIZE, PAGE_SIZE);
|   |   `-- find_smp_config();
|   |       `-- find_intel_smp();
|   |           `-- smp_scan_config();
|   |               |-- set flag smp_found_config
|   |               |-- set MP floating pointer mpf_found
|   |               `-- reserve_bootmem(mpf_found, PAGE_SIZE);
|   |-- if (disable_x86_ht) {   // if HyperThreading feature disabled
|   |       clear_bit(X86_FEATURE_HT, &boot_cpu_data.x86_capability[0]);
|   |       set_bit(X86_FEATURE_HT, disabled_x86_caps);
|   |       enable_acpi_smp_table = 0;
|   |   }
|   |-- if (test_bit(X86_FEATURE_HT, &boot_cpu_data.x86_capability[0]))
|   |       enable_acpi_smp_table = 1;
|   |-- smp_alloc_memory();
|   |   `-- /* reserve AP processor's real-mode code space in low memory */
|   |       trampoline_base = (void *) alloc_bootmem_low_pages(PAGE_SIZE);
|   `-- get_smp_config();     /* get boot-time MP configuration */
|       |-- config_acpi_tables();
|       |   |-- memset(&acpi_boot_ops, 0, sizeof(acpi_boot_ops));
|       |   |-- acpi_boot_ops[ACPI_APIC] = acpi_parse_madt;
|       |   `-- /* Set have_acpi_tables to indicate using
|       |        * MADT in the ACPI tables; Use MPS tables if failed. */
|       |       if (enable_acpi_smp_table && !acpi_tables_init())
|       |           have_acpi_tables = 1;
|       |-- set pic_mode
|       |   /* =1, if the IMCR is present and PIC Mode is implemented;
|       |    * =0, otherwise Virtual Wire Mode is implemented. */
|       |-- save local APIC address in mp_lapic_addr
|       `-- scan for MP configuration table entries, like
|             MP_PROCESSOR, MP_BUS, MP_IOAPIC, MP_INTSRC and MP_LINTSRC.
|-- trap_init();
|   `-- init_apic_mappings();   // setup PTE for APIC
|       |-- /* If no local APIC can be found then set up a fake all
|       |    * zeroes page to simulate the local APIC and another
|       |    * one for the IO-APIC. */
|       |   if (!smp_found_config && detect_init_APIC()) {
|       |       apic_phys = (unsigned long) alloc_bootmem_pages(PAGE_SIZE);
|       |       apic_phys = __pa(apic_phys);
|       |   } else
|       |       apic_phys = mp_lapic_addr;
|       |-- /* map local APIC address,
|       |    *   mp_lapic_addr (0xfee00000) in most case,
|       |    *   to linear address FIXADDR_TOP (0xffffe000) */
|       |   set_fixmap_nocache(FIX_APIC_BASE, apic_phys);
|       |-- /* Fetch the APIC ID of the BSP in case we have a
|       |    * default configuration (or the MP table is broken). */
|       |   if (boot_cpu_physical_apicid == -1U)
|       |       boot_cpu_physical_apicid = GET_APIC_ID(apic_read(APIC_ID));
|       `-- // map IOAPIC address to uncacheable linear address
|           set_fixmap_nocache(idx, ioapic_phys);
|       // Now we can use linear address to access APIC space.
|-- init_IRQ();
|   |-- init_ISA_irqs();
|   |   |-- /* An initial setup of the virtual wire mode. */
|   |   |   init_bsp_APIC();
|   |   `-- init_8259A(auto_eoi=0);
|   `-- setup SMP/APIC interrupt handlers, esp. IPI.
`-- mem_init();
    `-- /* delay zapping low mapping entries for SMP: zap_low_mappings() */

IPI (处理器间中断),通过本地 APIC 的 CPU 到 CPU 的中断,是 BSP 用于触发 AP 的机制。

请注意,在符合 MP 的系统中“每个 CPU 都需要一个本地 APIC”。 处理器不共享 APIC 本地单元地址空间(物理地址 0xFEE00000 - 0xFEEFFFFF),但会共享 APIC I/O 单元 (0xFEC00000 - 0xFECFFFFF)。 两个地址空间都是不可缓存的。


8.2. smp_init()

BSP 调用 start_kernel() -> smp_init() -> smp_boot_cpus() 来设置每个 CPU 的数据结构并激活其余的 AP。
///////////////////////////////////////////////////////////////////////////////
static void __init smp_init(void)
{
        /* Get other processors into their bootup holding patterns. */
        smp_boot_cpus();
        wait_init_idle = cpu_online_map;
        clear_bit(current->processor, &wait_init_idle); /* Don't wait on me! */

        smp_threads_ready=1;
        smp_commence() {
                /* Lets the callins below out of their loop. */
                Dprintk("Setting commenced=1, go go go\n");
                wmb();
                atomic_set(&smp_commenced,1);
        }

        /* Wait for the other cpus to set up their idle processes */
        printk("Waiting on wait_init_idle (map = 0x%lx)\n", wait_init_idle);
        while (wait_init_idle) {
                cpu_relax();    // i.e. "rep;nop"
                barrier();
        }
        printk("All processors have done init_idle\n");
}

///////////////////////////////////////////////////////////////////////////////
void __init smp_boot_cpus(void)
{
        // ... something not very interesting :-)

        /* Initialize the logical to physical CPU number mapping
         * and the per-CPU profiling router/multiplier */
        prof_counter[0..NR_CPUS-1] = 0;
        prof_old_multiplier[0..NR_CPUS-1] = 0;
        prof_multiplier[0..NR_CPUS-1] = 0;

        init_cpu_to_apicid() {
                physical_apicid_2_cpu[0..MAX_APICID-1] = -1;
                logical_apicid_2_cpu[0..MAX_APICID-1] = -1;
                cpu_2_physical_apicid[0..NR_CPUS-1] = 0;
                cpu_2_logical_apicid[0..NR_CPUS-1] = 0;
        }

        /* Setup boot CPU information */
        smp_store_cpu_info(0); /* Final full version of the data */
        printk("CPU%d: ", 0);
        print_cpu_info(&cpu_data[0]);

        /* We have the boot CPU online for sure. */
        set_bit(0, &cpu_online_map);
        boot_cpu_logical_apicid = logical_smp_processor_id() {
                GET_APIC_LOGICAL_ID(*(unsigned long *)(APIC_BASE+APIC_LDR));
        }
        map_cpu_to_boot_apicid(0, boot_cpu_apicid) {
               physical_apicid_2_cpu[boot_cpu_apicid] = 0;
               cpu_2_physical_apicid[0] = boot_cpu_apicid;
        }

        global_irq_holder = 0;
        current->processor = 0;
        init_idle();    // will clear corresponding bit in wait_init_idle
        smp_tune_scheduling();

        // ... some conditions checked

        connect_bsp_APIC();     // enable APIC mode if used to be PIC mode
        setup_local_APIC();

        if (GET_APIC_ID(apic_read(APIC_ID)) != boot_cpu_physical_apicid)
                BUG();

        /* Scan the CPU present map and fire up the other CPUs
         *   via do_boot_cpu() */
        Dprintk("CPU present map: %lx\n", phys_cpu_present_map);
        for (bit = 0; bit < NR_CPUS; bit++) {
                apicid = cpu_present_to_apicid(bit);
                /* Don't even attempt to start the boot CPU! */
                if (apicid == boot_cpu_apicid)
                        continue;
                if (!(phys_cpu_present_map & (1 << bit)))
                        continue;
                if ((max_cpus >= 0) && (max_cpus <= cpucount+1))
                        continue;
                do_boot_cpu(apicid);
                /* Make sure we unmap all failed CPUs */
                if ((boot_apicid_to_cpu(apicid) == -1) &&
                                (phys_cpu_present_map & (1 << bit)))
                        printk("CPU #%d not responding - cannot use it.\n",
                                                                apicid);
        }

        // ... SMP BogoMIPS
        // ... B stepping processor warning
        // ... HyperThreading handling

        /* Set up all local APIC timers in the system */
        setup_APIC_clocks();

        /* Synchronize the TSC with the AP */
        if (cpu_has_tsc && cpucount)
                synchronize_tsc_bp();

smp_done:
        zap_low_mappings();
}

///////////////////////////////////////////////////////////////////////////////
static void __init do_boot_cpu (int apicid)
{
        cpu = ++cpucount;

        // 1. prepare "idle process" task struct for next AP

        /* We can't use kernel_thread since we must avoid to
         * reschedule the child. */
        if (fork_by_hand() < 0)
                panic("failed fork for CPU %d", cpu);
        /* We remove it from the pidhash and the runqueue
         * once we got the process: */
        idle = init_task.prev_task;
        if (!idle)
                panic("No idle process for CPU %d", cpu);

        /* we schedule the first task manually */
        idle->processor = cpu;
        idle->cpus_runnable = 1 << cpu; // only on this AP!

        map_cpu_to_boot_apicid(cpu, apicid) {
                physical_apicid_2_cpu[apicid] = cpu;
                cpu_2_physical_apicid[cpu] = apicid;
        }

        idle->thread.eip = (unsigned long) start_secondary;

        del_from_runqueue(idle);
        unhash_process(idle);
        init_tasks[cpu] = idle;

        // 2. prepare stack and code (CS:IP) for next AP

        /* start_eip had better be page-aligned! */
        start_eip = setup_trampoline() {
                memcpy(trampoline_base, trampoline_data,
                        trampoline_end - trampoline_data);
                /* trampoline_base was reserved in
                 * start_kernel() -> setup_arch() -> smp_alloc_memory(),
                 * and will be shared by all APs (one by one) */
                return virt_to_phys(trampoline_base);
        }

        /* So we see what's up */
        printk("Booting processor %d/%d eip %lx\n", cpu, apicid, start_eip);
        stack_start.esp = (void *) (1024 + PAGE_SIZE + (char *)idle);
        /* this value is used by next AP when it executes
         *   "lss stack_start,%esp" in
         *   linux/arch/i386/kernel/head.S:startup_32(). */

        /* This grunge runs the startup process for
         * the targeted processor. */
        atomic_set(&init_deasserted, 0);
        Dprintk("Setting warm reset code and vector.\n");

        CMOS_WRITE(0xa, 0xf);
        local_flush_tlb();
        Dprintk("1.\n");
        *((volatile unsigned short *) TRAMPOLINE_HIGH) = start_eip >> 4;
        Dprintk("2.\n");
        *((volatile unsigned short *) TRAMPOLINE_LOW) = start_eip & 0xf;
        Dprintk("3.\n");
        // we have setup 0:467 to start_eip (trampoline_base)

        // 3. kick AP to run (AP gets CS:IP from 0:467)

        // Starting actual IPI sequence...
        boot_error = wakeup_secondary_via_INIT(apicid, start_eip);
        if (!boot_error) {      // looks OK
                /* allow APs to start initializing. */
                set_bit(cpu, &cpu_callout_map);

                /* ... Wait 5s total for a response */

                // bit cpu in cpu_callin_map is set by AP in smp_callin()
                if (test_bit(cpu, &cpu_callin_map)) {
                        print_cpu_info(&cpu_data[cpu]);
                } else {
                        boot_error= 1;
                        // marker 0xA5 set by AP in trampoline_data()
                        if (*((volatile unsigned char *)phys_to_virt(8192))
                                        == 0xA5)
                                /* trampoline started but... */
                                printk("Stuck ??\n");
                        else
                                /* trampoline code not run */
                                printk("Not responding.\n");
                }
        }
        if (boot_error) {
                /* Try to put things back the way they were before ... */
                unmap_cpu_to_boot_apicid(cpu, apicid);
                clear_bit(cpu, &cpu_callout_map); /* set in do_boot_cpu() */
                clear_bit(cpu, &cpu_initialized); /* set in cpu_init() */
                clear_bit(cpu, &cpu_online_map);  /* set in smp_callin() */
                cpucount--;
        }

        /* mark "stuck" area as not stuck */
        *((volatile unsigned long *)phys_to_virt(8192)) = 0;
}
不要将 start_secondary()trampoline_data() 混淆。 前者是 AP “空闲”进程任务结构 EIP 值,后者是 AP 在 BSP 启动后运行的实模式代码(使用 wakeup_secondary_via_INIT())。


8.3. linux/arch/i386/kernel/trampoline.S

此文件包含 16 位实模式 AP 启动代码。 BSP 在 start_kernel() -> setup_arch() -> smp_alloc_memory() 中保留了内存空间 trampoline_base。 在 BSP 触发 AP 之前,它将 trampoline 代码(介于 trampoline_datatrampoline_end 之间)复制到 trampoline_base (在 do_boot_cpu() -> setup_trampoline() 中)。 BSP 设置 0:467 以指向 trampoline_base,以便 AP 从此处运行。

///////////////////////////////////////////////////////////////////////////////
trampoline_data()
{
r_base:
        wbinvd;         // Needed for NUMA-Q should be harmless for other
        DS = CS;
        BX = 1;         // Flag an SMP trampoline
        cli;

        // write marker for master knows we're running
        trampoline_base = 0xA5A5A5A5;

        lidt idt_48;
        lgdt gdt_48;

        AX = 1;
        lmsw AX;        // protected mode!
        goto flush_instr;
flush_instr:
        goto CS:100000; // see linux/arch/i386/kernel/head.S:startup_32()
}

idt_48:
        .word   0                       # idt limit = 0
        .word   0, 0                    # idt base = 0L

gdt_48:
        .word   0x0800                  # gdt limit = 2048, 256 GDT entries
        .long   gdt_table-__PAGE_OFFSET # gdt base = gdt (first SMP CPU)

.globl SYMBOL_NAME(trampoline_end)
SYMBOL_NAME_LABEL(trampoline_end)
请注意,当 AP 跳转到linux/arch/i386/kernel/head.S:startup_32()时,BX=1,这与 BSP (BX=0) 不同。 请参见 第 6 节


8.4. initialize_secondary()

与 BSP 不同,在 第 6.4 节linux/arch/i386/kernel/head.S:startup_32() 的末尾,AP 将调用 initialize_secondary() 而不是 start_kernel()

/* Everything has been set up for the secondary
 * CPUs - they just need to reload everything
 * from the task structure
 * This function must not return. */
void __init initialize_secondary(void)
{
        /* We don't actually need to load the full TSS,
         * basically just the stack pointer and the eip. */
        asm volatile(
                "movl %0,%%esp\n\t"
                "jmp *%1"
                :
                :"r" (current->thread.esp),"r" (current->thread.eip));
}
由于 BSP 调用了 do_boot_cpu()thread.eip 设置为 start_secondary(),因此 AP 的控制权传递给此函数。 AP 使用一个新的堆栈帧,该堆栈帧由 BSP 在 do_boot_cpu() -> fork_by_hand() -> do_fork() 中设置。


8.5. start_secondary()

所有 AP 都等待来自 BSP 的信号 smp_commenced,该信号在 第 8.2 节 smp_init() -> smp_commence() 中触发。 收到此信号后,它们将运行 “idle” 进程。
///////////////////////////////////////////////////////////////////////////////
int __init start_secondary(void *unused)
{
        /* Dont put anything before smp_callin(), SMP
         * booting is too fragile that we want to limit the
         * things done here to the most necessary things. */
        cpu_init();
        smp_callin();
        while (!atomic_read(&smp_commenced))
                rep_nop();
        /* low-memory mappings have been cleared, flush them from
         * the local TLBs too. */
        local_flush_tlb();
        return cpu_idle();      // never return, see Section 7.3
}
cpu_idle() -> init_idle() 将清除 wait_init_idle 中的相应位,并最终使 BSP 完成 smp_init() 并继续执行 start_kernel() 中的以下函数(即 rest_init())。


A. 内核构建示例

这是一个内核构建示例(在 Redhat 9.0 中)。 “/*”和“*/”之间的语句是内联注释,而不是控制台输出。
[root@localhost root]# ln -s /usr/src/linux-2.4.20 /usr/src/linux
[root@localhost root]# cd /usr/src/linux
[root@localhost linux]# make xconfig
        /* Create .config
         *   1. "Load Configuration from File" ->
         *        /boot/config-2.4.20-28.9, or whatever you like
         *   2. Modify kernel configuration parameters
         *   3. "Save and Exit" */
[root@localhost linux]# make oldconfig
        /* Re-check .config, optional */
[root@localhost linux]# vi Makefile
        /* Modify EXTRAVERSION in linux/Makefile, optional */
[root@localhost linux]# make dep
        /* Create .depend and more */
[root@localhost linux]# make bzImage
        /* ... Some output omitted */
ld -m elf_i386 -T /usr/src/linux-2.4.20/arch/i386/vmlinux.lds -e stext arch/i386
/kernel/head.o arch/i386/kernel/init_task.o init/main.o init/version.o init/do_m
ounts.o \
        --start-group \
        arch/i386/kernel/kernel.o arch/i386/mm/mm.o kernel/kernel.o mm/mm.o fs/f
s.o ipc/ipc.o \
         drivers/char/char.o drivers/block/block.o drivers/misc/misc.o drivers/n
et/net.o drivers/media/media.o drivers/char/drm/drm.o drivers/net/fc/fc.o driver
s/net/appletalk/appletalk.o drivers/net/tokenring/tr.o drivers/net/wan/wan.o dri
vers/atm/atm.o drivers/ide/idedriver.o drivers/cdrom/driver.o drivers/pci/driver
.o drivers/net/pcmcia/pcmcia_net.o drivers/net/wireless/wireless_net.o drivers/p
np/pnp.o drivers/video/video.o drivers/net/hamradio/hamradio.o drivers/md/mddev.
o drivers/isdn/vmlinux-obj.o \
        net/network.o \
        /usr/src/linux-2.4.20/arch/i386/lib/lib.a /usr/src/linux-2.4.20/lib/lib.
a /usr/src/linux-2.4.20/arch/i386/lib/lib.a \
        --end-group \
        -o vmlinux
nm vmlinux | grep -v '\(compiled\)\|\(\.o$\)\|\( [aUw] \)\|\(\.\.ng$\)\|\(LASH[R
L]DI\)' | sort > System.map
make[1]: Entering directory `/usr/src/linux-2.4.20/arch/i386/boot'
gcc -E -D__KERNEL__ -I/usr/src/linux-2.4.20/include -D__BIG_KERNEL__ -traditiona
l -DSVGA_MODE=NORMAL_VGA  bootsect.S -o bbootsect.s
as -o bbootsect.o bbootsect.s
bootsect.S: Assembler messages:
bootsect.S:239: Warning: indirect lcall without `*'
ld -m elf_i386 -Ttext 0x0 -s --oformat binary bbootsect.o -o bbootsect
gcc -E -D__KERNEL__ -I/usr/src/linux-2.4.20/include -D__BIG_KERNEL__ -D__ASSEMBL
Y__ -traditional -DSVGA_MODE=NORMAL_VGA  setup.S -o bsetup.s
as -o bsetup.o bsetup.s
setup.S: Assembler messages:
setup.S:230: Warning: indirect lcall without `*'
ld -m elf_i386 -Ttext 0x0 -s --oformat binary -e begtext -o bsetup bsetup.o
make[2]: Entering directory `/usr/src/linux-2.4.20/arch/i386/boot/compressed'
tmppiggy=_tmp_$$piggy; \
rm -f $tmppiggy $tmppiggy.gz $tmppiggy.lnk; \
objcopy -O binary -R .note -R .comment -S /usr/src/linux-2.4.20/vmlinux $tmppigg
y; \
gzip -f -9 < $tmppiggy > $tmppiggy.gz; \
echo "SECTIONS { .data : { input_len = .; LONG(input_data_end - input_data) inpu
t_data = .; *(.data) input_data_end = .; }}" > $tmppiggy.lnk; \
ld -m elf_i386 -r -o piggy.o -b binary $tmppiggy.gz -b elf32-i386 -T $tmppiggy.l
nk; \
rm -f $tmppiggy $tmppiggy.gz $tmppiggy.lnk
gcc -D__ASSEMBLY__ -D__KERNEL__ -I/usr/src/linux-2.4.20/include -traditional -c
head.S
gcc -D__KERNEL__ -I/usr/src/linux-2.4.20/include -Wall -Wstrict-prototypes -Wno-
trigraphs -O2 -fno-strict-aliasing -fno-common -fomit-frame-pointer -pipe -mpref
erred-stack-boundary=2 -march=i686 -DKBUILD_BASENAME=misc -c misc.c
ld -m elf_i386 -Ttext 0x100000 -e startup_32 -o bvmlinux head.o misc.o piggy.o
make[2]: Leaving directory `/usr/src/linux-2.4.20/arch/i386/boot/compressed'
gcc -Wall -Wstrict-prototypes -O2 -fomit-frame-pointer -o tools/build tools/buil
d.c -I/usr/src/linux-2.4.20/include
objcopy -O binary -R .note -R .comment -S compressed/bvmlinux compressed/bvmlinu
x.out
tools/build -b bbootsect bsetup compressed/bvmlinux.out CURRENT > bzImage
Root device is (3, 67)
Boot sector 512 bytes.
Setup is 4780 bytes.
System is 852 kB
make[1]: Leaving directory `/usr/src/linux-2.4.20/arch/i386/boot'
[root@localhost linux]# make modules modules_install
        /* ... Some output omitted */
cd /lib/modules/2.4.20; \
mkdir -p pcmcia; \
find kernel -path '*/pcmcia/*' -name '*.o' | xargs -i -r ln -sf ../{} pcmcia
if [ -r System.map ]; then /sbin/depmod -ae -F System.map  2.4.20; fi
[root@localhost linux]# cp arch/i386/boot/bzImage /boot/vmlinuz-2.4.20
[root@localhost linux]# cp vmlinux /boot/vmlinux-2.4.20
[root@localhost linux]# cp System.map /boot/System.map-2.4.20
[root@localhost linux]# cp .config /boot/config-2.4.20
[root@localhost linux]# mkinitrd /boot/initrd-2.4.20.img 2.4.20
[root@localhost linux]# vi /boot/grub/grub.conf
        /* Add the following lines to grub.conf:
title Linux (2.4.20)
        kernel /vmlinuz-2.4.20 ro root=LABEL=/
        initrd /initrd-2.4.20.img
         */

有关更多详细信息,请参阅 Kernelnewbies FAQ:如何编译内核内核重建过程

要在 Debian 中构建内核,另请参阅 Debian 安装手册:编译新内核Debian GNU/Linux FAQ:Debian 和内核Debian 参考:Debian 下的 Linux 内核。 如果遇到问题,请检查 "zless /usr/share/doc/kernel-package/Problems.gz"。


B. 内部链接器脚本

如果没有指定 -T (--script=) 选项,ld 将使用此内置脚本来链接目标
[root@localhost linux]# ld --verbose
GNU ld version 2.13.90.0.18 20030206
  Supported emulations:
   elf_i386
   i386linux
using internal linker script:
==================================================
/* Script for -z combreloc: combine and sort reloc sections */
OUTPUT_FORMAT("elf32-i386", "elf32-i386",
              "elf32-i386")
OUTPUT_ARCH(i386)
ENTRY(_start)
SEARCH_DIR("/usr/i386-redhat-linux/lib"); SEARCH_DIR("/usr/lib"); SEARCH_DIR("/u
sr/local/lib"); SEARCH_DIR("/lib");
/* Do we need any of these for elf?
   __DYNAMIC = 0;    */
SECTIONS
{
  /* Read-only sections, merged into text segment: */
  . = 0x08048000 + SIZEOF_HEADERS;
  .interp         : { *(.interp) }
  .hash           : { *(.hash) }
  .dynsym         : { *(.dynsym) }
  .dynstr         : { *(.dynstr) }
  .gnu.version    : { *(.gnu.version) }
  .gnu.version_d  : { *(.gnu.version_d) }
  .gnu.version_r  : { *(.gnu.version_r) }
  .rel.dyn        :
    {
      *(.rel.init)
      *(.rel.text .rel.text.* .rel.gnu.linkonce.t.*)
      *(.rel.fini)
      *(.rel.rodata .rel.rodata.* .rel.gnu.linkonce.r.*)
      *(.rel.data .rel.data.* .rel.gnu.linkonce.d.*)
      *(.rel.tdata .rel.tdata.* .rel.gnu.linkonce.td.*)
      *(.rel.tbss .rel.tbss.* .rel.gnu.linkonce.tb.*)
      *(.rel.ctors)
      *(.rel.dtors)
      *(.rel.got)
      *(.rel.bss .rel.bss.* .rel.gnu.linkonce.b.*)
    }
  .rela.dyn       :
    {
      *(.rela.init)
      *(.rela.text .rela.text.* .rela.gnu.linkonce.t.*)
      *(.rela.fini)
      *(.rela.rodata .rela.rodata.* .rela.gnu.linkonce.r.*)
      *(.rela.data .rela.data.* .rela.gnu.linkonce.d.*)
      *(.rela.tdata .rela.tdata.* .rela.gnu.linkonce.td.*)
      *(.rela.tbss .rela.tbss.* .rela.gnu.linkonce.tb.*)
      *(.rela.ctors)
      *(.rela.dtors)
      *(.rela.got)
      *(.rela.bss .rela.bss.* .rela.gnu.linkonce.b.*)
    }
  .rel.plt        : { *(.rel.plt) }
  .rela.plt       : { *(.rela.plt) }
  .init           :
  {
    KEEP (*(.init))
  } =0x90909090
  .plt            : { *(.plt) }
  .text           :
  {
    *(.text .stub .text.* .gnu.linkonce.t.*)
    /* .gnu.warning sections are handled specially by elf32.em.  */
    *(.gnu.warning)
  } =0x90909090
  .fini           :
  {
    KEEP (*(.fini))
  } =0x90909090
  PROVIDE (__etext = .);
  PROVIDE (_etext = .);
  PROVIDE (etext = .);
  .rodata         : { *(.rodata .rodata.* .gnu.linkonce.r.*) }
  .rodata1        : { *(.rodata1) }
  .eh_frame_hdr : { *(.eh_frame_hdr) }
  .eh_frame       : ONLY_IF_RO { KEEP (*(.eh_frame)) }
  .gcc_except_table   : ONLY_IF_RO { *(.gcc_except_table) }
  /* Adjust the address for the data segment.  We want to adjust up to
     the same address within the page on the next page up.  */
  . = ALIGN (0x1000) - ((0x1000 - .) & (0x1000 - 1)); . = DATA_SEGMENT_ALIGN (0x
1000, 0x1000);
  /* For backward-compatibility with tools that don't support the
     *_array_* sections below, our glibc's crt files contain weak
     definitions of symbols that they reference.  We don't want to use
     them, though, unless they're strictly necessary, because they'd
     bring us empty sections, unlike PROVIDE below, so we drop the
     sections from the crt files here.  */
  /DISCARD/ : {
      */crti.o(.init_array .fini_array .preinit_array)
      */crtn.o(.init_array .fini_array .preinit_array)
  }
  /* Ensure the __preinit_array_start label is properly aligned.  We
     could instead move the label definition inside the section, but
     the linker would then create the section even if it turns out to
     be empty, which isn't pretty.  */
  . = ALIGN(32 / 8);
  PROVIDE (__preinit_array_start = .);
  .preinit_array     : { *(.preinit_array) }
  PROVIDE (__preinit_array_end = .);
  PROVIDE (__init_array_start = .);
  .init_array     : { *(.init_array) }
  PROVIDE (__init_array_end = .);
  PROVIDE (__fini_array_start = .);
  .fini_array     : { *(.fini_array) }
  PROVIDE (__fini_array_end = .);
  .data           :
  {
    *(.data .data.* .gnu.linkonce.d.*)
    SORT(CONSTRUCTORS)
  }
  .data1          : { *(.data1) }
  .tdata          : { *(.tdata .tdata.* .gnu.linkonce.td.*) }
  .tbss           : { *(.tbss .tbss.* .gnu.linkonce.tb.*) *(.tcommon) }
  .eh_frame       : ONLY_IF_RW { KEEP (*(.eh_frame)) }
  .gcc_except_table   : ONLY_IF_RW { *(.gcc_except_table) }
  .dynamic        : { *(.dynamic) }
  .ctors          :
  {
    /* gcc uses crtbegin.o to find the start of
       the constructors, so we make sure it is
       first.  Because this is a wildcard, it
       doesn't matter if the user does not
       actually link against crtbegin.o; the
       linker won't look for a file to match a
       wildcard.  The wildcard also means that it
       doesn't matter which directory crtbegin.o
       is in.  */
    KEEP (*crtbegin.o(.ctors))
    /* We don't want to include the .ctor section from
       from the crtend.o file until after the sorted ctors.
       The .ctor section from the crtend file contains the
       end of ctors marker and it must be last */
    KEEP (*(EXCLUDE_FILE (*crtend.o ) .ctors))
    KEEP (*(SORT(.ctors.*)))
    KEEP (*(.ctors))
  }
  .dtors          :
  {
    KEEP (*crtbegin.o(.dtors))
    KEEP (*(EXCLUDE_FILE (*crtend.o ) .dtors))
    KEEP (*(SORT(.dtors.*)))
    KEEP (*(.dtors))
  }
  .jcr            : { KEEP (*(.jcr)) }
  .got            : { *(.got.plt) *(.got) }
  _edata = .;
  PROVIDE (edata = .);
  __bss_start = .;
  .bss            :
  {
   *(.dynbss)
   *(.bss .bss.* .gnu.linkonce.b.*)
   *(COMMON)
   /* Align here to ensure that the .bss section occupies space up to
      _end.  Align after .bss to ensure correct alignment even if the
      .bss section disappears because there are no input sections.  */
   . = ALIGN(32 / 8);
  }
  . = ALIGN(32 / 8);
  _end = .;
  PROVIDE (end = .);
  . = DATA_SEGMENT_END (.);
  /* Stabs debugging sections.  */
  .stab          0 : { *(.stab) }
  .stabstr       0 : { *(.stabstr) }
  .stab.excl     0 : { *(.stab.excl) }
  .stab.exclstr  0 : { *(.stab.exclstr) }
  .stab.index    0 : { *(.stab.index) }
  .stab.indexstr 0 : { *(.stab.indexstr) }
  .comment       0 : { *(.comment) }
  /* DWARF debug sections.
     Symbols in the DWARF debugging sections are relative to the beginning
     of the section so we begin them at 0.  */
  /* DWARF 1 */
  .debug          0 : { *(.debug) }
  .line           0 : { *(.line) }
  /* GNU DWARF 1 extensions */
  .debug_srcinfo  0 : { *(.debug_srcinfo) }
  .debug_sfnames  0 : { *(.debug_sfnames) }
  /* DWARF 1.1 and DWARF 2 */
  .debug_aranges  0 : { *(.debug_aranges) }
  .debug_pubnames 0 : { *(.debug_pubnames) }
  /* DWARF 2 */
  .debug_info     0 : { *(.debug_info .gnu.linkonce.wi.*) }
  .debug_abbrev   0 : { *(.debug_abbrev) }
  .debug_line     0 : { *(.debug_line) }
  .debug_frame    0 : { *(.debug_frame) }
  .debug_str      0 : { *(.debug_str) }
  .debug_loc      0 : { *(.debug_loc) }
  .debug_macinfo  0 : { *(.debug_macinfo) }
  /* SGI/MIPS DWARF 2 extensions */
  .debug_weaknames 0 : { *(.debug_weaknames) }
  .debug_funcnames 0 : { *(.debug_funcnames) }
  .debug_typenames 0 : { *(.debug_typenames) }
  .debug_varnames  0 : { *(.debug_varnames) }
}


==================================================
[root@localhost linux]# 


C. GRUB 和 LILO

GNU GRUBLILO 都理解实模式内核头格式,并将 bootsect(一个扇区)、setup 代码(setup_sects 个扇区)和压缩的内核映像 (syssize*16 字节) 加载到内存中。 它们填写加载器标识符 (type_of_loader) 并尝试将适当的参数和选项传递给内核。 完成工作后,控制权传递给设置代码。


C.1. GNU GRUB

以下 GNU GRUB 程序概要基于 grub-0.93。
stage2/stage2.c:cmain()
`-- run_menu()
    `-- run_script();
        |-- builtin = find_command(heap);
        |-- kernel_func();              // builtin->func() for command "kernel"
        |   `-- load_image();           // search BOOTSEC_SIGNATURE in boot.c
        |   /* memory from 0x100000 is populated by and in the order of
        |    *   (bvmlinux, bbootsect, bsetup) or (vmlinux, bootsect, setup) */
        |-- initrd_func();              // for command "initrd"
        |   `-- load_initrd();
        `-- boot_func();                // for implicit command "boot"
            `-- linux_boot();           // defined in stage2/asm.S
                or big_linux_boot();    //   not in grub/asmstub.c!

// In stage2/asm.S
linux_boot:
        /* copy kernel */
        move system code from 0x100000 to 0x10000 (linux_text_len bytes);
big_linux_boot:
        /* copy the real mode part */
        EBX = linux_data_real_addr;
        move setup code from linux_data_tmp_addr (0x100000+text_len)
            to linux_data_real_addr (0x9100 bytes);
        /* change %ebx to the segment address */
        linux_setup_seg = (EBX >> 4) + 0x20;
        /* XXX new stack pointer in safe area for calling functions */
        ESP = 0x4000;
        stop_floppy();
        /* final setup for linux boot */
        prot_to_real();
        cli;
        SS:ESP = BX:9000;
        DS = ES = FS = GS = BX;
        /* jump to start, i.e. ljmp linux_setup_seg:0
         * Note that linux_setup_seg is just changed to BX. */
        .byte   0xea
        .word   0
linux_setup_seg:
        .word   0

请参考 "info grub" 以获取 GRUB 手册。

如果您正在移植 grub-0.93 并修改 bsetup,则应注意一个已报告的 GNU GRUB 错误


C.2. LILO

与 GRUB 不同,LILO 在启动系统时不检查配置文件。当从终端调用 lilo 时会发生问题。

以下 LILO 程序概要基于 lilo-22.5.8。
lilo.c:main()
|-- cfg_open(config_file);
|-- cfg_parse(cf_options);
|-- bsect_open(boot_dev, map_file, install, delay, timeout);
|   |-- open_bsect(boot_dev);
|   `-- map_create(map_file);
|-- cfg_parse(cf_top)
|   `-- cfg_do_set();
|       `-- do_image();             // walk->action for "image=" section
|           |-- cfg_parse(cf_image) -> cfg_do_set();
|           |-- bsect_common(&descr, 1);
|           |   |-- map_begin_section();
|           |   |-- map_add_sector(fallback_buf);
|           |   `-- map_add_sector(options);
|           |-- boot_image(name, &descr) or boot_device(name, range, &descr);
|           |   |-- int fd = geo_open(&descr, name, O_RDONLY);
|           |   |   read(fd, &buff, SECTOR_SIZE);
|           |   |   map_add(&geo, 0, image_sectors);
|           |   |   map_end_section(&descr->start, setup_sects+2+1);
|           |   |       /* two sectors created in bsect_common(),
|           |   |        *   another one sector for bootsect */
|           |   |   geo_close(&geo);
|           |   `-- fd = geo_open(&descr, initrd, O_RDONLY);
|           |       map_begin_section();
|           |       map_add(&geo, 0, initrd_sectors);
|           |       map_end_section(&descr->initrd,0);
|           |       geo_close(&geo);
|           `-- bsect_done(name, &descr);
`-- bsect_update(backup_file, force_backup, 0); // update boot sector
    |-- make_backup();
    |-- map_begin_section();
    |   map_add_sector(table);
    |   map_write(&param2, keytab, 0, 0);
    |   map_close(&param2, here2);
    |-- // ... perform the relocation of the boot sector
    |-- // ... setup bsect_wr to correct place
    |-- write(fd, bsect_wr, SECTOR_SIZE);
    `-- close(fd);
map_add()、map_add_sector()map_add_zero() 可能会调用 map_register() 来完成它们的工作,而 map_register() 将保存一个列表,其中包含用于标识所有已注册扇区的所有 (CX, DX, AL) 三元组(数据结构 SECTOR_ADDR)。

LILO 运行first.Ssecond.S来启动一个系统。 它调用 second.S:doboot() 来加载映射文件、引导扇区和设置代码。 然后它调用 lfile() 来加载系统代码,调用 launch2() -> launch() -> cl_wait() -> start_setup() -> start_setup2() 并最终执行 "jmpi 0,SETUPSEG" 指令来运行设置代码。

请参考 "man lilo" 和 "man lilo.conf" 以获取 LILO 详细信息。


D. 常见问题解答 (FAQ)

针对应该放在相应章节中的内容,或者应该放在这里的内容。 /* 待办事项:*/