Linux 内核模块编程指南

Ori Pomerantz 想要感谢 Yoav Weiss 提出的许多有益的建议和讨论，以及在本文档发布之前发现的错误。Ori 还想感谢来自荷兰的 Frodo Looijaard、来自新西兰的 Stephen Judd、来自瑞典的 Magnus Ahltorp 以及来自加拿大魁北克的 Emmanuel Papirakis。

我要感谢 Ori Pomerantz 最初撰写了本指南，然后让我来维护它。这是他付出的巨大努力。我希望他喜欢我对本文档所做的工作。

我还要感谢 Jeff Newmiller 和 Rhonda Bailey 对我的教导。他们对我很有耐心，并借给我他们的经验，无论他们有多忙。 David Porter 承担了将原始 LaTeX 源代码转换为 docbook 的艰巨任务。这是一项漫长、枯燥且肮脏的工作。但总得有人去做。谢谢你，David。

还要感谢 www.kernelnewbies.org 的优秀人士。特别是 Mark McLoughlin 和 John Levon，我相信他们有比在 kernelnewbies.org 上闲逛和教新手更好的事情要做。如果本指南教会了您任何东西，他们也有一部分功劳。

Ori 和我都想感谢 Richard M. Stallman 和 Linus Torvalds，感谢他们让我们有机会不仅可以运行高质量的操作系统，还可以仔细了解它的工作原理。我从未见过 Linus，可能永远也不会见到，但他对我的生活产生了深远的影响。

以下人员写信给我，提出了更正或好的建议：Ignacio Martin、David Porter 和 Dimo Velev

2. 作者身份和版权

《Linux 内核模块编程指南》(lkmpg) 最初由 Ori Pomerantz 编写。它因成为学习如何编写 Linux 内核模块的最佳免费方式而非常受欢迎。生活变得忙碌，Ori 不再有时间和意愿来维护该文档。毕竟，Linux 内核是一个快速发展的目标。 Peter Jay Salzman（那是我）主动接管维护工作，以便至少可以进行错误修复和偶尔的更新。如果您想

3. 注意事项

Ori 的原始文档在支持早期版本的 Linux 方面做得很好，一直可以追溯到 2.0 时代。我最初打算保持该计划，但在考虑之后，选择了退出。我保持兼容性的主要原因是 GNU/Linux 发行版（如 LEAF），它们倾向于使用较旧的内核。然而，即使 LEAF 现在也使用 2.2 和 2.4 内核。

Ori 和我都使用 x86 平台。在大多数情况下，源代码和讨论应适用于其他架构，但我无法保证任何事情。一个例外是第 12 章，中断处理程序，它不应在 x86 之外的任何架构上工作。

第 1 章. 简介

1.1. 什么是内核模块？

所以，你想编写一个内核模块。你懂 C 语言，你编写了一些作为进程运行的普通程序，现在你想进入真正的操作发生的地方，在那里一个野指针可以擦除你的文件系统，而 core dump 意味着重启。

内核模块到底是什么？模块是可以根据需要加载和卸载到内核中的代码片段。它们扩展了内核的功能，而无需重启系统。例如，一种模块是设备驱动程序，它允许内核访问连接到系统的硬件。如果没有模块，我们将不得不构建单内核，并将新功能直接添加到内核镜像中。除了拥有更大的内核之外，这还具有每次想要新功能时都必须重建和重启内核的缺点。

1.2. 模块如何进入内核？

您可以通过运行 lsmod 查看已加载到内核中的模块，它通过读取文件获取其信息/proc/modules.

这些模块是如何找到进入内核的途径的？当内核需要内核中不存在的功能时，内核模块守护进程 kmod[1] 执行 modprobe 来加载模块。 modprobe 传递一个字符串，格式为以下两种之一

像这样的模块名称softdog或ppp.
像这样的更通用的标识符char-major-10-30.

如果 modprobe 收到一个通用标识符，它首先在文件/etc/modules.conf中查找该字符串。如果它找到像这样的别名行

   alias char-major-10-30 softdog

它知道通用标识符指的是模块softdog.o.

接下来，modprobe 查找文件/lib/modules/version/modules.dep，以查看在加载请求的模块之前是否必须加载其他模块。该文件由 depmod -a 创建，并包含模块依赖项。例如，msdos.o需要fat.o模块已加载到内核中。如果另一个模块定义了请求模块使用的符号（变量或函数），则请求的模块对另一个模块具有依赖性。

最后，modprobe 使用 insmod 首先将任何先决条件模块加载到内核中，然后再加载请求的模块。 modprobe 指示 insmod 到/lib/modules/version/[2]，模块的标准目录。 insmod 旨在对模块的位置相当迟钝，而 modprobe 则知道模块的默认位置。因此，例如，如果您想加载 msdos 模块，您必须运行

    insmod /lib/modules/2.5.1/kernel/fs/fat/fat.o
    insmod /lib/modules/2.5.1/kernel/fs/msdos/msdos.o

或者只需运行 “modprobe -a msdos”。

Linux 发行版将 modprobe、insmod 和 depmod 作为名为 modutils 或 mod-utils 的软件包提供。

在结束本章之前，让我们快速看一下一段/etc/modules.conf:

    #This file is automatically generated by update-modules
    path[misc]=/lib/modules/2.4.?/local
    keep
    path[net]=~p/mymodules
    options mydriver irq=10
    alias eth0 eepro

以 '#' 开头的行是注释。空行将被忽略。

path[misc]行告诉 modprobe 用目录/lib/modules/2.4.?/local替换 misc 模块的搜索路径。如您所见，shell 元字符被接受。

path[net]行告诉 modprobe 在目录~p/mymodules中查找 net 模块，但是，位于path[net]指令之前的 “keep” 指令告诉 modprobe 将此目录添加到 net 模块的标准搜索路径，而不是像我们对 misc 模块所做的那样替换标准搜索路径。

alias 行表示在 kmod 请求加载通用标识符“eth0” 时加载eepro.o。

您不会在/etc/modules.conf中看到类似 “alias block-major-2 floppy” 的行，因为 modprobe 已经知道将在大多数系统上使用的标准驱动程序。

现在您知道模块是如何进入内核的了。如果您想编写自己的依赖于其他模块的模块（我们称之为“堆叠模块”），那么还有更多内容。但这必须等到以后的章节。在解决这个相对高级的问题之前，我们还有很多内容要介绍。

1.2.1. 在我们开始之前

在我们深入研究代码之前，我们需要讨论一些问题。每个人的系统都不同，每个人都有自己的习惯。让你的第一个 “hello world” 程序正确编译和加载有时可能是一个技巧。请放心，在您克服第一次的初始障碍之后，一切都会一帆风顺。

1.2.1.1. 模块版本控制

为某个内核编译的模块不会在您启动不同的内核时加载，除非您启用CONFIG_MODVERSIONS在内核中。我们将在本指南的后面部分介绍模块版本控制。在我们介绍 modversions 之前，如果您运行的内核启用了 modversioning，则本指南中的示例可能无法工作。但是，大多数库存 Linux 发行版内核都启用了它。如果您因版本控制错误而无法加载模块，请编译一个禁用 modversioning 的内核。

1.2.1.2. 使用 X

强烈建议您键入、编译和加载本指南讨论的所有示例。还强烈建议您从控制台执行此操作。您不应在 X 中处理这些内容。

模块无法像printf()那样打印到屏幕，但它们可以记录信息和警告，这些信息和警告最终会打印在您的屏幕上，但仅在控制台上。如果您从 xterm 中 insmod 一个模块，信息和警告将被记录，但仅记录到您的日志文件中。除非您查看日志文件，否则您不会看到它。为了立即访问此信息，请从控制台完成所有工作。

1.2.1.3. 编译问题和内核版本

通常，Linux 发行版会分发以各种非标准方式修补的内核源代码，这可能会导致问题。

更常见的问题是某些 Linux 发行版分发不完整的内核头文件。您需要使用来自 Linux 内核的各种头文件来编译代码。墨菲定律指出，缺少的头文件正是您模块工作所需的头文件。

为了避免这两个问题，我强烈建议您下载、编译并启动一个全新的、库存的 Linux 内核，该内核可以从任何 Linux 内核镜像站点下载。有关更多详细信息，请参阅 Linux Kernel HOWTO。

具有讽刺意味的是，这也可能导致问题。默认情况下，系统上的 gcc 可能会在其默认位置而不是您安装新内核副本的位置（通常在/usr/src/中）查找内核头文件。这可以通过使用 gcc 的-I开关来解决。

第 2 章. Hello World

2.1. Hello, World（第 1 部分）：最简单的模块

当第一个穴居人程序员在第一个洞穴计算机的墙壁上凿刻第一个程序时，这是一个在羚羊图片中绘制字符串 “Hello, world” 的程序。罗马编程教科书以 “Salut, Mundi” 程序开头。我不知道那些打破这一传统的人会发生什么，但我认为最好不要 выяснять. 我们将从一系列 hello world 程序开始，这些程序演示了编写内核模块基础知识的不同方面。

这是最简单的模块。暂时不要编译它；我们将在下一节介绍模块编译。

示例 2-1. hello-1.c

/*  hello-1.c - The simplest kernel module.
 *
 *  Copyright (C) 2001 by Peter Jay Salzman
 *
 *  08/02/2006 - Updated by Rodrigo Rubira Branco <rodrigo@kernelhacking.com>
 */

/* Kernel Programming */
#define MODULE
#define LINUX
#define __KERNEL__

#include <linux/module.h>  /* Needed by all modules */
#include <linux/kernel.h>  /* Needed for KERN_ALERT */


int init_module(void)
{
   printk("<1>Hello world 1.\n");
	
   // A non 0 return means init_module failed; module can't be loaded.
   return 0;
}


void cleanup_module(void)
{
  printk(KERN_ALERT "Goodbye world 1.\n");
}  

MODULE_LICENSE("GPL");

内核模块必须至少有两个函数：一个 “start”（初始化）函数，名为init_module()在模块 insmod 到内核中时调用，以及一个 “end”（清理）函数，名为cleanup_module()在 rmmod 之前调用。实际上，从内核 2.3.13 开始，情况发生了变化。现在，您可以为模块的启动和结束函数使用任何您喜欢的名称，您将在第 2.3 节中学习如何执行此操作。事实上，新方法是首选方法。然而，许多人仍然使用init_module()和cleanup_module()作为他们的启动和结束函数。

通常，init_module()要么向内核注册某个处理程序，要么用自己的代码替换内核函数之一（通常是执行某些操作然后调用原始函数的代码）。cleanup_module()函数应该撤消init_module()所做的任何操作，以便可以安全地卸载模块。

最后，每个内核模块都需要包含linux/module.h。我们需要包含linux/kernel.h仅用于printk()日志级别的宏扩展，KERN_ALERT，您将在第 2.1.1 节中了解它。

2.1.1. 介绍`printk()`

尽管您可能认为，printk()并不是为了向用户传达信息，即使我们在 hello-1 中完全将其用于此目的！它恰好是内核的日志记录机制，用于记录信息或发出警告。因此，每个printk()语句都带有优先级，即您看到的<1>和KERN_ALERT。共有 8 个优先级，内核有它们的宏，因此您不必使用神秘的数字，您可以在linux/kernel.h中查看它们（及其含义）。如果您未指定优先级，则将使用默认优先级，DEFAULT_MESSAGE_LOGLEVEL。

花时间阅读优先级宏。头文件还描述了每个优先级的含义。在实践中，不要使用数字，例如<4>。始终使用宏，例如KERN_WARNING.

如果优先级低于int console_loglevel，则消息会打印在您当前的终端上。如果 syslogd 和 klogd 都在运行，则消息也将附加到/var/log/messages，无论它是否打印到控制台。我们使用高优先级，例如KERN_ALERT，以确保printk()消息打印到您的控制台，而不是仅记录到您的日志文件。当您编写真正的模块时，您需要使用对当前情况有意义的优先级。

2.2. 编译内核模块

内核模块需要使用某些 gcc 选项进行编译才能工作。此外，它们还需要使用定义的某些符号进行编译。这是因为内核头文件需要表现不同，具体取决于我们是编译内核模块还是可执行文件。您可以使用 gcc 的-D选项或使用#define预处理器命令来定义符号。我们将在本节介绍为了编译内核模块您需要做什么。

-c：内核模块不是独立的executable，而是一个对象文件，它将在运行时使用 insmod 链接到内核中。因此，模块应使用-c标志编译。
-O2：内核广泛使用内联函数，因此模块必须使用优化标志打开进行编译。如果没有优化，一些汇编器宏调用将被编译器误认为是函数调用。这将导致加载模块失败，因为 insmod 将在内核中找不到这些函数。
-W -Wall：编程错误可能会导致您的系统崩溃。您应该始终打开编译器警告，这适用于您的所有编译工作，而不仅仅是模块编译。
-isystem /lib/modules/`uname -r`/build/include：您必须使用您要编译的内核的内核头文件。使用默认的/usr/include/linux将不起作用。
-D__KERNEL__：定义此符号告诉头文件代码将在内核模式下运行，而不是作为用户进程运行。
-DMODULE：此符号告诉头文件为内核模块提供适当的定义。

我们使用 gcc 的-isystem选项而不是-I，因为它告诉 gcc 抑制一些 “unused variable” 警告，这些警告在您包含-W -Wall时会导致module.h。通过在 gcc-3.0 下使用-isystem，内核头文件会被特殊处理，并且警告会被抑制。如果您改为使用-I（甚至在 gcc 2.9x 下使用-isystem），则会打印 “unused variable” 警告。如果发生这种情况，请忽略它们。

所以，让我们看一下用于编译名为hello-1.c:

的模块的简单 Makefile。

TARGET  := hello-1
WARN    := -W -Wall -Wstrict-prototypes -Wmissing-prototypes
INCLUDE := -isystem /lib/modules/`uname -r`/build/include
CFLAGS  := -O2 -DMODULE -D__KERNEL__ ${WARN} ${INCLUDE}
CC      := gcc-3.0
	
${TARGET}.o: ${TARGET}.c

.PHONY: clean

clean:
    rm -rf ${TARGET}.o

作为读者的练习，编译hello-1.c并使用 insmod ./hello-1.o 将其插入内核中（忽略您看到的有关 tainted kernels 的任何内容；我们稍后会介绍）。很酷，对吧？加载到内核中的所有模块都列在/proc/modules中。继续 cat 该文件以查看您的模块是否真的是内核的一部分。恭喜您，您现在是 Linux 内核代码的作者了！当新鲜感消失后，使用 rmmod hello-1 从内核中删除您的模块。看一下/var/log/messages，只是为了看看它是否已记录到您的系统日志文件中。

这是给读者的另一个练习。参见init_module()中 return 语句上方的注释？将返回值更改为非零值，重新编译并再次加载模块。会发生什么？

2.3. Hello World（第 2 部分）

从 Linux 2.4 开始，您可以重命名模块的 init 和 cleanup 函数；它们不再必须分别称为init_module()和cleanup_module()和。这是通过和module_exit()宏完成的。这些宏在linux/init.h中定义。唯一的注意事项是您的 init 和 cleanup 函数必须在调用宏之前定义，否则您将收到编译错误。这是该技术的示例

示例 2-3. hello-2.c

/*  hello-2.c - Demonstrating the module_init() and module_exit() macros.  This is the 
 *     preferred over using init_module() and cleanup_module().
 *
 *  Copyright (C) 2001 by Peter Jay Salzman
 *
 *  08/02/2006 - Updated by Rodrigo Rubira Branco <rodrigo@kernelhacking.com>
 */

/* Kernel Programming */
#define MODULE
#define LINUX
#define __KERNEL__

#include <linux/module.h>   // Needed by all modules
#include <linux/kernel.h>   // Needed for KERN_ALERT
#include <linux/init.h>     // Needed for the macros


static int hello_2_init(void)
{
   printk(KERN_ALERT "Hello, world 2\n");
   return 0;
}


static void hello_2_exit(void)
{
   printk(KERN_ALERT "Goodbye, world 2\n");
}


module_init(hello_2_init);
module_exit(hello_2_exit);

MODULE_LICENSE("GPL");

所以现在我们已经掌握了两个真正的内核模块。凭借我们如此高的生产力，我们应该有一个功能强大的 Makefile。这是一个更高级的 Makefile，它将同时编译我们的两个模块。它针对简洁性和可扩展性进行了优化。如果您不理解它，我强烈建议您阅读 makefile 信息页或 GNU Make 手册。

示例 2-4. 我们两个模块的 Makefile

WARN    := -W -Wall -Wstrict-prototypes -Wmissing-prototypes
INCLUDE := -isystem /lib/modules/`uname -r`/build/include
CFLAGS  := -O2 -DMODULE -D__KERNEL__ ${WARN} ${INCLUDE}
CC      := gcc-3.0
OBJS    := ${patsubst %.c, %.o, ${wildcard *.c}}

all: ${OBJS}

.PHONY: clean

clean:
    rm -rf *.o

作为读者的练习，如果我们在同一个目录中有另一个模块，例如hello-3.c，您将如何修改此 Makefile 以自动编译该模块？

2.4. Hello World（第 3 部分）：`init`和`和`exit

这演示了内核 2.2 及更高版本的一个特性。请注意 init 和 cleanup 函数定义中的更改。__init宏导致 init 函数在内置驱动程序的 init 函数完成后被丢弃并释放其内存，但对于可加载模块则不然。如果您考虑何时调用 init 函数，这完全有道理。

还有一个__initdata，其工作方式类似于__init，但适用于 init 变量而不是函数。

和宏导致在将模块构建到内核中时省略该函数，并且像和一样，对可加载模块没有影响。同样，如果您考虑 cleanup 函数何时运行，这完全有道理；内置驱动程序不需要 cleanup 函数，而可加载模块则需要。

这些宏在linux/init.h中定义，用于释放内核内存。当您启动内核并看到类似Freeing unused kernel memory: 236k freed的内容时，这正是内核正在释放的内容。

示例 2-5. hello-3.c

/*  hello-3.c - Illustrating the __init, __initdata and __exit macros.
 *
 *  Copyright (C) 2001 by Peter Jay Salzman
 *
 *  08/02/2006 - Updated by Rodrigo Rubira Branco <rodrigo@kernelhacking.com>
 */

/* Kernel Programming */
#define MODULE
#define LINUX
#define __KERNEL__

#include <linux/module.h>      /* Needed by all modules */
#include <linux/kernel.h>      /* Needed for KERN_ALERT */
#include <linux/init.h>        /* Needed for the macros */

static int hello3_data __initdata = 3;


static int __init hello_3_init(void)
{
   printk(KERN_ALERT "Hello, world %d\n", hello3_data);
   return 0;
}


static void __exit hello_3_exit(void)
{
   printk(KERN_ALERT "Goodbye, world 3\n");
}


module_init(hello_3_init);
module_exit(hello_3_exit);

MODULE_LICENSE("GPL");

顺便说一句，您可能会在为 Linux 2.2 内核编写的驱动程序中看到指令 “__initfunction()”。

 __initfunction(int init_module(void))
{
   printk(KERN_ALERT "Hi there.\n");
   return 0;
}

此宏的用途与__init相同，但现在非常不推荐使用，而推荐使用__init。我提到它只是因为您可能会在现代内核中看到它。截至 2.4.18，有 38 个对__initfunction()的引用，而 2.4.20 中有 37 个引用。但是，不要在您自己的代码中使用它。

2.5. Hello World（第 4 部分）：许可和模块文档

如果您运行的是内核 2.4 或更高版本，您可能在加载之前的示例模块时注意到类似这样的内容

# insmod hello-3.o
Warning: loading hello-3.o will taint the kernel: no license
  See http://www.tux.org/lkml/#export-tainted for information about tainted modules
Hello, world 3
Module hello-3 loaded, with warnings

在内核 2.4 及更高版本中，设计了一种机制来识别在 GPL（及其朋友）下许可的代码，以便可以警告人们该代码不是开源的。这是通过MODULE_LICENSE()宏完成的，该宏在下一段代码中演示。通过将许可证设置为 GPL，您可以阻止打印警告。此许可证机制在linux/module.h.

同样，MODULE_DESCRIPTION()用于描述模块的功能，MODULE_AUTHOR()声明模块的作者，以及MODULE_SUPPORTED_DEVICE()声明模块支持的设备类型。

这些宏都在linux/module.h中定义，内核本身不使用它们。它们仅用于文档，可以通过 objdump 等工具查看。作为读者的练习，尝试 grepping 通过linux/drivers来查看模块作者如何使用这些宏来记录他们的模块。

示例 2-6. hello-4.c

/*  hello-4.c - Demonstrates module documentation.
 *
 *  Copyright (C) 2001 by Peter Jay Salzman
 *
 *  08/02/2006 - Updated by Rodrigo Rubira Branco <rodrigo@kernelhacking.com>
 */

/* Kernel Programming */
#define MODULE
#define LINUX
#define __KERNEL__

#include <linux/module.h>
#include <linux/kernel.h>
#include <linux/init.h>
#define DRIVER_AUTHOR "Peter Jay Salzman <p@dirac.org>"
#define DRIVER_DESC   "A sample driver"

int init_hello_3(void);
void cleanup_hello_3(void);


static int init_hello_4(void)
{
   printk(KERN_ALERT "Hello, world 4\n");
   return 0;
}


static void cleanup_hello_4(void)
{
   printk(KERN_ALERT "Goodbye, world 4\n");
}


module_init(init_hello_4);
module_exit(cleanup_hello_4);


/*  You can use strings, like this:
 */
MODULE_LICENSE("GPL");           // Get rid of taint message by declaring code as GPL.

/*  Or with defines, like this:
 */
MODULE_AUTHOR(DRIVER_AUTHOR);    // Who wrote this module?
MODULE_DESCRIPTION(DRIVER_DESC); // What does this module do?

/*  This module uses /dev/testdevice.  The MODULE_SUPPORTED_DEVICE macro might be used in
 *  the future to help automatic configuration of modules, but is currently unused other
 *  than for documentation purposes.
 */
MODULE_SUPPORTED_DEVICE("testdevice");

2.6. 向模块传递命令行参数

模块可以接受命令行参数，但不能使用您可能习惯的argc/argv。

要允许将参数传递给您的模块，请将将接受命令行参数值的变量声明为全局变量，然后使用MODULE_PARM()宏（在linux/module.h中定义）来设置机制。在运行时，insmod 将使用给定的任何命令行参数填充变量，例如 ./insmod mymodule.o myvariable=5。为了清晰起见，变量声明和宏应放在模块的开头。示例代码应该可以消除我公认糟糕的解释。

MODULE_PARM()宏接受 2 个参数：变量的名称及其类型。支持的变量类型为 “b”：单字节，“h”：short int，“i”：整数，“l”：long int 和 “s”：字符串，整数类型可以像往常一样有符号或无符号。字符串应声明为 “char *”，insmod 将为其分配内存。您应该始终尝试为变量提供初始默认值。这是内核代码，您应该进行防御性编程。例如

    int myint = 3;
    char *mystr;

    MODULE_PARM(myint, "i");
    MODULE_PARM(mystr, "s");

也支持数组。 MODULE_PARM 中类型之前的整数值将指示某个最大长度的数组。用“-”分隔的两个数字将给出值的最小和最大数量。例如，一个至少有 2 个且不超过 4 个值的 short 数组可以声明为

    int myshortArray[4];
    MODULE_PARM (myintArray, "3-9i");

这样做的一个好处是设置模块变量的默认值，例如端口或 IO 地址。如果变量包含默认值，则执行自动检测（在其他地方解释）。否则，保持当前值。这将在稍后明确说明。

最后，有一个宏函数，MODULE_PARM_DESC()用于记录模块可以接受的参数。它接受两个参数：一个变量名和一个描述该变量的自由格式字符串。

示例 2-7. hello-5.c

/*  hello-5.c - Demonstrates command line argument passing to a module.
 *
 *  Copyright (C) 2001 by Peter Jay Salzman
 *
 *  08/02/2006 - Updated by Rodrigo Rubira Branco <rodrigo@kernelhacking.com>
 */

/* Kernel Programming */
#define MODULE
#define LINUX
#define __KERNEL__

#include <linux/module.h>
#include <linux/kernel.h>
#include <linux/init.h>

MODULE_LICENSE("GPL");
MODULE_AUTHOR("Peter Jay Salzman");

// These global variables can be set with command line arguments when you insmod
// the module in. 
//
static u8             mybyte = 'A';
static unsigned short myshort = 1;
static int            myint = 20;
static long           mylong = 9999;
static char           *mystring = "blah";
static int            myintArray[2] = { 0, 420 };

/*  Now we're actually setting the mechanism up -- making the variables command
 *  line arguments rather than just a bunch of global variables.
 */
MODULE_PARM(mybyte, "b");
MODULE_PARM(myshort, "h");
MODULE_PARM(myint, "i");
MODULE_PARM(mylong, "l");
MODULE_PARM(mystring, "s");
MODULE_PARM(myintArray, "1-2i");

MODULE_PARM_DESC(mybyte, "This byte really does nothing at all.");
MODULE_PARM_DESC(myshort, "This short is *extremely* important.");
// You get the picture.  Always use a MODULE_PARM_DESC() for each MODULE_PARM().


static int __init hello_5_init(void)
{
   printk(KERN_ALERT "mybyte is an 8 bit integer: %i\n", mybyte);
   printk(KERN_ALERT "myshort is a short integer: %hi\n", myshort);
   printk(KERN_ALERT "myint is an integer: %i\n", myint);
   printk(KERN_ALERT "mylong is a long integer: %li\n", mylong);
   printk(KERN_ALERT "mystring is a string: %s\n", mystring);
   printk(KERN_ALERT "myintArray is %i and %i\n", myintArray[0], myintArray[1]);
   return 0;
}


static void __exit hello_5_exit(void)
{
   printk(KERN_ALERT "Goodbye, world 5\n");
}


module_init(hello_5_init);
module_exit(hello_5_exit);

我建议尝试运行这段代码

    satan# insmod hello-5.o mystring="bebop" mybyte=255 myintArray=-1
    mybyte is an 8 bit integer: 255
    myshort is a short integer: 1
    myint is an integer: 20
    mylong is a long integer: 9999
    mystring is a string: bebop
    myintArray is -1 and 420
    
    satan# rmmod hello-5
    Goodbye, world 5
    
    satan# insmod hello-5.o mystring="supercalifragilisticexpialidocious" \
    > mybyte=256 myintArray=-1,-1
    mybyte is an 8 bit integer: 0
    myshort is a short integer: 1
    myint is an integer: 20
    mylong is a long integer: 9999
    mystring is a string: supercalifragilisticexpialidocious
    myintArray is -1 and -1
    
    satan# rmmod hello-5
    Goodbye, world 5
    
    satan# insmod hello-5.o mylong=hello
    hello-5.o: invalid argument syntax for mylong: 'h'

2.7. 跨越多个文件的模块

有时将内核模块拆分到多个源文件是有意义的。在这种情况下，你需要：

在除一个源文件之外的所有源文件中，添加代码行 #define __NO_VERSION__。这很重要，因为module.h通常包含以下内容的定义：kernel_version，一个包含模块编译时所针对的内核版本的全局变量。如果你需要version.h，你需要自己包含它，因为module.h在使用__NO_VERSION__.
时不会为你执行此操作。
像往常一样编译所有源文件。

将所有目标文件合并为一个文件。在 x86 架构下，使用 ld -m elf_i386 -r -o <模块名.o> <第一个源文件.o> <第二个源文件.o>。

这是一个此类内核模块的示例。

/*  start.c - Illustration of multi filed modules
 *
 *  Copyright (C) 2001 by Peter Jay Salzman
 *
 *  08/02/2006 - Updated by Rodrigo Rubira Branco <rodrigo@kernelhacking.com>
 */

/* Kernel Programming */
#define MODULE
#define LINUX
#define __KERNEL__

#include <linux/kernel.h>       /* We're doing kernel work */
#include <linux/module.h>       /* Specifically, a module */

int init_module(void)
{
  printk("Hello, world - this is the kernel speaking\n");
  return 0;
}

MODULE_LICENSE("GPL");

下一个文件

示例 2-9. stop.c

/*  stop.c - Illustration of multi filed modules
 *
 *  Copyright (C) 2001 by Peter Jay Salzman
 *
 *  08/02/2006 - Updated by Rodrigo Rubira Branco <rodrigo@kernelhacking.com>
 */

/* Kernel Programming */
#define MODULE
#define LINUX
#define __KERNEL__

#if defined(CONFIG_MODVERSIONS) && ! defined(MODVERSIONS)
   #include <linux/modversions.h> /* Will be explained later */
   #define MODVERSIONS
#endif        

#include <linux/kernel.h>  /* We're doing kernel work */
#include <linux/module.h>  /* Specifically, a module  */
#define __NO_VERSION__     /* It's not THE file of the kernel module */
#include <linux/version.h> /* Not included by module.h because of
	                                      __NO_VERSION__ */
	
void cleanup_module()
{
   printk("<1>Short is the life of a kernel module\n");
}

最后，是 makefile 文件

示例 2-10. 用于多文件模块的 Makefile

CC=gcc
MODCFLAGS := -O -Wall -DMODULE -D__KERNEL__
   	
hello.o:	hello2_start.o hello2_stop.o
   ld -m elf_i386 -r -o hello2.o hello2_start.o hello2_stop.o
   	
start.o: hello2_start.c
   ${CC} ${MODCFLAGS} -c hello2_start.c
   	
stop.o: hello2_stop.c
   ${CC} ${MODCFLAGS} -c hello2_stop.c

第 3 章. 预备知识

3.1. 模块与程序

3.1.1. 模块如何开始和结束

程序通常以一个main()函数开始，执行一系列指令，并在这些指令完成后终止。内核模块的工作方式略有不同。模块总是以init_module或你使用module_init调用的函数开始。这是模块的入口函数；它告诉内核模块提供的功能，并设置内核以在需要时运行模块的函数。完成此操作后，入口函数返回，模块将不执行任何操作，直到内核想要对模块提供的代码执行某些操作。

所有模块都通过调用cleanup_module或你使用module_exit调用的函数结束。这是模块的出口函数；它撤消入口函数所做的任何操作。它注销入口函数注册的功能。

每个模块都必须有一个入口函数和一个出口函数。由于指定入口函数和出口函数的方法不止一种，我将尽力使用术语“入口函数”和“出口函数”，但如果我疏忽而简单地将它们称为init_module和cleanup_module，我想你会明白我的意思。

3.1.2. 模块可用的函数

程序员经常使用他们没有定义的函数。这方面的一个主要例子是printf()。你使用这些由标准 C 库 libc 提供的库函数。这些函数的定义实际上直到链接阶段才进入你的程序，这确保了代码 (例如printf()) 是可用的，并修复调用指令以指向该代码。

内核模块在这里也不同。在 hello world 示例中，你可能已经注意到我们使用了一个函数，printk()但没有包含标准 I/O 库。这是因为模块是目标文件，其符号在 `insmod` 时被解析。符号的定义来自内核本身；你可以使用的唯一外部函数是内核提供的函数。如果你对内核导出了哪些符号感到好奇，请查看/proc/ksyms.

需要记住的一点是库函数和系统调用之间的区别。库函数是更高级别的，完全在用户空间中运行，并为程序员提供更方便的接口来访问执行实际工作的函数——系统调用。系统调用代表用户在内核模式下运行，并由内核本身提供。库函数printf()可能看起来像一个非常通用的打印函数，但它所做的只是将数据格式化为字符串，并使用低级系统调用写入字符串数据write()，然后将数据发送到标准输出。

你想看看printf()发出了哪些系统调用吗？这很容易！编译以下程序

    #include <stdio.h>
    int main(void)
    { printf("hello"); return 0; }

使用 gcc -Wall -o hello hello.c。使用 strace hello 运行可执行文件。你感到惊讶吗？你看到的每一行都对应一个系统调用。 strace[3] 是一个方便的程序，它可以为你提供有关程序正在进行的系统调用的详细信息，包括进行了哪些调用、它的参数是什么以及它返回什么。它是找出程序正在尝试访问哪些文件等问题的宝贵工具。在末尾，你将看到一行类似于write(1, "hello", 5hello)。就在那里。printf()面具背后的真面目。你可能不熟悉 write，因为大多数人使用库函数进行文件 I/O（如 fopen、fputs、fclose）。如果是这种情况，请尝试查看 man 2 write。第二个 man 手册部分专门介绍系统调用（如kill()和read()。第三个 man 手册部分专门介绍库调用，你可能更熟悉这些调用（如cosh()和random()).

你甚至可以编写模块来替换内核的系统调用，我们稍后会这样做。破解者经常利用这种方法来创建后门或木马，但你可以编写自己的模块来做更良性的事情，例如让内核在每次有人尝试删除系统上的文件时写入 *Tee hee, that tickles!*。

3.1.3. 用户空间与内核空间

内核的核心是访问资源，无论所讨论的资源是显卡、硬盘驱动器甚至是内存。程序经常竞争相同的资源。正如我刚刚保存此文档时，updatedb 开始更新 locate 数据库。我的 vim 会话和 updatedb 都在并发使用硬盘驱动器。内核需要保持事物井然有序，而不是在用户想要访问资源时就给予他们访问权限。为此，CPU 可以在不同的模式下运行。每种模式都为你在系统上执行操作提供了不同程度的自由。 Intel 80386 架构有 4 种这样的模式，称为环。 Unix 仅使用两个环；最高环（环 0，也称为“超级用户模式”，其中允许发生任何事情）和最低环，称为“用户模式”。

回想一下关于库函数与系统调用的讨论。通常，你在用户模式下使用库函数。库函数调用一个或多个系统调用，这些系统调用代表库函数执行，但由于它们是内核本身的一部分，因此在超级用户模式下执行。一旦系统调用完成其任务，它就会返回，执行权将转移回用户模式。

3.1.4. 命名空间

当你编写一个小的 C 程序时，你使用的变量对读者来说是方便且有意义的。另一方面，如果你正在编写将成为更大问题一部分的例程，那么你拥有的任何全局变量都是其他人全局变量社区的一部分；某些变量名可能会冲突。当程序有大量不够有意义以至于无法区分的全局变量时，就会出现 *命名空间污染*。在大型项目中，必须努力记住保留名称，并找到开发命名唯一变量名和符号的方案的方法。

在编写内核代码时，即使是最小的模块也会与整个内核链接，因此这绝对是一个问题。解决此问题的最佳方法是将所有变量声明为 static，并为你的符号使用明确定义的前缀。按照惯例，所有内核前缀均为小写。如果你不想将所有内容都声明为 static，另一种选择是声明一个符号表并将其注册到内核。我们稍后会讨论这个问题。

文件/proc/ksyms保存内核知道的所有符号，因此你的模块可以访问这些符号，因为它们共享内核的代码空间。

3.1.5. 代码空间

内存管理是一个非常复杂的主题 —— O'Reilly 的《深入理解 Linux 内核》的大部分内容都在讲内存管理！我们不是要成为内存管理方面的专家，但我们确实需要了解一些事实才能开始考虑编写真正的模块。

如果你没有考虑过段错误真正意味着什么，你可能会惊讶地听到指针实际上并不指向内存位置。至少不是真正的内存位置。当创建一个进程时，内核会留出一部分真实的物理内存，并将其交给进程用于其正在执行的代码、变量、堆栈、堆以及计算机科学家会了解的其他内容[4]。此内存从 $0$ 开始，并扩展到所需的任何大小。由于任何两个进程的内存空间都不重叠，因此每个可以访问内存地址的进程，例如0xbffff978，都将访问真实物理内存中的不同位置！这些进程将访问一个名为0xbffff978的索引，该索引指向为该特定进程预留的内存区域中的某种偏移量。在大多数情况下，像我们的 Hello, World 程序这样的进程无法访问另一个进程的空间，尽管有一些方法我们将在后面讨论。

内核也有自己的内存空间。由于模块是可以动态插入和删除到内核中的代码（与半自主对象相反），因此它共享内核的代码空间，而不是拥有自己的代码空间。因此，如果你的模块发生段错误，内核也会发生段错误。如果你由于差一错误而开始覆盖数据，那么你就是在践踏内核代码。这比听起来更糟糕，所以请尽力小心。

顺便说一下，我想指出的是，以上讨论适用于任何使用单内核[5]的操作系统。有一些称为微内核的东西，它们的模块有自己的代码空间。 GNU Hurd 和 QNX Neutrino 是微内核的两个例子。

3.1.6. 设备驱动程序

模块的一个类别是设备驱动程序，它为硬件（如电视卡或串行端口）提供功能。在 Unix 上，每个硬件都由位于/dev的文件中表示，该文件被命名为设备文件，它提供了与硬件通信的手段。设备驱动程序代表用户程序提供通信。因此，es1370.o声卡设备驱动程序可能会连接/dev/sound设备文件到 Ensoniq IS1370 声卡。像 mp3blaster 这样的用户空间程序可以使用/dev/sound，而无需知道安装了哪种声卡。

3.1.6.1. 主设备号和次设备号

让我们看看一些设备文件。以下是代表主 IDE 硬盘驱动器上第一个、第二个和第三个分区的设备文件：

    # ls -l /dev/hda[1-3]
    brw-rw----  1 root  disk  3, 1 Jul  5  2000 /dev/hda1
    brw-rw----  1 root  disk  3, 2 Jul  5  2000 /dev/hda2
    brw-rw----  1 root  disk  3, 3 Jul  5  2000 /dev/hda3

注意用逗号分隔的数字列吗？第一个数字称为设备的主设备号。第二个数字是次设备号。主设备号告诉你哪个驱动程序用于访问硬件。每个驱动程序都分配有一个唯一的主设备号；所有具有相同主设备号的设备文件都由同一个驱动程序控制。上述所有主设备号都是 3，因为它们都由同一个驱动程序控制。

次设备号由驱动程序用于区分其控制的各种硬件。回到上面的示例，尽管所有三个设备都由同一个驱动程序处理，但它们具有唯一的次设备号，因为驱动程序将它们视为不同的硬件组件。

设备分为两种类型：字符设备和块设备。区别在于块设备有一个请求缓冲区，因此它们可以选择响应请求的最佳顺序。这在存储设备的情况下很重要，在存储设备中，读取或写入彼此靠近的扇区比读取或写入相距较远的扇区更快。另一个区别是，块设备只能以块（其大小可能因设备而异）接受输入和返回输出，而字符设备允许根据需要使用任意数量的字节。世界上大多数设备都是字符设备，因为它们不需要这种类型的缓冲，并且它们不以固定的块大小运行。你可以通过查看 ls -l 输出的第一个字符来判断设备文件是块设备还是字符设备。如果是“b”，则它是块设备，如果是“c”，则它是字符设备。你在上面看到的设备是块设备。以下是一些字符设备（串行端口）：

    crw-rw----  1 root  dial 4, 64 Feb 18 23:34 /dev/ttyS0
    crw-r-----  1 root  dial 4, 65 Nov 17 10:26 /dev/ttyS1
    crw-rw----  1 root  dial 4, 66 Jul  5  2000 /dev/ttyS2
    crw-rw----  1 root  dial 4, 67 Jul  5  2000 /dev/ttyS3

如果你想查看已分配了哪些主设备号，你可以查看/usr/src/linux/Documentation/devices.txt.

安装系统时，所有这些设备文件都是由 mknod 命令创建的。要创建一个名为“coffee”的新字符设备，其主/次设备号为12和2，只需执行 mknod /dev/coffee c 12 2 即可。你 *不一定* 必须将你的设备文件放在/dev，但这是一种约定俗成的做法。 Linus 将他的设备文件放在/dev，你也应该这样做。但是，在创建用于测试目的的设备文件时，将其放在你编译内核模块的工作目录中可能没问题。只是请务必在你完成编写设备驱动程序后将其放在正确的位置。

我想最后再强调几点，这些点从上面的讨论中隐含着，但我想明确地说明以防万一。当访问设备文件时，内核使用该文件的主设备号来确定应使用哪个驱动程序来处理访问。这意味着内核实际上不需要使用甚至不需要知道次设备号。驱动程序本身是唯一关心次设备号的东西。它使用次设备号来区分不同的硬件组件。

顺便说一下，当我说“硬件”时，我的意思是指比你可以握在手中的 PCI 卡更抽象的东西。看一下这两个设备文件：

    % ls -l /dev/fd0 /dev/fd0u1680
    brwxrwxrwx   1 root  floppy   2,  0 Jul  5  2000 /dev/fd0
    brw-rw----   1 root  floppy   2, 44 Jul  5  2000 /dev/fd0u1680

现在你可以查看这两个设备文件，并立即知道它们是块设备，并且由相同的驱动程序处理（块设备主设备号2）。你甚至可能意识到这两个都代表你的软盘驱动器，即使你只有一个软盘驱动器。为什么是两个文件？一个代表具有1.44 MB 存储容量的软盘驱动器。另一个是 *同一个* 软盘驱动器，具有1.68 MB 存储容量，对应于某些人所说的“超格式化”磁盘。这种磁盘比标准格式化的软盘保存更多数据。因此，这是一个两个具有不同次设备号的设备文件实际上代表同一物理硬件的例子。因此，请注意，我们讨论中的“硬件”一词可能意味着非常抽象的东西。

第 4 章. 字符设备文件

4.1. 字符设备驱动程序

4.1.1. file_operations 结构体

file_operations 结构体在linux/fs.h中定义，并保存指向驱动程序定义的函数的指针，这些函数对设备执行各种操作。结构体的每个字段都对应于驱动程序定义的某个函数的地址，以处理请求的操作。

例如，每个字符驱动程序都需要定义一个从设备读取数据的函数。 file_operations 结构体保存模块执行该操作的函数的地址。以下是内核的定义：2.4.2:

    struct file_operations {
       struct module *owner;
       loff_t (*llseek) (struct file *, loff_t, int);
       ssize_t (*read) (struct file *, char *, size_t, loff_t *);
       ssize_t (*write) (struct file *, const char *, size_t, loff_t *);
       int (*readdir) (struct file *, void *, filldir_t);
       unsigned int (*poll) (struct file *, struct poll_table_struct *);
       int (*ioctl) (struct inode *, struct file *, unsigned int, unsigned long);
       int (*mmap) (struct file *, struct vm_area_struct *);
       int (*open) (struct inode *, struct file *);
       int (*flush) (struct file *);
       int (*release) (struct inode *, struct file *);
       int (*fsync) (struct file *, struct dentry *, int datasync);
       int (*fasync) (int, struct file *, int);
       int (*lock) (struct file *, int, struct file_lock *);
    	 ssize_t (*readv) (struct file *, const struct iovec *, unsigned long,
          loff_t *);
    	 ssize_t (*writev) (struct file *, const struct iovec *, unsigned long,
          loff_t *);
    };

某些操作未由驱动程序实现。例如，处理显卡的驱动程序不需要从目录结构中读取数据。 file_operations 结构体中相应的条目应设置为NULL.

有一个 gcc 扩展，使对此结构的赋值更加方便。你会在现代驱动程序中看到它，并且可能会感到惊讶。这是为结构体赋值的新方法：

    struct file_operations fops = {
       read: device_read,
       write: device_write,
       open: device_open,
       release: device_release
    };

但是，也有一种 C99 方法可以为结构的元素赋值，这绝对比使用 GNU 扩展更可取。我目前使用的 gcc 版本，2.95，支持新的 C99 语法。你应该使用此语法，以防有人想要移植你的驱动程序。这将有助于兼容性。

    struct file_operations fops = {
       .read = device_read,
       .write = device_write,
       .open = device_open,
       .release = device_release
    };

含义很明确，你应该意识到，你未显式赋值的结构的任何成员都将初始化为NULL由 gcc 初始化。

指向 struct file_operations 的指针通常命名为fops.

4.1.2. file 结构体

每个设备在内核中都由 file 结构体表示，该结构体在linux/fs.h中定义。请注意，file 是内核级结构，永远不会出现在用户空间程序中。它与 FILE 不同，FILE 由 glibc 定义，永远不会出现在内核空间函数中。此外，它的名称有点误导性；它代表一个抽象的打开的“文件”，而不是磁盘上的文件，磁盘上的文件由名为 inode 的结构体表示。

指向struct file的指针通常命名为filp。你也会看到它被称为struct file file。克制住这种诱惑。

继续查看file的定义。你看到的大多数条目，例如struct dentry没有被设备驱动程序使用，你可以忽略它们。这是因为驱动程序不直接填充file；它们仅使用包含在file中的结构体，这些结构体在其他地方创建。

4.1.3. 注册设备

如前所述，字符设备通过设备文件访问，设备文件通常位于/dev[6]。主设备号告诉你哪个驱动程序处理哪个设备文件。次设备号仅由驱动程序本身使用，以区分它正在操作的设备，以防驱动程序处理多个设备。

向你的系统添加驱动程序意味着将其注册到内核。这与在模块初始化期间为其分配主设备号同义。你可以使用register_chrdev函数来完成此操作，该函数定义为linux/fs.h.

    int register_chrdev(unsigned int major, const char *name,
       struct file_operations *fops);

其中unsigned int major是你要请求的主设备号，const char *name是设备名称，它将出现在/proc/devices和struct file_operations *fops是指向file_operations驱动程序的表。负返回值表示注册失败。请注意，我们没有将次设备号传递给register_chrdev。这是因为内核不关心次设备号；只有我们的驱动程序使用它。

现在的问题是，如何在不占用已在使用的主设备号的情况下获得一个主设备号？最简单的方法是查看Documentation/devices.txt并选择一个未使用的。这是一种糟糕的做法，因为你永远无法确定你选择的号码是否会在以后被分配。答案是你可以请求内核为你分配一个动态主设备号。

如果你将主设备号 0 传递给register_chrdev，返回值将是动态分配的主设备号。缺点是你无法提前创建设备文件，因为你不知道主设备号会是多少。有几种方法可以做到这一点。首先，驱动程序本身可以打印新分配的号码，我们可以手动创建设备文件。其次，新注册的设备将在/proc/devices中有一个条目，我们可以手动创建设备文件，也可以编写一个 shell 脚本来读取文件并创建设备文件。第三种方法是我们可以让我们的驱动程序在成功注册后使用mknod系统调用创建设备文件，并在调用cleanup_module.

4.1.4. 注销设备

我们不能允许在 root 用户想要 rmmod 内核模块时就随意执行此操作。如果设备文件被进程打开，然后我们移除内核模块，则使用该文件将导致调用到以前存在适当函数（read/write）的内存位置。如果我们幸运的话，那里没有加载其他代码，我们将收到一个难看的错误消息。如果我们不幸的话，另一个内核模块被加载到同一位置，这意味着跳入内核中另一个函数的中间。这样做的结果将无法预测，但它们不会非常乐观。

通常，当你不想允许某事时，你会从应该执行它的函数返回一个错误代码（负数）。对于cleanup_module，这是不可能的，因为它是一个 void 函数。但是，有一个计数器可以跟踪有多少进程正在使用你的模块。你可以通过查看/proc/modules的第三个字段来查看其值。如果此数字不为零，rmmod将会失败。请注意，你不必从cleanup_module内部检查计数器，因为系统调用sys_delete_module（在linux/module.c中定义）将为你执行检查。你不应直接使用此计数器，但在linux/modules.h中定义了一些宏，可以让你增加、减少和显示此计数器：

MOD_INC_USE_COUNT：增加使用计数。
MOD_DEC_USE_COUNT：减少使用计数。
MOD_IN_USE：显示使用计数。

保持计数器准确非常重要；如果你丢失了正确的使用计数，你将永远无法卸载模块；现在是重新启动的时候了，孩子们。这迟早会在模块开发过程中发生在你身上。

4.1.5. chardev.c

下一个代码示例创建一个名为chardev的字符驱动程序。你可以cat它的设备文件（或open使用程序打开文件），驱动程序会将设备文件被读取的次数放入文件中。我们不支持写入文件（例如 echo "hi" > /dev/hello），但会捕获这些尝试并告诉用户该操作不受支持。如果你没有看到我们对读入缓冲区的数据做了什么，请不要担心；我们没有对它做太多事情。我们只是读入数据并打印一条消息，确认我们收到了数据。

示例 4-1. chardev.c

/*  chardev.c: Creates a read-only char device that says how many times
 *  you've read from the dev file
 *
 *  Copyright (C) 2001 by Peter Jay Salzman
 *
 *  08/02/2006 - Updated by Rodrigo Rubira Branco <rodrigo@kernelhacking.com>
 */

/* Kernel Programming */
#define MODULE
#define LINUX
#define __KERNEL__

#if defined(CONFIG_MODVERSIONS) && ! defined(MODVERSIONS)
   #include <linux/modversions.h>
   #define MODVERSIONS
#endif
#include <linux/kernel.h>
#include <linux/module.h>
#include <linux/fs.h>
#include <asm/uaccess.h>  /* for put_user */
#include <asm/errno.h>

/*  Prototypes - this would normally go in a .h file */
int init_module(void);
void cleanup_module(void);
static int device_open(struct inode *, struct file *);
static int device_release(struct inode *, struct file *);
static ssize_t device_read(struct file *, char *, size_t, loff_t *);
static ssize_t device_write(struct file *, const char *, size_t, loff_t *);

#define SUCCESS 0
#define DEVICE_NAME "chardev" /* Dev name as it appears in /proc/devices   */
#define BUF_LEN 80            /* Max length of the message from the device */


/* Global variables are declared as static, so are global within the file. */

static int Major;            /* Major number assigned to our device driver */
static int Device_Open = 0;  /* Is device open?  Used to prevent multiple
                                        access to the device */
static char msg[BUF_LEN];    /* The msg the device will give when asked    */
static char *msg_Ptr;

static struct file_operations fops = {
  .read = device_read,
  .write = device_write,
  .open = device_open,
  .release = device_release
};


/* Functions */

int init_module(void)
{
   Major = register_chrdev(0, DEVICE_NAME, &fops);

   if (Major < 0) {
     printk ("Registering the character device failed with %d\n", Major);
     return Major;
   }

   printk("<1>I was assigned major number %d.  To talk to\n", Major);
   printk("<1>the driver, create a dev file with\n");
   printk("'mknod /dev/hello c %d 0'.\n", Major);
   printk("<1>Try various minor numbers.  Try to cat and echo to\n");
   printk("the device file.\n");
   printk("<1>Remove the device file and module when done.\n");

   return 0;
}


void cleanup_module(void)
{
   /* Unregister the device */
   int ret = unregister_chrdev(Major, DEVICE_NAME);
   if (ret < 0) printk("Error in unregister_chrdev: %d\n", ret);
}


/* Methods */

/* Called when a process tries to open the device file, like
 * "cat /dev/mycharfile"
 */
static int device_open(struct inode *inode, struct file *file)
{
   static int counter = 0;
   if (Device_Open) return -EBUSY;

   Device_Open++;
   sprintf(msg,"I already told you %d times Hello world!\n", counter++);
   msg_Ptr = msg;
   MOD_INC_USE_COUNT;

   return SUCCESS;
}


/* Called when a process closes the device file */
static int device_release(struct inode *inode, struct file *file)
{
   Device_Open --;     /* We're now ready for our next caller */

   /* Decrement the usage count, or else once you opened the file, you'll
                    never get get rid of the module. */
   MOD_DEC_USE_COUNT;

   return 0;
}


/* Called when a process, which already opened the dev file, attempts to
   read from it.
*/
static ssize_t device_read(struct file *filp,
   char *buffer,    /* The buffer to fill with data */
   size_t length,   /* The length of the buffer     */
   loff_t *offset)  /* Our offset in the file       */
{
   /* Number of bytes actually written to the buffer */
   int bytes_read = 0;

   /* If we're at the end of the message, return 0 signifying end of file */
   if (*msg_Ptr == 0) return 0;

   /* Actually put the data into the buffer */
   while (length && *msg_Ptr)  {

        /* The buffer is in the user data segment, not the kernel segment;
         * assignment won't work.  We have to use put_user which copies data from
         * the kernel data segment to the user data segment. */
         put_user(*(msg_Ptr++), buffer++);

         length--;
         bytes_read++;
   }

   /* Most read functions return the number of bytes put into the buffer */
   return bytes_read;
}


/*  Called when a process writes to dev file: echo "hi" > /dev/hello */
static ssize_t device_write(struct file *filp,
   const char *buff,
   size_t len,
   loff_t *off)
{
   printk ("<1>Sorry, this operation isn't supported.\n");
   return -EINVAL;
}

MODULE_LICENSE("GPL");

4.1.6. 为多个内核版本编写模块

系统调用是内核向进程显示的主要接口，通常在不同版本之间保持不变。可能会添加新的系统调用，但通常旧的系统调用会像以前一样运行。这对于向后兼容性是必要的——新的内核版本不应该破坏常规进程。在大多数情况下，设备文件也将保持不变。另一方面，内核内部的接口在不同版本之间可能会发生变化，并且确实会发生变化。

Linux 内核版本分为稳定版本（n.$<$偶数$>$.m）和开发版本（n.$<$奇数$>$.m）。开发版本包含所有很酷的新想法，包括那些将在下一个版本中被认为是错误或重新实现的想法。因此，你不能相信这些版本中的接口会保持不变（这就是为什么我不打算在本书中支持它们的原因，这太费力了，而且很快就会过时）。另一方面，在稳定版本中，我们可以期望接口保持不变，而与错误修复版本（m 数字）无关。

不同内核版本之间存在差异，如果你想支持多个内核版本，你将发现自己必须编写条件编译指令。执行此操作的方法是比较宏LINUX_VERSION_CODE与宏KERNEL_VERSION。在版本a.b.c的内核中，此宏的值将为 $2^{16}a+2^{8}b+c$。请注意，此宏未为内核 2.0.35 及更早版本定义，因此如果你要编写支持非常旧的内核的模块，则必须自己定义它，例如

示例 4-2. 某些标题

    #if LINUX_KERNEL_VERSION >= KERNEL_VERSION(2,2,0)
        #define KERNEL_VERSION(a,b,c) ((a)*65536+(b)*256+(c))
    #endif

当然，由于这些是宏，你也可以使用 #ifndef KERNEL_VERSION 来测试宏的存在，而不是测试内核的版本。

第 5 章. /proc 文件系统

5.1. /proc 文件系统

在 Linux 中，内核和内核模块有一种额外的机制可以将信息发送给进程——/proc文件系统。最初设计用于轻松访问有关进程的信息（因此得名），现在内核的每个部分都使用它来报告一些有趣的内容，例如/proc/modules，其中包含模块列表，以及/proc/meminfo，其中包含内存使用统计信息。

使用 proc 文件系统的方法与设备驱动程序使用的方法非常相似——你创建一个结构体，其中包含/proc文件所需的所有信息，包括指向任何处理函数的指针（在我们的例子中只有一个，即当有人尝试从/proc文件读取时调用的函数）。然后，init_module将结构体注册到内核，并且cleanup_module注销它。

我们使用proc_register_dynamic[7] 的原因是我们不想提前确定用于我们文件的 inode 号，而是允许内核确定它以防止冲突。普通文件系统位于磁盘上，而不仅仅是在内存中（/proc就是这种情况），在这种情况下，inode 号是指向磁盘位置的指针，文件的索引节点（简称 inode）位于该位置。 inode 包含有关文件的信息，例如文件的权限，以及指向可以找到文件数据的磁盘位置的指针。

因为在文件打开或关闭时我们不会被调用，所以我们无处放置MOD_INC_USE_COUNT和MOD_DEC_USE_COUNT在此模块中，如果文件已打开然后模块被移除，则无法避免后果。在下一章中，我们将看到一种更难实现但更灵活的方式来处理/proc文件，这将使我们能够防止这个问题。

示例 5-1. procfs.c

/*  procfs.c -  create a "file" in /proc 
 *
 *  Copyright (C) 2001 by Peter Jay Salzman
 *
 *  08/02/2006 - Updated by Rodrigo Rubira Branco <rodrigo@kernelhacking.com>
 */

/* Kernel Programming */
#define MODULE
#define LINUX
#define __KERNEL__

#include <linux/kernel.h>   /* We're doing kernel work */
#include <linux/module.h>   /* Specifically, a module */

/* Deal with CONFIG_MODVERSIONS */
#if CONFIG_MODVERSIONS==1
#define MODVERSIONS
#include <linux/modversions.h>
#endif        


/* Necessary because we use the proc fs */
#include <linux/proc_fs.h>



/* In 2.2.3 /usr/include/linux/version.h includes a 
 * macro for this, but 2.0.35 doesn't - so I add it 
 * here if necessary. */
#ifndef KERNEL_VERSION
#define KERNEL_VERSION(a,b,c) ((a)*65536+(b)*256+(c))
#endif



/* Put data into the proc fs file.

   Arguments
   =========
   1. The buffer where the data is to be inserted, if 
      you decide to use it.
   2. A pointer to a pointer to characters. This is 
      useful if you don't want to use the buffer 
      allocated by the kernel.
   3. The current position in the file. 
   4. The size of the buffer in the first argument.  
   5. Zero (for future use?).


   Usage and Return Value
   ======================
   If you use your own buffer, like I do, put its 
   location in the second argument and return the 
   number of bytes used in the buffer.

   A return value of zero means you have no further 
   information at this time (end of file). A negative 
   return value is an error condition.
   

   For More Information
   ==================== 
   The way I discovered what to do with this function 
   wasn't by reading documentation, but by reading the 
   code which used it. I just looked to see what uses 
   the get_info field of proc_dir_entry struct (I used a 
   combination of find and grep, if you're interested), 
   and I saw that  it is used in <kernel source 
   directory>/fs/proc/array.c.

   If something is unknown about the kernel, this is 
   usually the way to go. In Linux we have the great 
   advantage of having the kernel source code for 
   free - use it.
 */
#if LINUX_VERSION_CODE > KERNEL_VERSION(2,4,0)
int procfile_read(char *buffer,
                  char **buffer_location, off_t offset,
                  int buffer_length, int *eof, void *data)
#else
int procfile_read(char *buffer, 
		  char **buffer_location, 
		  off_t offset, 
		  int buffer_length, 
		  int zero)
#endif
{
  int len;  /* The number of bytes actually used */

  /* This is static so it will still be in memory 
   * when we leave this function */
  static char my_buffer[80];  

  static int count = 1;

  /* We give all of our information in one go, so if the 
   * user asks us if we have more information the 
   * answer should always be no. 
   *
   * This is important because the standard read 
   * function from the library would continue to issue 
   * the read system call until the kernel replies
   * that it has no more information, or until its 
   * buffer is filled.
   */
  if (offset > 0)
    return 0;

  /* Fill the buffer and get its length */
  len = sprintf(my_buffer, 
    "For the %d%s time, go away!\n", count,
    (count % 100 > 10 && count % 100 < 14) ? "th" : 
      (count % 10 == 1) ? "st" :
        (count % 10 == 2) ? "nd" :
          (count % 10 == 3) ? "rd" : "th" );
  count++;

  /* Tell the function which called us where the 
   * buffer is */
  *buffer_location = my_buffer;

  /* Return the length */
  return len;
}

#if LINUX_VERSION_CODE > KERNEL_VERSION(2,4,0)
struct proc_dir_entry *Our_Proc_File;
#else
struct proc_dir_entry Our_Proc_File = 
  {
    0, /* Inode number - ignore, it will be filled by 
        * proc_register[_dynamic] */
    4, /* Length of the file name */
    "test", /* The file name */
    S_IFREG | S_IRUGO, /* File mode - this is a regular 
                        * file which can be read by its 
                        * owner, its group, and everybody
                        * else */
    1,	/* Number of links (directories where the 
         * file is referenced) */
    0, 0,  /* The uid and gid for the file - we give it 
            * to root */
    80, /* The size of the file reported by ls. */
    NULL, /* functions which can be done on the inode 
           * (linking, removing, etc.) - we don't 
           * support any. */
    (struct file_operations *) procfile_read, /* The read function for this file, 
                    * the function called when somebody 
                    * tries to read something from it. */
    NULL /* We could have here a function to fill the 
          * file's inode, to enable us to play with 
          * permissions, ownership, etc. */
  }; 
#endif





/* Initialize the module - register the proc file */
int init_module()
{
  /* Success if proc_register[_dynamic] is a success, 
   * failure otherwise. */
#if LINUX_VERSION_CODE > KERNEL_VERSION(2,2,0)
  /* In version 2.2, proc_register assign a dynamic 
   * inode number automatically if it is zero in the 
   * structure , so there's no more need for 
   * proc_register_dynamic
   */
  #if LINUX_VERSION_CODE > KERNEL_VERSION(2,4,0)
	Our_Proc_File=create_proc_read_entry("test", 0444, NULL, procfile_read, NULL);

	if ( Our_Proc_File == NULL )
		return -ENOMEM;
	else
		return 0;
  #else
  	return proc_register(&proc_root, &Our_Proc_File);
  #endif
#else
  return proc_register_dynamic(&proc_root, &Our_Proc_File);
#endif
 
  /* proc_root is the root directory for the proc 
   * fs (/proc). This is where we want our file to be 
   * located. 
   */
}


/* Cleanup - unregister our file from /proc */
void cleanup_module()
{
  #if LINUX_VERSION_CODE > KERNEL_VERSION(2,4,0)
	remove_proc_entry("test", NULL);
  #else
  	proc_unregister(&proc_root, Our_Proc_File.low_ino);
  #endif
}  

MODULE_LICENSE("GPL");

第 6 章. 使用 /proc 进行输入

6.1. 使用 /proc 进行输入

到目前为止，我们有两种方法可以从内核模块生成输出：我们可以注册设备驱动程序并 mknod 一个设备文件，或者我们可以创建一个/proc文件。这允许内核模块告诉我们它喜欢的任何内容。唯一的问题是我们无法回复。我们将输入发送到内核模块的第一种方法是通过写回/proc文件。

由于 proc 文件系统主要用于允许内核向进程报告其状态，因此没有针对输入的特殊规定。struct proc_dir_entry不包含指向输入函数的指针，就像它包含指向输出函数的指针一样。相反，要写入/proc文件，我们需要使用标准的文件系统机制。

在 Linux 中，有一个用于文件系统注册的标准机制。由于每个文件系统都必须有自己的函数来处理 inode 和文件操作[8]，因此有一个特殊的结构来保存指向所有这些函数的指针，struct inode_operations，其中包括指向struct file_operations的指针。在 /proc 中，每当我们注册一个新文件时，我们都可以指定哪个struct inode_operations将用于访问它。这就是我们使用的机制，一个struct inode_operations其中包括指向一个struct file_operations其中包括指向我们的module_input和module_output函数的指针。

重要的是要注意，在内核中，读和写的标准角色是相反的。读函数用于输出，而写函数用于输入。原因是读和写指的是用户的角度 --- 如果一个进程从内核读取某些内容，那么内核需要输出它；如果一个进程向内核写入某些内容，那么内核将其接收为输入。

这里另一个有趣的点是module_permission函数。每当进程尝试对/proc文件执行某些操作时，都会调用此函数，它可以决定是否允许访问。目前，它仅基于操作和当前用户的 uid（如current中可用，current 是指向包含有关当前正在运行的进程信息的结构的指针），但它可以基于我们喜欢的任何内容，例如其他进程对同一文件执行的操作、一天中的时间或我们收到的最后一个输入。

使用put_user和get_user的原因是 Linux 内存（在 Intel 架构下，在某些其他处理器下可能不同）是分段的。这意味着指针本身并不引用内存中的唯一位置，而仅引用内存段中的位置，并且您需要知道它是哪个内存段才能使用它。内核有一个内存段，每个进程都有一个内存段。

进程唯一可访问的内存段是它自己的，因此当编写作为进程运行的常规程序时，无需担心段。当您编写内核模块时，通常您希望访问内核内存段，这由系统自动处理。但是，当需要在当前运行的进程和内核之间传递内存缓冲区的内容时，内核函数会接收指向进程段中内存缓冲区的指针。put_user和get_user宏允许您访问该内存。

示例 6-1. procfs.c

/*  procfs.c -  create a "file" in /proc, which allows both input and output.
 */

#include <linux/kernel.h>   /* We're doing kernel work */
#include <linux/module.h>   /* Specifically, a module */

/* Necessary because we use proc fs */
#include <linux/proc_fs.h>


/* In 2.2.3 /usr/include/linux/version.h includes a 
 * macro for this, but 2.0.35 doesn't - so I add it 
 * here if necessary. */
#ifndef KERNEL_VERSION
#define KERNEL_VERSION(a,b,c) ((a)*65536+(b)*256+(c))
#endif



#if LINUX_VERSION_CODE >= KERNEL_VERSION(2,2,0)
#include <asm/uaccess.h>  /* for get_user and put_user */
#endif

/* The module's file functions ********************** */


/* Here we keep the last message received, to prove 
 * that we can process our input */
#define MESSAGE_LENGTH 80
static char Message[MESSAGE_LENGTH];


/* Since we use the file operations struct, we can't 
 * use the special proc output provisions - we have to 
 * use a standard read function, which is this function */
#if LINUX_VERSION_CODE &gt;= KERNEL_VERSION(2,2,0)
static ssize_t module_output(
    struct file *file,   /* The file read */
    char *buf, /* The buffer to put data to (in the
                * user segment) */
    size_t len,  /* The length of the buffer */
    loff_t *offset) /* Offset in the file - ignore */
#else
static int module_output(
    struct inode *inode, /* The inode read */
    struct file *file,   /* The file read */
    char *buf, /* The buffer to put data to (in the
                * user segment) */
    int len)  /* The length of the buffer */
#endif
{
  static int finished = 0;
  int i;
  char message[MESSAGE_LENGTH+30];

  /* We return 0 to indicate end of file, that we have 
   * no more information. Otherwise, processes will 
   * continue to read from us in an endless loop. */
  if (finished) {
    finished = 0;
    return 0;
  }

  /* We use put_user to copy the string from the kernel's 
   * memory segment to the memory segment of the process 
   * that called us. get_user, BTW, is
   * used for the reverse. */
  sprintf(message, "Last input:%s", Message);
  for(i=0; i&lt;len && message[i]; i++) 
    put_user(message[i], buf+i);


  /* Notice, we assume here that the size of the message 
   * is below len, or it will be received cut. In a real 
   * life situation, if the size of the message is less 
   * than len then we'd return len and on the second call 
   * start filling the buffer with the len+1'th byte of 
   * the message. */
  finished = 1; 

  return i;  /* Return the number of bytes "read" */
}


/* This function receives input from the user when the 
 * user writes to the /proc file. */
#if LINUX_VERSION_CODE >= KERNEL_VERSION(2,2,0)
static ssize_t module_input(
    struct file *file,   /* The file itself */
    const char *buf,     /* The buffer with input */
    size_t length,       /* The buffer's length */
    loff_t *offset)      /* offset to file - ignore */
#else
static int module_input(
    struct inode *inode, /* The file's inode */
    struct file *file,   /* The file itself */
    const char *buf,     /* The buffer with the input */
    int length)          /* The buffer's length */
#endif
{
  int i;

  /* Put the input into Message, where module_output 
   * will later be able to use it */
  for(i=0; i<MESSAGE_LENGTH-1 && i<length; i++)
#if LINUX_VERSION_CODE >= KERNEL_VERSION(2,2,0)
    get_user(Message[i], buf+i);
  /* In version 2.2 the semantics of get_user changed, 
   * it not longer returns a character, but expects a 
   * variable to fill up as its first argument and a 
   * user segment pointer to fill it from as the its 
   * second.
   *
   * The reason for this change is that the version 2.2 
   * get_user can also read an short or an int. The way 
   * it knows the type of the variable it should read 
   * is by using sizeof, and for that it needs the 
   * variable itself.
   */ 
#else 
    Message[i] = get_user(buf+i);
#endif
  Message[i] = '\0';  /* we want a standard, zero 
                       * terminated string */
  
  /* We need to return the number of input characters 
   * used */
  return i;
}



/* This function decides whether to allow an operation 
 * (return zero) or not allow it (return a non-zero 
 * which indicates why it is not allowed).
 *
 * The operation can be one of the following values:
 * 0 - Execute (run the "file" - meaningless in our case)
 * 2 - Write (input to the kernel module)
 * 4 - Read (output from the kernel module)
 *
 * This is the real function that checks file 
 * permissions. The permissions returned by ls -l are 
 * for referece only, and can be overridden here. 
 */
static int module_permission(struct inode *inode, int op)
{
  /* We allow everybody to read from our module, but 
   * only root (uid 0) may write to it */ 
  if (op == 4 || (op == 2 && current->euid == 0))
    return 0; 

  /* If it's anything else, access is denied */
  return -EACCES;
}




/* The file is opened - we don't really care about 
 * that, but it does mean we need to increment the 
 * module's reference count. */
int module_open(struct inode *inode, struct file *file)
{
  MOD_INC_USE_COUNT;
 
  return 0;
}


/* The file is closed - again, interesting only because 
 * of the reference count. */
#if LINUX_VERSION_CODE >= KERNEL_VERSION(2,2,0)
int module_close(struct inode *inode, struct file *file)
#else
void module_close(struct inode *inode, struct file *file)
#endif
{
  MOD_DEC_USE_COUNT;

#if LINUX_VERSION_CODE >= KERNEL_VERSION(2,2,0)
  return 0;  /* success */
#endif
}


/* Structures to register as the /proc file, with 
 * pointers to all the relevant functions. ********** */



/* File operations for our proc file. This is where we 
 * place pointers to all the functions called when 
 * somebody tries to do something to our file. NULL 
 * means we don't want to deal with something. */
static struct file_operations File_Ops_4_Our_Proc_File =
  {
    NULL,  /* lseek */
    module_output,  /* "read" from the file */
    module_input,   /* "write" to the file */
    NULL,  /* readdir */
    NULL,  /* select */
    NULL,  /* ioctl */
    NULL,  /* mmap */
    module_open,    /* Somebody opened the file */
#if LINUX_VERSION_CODE >= KERNEL_VERSION(2,2,0)
    NULL,   /* flush, added here in version 2.2 */
#endif
    module_close,    /* Somebody closed the file */
    /* etc. etc. etc. (they are all given in 
     * /usr/include/linux/fs.h). Since we don't put 
     * anything here, the system will keep the default
     * data, which in Unix is zeros (NULLs when taken as 
     * pointers). */
  };



/* Inode operations for our proc file. We need it so 
 * we'll have some place to specify the file operations 
 * structure we want to use, and the function we use for 
 * permissions. It's also possible to specify functions 
 * to be called for anything else which could be done to 
 * an inode (although we don't bother, we just put 
 * NULL). */
static struct inode_operations Inode_Ops_4_Our_Proc_File =
  {
    &File_Ops_4_Our_Proc_File,
    NULL, /* create */
    NULL, /* lookup */
    NULL, /* link */
    NULL, /* unlink */
    NULL, /* symlink */
    NULL, /* mkdir */
    NULL, /* rmdir */
    NULL, /* mknod */
    NULL, /* rename */
    NULL, /* readlink */
    NULL, /* follow_link */
    NULL, /* readpage */
    NULL, /* writepage */
    NULL, /* bmap */
    NULL, /* truncate */
    module_permission /* check for permissions */
  };


/* Directory entry */
static struct proc_dir_entry Our_Proc_File = 
  {
    0, /* Inode number - ignore, it will be filled by 
        * proc_register[_dynamic] */
    7, /* Length of the file name */
    "rw_test", /* The file name */
    S_IFREG | S_IRUGO | S_IWUSR, 
    /* File mode - this is a regular file which 
     * can be read by its owner, its group, and everybody
     * else. Also, its owner can write to it.
     *
     * Actually, this field is just for reference, it's
     * module_permission that does the actual check. It 
     * could use this field, but in our implementation it
     * doesn't, for simplicity. */
    1,  /* Number of links (directories where the 
         * file is referenced) */
    0, 0,  /* The uid and gid for the file - 
            * we give it to root */
    80, /* The size of the file reported by ls. */
    &Inode_Ops_4_Our_Proc_File, 
    /* A pointer to the inode structure for
     * the file, if we need it. In our case we
     * do, because we need a write function. */
    NULL  
    /* The read function for the file. Irrelevant, 
     * because we put it in the inode structure above */
  }; 



/* Module initialization and cleanup ******************* */

/* Initialize the module - register the proc file */
int init_module()
{
  /* Success if proc_register[_dynamic] is a success, 
   * failure otherwise */
#if LINUX_VERSION_CODE >= KERNEL_VERSION(2,2,0)
  /* In version 2.2, proc_register assign a dynamic 
   * inode number automatically if it is zero in the 
   * structure , so there's no more need for 
   * proc_register_dynamic
   */
  return proc_register(&proc_root, &Our_Proc_File);
#else
  return proc_register_dynamic(&proc_root, &Our_Proc_File);
#endif
}


/* Cleanup - unregister our file from /proc */
void cleanup_module()
{
  proc_unregister(&proc_root, Our_Proc_File.low_ino);
}

第 7 章。与设备文件对话

7.1. 与设备文件对话（写入和 IOCTL）}

设备文件应该代表物理设备。大多数物理设备都用于输出和输入，因此内核中的设备驱动程序必须有一种机制来获取要发送到进程的设备的输出。这是通过打开设备文件进行输出并写入它来完成的，就像写入文件一样。在以下示例中，这是通过device_write.

实现的。这并不总是足够。想象一下，您有一个连接到调制解调器的串行端口（即使您有一个内置调制解调器，从 CPU 的角度来看，它仍然被实现为连接到调制解调器的串行端口，因此您不必太费脑筋）。自然的做法是使用设备文件向调制解调器写入内容（调制解调器命令或要通过电话线发送的数据）并从调制解调器读取内容（命令的响应或通过电话线接收的数据）。但是，当您需要与串行端口本身对话时，例如发送数据发送和接收速率时，这仍然存在问题。

在 Unix 中的答案是使用一个名为ioctl（Input Output ConTroL 的缩写）的特殊函数。每个设备都可以有自己的ioctl命令，可以是 readioctl's（用于将信息从进程发送到内核），writeioctl's（用于将信息返回到进程），[9] 两者都有或都没有。ioctl函数使用三个参数调用：适当设备文件的文件描述符、ioctl 编号和一个参数，该参数的类型为 long，因此您可以使用强制转换将其用于传递任何内容。[10]

ioctl 编号编码了主设备号、ioctl 的类型、命令和参数的类型。此 ioctl 编号通常由宏调用创建（_IO, _IOR, _IOW或_IOWR--- 取决于类型）在头文件中。然后，程序和内核模块都应包含此头文件，程序使用ioctl（以便它们可以生成适当的ioctl's），内核模块使用（以便它可以理解它）。在下面的示例中，头文件是chardev.h，使用它的程序是ioctl.c.

如果您想在自己的内核模块中使用ioctls，最好接收官方ioctl分配，这样如果您不小心获得了别人的ioctls，或者他们获得了您的，您就会知道出了问题。有关更多信息，请查阅内核源代码树，地址为Documentation/ioctl-number.txt.

示例 7-1. chardev.c

/*  chardev.c - Create an input/output character device
 */

#include <linux/kernel.h>   /* We're doing kernel work */
#include <linux/module.h>   /* Specifically, a module */

/* Deal with CONFIG_MODVERSIONS */
#if CONFIG_MODVERSIONS==1
#define MODVERSIONS
#include <linux/modversions.h>
#endif        

/* For character devices */

/* The character device definitions are here */
#include <linux/fs.h>

/* A wrapper which does next to nothing at
 * at present, but may help for compatibility
 * with future versions of Linux */
#include <linux/wrapper.h>

			     
/* Our own ioctl numbers */
#include "chardev.h"


/* In 2.2.3 /usr/include/linux/version.h includes a 
 * macro for this, but 2.0.35 doesn't - so I add it 
 * here if necessary. */
#ifndef KERNEL_VERSION
#define KERNEL_VERSION(a,b,c) ((a)*65536+(b)*256+(c))
#endif



#if LINUX_VERSION_CODE >= KERNEL_VERSION(2,2,0)
#include <asm/uaccess.h>  /* for get_user and put_user */
#endif



#define SUCCESS 0


/* Device Declarations ******************************** */


/* The name for our device, as it will appear in 
 * /proc/devices */
#define DEVICE_NAME "char_dev"


/* The maximum length of the message for the device */
#define BUF_LEN 80

/* Is the device open right now? Used to prevent 
 * concurent access into the same device */
static int Device_Open = 0;

/* The message the device will give when asked */
static char Message[BUF_LEN];

/* How far did the process reading the message get? 
 * Useful if the message is larger than the size of the 
 * buffer we get to fill in device_read. */
static char *Message_Ptr;


/* This function is called whenever a process attempts 
 * to open the device file */
static int device_open(struct inode *inode, 
                       struct file *file)
{
#ifdef DEBUG
  printk ("device_open(%p)\n", file);
#endif

  /* We don't want to talk to two processes at the 
   * same time */
  if (Device_Open)
    return -EBUSY;

  /* If this was a process, we would have had to be 
   * more careful here, because one process might have 
   * checked Device_Open right before the other one 
   * tried to increment it. However, we're in the 
   * kernel, so we're protected against context switches.
   *
   * This is NOT the right attitude to take, because we
   * might be running on an SMP box, but we'll deal with
   * SMP in a later chapter.
   */ 

  Device_Open++;

  /* Initialize the message */
  Message_Ptr = Message;

  MOD_INC_USE_COUNT;

  return SUCCESS;
}


/* This function is called when a process closes the 
 * device file. It doesn't have a return value because 
 * it cannot fail. Regardless of what else happens, you 
 * should always be able to close a device (in 2.0, a 2.2
 * device file could be impossible to close).
 */
#if LINUX_VERSION_CODE >= KERNEL_VERSION(2,2,0)
static int device_release(struct inode *inode, 
                          struct file *file)
#else
static void device_release(struct inode *inode, 
                           struct file *file)
#endif
{
#ifdef DEBUG
  printk ("device_release(%p,%p)\n", inode, file);
#endif
 
  /* We're now ready for our next caller */
  Device_Open --;

  MOD_DEC_USE_COUNT;

#if LINUX_VERSION_CODE >= KERNEL_VERSION(2,2,0)
  return 0;
#endif
}



/* This function is called whenever a process which 
 * has already opened the device file attempts to 
 * read from it. */
#if LINUX_VERSION_CODE >= KERNEL_VERSION(2,2,0)
static ssize_t device_read(
    struct file *file,
    char *buffer, /* The buffer to fill with the data */   
    size_t length,     /* The length of the buffer */
    loff_t *offset) /* offset to the file */
#else
static int device_read(
    struct inode *inode,
    struct file *file,
    char *buffer,   /* The buffer to fill with the data */ 
    int length)     /* The length of the buffer 
                     * (mustn't write beyond that!) */
#endif
{
  /* Number of bytes actually written to the buffer */
  int bytes_read = 0;

#ifdef DEBUG
  printk("device_read(%p,%p,%d)\n", file, buffer, length);
#endif

  /* If we're at the end of the message, return 0 
   * (which signifies end of file) */
  if (*Message_Ptr == 0)
    return 0;

  /* Actually put the data into the buffer */
  while (length && *Message_Ptr)  {

    /* Because the buffer is in the user data segment, 
     * not the kernel data segment, assignment wouldn't 
     * work. Instead, we have to use put_user which 
     * copies data from the kernel data segment to the 
     * user data segment. */
    put_user(*(Message_Ptr++), buffer++);
    length --;
    bytes_read ++;
  }

#ifdef DEBUG
   printk ("Read %d bytes, %d left\n", bytes_read, length);
#endif

   /* Read functions are supposed to return the number 
    * of bytes actually inserted into the buffer */
  return bytes_read;
}


/* This function is called when somebody tries to 
 * write into our device file. */ 
#if LINUX_VERSION_CODE >= KERNEL_VERSION(2,2,0)
static ssize_t device_write(struct file *file,
                            const char *buffer,
                            size_t length,
                            loff_t *offset)
#else
static int device_write(struct inode *inode,
                        struct file *file,
                        const char *buffer,
                        int length)
#endif
{
  int i;

#ifdef DEBUG
  printk ("device_write(%p,%s,%d)",
    file, buffer, length);
#endif

  for(i=0; i<length && i<BUF_LEN; i++)
#if LINUX_VERSION_CODE >= KERNEL_VERSION(2,2,0)
    get_user(Message[i], buffer+i);
#else
    Message[i] = get_user(buffer+i);
#endif  

  Message_Ptr = Message;

  /* Again, return the number of input characters used */
  return i;
}


/* This function is called whenever a process tries to 
 * do an ioctl on our device file. We get two extra 
 * parameters (additional to the inode and file 
 * structures, which all device functions get): the number
 * of the ioctl called and the parameter given to the 
 * ioctl function.
 *
 * If the ioctl is write or read/write (meaning output 
 * is returned to the calling process), the ioctl call 
 * returns the output of this function.
 */
int device_ioctl(
    struct inode *inode,
    struct file *file,
    unsigned int ioctl_num,/* The number of the ioctl */
    unsigned long ioctl_param) /* The parameter to it */
{
  int i;
  char *temp;
#if LINUX_VERSION_CODE >= KERNEL_VERSION(2,2,0)
  char ch;
#endif

  /* Switch according to the ioctl called */
  switch (ioctl_num) {
    case IOCTL_SET_MSG:
      /* Receive a pointer to a message (in user space) 
       * and set that to be the device's message. */ 

      /* Get the parameter given to ioctl by the process */
      temp = (char *) ioctl_param;
   
      /* Find the length of the message */
#if LINUX_VERSION_CODE >= KERNEL_VERSION(2,2,0)
      get_user(ch, temp);
      for (i=0; ch && i<BUF_LEN; i++, temp++)
        get_user(ch, temp);
#else
      for (i=0; get_user(temp) && i<BUF_LEN; i++, temp++)
	;
#endif

      /* Don't reinvent the wheel - call device_write */
#if LINUX_VERSION_CODE >= KERNEL_VERSION(2,2,0)
      device_write(file, (char *) ioctl_param, i, 0);
#else
      device_write(inode, file, (char *) ioctl_param, i);
#endif
      break;

    case IOCTL_GET_MSG:
      /* Give the current message to the calling 
       * process - the parameter we got is a pointer, 
       * fill it. */
#if LINUX_VERSION_CODE >= KERNEL_VERSION(2,2,0)
      i = device_read(file, (char *) ioctl_param, 99, 0); 
#else
      i = device_read(inode, file, (char *) ioctl_param, 99); 
#endif
      /* Warning - we assume here the buffer length is 
       * 100. If it's less than that we might overflow 
       * the buffer, causing the process to core dump. 
       *
       * The reason we only allow up to 99 characters is 
       * that the NULL which terminates the string also 
       * needs room. */

      /* Put a zero at the end of the buffer, so it 
       * will be properly terminated */
      put_user('\0', (char *) ioctl_param+i);
      break;

    case IOCTL_GET_NTH_BYTE:
      /* This ioctl is both input (ioctl_param) and 
       * output (the return value of this function) */
      return Message[ioctl_param];
      break;
  }

  return SUCCESS;
}


/* Module Declarations *************************** */


/* This structure will hold the functions to be called 
 * when a process does something to the device we 
 * created. Since a pointer to this structure is kept in 
 * the devices table, it can't be local to
 * init_module. NULL is for unimplemented functions. */
struct file_operations Fops = {
  NULL,   /* seek */
  device_read, 
  device_write,
  NULL,   /* readdir */
  NULL,   /* select */
  device_ioctl,   /* ioctl */
  NULL,   /* mmap */
  device_open,
#if LINUX_VERSION_CODE >= KERNEL_VERSION(2,2,0)
  NULL,  /* flush */
#endif
  device_release  /* a.k.a. close */
};


/* Initialize the module - Register the character device */
int init_module()
{
  int ret_val;

  /* Register the character device (atleast try) */
  ret_val = module_register_chrdev(MAJOR_NUM, 
                                 DEVICE_NAME,
                                 &Fops);

  /* Negative values signify an error */
  if (ret_val < 0) {
    printk ("%s failed with %d\n",
            "Sorry, registering the character device ",
            ret_val);
    return ret_val;
  }

  printk ("%s The major device number is %d.\n",
          "Registeration is a success", 
          MAJOR_NUM);
  printk ("If you want to talk to the device driver,\n");
  printk ("you'll have to create a device file. \n");
  printk ("We suggest you use:\n");
  printk ("mknod %s c %d 0\n", DEVICE_FILE_NAME, 
          MAJOR_NUM);
  printk ("The device file name is important, because\n");
  printk ("the ioctl program assumes that's the\n");
  printk ("file you'll use.\n");

  return 0;
}


/* Cleanup - unregister the appropriate file from /proc */
void cleanup_module()
{
  int ret;

  /* Unregister the device */
  ret = module_unregister_chrdev(MAJOR_NUM, DEVICE_NAME);
 
  /* If there's an error, report it */ 
  if (ret < 0)
    printk("Error in module_unregister_chrdev: %d\n", ret);
}

示例 7-2. chardev.h

/*  chardev.h - the header file with the ioctl definitions.
 *
 *  The declarations here have to be in a header file, because
 *  they need to be known both to the kernel module
 *  (in chardev.c) and the process calling ioctl (ioctl.c)
 */

#ifndef CHARDEV_H
#define CHARDEV_H

#include <linux/ioctl.h>



/* The major device number. We can't rely on dynamic 
 * registration any more, because ioctls need to know 
 * it. */
#define MAJOR_NUM 100


/* Set the message of the device driver */
#define IOCTL_SET_MSG _IOR(MAJOR_NUM, 0, char *)
/* _IOR means that we're creating an ioctl command 
 * number for passing information from a user process
 * to the kernel module. 
 *
 * The first arguments, MAJOR_NUM, is the major device 
 * number we're using.
 *
 * The second argument is the number of the command 
 * (there could be several with different meanings).
 *
 * The third argument is the type we want to get from 
 * the process to the kernel.
 */

/* Get the message of the device driver */
#define IOCTL_GET_MSG _IOR(MAJOR_NUM, 1, char *)
 /* This IOCTL is used for output, to get the message 
  * of the device driver. However, we still need the 
  * buffer to place the message in to be input, 
  * as it is allocated by the process.
  */


/* Get the n'th byte of the message */
#define IOCTL_GET_NTH_BYTE _IOWR(MAJOR_NUM, 2, int)
 /* The IOCTL is used for both input and output. It 
  * receives from the user a number, n, and returns 
  * Message[n]. */


/* The name of the device file */
#define DEVICE_FILE_NAME "char_dev"


#endif

示例 7-3. ioctl.c

/*  ioctl.c - the process to use ioctl's to control the kernel module
 *
 *  Until now we could have used cat for input and output.  But now
 *  we need to do ioctl's, which require writing our own process. 
 */

/* device specifics, such as ioctl numbers and the 
 * major device file. */
#include "chardev.h"    


#include <fcntl.h>      /* open */ 
#include <unistd.h>     /* exit */
#include <sys/ioctl.h>  /* ioctl */



/* Functions for the ioctl calls */

ioctl_set_msg(int file_desc, char *message)
{
  int ret_val;

  ret_val = ioctl(file_desc, IOCTL_SET_MSG, message);

  if (ret_val < 0) {
    printf ("ioctl_set_msg failed:%d\n", ret_val);
    exit(-1);
  }
}



ioctl_get_msg(int file_desc)
{
  int ret_val;
  char message[100]; 

  /* Warning - this is dangerous because we don't tell 
   * the kernel how far it's allowed to write, so it 
   * might overflow the buffer. In a real production 
   * program, we would have used two ioctls - one to tell
   * the kernel the buffer length and another to give 
   * it the buffer to fill
   */
  ret_val = ioctl(file_desc, IOCTL_GET_MSG, message);

  if (ret_val < 0) {
    printf ("ioctl_get_msg failed:%d\n", ret_val);
    exit(-1);
  }

  printf("get_msg message:%s\n", message);
}



ioctl_get_nth_byte(int file_desc)
{
  int i;
  char c;

  printf("get_nth_byte message:");

  i = 0;
  while (c != 0) {
    c = ioctl(file_desc, IOCTL_GET_NTH_BYTE, i++);

    if (c < 0) {
      printf(
      "ioctl_get_nth_byte failed at the %d'th byte:\n", i);
      exit(-1);
    }

    putchar(c);
  } 
  putchar('\n');
}




/* Main - Call the ioctl functions */
main()
{
  int file_desc, ret_val;
  char *msg = "Message passed by ioctl\n";

  file_desc = open(DEVICE_FILE_NAME, 0);
  if (file_desc < 0) {
    printf ("Can't open device file: %s\n", 
            DEVICE_FILE_NAME);
    exit(-1);
  }

  ioctl_get_nth_byte(file_desc);
  ioctl_get_msg(file_desc);
  ioctl_set_msg(file_desc, msg);

  close(file_desc); 
}

第 8 章。系统调用

8.1. 系统调用

到目前为止，我们唯一做的事情是使用定义良好的内核机制来注册/proc文件和设备处理程序。如果您想做内核程序员认为您会想做的事情，例如编写设备驱动程序，这很好。但是，如果您想做一些不寻常的事情，以某种方式更改系统的行为呢？那么，您基本上就要靠自己了。

这就是内核编程变得危险的地方。在编写以下示例时，我杀死了open()系统调用。这意味着我无法打开任何文件，无法运行任何程序，也无法 shutdown 计算机。我不得不拔掉电源开关。幸运的是，没有文件损坏。为了确保您也不会丢失任何文件，请在执行 insmod 和 rmmod 之前运行 sync。

忘记/proc文件，忘记设备文件。它们只是次要细节。真正的进程到内核通信机制，所有进程都使用的机制是系统调用。当进程请求内核的服务（例如打开文件、fork 到新进程或请求更多内存）时，使用的就是这种机制。如果您想以有趣的方式更改内核的行为，这就是要执行的地方。顺便说一句，如果您想查看程序使用哪些系统调用，请运行 strace <arguments>。

一般来说，进程不应该能够访问内核。它无法访问内核内存，也无法调用内核函数。CPU 的硬件强制执行这一点（这就是它被称为“保护模式”的原因）。

系统调用是此一般规则的例外。发生的情况是，进程用适当的值填充寄存器，然后调用一个特殊指令，该指令跳转到内核中预先定义的位置（当然，该位置对用户进程是可读的，但对它们不可写）。在 Intel CPU 下，这是通过中断 0x80 完成的。硬件知道，一旦您跳转到此位置，您就不再以受限制的用户模式运行，而是作为操作系统内核运行 --- 因此，您被允许做任何您想做的事情。

进程可以跳转到的内核中的位置称为 system_call。该位置的过程检查系统调用号，该系统调用号告诉内核进程请求的服务。然后，它查看系统调用表（sys_call_table）以查看要调用的内核函数的地址。然后，它调用该函数，并在其返回后，执行一些系统检查，然后返回到进程（或不同的进程，如果进程时间用完）。如果您想阅读此代码，它位于源文件arch/$<$architecture$>$/kernel/entry.S中，在ENTRY(system_call).

因此，如果我们想更改某个系统调用的工作方式，我们需要做的是编写我们自己的函数来实现它（通常是通过添加一些我们自己的代码，然后调用原始函数），然后更改sys_call_table处指向我们函数的指针。因为我们稍后可能会被删除，并且我们不想将系统置于不稳定状态，所以cleanup_module恢复表到其原始状态非常重要。

此处的源代码是此类内核模块的示例。我们想要“监视”某个用户，并在该用户打开文件时printk()发送消息。为此，我们将打开文件的系统调用替换为我们自己的函数，称为our_sys_open。此函数检查当前进程的 uid（用户 ID），如果它等于我们监视的 uid，则调用printk()以显示要打开的文件名。然后，无论哪种方式，它都会使用相同的参数调用原始的open()函数，以实际打开文件。

init_module函数替换了sys_call_table中的适当位置，并将原始指针保存在变量中。cleanup_module函数使用该变量将一切恢复正常。这种方法是危险的，因为可能有两个内核模块更改同一个系统调用。想象一下，我们有两个内核模块 A 和 B。A 的 open 系统调用将是 A_open，B 的将是 B_open。现在，当 A 插入内核时，系统调用将替换为 A_open，它将在完成时调用原始的 sys_open。接下来，B 插入内核，它将系统调用替换为 B_open，它将在完成时调用它认为是原始系统调用 A_open 的内容。

现在，如果先删除 B，一切都会很好 --- 它只会将系统调用恢复为 A_open，后者调用原始系统调用。但是，如果先删除 A，然后再删除 B，则系统将崩溃。A 的删除会将系统调用恢复为原始的 sys_open，从而将 B 从循环中移除。然后，当删除 B 时，它会将系统调用恢复为它认为是原始的 A_open，而 A_open 不再在内存中。乍一看，我们似乎可以通过检查系统调用是否等于我们的 open 函数来解决这个特定问题，如果是，则根本不更改它（以便 B 在删除时不会更改系统调用），但这会导致更糟糕的问题。当删除 A 时，它看到系统调用已更改为 B_open，因此不再指向 A_open，因此它不会在从内存中删除之前将其恢复为 sys_open。不幸的是，B_open 仍然会尝试调用不再存在的 A_open，因此即使不删除 B，系统也会崩溃。

我可以想到两种方法来防止这个问题。第一种是将调用恢复为原始值 sys_open。不幸的是，sys_open 不是/proc/ksyms中的内核系统表的一部分，因此我们无法访问它。另一种解决方案是使用引用计数来防止 root 用户 rmmod 加载模块。这对于生产模块来说是好的，但对于教育示例来说是不好的 --- 这就是为什么我没有在这里这样做。

示例 8-1. syscall.c

/*  syscall.c 
 * 
 *  System call "stealing" sample.
 */


/* Copyright (C) 2001 by Peter Jay Salzman */


/* The necessary header files */

/* Standard in kernel modules */
#include <linux/kernel.h>   /* We're doing kernel work */
#include <linux/module.h>   /* Specifically, a module */

/* Deal with CONFIG_MODVERSIONS */
#if CONFIG_MODVERSIONS==1
#define MODVERSIONS
#include <linux/modversions.h>
#endif        

#include <sys/syscall.h>  /* The list of system calls */

/* For the current (process) structure, we need
 * this to know who the current user is. */
#include <linux/sched.h>




/* In 2.2.3 /usr/include/linux/version.h includes a 
 * macro for this, but 2.0.35 doesn't - so I add it 
 * here if necessary. */
#ifndef KERNEL_VERSION
#define KERNEL_VERSION(a,b,c) ((a)*65536+(b)*256+(c))
#endif



#if LINUX_VERSION_CODE >= KERNEL_VERSION(2,2,0)
#include <asm/uaccess.h>
#endif



/* The system call table (a table of functions). We 
 * just define this as external, and the kernel will 
 * fill it up for us when we are insmod'ed 
 */
extern void *sys_call_table[];


/* UID we want to spy on - will be filled from the 
 * command line */
int uid;  

#if LINUX_VERSION_CODE >= KERNEL_VERSION(2,2,0)
MODULE_PARM(uid, "i");
#endif

/* A pointer to the original system call. The reason 
 * we keep this, rather than call the original function 
 * (sys_open), is because somebody else might have 
 * replaced the system call before us. Note that this 
 * is not 100% safe, because if another module 
 * replaced sys_open before us, then when we're inserted 
 * we'll call the function in that module - and it 
 * might be removed before we are.
 *
 * Another reason for this is that we can't get sys_open.
 * It's a static variable, so it is not exported. */
asmlinkage int (*original_call)(const char *, int, int);



/* For some reason, in 2.2.3 current->uid gave me 
 * zero, not the real user ID. I tried to find what went 
 * wrong, but I couldn't do it in a short time, and 
 * I'm lazy - so I'll just use the system call to get the 
 * uid, the way a process would. 
 *
 * For some reason, after I recompiled the kernel this 
 * problem went away. 
 */
asmlinkage int (*getuid_call)();



/* The function we'll replace sys_open (the function 
 * called when you call the open system call) with. To 
 * find the exact prototype, with the number and type 
 * of arguments, we find the original function first 
 * (it's at fs/open.c). 
 *
 * In theory, this means that we're tied to the 
 * current version of the kernel. In practice, the 
 * system calls almost never change (it would wreck havoc 
 * and require programs to be recompiled, since the system
 * calls are the interface between the kernel and the 
 * processes).
 */
asmlinkage int our_sys_open(const char *filename, 
                            int flags, 
                            int mode)
{
  int i = 0;
  char ch;

  /* Check if this is the user we're spying on */
  if (uid == getuid_call()) {  
   /* getuid_call is the getuid system call, 
    * which gives the uid of the user who
    * ran the process which called the system
    * call we got */

    /* Report the file, if relevant */
    printk("Opened file by %d: ", uid); 
    do {
#if LINUX_VERSION_CODE >= KERNEL_VERSION(2,2,0)
      get_user(ch, filename+i);
#else
      ch = get_user(filename+i);
#endif
      i++;
      printk("%c", ch);
    } while (ch != 0);
    printk("\n");
  }

  /* Call the original sys_open - otherwise, we lose 
   * the ability to open files */
  return original_call(filename, flags, mode);
}



/* Initialize the module - replace the system call */
int init_module()
{
  /* Warning - too late for it now, but maybe for 
   * next time... */
  printk("I'm dangerous. I hope you did a ");
  printk("sync before you insmod'ed me.\n");
  printk("My counterpart, cleanup_module(), is even"); 
  printk("more dangerous. If\n");
  printk("you value your file system, it will ");
  printk("be \"sync; rmmod\" \n");
  printk("when you remove this module.\n");

  /* Keep a pointer to the original function in 
   * original_call, and then replace the system call 
   * in the system call table with our_sys_open */
  original_call = sys_call_table[__NR_open];
  sys_call_table[__NR_open] = our_sys_open;

  /* To get the address of the function for system 
   * call foo, go to sys_call_table[__NR_foo]. */

  printk("Spying on UID:%d\n", uid);

  /* Get the system call for getuid */
  getuid_call = sys_call_table[__NR_getuid];

  return 0;
}


/* Cleanup - unregister the appropriate file from /proc */
void cleanup_module()
{
  /* Return the system call back to normal */
  if (sys_call_table[__NR_open] != our_sys_open) {
    printk("Somebody else also played with the ");
    printk("open system call\n");
    printk("The system may be left in ");
    printk("an unstable state.\n");
  }

  sys_call_table[__NR_open] = original_call;
}

第 9 章。阻塞进程

9.1. 阻塞进程

9.1.1. 替换`printk`

当有人向您询问您无法立即完成的事情时，您会怎么做？如果您是人类，并且受到人类的困扰，那么您唯一能说的就是：“现在不行，我很忙。走开！”。但是，如果您是内核模块，并且受到进程的困扰，那么您还有另一种可能性。您可以让进程休眠，直到您可以为其服务。毕竟，进程一直在被内核休眠和唤醒（这就是多个进程看起来在单个 CPU 上同时运行的方式）。

此内核模块就是这种情况的一个示例。文件（名为/proc/sleep）一次只能由一个进程打开。如果文件已打开，则内核模块调用module_interruptible_sleep_on[11]。此函数将任务（任务是内核数据结构，其中包含有关进程及其所在的系统调用的信息，如果有）的状态更改为TASK_INTERRUPTIBLE，这意味着该任务将不会运行，直到以某种方式将其唤醒，并将其添加到 WaitQ，即等待访问文件的任务队列。然后，该函数调用调度程序以上下文切换到另一个进程，该进程对 CPU 有一些用途。

当进程完成文件操作时，它会关闭文件，并调用module_close。该函数唤醒队列中的所有进程（没有机制仅唤醒其中一个进程）。然后它返回，刚刚关闭文件的进程可以继续运行。随着时间的推移，调度程序确定该进程已运行足够长的时间，并将 CPU 控制权交给另一个进程。最终，队列中的一个进程将由调度程序赋予 CPU 控制权。它从调用module_interruptible_sleep_on[12] 之后的点开始。然后，它可以继续设置一个全局变量，以告知所有其他进程该文件仍处于打开状态，并继续其生命周期。当其他进程获得 CPU 片段时，它们将看到该全局变量并返回睡眠状态。

为了使我们的生活更有趣，module_close并没有垄断唤醒等待访问文件的进程。信号，例如 Ctrl+c (SIGINT) 也可以唤醒进程。[13] 在这种情况下，我们希望立即返回-EINTR。这很重要，这样用户就可以在进程接收文件之前将其杀死，例如。

还有一点要记住。有时进程不想休眠，它们要么想立即获得它们想要的东西，要么想被告知无法完成。此类进程在打开文件时使用O_NONBLOCK标志。内核应该通过从原本会阻塞的操作（例如本例中打开文件）返回错误代码-EAGAIN来响应。本章源代码目录中提供的程序 cat_noblock 可用于打开带有O_NONBLOCK.

示例 9-1. sleep.c

/*  sleep.c - create a /proc file, and if several processes try to open it at
 *  the same time, put all but one to sleep
 */

#include <linux/kernel.h>                   /* We're doing kernel work */
#include <linux/module.h>                   /* Specifically, a module */

/* Deal with CONFIG_MODVERSIONS */
#if CONFIG_MODVERSIONS==1
#define MODVERSIONS
#include <linux/modversions.h>
#endif        

/* Necessary because we use proc fs */
#include <linux/proc_fs.h>

/* For putting processes to sleep and waking them up */
#include <linux/sched.h>
#include <linux/wrapper.h>

/* In 2.2.3 /usr/include/linux/version.h includes a macro for this, but 2.0.35
 * doesn't - so I add it here if necessary.
 */
#ifndef KERNEL_VERSION
#define KERNEL_VERSION(a,b,c) ((a)*65536+(b)*256+(c))
#endif

#if LINUX_VERSION_CODE >= KERNEL_VERSION(2,2,0)
#include <asm/uaccess.h>                    /* for get_user and put_user */
#endif

/* The module's file functions */

/* Here we keep the last message received, to prove that we can process our
 * input
 */
#define MESSAGE_LENGTH 80
static char Message[MESSAGE_LENGTH];

/* Since we use the file operations struct, we can't use the special proc
 * output provisions - we have to use a standard read function, which is this
 * function
 */
#if LINUX_VERSION_CODE >= KERNEL_VERSION(2,2,0)
static ssize_t module_output (
   struct file *file,                      /* The file read */
   char *buf,           /* The buffer to put data to (in the user segment) */
   size_t len,                             /* The length of the buffer */
   loff_t *offset)                         /* Offset in the file - ignore */
#else
static int module_output (
   struct inode *inode,                    /* The inode read */
   struct file *file,                      /* The file read */
   char *buf,           /* The buffer to put data to (in the user segment) */
   int len)                                /* The length of the buffer */
#endif
{
   static int finished = 0;
   int i;
   char message[MESSAGE_LENGTH+30];

   /* Return 0 to signify end of file - that we have nothing more to say at this
    * point.
    */
   if (finished) {
      finished = 0;
      return 0;
   }

   /* If you don't understand this by now, you're hopeless as a kernel
    * programmer.
    */
   sprintf(message, "Last input:%s\n", Message);
   for (i = 0; i < len && message[i]; i++) 
      put_user(message[i], buf+i);

   finished = 1;
   return i;                            /* Return the number of bytes "read" */
}

/* This function receives input from the user when the user writes to the /proc
 * file.
 */
#if LINUX_VERSION_CODE >= KERNEL_VERSION(2,2,0)
static ssize_t module_input (
   struct file *file,                     /* The file itself */
   const char *buf,                       /* The buffer with input */
   size_t length,                         /* The buffer's length */
   loff_t *offset)                        /* offset to file - ignore */
#else
static int module_input (
   struct inode *inode,                   /* The file's inode */
   struct file *file,                     /* The file itself */
   const char *buf,                       /* The buffer with the input */
   int length)                            /* The buffer's length */
#endif
{
   int i;

   /* Put the input into Message, where module_output will later be able to use
    * it
    */
   for(i = 0; i < MESSAGE_LENGTH-1 && i < length; i++)
#if LINUX_VERSION_CODE >= KERNEL_VERSION(2,2,0)
      get_user(Message[i], buf+i);
#else
      Message[i] = get_user(buf+i);
#endif
   /* we want a standard, zero terminated string */
   Message[i] = '\0';  
  
   /* We need to return the number of input characters used */
   return i;
}

/* 1 if the file is currently open by somebody */
int Already_Open = 0;

/* Queue of processes who want our file */
static struct wait_queue *WaitQ = NULL;

/* Called when the /proc file is opened */
static int module_open(struct inode *inode, struct file *file)
{
   /* If the file's flags include O_NONBLOCK, it means the process doesn't want
    * to wait for the file.  In this case, if the file is already open, we
    * should fail with -EAGAIN, meaning "you'll have to try again", instead of
    * blocking a process which would rather stay awake.
    */
   if ((file->f_flags & O_NONBLOCK) && Already_Open) 
      return -EAGAIN;

	 /* This is the correct place for MOD_INC_USE_COUNT because if a process is
    * in the loop, which is within the kernel module, the kernel module must
    * not be removed.
    */
   MOD_INC_USE_COUNT;

   /* If the file is already open, wait until it isn't */
   while (Already_Open) 
   {
#if LINUX_VERSION_CODE >= KERNEL_VERSION(2,2,0)
      int i, is_sig = 0;
#endif

      /* This function puts the current process, including any system calls,
       * such as us, to sleep.  Execution will be resumed right after the
       * function call, either because somebody called wake_up(&WaitQ) (only
       * module_close does that, when the file is closed) or when a signal,
       * such as Ctrl-C, is sent to the process
       */
      module_interruptible_sleep_on(&WaitQ);
 
      /* If we woke up because we got a signal we're not blocking, return
       * -EINTR (fail the system call).  This allows processes to be killed or
       * stopped.
       */

/*
 * Emmanuel Papirakis:
 *
 * This is a little update to work with 2.2.*.  Signals now are contained in
 * two words (64 bits) and are stored in a structure that contains an array of
 * two unsigned longs.  We now have to make 2 checks in our if.
 *
 * Ori Pomerantz:
 *
 * Nobody promised me they'll never use more than 64 bits, or that this book
 * won't be used for a version of Linux with a word size of 16 bits.  This code
 * would work in any case.
 */	  
#if LINUX_VERSION_CODE >= KERNEL_VERSION(2,2,0)
      for (i = 0; i < _NSIG_WORDS && !is_sig; i++)
         is_sig = current->signal.sig[i] & ~current->blocked.sig[i];

      if (is_sig) {
#else
      if (current->signal & ~current->blocked) {
#endif
         /* It's important to put MOD_DEC_USE_COUNT here, because for processes
          * where the open is interrupted there will never be a corresponding
          * close. If we don't decrement the usage count here, we will be left
          * with a positive usage count which we'll have no way to bring down
          * to zero, giving us an immortal module, which can only be killed by
          * rebooting the machine.
          */
         MOD_DEC_USE_COUNT;
         return -EINTR;
      }
   }

   /* If we got here, Already_Open must be zero */

   /* Open the file */
   Already_Open = 1;
   return 0;                                 /* Allow the access */
}

/* Called when the /proc file is closed */
#if LINUX_VERSION_CODE >= KERNEL_VERSION(2,2,0)
int module_close(struct inode *inode, struct file *file)
#else
void module_close(struct inode *inode, struct file *file)
#endif
{
   /* Set Already_Open to zero, so one of the processes in the WaitQ will be
    * able to set Already_Open back to one and to open the file.  All the other
    * processes will be called when Already_Open is back to one, so they'll go
    * back to sleep.
    */
   Already_Open = 0;

   /* Wake up all the processes in WaitQ, so if anybody is waiting for the
    * file, they can have it.
    */
   module_wake_up(&WaitQ);

   MOD_DEC_USE_COUNT;

#if LINUX_VERSION_CODE >= KERNEL_VERSION(2,2,0)
   return 0;                                 /* success */
#endif
}

/* This function decides whether to allow an operation (return zero) or not
 * allow it (return a non-zero which indicates why it is not allowed).
 *
 * The operation can be one of the following values:
 * 0 - Execute (run the "file" - meaningless in our case)
 * 2 - Write (input to the kernel module)
 * 4 - Read (output from the kernel module)
 *
 * This is the real function that checks file permissions. The permissions
 * returned by ls -l are for referece only, and can be overridden here. 
 */
static int module_permission(struct inode *inode, int op)
{
   /* We allow everybody to read from our module, but only root (uid 0) may
    * write to it
    */ 
   if (op == 4 || (op == 2 && current->euid == 0))
      return 0; 

   /* If it's anything else, access is denied */
   return -EACCES;
}

/* Structures to register as the /proc file, with pointers to all the relevant
 * functions. 
 */

/* File operations for our proc file. This is where we place pointers to all
 * the functions called when somebody tries to do something to our file. NULL
 * means we don't want to deal with something.
 */
static struct file_operations File_Ops_4_Our_Proc_File = {
   NULL,                                   /* lseek */
   module_output,                          /* "read" from the file */
   module_input,                           /* "write" to the file */
   NULL,                                   /* readdir */
   NULL,                                   /* select */
   NULL,                                   /* ioctl */
   NULL,                                   /* mmap */
   module_open,                    /* called when the /proc file is opened */
#if LINUX_VERSION_CODE >= KERNEL_VERSION(2,2,0)
   NULL,                                   /* flush */
#endif
   module_close};                          /* called when it's classed */

/* Inode operations for our proc file.  We need it so we'll have somewhere to
 * specify the file operations structure we want to use, and the function we
 * use for permissions. It's also possible to specify functions to be called
 * for anything else which could be done to an inode (although we don't bother,
 * we just put NULL).
 */
static struct inode_operations Inode_Ops_4_Our_Proc_File = {
   &File_Ops_4_Our_Proc_File,
   NULL,                                   /* create */
   NULL,                                   /* lookup */
   NULL,                                   /* link */
   NULL,                                   /* unlink */
   NULL,                                   /* symlink */
   NULL,                                   /* mkdir */
   NULL,                                   /* rmdir */
   NULL,                                   /* mknod */
   NULL,                                   /* rename */
   NULL,                                   /* readlink */
   NULL,                                   /* follow_link */
   NULL,                                   /* readpage */
   NULL,                                   /* writepage */
   NULL,                                   /* bmap */
   NULL,                                   /* truncate */
   module_permission};                     /* check for permissions */

/* Directory entry */
static struct proc_dir_entry Our_Proc_File = {
	 0,                 /* Inode number - ignore, it will be filled by 
                       * proc_register[_dynamic]
                       */
   5,                                      /* Length of the file name */
   "sleep",                                /* The file name */

   /* File mode - this is a regular file which can be read by its owner, its
    * group, and everybody else. Also, its owner can write to it.
    *
    * Actually, this field is just for reference, it's module_permission that
    * does the actual check. It could use this field, but in our
    * implementation it doesn't, for simplicity.
    */
   S_IFREG | S_IRUGO | S_IWUSR, 
   1,        /* Number of links (directories where the file is referenced) */
   0, 0,     /* The uid and gid for the file - we give it to root */
   80,       /* The size of the file reported by ls. */

   /* A pointer to the inode structure for the file, if we need it. In our
    * case we do, because we need a write function.
    */
   &Inode_Ops_4_Our_Proc_File, 

   /* The read function for the file.  Irrelevant, because we put it in the
    * inode structure above
    */
   NULL}; 

/* Module initialization and cleanup */

/* Initialize the module - register the proc file */
int init_module()
{
   /* Success if proc_register_dynamic is a success, failure otherwise */
#if LINUX_VERSION_CODE >= KERNEL_VERSION(2,2,0)
   return proc_register(&proc_root, &Our_Proc_File);
#else
   return proc_register_dynamic(&proc_root, &Our_Proc_File);
#endif 

   /* proc_root is the root directory for the proc fs (/proc).  This is where
    * we want our file to be located. 
    */
}

/* Cleanup - unregister our file from /proc.  This could get dangerous if
 * there are still processes waiting in WaitQ, because they are inside our
 * open function, which will get unloaded. I'll explain how to avoid removal
 * of a kernel module in such a case in chapter 10.
 */
void cleanup_module()
{
   proc_unregister(&proc_root, Our_Proc_File.low_ino);
}

第 10 章。替换 Printks

10.1. 替换`printk`

在第 1.2.1.2 节中，我说过 X 和内核模块编程不能混为一谈。这对于开发内核模块是正确的，但在实际使用中，您希望能够将消息发送到加载模块的命令来自的任何 tty[14]。

完成此操作的方法是使用current，指向当前正在运行的任务的指针，以获取当前任务的 tty 结构。然后，我们查看该 tty 结构内部，找到指向字符串写入函数的指针，我们使用该指针将字符串写入 tty。

示例 10-1. print_string.c

/*  print_string.c - Send output to the tty you're running on, regardless of whether it's
 *     through X11, telnet, etc.  We do this by printing the string to the tty associated
 *     with the current task.
 */
#include <linux/kernel.h>
#include <linux/module.h>
#include <linux/sched.h>    // For current
#include <linux/tty.h>      // For the tty declarations
MODULE_LICENSE("GPL");
MODULE_AUTHOR("Peter Jay Salzman");


void print_string(char *str)
{
   struct tty_struct *my_tty;
   my_tty = current->tty;           // The tty for the current task

   /* If my_tty is NULL, the current task has no tty you can print to (this is possible,
    * for example, if it's a daemon).  If so, there's nothing we can do.
    */
   if (my_tty != NULL) { 

      /* my_tty->driver is a struct which holds the tty's functions, one of which (write)
       * is used to write strings to the tty.  It can be used to take a string either
       * from the user's memory segment or the kernel's memory segment.
       *
       * The function's 1st parameter is the tty to write to, because the same function
       * would normally be used for all tty's of a certain type.  The 2nd parameter
       * controls whether the function receives a string from kernel memory (false, 0) or
       * from user memory (true, non zero).  The 3rd parameter is a pointer to a string.
       * The 4th parameter is the length of the string.
       */
      (*(my_tty->driver).write)(
         my_tty,                 // The tty itself
         0,                      // We don't take the string from user space
         str,                    // String
         strlen(str));           // Length

      /* ttys were originally hardware devices, which (usually) strictly followed the
       * ASCII standard.  In ASCII, to move to a new line you need two characters, a
       * carriage return and a line feed.  On Unix, the ASCII line feed is used for both
       * purposes - so we can't just use \n, because it wouldn't have a carriage return
       * and the next line will start at the column right after the line feed. 
       *
       * BTW, this is why text files are different between Unix and MS Windows.  In CP/M
       * and its derivatives, like MS-DOS and MS Windows, the ASCII standard was strictly
       * adhered to, and therefore a newline requirs both a LF and a CR.
       */
      (*(my_tty->driver).write)(my_tty, 0, "\015\012", 2);
   }
}


int print_string_init(void)
{
   print_string("The module has been inserted.  Hello world!");
   return 0;
}


void print_string_exit(void)
{
   print_string("The module has been removed.  Farewell world!");
}  


module_init(print_string_init);
module_exit(print_string_exit);

第 11 章。调度任务

11.1. 调度任务

通常，我们有 “内务处理” 任务，这些任务必须在特定时间或每隔一段时间完成。如果要由进程完成任务，我们通过将其放入crontab文件中来完成。如果要由内核模块完成任务，我们有两种可能性。第一种是将进程放入crontab文件中，该文件将在必要时通过系统调用唤醒模块，例如通过打开文件。然而，这是非常低效的 --- 我们从crontab中运行一个新进程，将一个新的可执行文件读取到内存中，而这一切只是为了唤醒一个无论如何都在内存中的内核模块。

与其这样做，不如创建一个函数，该函数将在每次定时器中断时调用一次。我们这样做的方式是创建一个任务，保存在 tq_struct 结构中，该结构将保存指向该函数的指针。然后，我们使用queue_task将该任务放在名为 tq_timer 的任务列表中，该列表是在下一个定时器中断时要执行的任务列表。因为我们希望该函数继续执行，所以我们需要在每次调用它时将其放回 tq_timer 上，以用于下一次定时器中断。

这里还有一点我们需要记住。当通过 rmmod 删除模块时，首先检查其引用计数。如果为零，则调用module_cleanup。然后，从内存中删除模块及其所有函数。没有人检查定时器的任务列表是否碰巧包含指向其中一个函数的指针，该函数将不再可用。很久以后（从计算机的角度来看，从人类的角度来看，这没什么，不到百分之一秒），内核会发生定时器中断，并尝试调用任务列表中的函数。不幸的是，该函数不再存在。在大多数情况下，它所在的内存页未使用，您会收到一条难看的错误消息。但是，如果现在某些其他代码位于同一内存位置，情况可能会变得非常糟糕。不幸的是，我们没有一种简单的方法可以从任务列表中取消注册任务。

由于cleanup_module无法返回错误代码（它是一个 void 函数），因此解决方案是不让它返回。相反，它调用sleep_on或module_sleep_on[15] 以使 rmmod 进程进入睡眠状态。在此之前，它会通知在定时器中断时调用的函数停止附加自身，方法是设置一个全局变量。然后，在下一次定时器中断时，rmmod 进程将被唤醒，此时我们的函数不再在队列中，可以安全地删除模块。

示例 11-1. sched.c

/*  sched.c - scheduale a function to be called on every timer interrupt.
 *
 *  Copyright (C) 2001 by Peter Jay Salzman
 *
 *  06/20/2006 - Updated by Rodrigo Rubira Branco <rodrigo@kernelhacking.com>
 */

/* Kernel Programming */
#define MODULE
#define LINUX
#define __KERNEL__

/* The necessary header files */

/* Standard in kernel modules */
#include <linux/kernel.h>                   /* We're doing kernel work */
#include <linux/module.h>                   /* Specifically, a module */

/* Deal with CONFIG_MODVERSIONS */
#if CONFIG_MODVERSIONS==1
#define MODVERSIONS
#include <linux/modversions.h>
#endif        

/* Necessary because we use the proc fs */
#include <linux/proc_fs.h>

/* We scheduale tasks here */
#include <linux/tqueue.h>

/* We also need the ability to put ourselves to sleep and wake up later */
#include <linux/sched.h>

/* In 2.2.3 /usr/include/linux/version.h includes a macro for this, but
 * 2.0.35 doesn't - so I add it here if necessary.
 */
#ifndef KERNEL_VERSION
#define KERNEL_VERSION(a,b,c) ((a)*65536+(b)*256+(c))
#endif

/* The number of times the timer interrupt has been called so far */
static int TimerIntrpt = 0;

/* This is used by cleanup, to prevent the module from being unloaded while
 * intrpt_routine is still in the task queue
 */
static DECLARE_WAIT_QUEUE_HEAD(WaitQ);
int waitq=0;

static void intrpt_routine(void *);

/* The task queue structure for this task, from tqueue.h */
static struct tq_struct Task = {
   routine: (void (*)(void *)) intrpt_routine, /* The function to run */
   data: NULL            /* The void* parameter for that function */
};

/* This function will be called on every timer interrupt. Notice the void*
 * pointer - task functions can be used for more than one purpose, each time 
 * getting a different parameter.
 */
static void intrpt_routine(void *irrelevant)
{
   /* Increment the counter */
   TimerIntrpt++;

   /* If cleanup wants us to die */
   if (waitq) 
      wake_up(&WaitQ);               /* Now cleanup_module can return */
   else
      /* Put ourselves back in the task queue */
      queue_task(&Task, &tq_timer);  
}

/* Put data into the proc fs file. */
int procfile_read(char *buffer, 
                  char **buffer_location, off_t offset, 
                  int buffer_length, int *eof, void *data)
{
   int len;  /* The number of bytes actually used */

   /* It's static so it will still be in memory when we leave this function
    */
   static char my_buffer[80];  

   static int count = 1;

   /* We give all of our information in one go, so if the anybody asks us
    * if we have more information the answer should always be no. 
    */
   if (offset > 0)
      return 0;

   /* Fill the buffer and get its length */
   len = sprintf(my_buffer, "Timer called %d times so far\n", TimerIntrpt);
   count++;

   /* Tell the function which called us where the buffer is */
   *buffer_location = my_buffer;

   /* Return the length */
   return len;
}

/* Proc structure pointer */
struct proc_dir_entry *Our_Proc_File;

/* Initialize the module - register the proc file */
int init_module()
{
   /* Put the task in the tq_timer task queue, so it will be executed at
    * next timer interrupt
    */
   queue_task(&Task, &tq_timer);

   /* Success if proc_register_dynamic is a success, failure otherwise */
   Our_Proc_File=create_proc_read_entry("sched", 0444, NULL, procfile_read, NULL);

   if ( Our_Proc_File == NULL )
	return -ENOMEM;
   else
	return 0;

}

/* Cleanup */
void cleanup_module()
{
   /* Unregister our /proc file */
   remove_proc_entry("sched", NULL);
  
   /* Sleep until intrpt_routine is called one last time. This is necessary,
    * because otherwise we'll deallocate the memory holding intrpt_routine
    * and Task while tq_timer still references them.  Notice that here we
    * don't allow signals to interrupt us. 
    *
    * Since WaitQ is now not NULL, this automatically tells the interrupt
    * routine it's time to die.
    */
   waitq=1;
   sleep_on(&WaitQ);
}  

MODULE_LICENSE("GPL");

第 12 章。中断处理程序

12.1. 中断处理程序

12.1.1. 中断处理程序

除了最后一章，到目前为止，我们在内核中所做的一切都是为了响应进程的请求，无论是通过处理特殊文件、发送ioctl()还是发出系统调用。但是内核的工作不仅仅是响应进程请求。另一项同样重要的工作是与连接到机器的硬件进行对话。

CPU 和计算机的其他硬件之间有两种类型的交互。第一种类型是当 CPU 向硬件发出命令时，另一种类型是当硬件需要告诉 CPU 某些内容时。第二种类型称为中断，实现起来要困难得多，因为它必须在硬件方便时而不是 CPU 方便时处理。硬件设备通常具有非常小的 RAM，如果您不及时读取它们的信息，则会丢失。

在 Linux 下，硬件中断称为 IRQ（Interrupt Requests）[16]。IRQ 有两种类型，短 IRQ 和长 IRQ。短 IRQ 是预期占用非常短时间的 IRQ，在此期间，机器的其余部分将被阻塞，并且不会处理其他中断。长 IRQ 是可能占用更长时间的 IRQ，在此期间，可能会发生其他中断（但不会来自同一设备的中断）。如果可能，最好将中断处理程序声明为长 IRQ。

当 CPU 接收到中断时，它会停止正在执行的操作（除非它正在处理更重要的中断，在这种情况下，它只会在更重要的中断完成后才处理此中断），将某些参数保存在堆栈上，并调用中断处理程序。这意味着在中断处理程序本身中不允许某些事情，因为系统处于未知状态。解决此问题的方法是让中断处理程序立即完成需要完成的事情，通常是从硬件读取某些内容或向硬件发送某些内容，然后在稍后的时间安排处理新信息（这称为 “下半部”）并返回。然后，内核保证尽快调用下半部 --- 当它调用时，内核模块中允许的所有内容都将被允许。

实现此目的的方法是调用request_irq()，以便在接收到相关 IRQ 时调用您的中断处理程序（在 Intel 平台上，有 15 个 IRQ，外加 1 个用于级联中断控制器）。此函数接收 IRQ 编号、函数名称、标志、/proc/interrupts的名称以及传递给中断处理程序的参数。标志可以包括SA_SHIRQ以指示您愿意与其他中断处理程序共享 IRQ（通常是因为许多硬件设备位于同一 IRQ 上）和SA_INTERRUPT以指示这是一个快速中断。仅当此 IRQ 上尚无处理程序，或者您都愿意共享时，此函数才会成功。

然后，在中断处理程序中，我们与硬件通信，然后使用queue_task_irq()与tq_immediate()和mark_bh(BH_IMMEDIATE)一起调度下半部。我们不能在 2.0 版本中使用标准的queue_task的原因是，中断可能会发生在其他人的queue_task[17] 中间。我们需要mark_bh是因为早期版本的 Linux 只有 32 个下半部数组，现在其中一个（BH_IMMEDIATE）用于未分配下半部条目的驱动程序的下半部链表。

12.1.2. Intel 架构上的键盘

本章的其余部分完全特定于 Intel。如果您不是在 Intel 平台上运行，它将无法工作。甚至不要尝试编译此处的代码。

我在编写本章的示例代码时遇到了问题。一方面，为了使示例有用，它必须在每个人的计算机上以有意义的结果运行。另一方面，内核已经包含了所有常用设备的设备驱动程序，而这些设备驱动程序将无法与我要编写的内容共存。我找到的解决方案是为键盘中断编写一些东西，并首先禁用常规键盘中断处理程序。由于它在内核源文件中被定义为静态符号（具体来说，drivers/char/keyboard.c），因此无法恢复它。在insmod此代码之前，请在另一个终端上执行sleep 120 ; reboot如果您重视您的文件系统。

此代码将自身绑定到 IRQ 1，这是 Intel 架构下受控键盘的 IRQ。然后，当它接收到键盘中断时，它会读取键盘的状态（这就是inb(0x64)的目的）和扫描码，这是键盘返回的值。然后，一旦内核认为可行，它就会运行got_char，该函数给出所用键的代码（扫描码的前七位）以及它是被按下（如果第 8 位为零）还是释放（如果为 1）。

示例 12-1. intrpt.c

/*  intrpt.c - An interrupt handler.
 *
 *  Copyright (C) 2001 by Peter Jay Salzman
 */

/* The necessary header files */

/* Standard in kernel modules */
#include <linux/kernel.h>               /* We're doing kernel work */
#include <linux/module.h>               /* Specifically, a module */

/* Deal with CONFIG_MODVERSIONS */
#if CONFIG_MODVERSIONS==1
#define MODVERSIONS
#include <linux/modversions.h>
#endif        

#include <linux/sched.h>
#include <linux/tqueue.h>

/* We want an interrupt */
#include <linux/interrupt.h>

#include <asm/io.h>

/* In 2.2.3 /usr/include/linux/version.h includes a macro for this, but
 * 2.0.35 doesn't - so I add it here if necessary.
 */
#ifndef KERNEL_VERSION
#define KERNEL_VERSION(a,b,c) ((a)*65536+(b)*256+(c))
#endif

/* Bottom Half - this will get called by the kernel as soon as it's safe
 * to do everything normally allowed by kernel modules.
 */
static void got_char(void *scancode)
{
   printk("Scan Code %x %s.\n",
          (int) *((char *) scancode) & 0x7F,
          *((char *) scancode) & 0x80 ? "Released" : "Pressed");
}

/* This function services keyboard interrupts. It reads the relevant
 * information from the keyboard and then scheduales the bottom half
 * to run when the kernel considers it safe.
 */
void irq_handler(int irq, void *dev_id, struct pt_regs *regs)
{
   /* This variables are static because they need to be 
    * accessible (through pointers) to the bottom half routine.
    */
   static unsigned char scancode;
   static struct tq_struct task = {NULL, 0, got_char, &scancode};
   unsigned char status;

   /* Read keyboard status */
   status = inb(0x64);
   scancode = inb(0x60);
  
   /* Scheduale bottom half to run */
#if LINUX_VERSION_CODE > KERNEL_VERSION(2,2,0)
   queue_task(&task, &tq_immediate);
#else
   queue_task_irq(&task, &tq_immediate);
#endif
   mark_bh(IMMEDIATE_BH);
}

/* Initialize the module - register the IRQ handler */
int init_module()
{
   /* Since the keyboard handler won't co-exist with another handler,
    * such as us, we have to disable it (free its IRQ) before we do
    * anything.  Since we don't know where it is, there's no way to
		* reinstate it later - so the computer will have to be rebooted
		* when we're done.
    */
   free_irq(1, NULL);

   /* Request IRQ 1, the keyboard IRQ, to go to our irq_handler.
	  * SA_SHIRQ means we're willing to have othe handlers on this IRQ.
		* SA_INTERRUPT can be used to make the handler into a fast interrupt. 
    */
   return request_irq(1,   /* The number of the keyboard IRQ on PCs */ 
              irq_handler, /* our handler */
              SA_SHIRQ, 
              "test_keyboard_irq_handler", NULL);
}

/* Cleanup */
void cleanup_module()
{
   /* This is only here for completeness. It's totally irrelevant, since
	  * we don't have a way to restore the normal keyboard interrupt so the
		* computer is completely useless and has to be rebooted.
    */
   free_irq(1, NULL);
}

第 13 章。对称多处理

13.1. 对称多处理

提高硬件性能的最简单和最廉价的方法之一是在主板上放置多个 CPU。这可以通过让不同的 CPU 承担不同的工作（非对称多处理）或通过使它们全部并行运行来完成同一工作（对称多处理，又名 SMP）。有效执行非对称多处理需要有关计算机应执行的任务的专门知识，这在 Linux 等通用操作系统中是不可用的。另一方面，对称多处理相对容易实现。

相对容易，我的意思正是如此：并非真的容易。在对称多处理环境中，CPU 共享相同的内存，因此在一个 CPU 中运行的代码可能会影响另一个 CPU 使用的内存。您不能再确定您在前一行设置为某个值的变量仍然具有该值；当您不注意时，另一个 CPU 可能已经对其进行了操作。显然，像这样编程是不可能的。

在进程编程的情况下，这通常不是问题，因为进程通常一次只在一个 CPU 上运行[18]。另一方面，内核可能会被在不同 CPU 上运行的不同进程调用。

在 2.0.x 版本中，这不是问题，因为整个内核都在一个大的自旋锁中。这意味着，如果一个 CPU 在内核中，而另一个 CPU 想要进入内核，例如因为系统调用，则它必须等到第一个 CPU 完成。这使 Linux SMP 安全[19]，但效率低下。

在 2.2.x 版本中，多个 CPU 可以同时在内核中。模块编写者需要注意这一点。

第 14 章。常见陷阱

14.1. 常见陷阱

在您开始编写内核模块并进入实际操作之前，有些事情我需要警告您。如果我未能警告您，并且发生了不好的事情，请向我报告问题，以便全额退还您购买本书的费用。

使用标准库: 您不能这样做。在内核模块中，您只能使用内核函数，这些函数可以在以下位置找到：/proc/ksyms.
禁用中断: 您可能需要在短时间内这样做，这没问题，但是如果您之后不启用它们，您的系统将会卡死，您将不得不关闭电源。
将您的头伸入大型食肉动物的嘴里: 我可能不必警告您这件事，但我想无论如何我还是会警告您，以防万一。

附录 A. 变更：2.0 到 2.2

A.1. 2.0 和 2.2 之间的变更

A.1.1. 2.0 和 2.2 之间的变更

我对整个内核的了解还不够深入，无法记录所有的变更。在转换示例（或者实际上是适配 Emmanuel Papirakis 的变更）的过程中，我遇到了以下差异。我在这里将它们全部列出，以帮助模块程序员，特别是那些从本书的先前版本学习过并且最熟悉我使用的技术的人，转换为新版本。

希望转换为 2.2 的人们可以访问 Richard Gooch 的网站，那里有额外的资源。

asm/uaccess.h: 如果您需要put_user或get_user您必须#include它。
get_user: 在 2.2 版本中，get_user同时接收指向用户内存的指针和内核内存中的变量，以填充信息。原因是get_user现在可以一次读取两个或四个字节，如果我们读取的变量是两个或四个字节长。
file_operations: 此结构现在在open和close函数的指针。
close在 file_operations 中有一个刷新函数: 在 2.2 版本中，close函数返回一个整数，因此它被允许失败。
read,write在 file_operations 中有一个刷新函数: 这些函数的头文件已更改。它们现在返回ssize_t而不是整数，并且它们的参数列表也不同。inode 不再是参数，另一方面，文件偏移量是参数。
proc_register_dynamic: 此函数不再存在。相反，您调用常规的proc_register并在结构的 inode 字段中放入零。
信号: 任务结构中的信号不再是 32 位整数，而是一个_NSIG_WORDS整数的数组。
queue_task_irq: 即使您想安排一个任务从中断处理程序内部发生，您也应该使用queue_task，而不是queue_task_irq.
模块参数: 您不再只是将模块参数声明为全局变量。在 2.2 中，您还必须使用MODULE_PARM来声明它们的类型。这是一个很大的改进，因为它允许模块接收以数字开头的字符串参数，例如，而不会感到困惑。
对称多处理: 内核不再位于一个巨大的自旋锁内，这意味着内核模块必须意识到 SMP。

附录 B. 后续方向

B.1. 未来方向？

我本可以轻松地在这本书中挤出更多章节。我可以添加一章关于创建新的文件系统，或者关于添加新的协议栈（好像有这个需要一样——你必须挖地三尺才能找到一个 Linux 不支持的协议栈）。我可以添加我们尚未触及的内核机制的解释，例如引导或磁盘接口。

但是，我选择不这样做。我写这本书的目的是为了引导读者进入内核模块编程的奥秘，并教授用于此目的的常用技术。对于那些对内核编程真正感兴趣的人，我推荐 Juan-Mariano de Goyeneche 的内核资源列表。此外，正如 Linus 所说，学习内核的最佳方式是阅读源代码。

如果您对更多短小的内核模块示例感兴趣，我推荐 Phrack 杂志。即使您对安全不感兴趣（作为程序员您应该感兴趣），那里的内核模块也是您可以在内核内部执行的操作的良好示例，并且它们足够短，不需要花费太多精力来理解。

我希望我已经帮助您走上了成为更好程序员的道路，或者至少通过技术获得乐趣。而且，如果您确实编写了有用的内核模块，我希望您在 GPL 许可下发布它们，这样我也能使用它们。

索引

符号

/etc/conf.modules, 模块如何进入内核？
/etc/modules.conf, 模块如何进入内核？
/proc 文件系统, /proc 文件系统
/proc/interrupts, 中断处理程序
/proc/ksyms, 模块可用的函数, 命名空间, 常见陷阱
/proc/meminfo, /proc 文件系统
/proc/modules, 模块如何进入内核？, /proc 文件系统
2.2 变更, 2.0 和 2.2 之间的变更
_IO, 与设备文件对话（写入和 IOCTL）}
_IOR, 与设备文件对话（写入和 IOCTL）}
_IOW, 与设备文件对话（写入和 IOCTL）}
_IOWR, 与设备文件对话（写入和 IOCTL）}
_NSIG_WORDS, 2.0 和 2.2 之间的变更
__exit, Hello World（第 3 部分）：__init 和 __exit 宏
__init, Hello World（第 3 部分）：__init 和 __exit 宏
__initdata, Hello World（第 3 部分）：__init 和 __exit 宏
__initfunction(), Hello World（第 3 部分）：__init 和 __exit 宏
__NO_VERSION__, 跨越多个文件的模块

A

asm

uaccess.h, 2.0 和 2.2 之间的变更

asm/uaccess.h, 2.0 和 2.2 之间的变更

B

BH_IMMEDIATE, 中断处理程序
阻塞进程, 阻塞进程
阻塞, 如何避免, 替换 printk
下半部, 中断处理程序
busy, 替换 printk

C

食肉动物

大型, 常见陷阱

cleanup_module(), Hello, World（第 1 部分）：最简单的模块

close, 2.0 和 2.2 之间的变更

代码空间, 代码空间

咖啡, 主设备号和次设备号

CPU

多个, 对称多处理

crontab, 调度任务

ctrl-c, 替换 printk

当前任务, 替换 printk

D

DEFAULT_MESSAGE_LOGLEVEL, 介绍 printk()

定义 ioctl, 与设备文件对话（写入和 IOCTL）}

设备文件

字符, 字符设备驱动程序

设备文件

输入到, 与设备文件对话（写入和 IOCTL）}

写入到, 与设备文件对话（写入和 IOCTL）}

E

EAGAIN, 替换 printk
EINTR, 替换 printk
elf_i386, 跨越多个文件的模块
ENTRY(系统调用), 系统调用
entry.S, 系统调用

F

file, file 结构

文件系统

/proc, /proc 文件系统

注册, 使用 /proc 进行输入

文件系统注册, 使用 /proc 进行输入

file_operations, file_operations 结构

file_operations 结构, 使用 /proc 进行输入

flush, 2.0 和 2.2 之间的变更

G

get_user, 使用 /proc 进行输入, 2.0 和 2.2 之间的变更

H

处理程序

中断, 中断处理程序

内务处理, 调度任务

Hurd, 代码空间

I

inb, Intel 架构上的键盘

init_module(), Hello, World（第 1 部分）：最简单的模块

inode, file 结构, /proc 文件系统

inode_operations 结构, 使用 /proc 进行输入

输入

使用 /proc 进行, 使用 /proc 进行输入

insmod, 编译内核模块, 系统调用

Intel 架构

键盘, Intel 架构上的键盘

中断 0x80, 系统调用

中断处理程序, 中断处理程序

interruptible_sleep_on, 替换 printk

中断, 2.0 和 2.2 之间的变更

禁用, 常见陷阱

ioctl, 与设备文件对话（写入和 IOCTL）}

定义, 与设备文件对话（写入和 IOCTL）}

官方分配, 与设备文件对话（写入和 IOCTL）}

irqs, 2.0 和 2.2 之间的变更

K

内核

版本, 2.0 和 2.2 之间的变更

内核版本, 为多个内核版本编写模块

kerneld, 模块如何进入内核？

kernel\_version, 跨越多个文件的模块

KERNEL_VERSION, 为多个内核版本编写模块

键盘, Intel 架构上的键盘

kmod, 模块如何进入内核？

L

ld, 跨越多个文件的模块

库

标准, 常见陷阱

库函数, 模块可用的函数

LINUX_VERSION_CODE, 为多个内核版本编写模块

M

主设备号, 主设备号和次设备号

动态分配, 注册设备

mark_bh, 中断处理程序

内存段, 使用 /proc 进行输入

微内核, 代码空间

次设备号, 主设备号和次设备号

mknod, 主设备号和次设备号

调制解调器, 与设备文件对话（写入和 IOCTL）}

模块

参数, 2.0 和 2.2 之间的变更

模块参数, 2.0 和 2.2 之间的变更

module.h, 跨越多个文件的模块

modules.conf

别名, 模块如何进入内核？

注释, 模块如何进入内核？

keep, 模块如何进入内核？

选项, 模块如何进入内核？

路径, 模块如何进入内核？

MODULE_AUTHOR(), Hello World（第 4 部分）：许可和模块文档

module_cleanup, 调度任务

MODULE_DESCRIPTION(), Hello World（第 4 部分）：许可和模块文档

module_exit, Hello World（第 2 部分）

module_init, Hello World（第 2 部分）

module_interruptible_sleep_on, 替换 printk

MODULE_LICENSE(), Hello World（第 4 部分）：许可和模块文档

MODULE_PARM, 2.0 和 2.2 之间的变更

module_permissions, 使用 /proc 进行输入

module_sleep_on, 替换 printk, 调度任务

MODULE_SUPPORTED_DEVICE(), Hello World（第 4 部分）：许可和模块文档

module_wake_up, 替换 printk

MOD_DEC_USE_COUNT, 注销设备

MOD_INC_USE_COUNT, 注销设备, 系统调用

MOD_IN_USE, 注销设备

单内核, 代码空间

多处理, 对称多处理

多任务, 替换 printk

多任务处理, 替换 printk

N

命名空间污染, 命名空间
Neutrino, 代码空间
非阻塞, 替换 printk

O

官方 ioctl 分配, 与设备文件对话（写入和 IOCTL）}
O_NONBLOCK, 替换 printk

P

权限, 使用 /proc 进行输入

指针

当前, 使用 /proc 进行输入

printk

替换, 替换 printk

printk(), 介绍 printk()

proc

使用 /proc 进行输入, 使用 /proc 进行输入

proc 文件

ksyms, 常见陷阱

进程

阻塞, 阻塞进程

杀死, 替换 printk

唤醒, 替换 printk

处理

多, 对称多处理

proc_dir_entry, 使用 /proc 进行输入

proc_register, /proc 文件系统, 2.0 和 2.2 之间的变更

proc_register_dynamic, /proc 文件系统, 2.0 和 2.2 之间的变更

使进程休眠, 替换 printk

put_user, 使用 /proc 进行输入, 2.0 和 2.2 之间的变更

Q

queue_task, 调度任务, 中断处理程序, 2.0 和 2.2 之间的变更
queue_task_irq, 中断处理程序, 2.0 和 2.2 之间的变更

R

read, 2.0 和 2.2 之间的变更

在内核中, 使用 /proc 进行输入

引用计数, 调度任务

退款政策, 常见陷阱

register_chrdev, 注册设备

request_irq(), 中断处理程序

rmmod, 系统调用, 调度任务

防止, 注销设备

S

SA_INTERRUPT, 中断处理程序

SA_SHIRQ, 中断处理程序

调度器, 替换 printk

调度任务, 调度任务

段

内存, 使用 /proc 进行输入

串口, 与设备文件对话（写入和 IOCTL）}

shutdown, 系统调用

SIGINT, 替换 printk

信号, 替换 printk

信号, 2.0 和 2.2 之间的变更

休眠

使进程进入休眠, 替换 printk

sleep_on, 替换 printk, 调度任务

SMP, 对称多处理, 2.0 和 2.2 之间的变更

源文件

chardev.c, 与设备文件对话（写入和 IOCTL）}

chardev.h, 与设备文件对话（写入和 IOCTL）}

hello-1.c, Hello, World（第 1 部分）：最简单的模块

hello-2.c, Hello World（第 2 部分）

hello-3.c, Hello World（第 3 部分）：__init 和 __exit 宏

hello-4.c, Hello World（第 4 部分）：许可和模块文档

hello-5.c, 向模块传递命令行参数

intrpt.c, Intel 架构上的键盘

ioctl.c, 与设备文件对话（写入和 IOCTL）}

print_string.c, 替换 printk

sched.c, 调度任务

sleep.c, 替换 printk

start.c, 跨越多个文件的模块

stop.c, 跨越多个文件的模块

syscall.c, 系统调用

源文件

多个, 跨越多个文件的模块

ssize_t, 2.0 和 2.2 之间的变更

标准库, 常见陷阱

strace, 模块可用的函数, 系统调用

struct

tty, 替换 printk

struct file_operations, 使用 /proc 进行输入

struct inode_operations, 使用 /proc 进行输入

结构

file_operations, 2.0 和 2.2 之间的变更

符号表, 命名空间

对称多处理, 对称多处理, 2.0 和 2.2 之间的变更

sync, 系统调用

系统调用, 模块可用的函数, 系统调用

open, 系统调用

系统调用, 系统调用

sys_call_table, 系统调用

sys_open, 系统调用

T

任务, 调度任务

当前, 替换 printk

任务

调度, 调度任务

TASK_INTERRUPTIBLE, 替换 printk

tq_immediate, 中断处理程序

tq_struct, 调度任务

tq_timer, 调度任务

tty_structure, 替换 printk

V

version.h, 跨越多个文件的模块

W

唤醒进程, 替换 printk

write, 2.0 和 2.2 之间的变更