数据流消毒器设计文档¶

本文档概述了数据流消毒器的设计，数据流消毒器是一种通用的动态数据流分析工具。与其他消毒器工具不同，此工具本身并非旨在检测特定类型的错误。相反，它提供了一个通用的动态数据流分析框架，供客户端使用，以帮助检测其自身代码中特定于应用程序的问题。

数据流消毒器是一种程序插桩，可以将多个污点标签与存储在程序可访问的任何内存区域中的任何数据关联起来。分析是动态的，这意味着它在正在运行的程序上运行，并跟踪标签如何在程序中传播。

用例¶

此插桩可以用作工具，帮助监控数据如何从程序的输入（源）流向其输出（接收器）。这在隐私/安全方面具有应用，因为可以审核敏感数据项在程序中的使用方式，并确保它不会从程序中退出任何不应退出的地方。

接口¶

提供了一些函数，这些函数将污点标签附加到内存区域，并提取与特定内存区域关联的标签集。这些函数在头文件 sanitizer/dfsan_interface.h 中声明。

/// Sets the label for each address in [addr,addr+size) to \c label.
void dfsan_set_label(dfsan_label label, void *addr, size_t size);

/// Sets the label for each address in [addr,addr+size) to the union of the
/// current label for that address and \c label.
void dfsan_add_label(dfsan_label label, void *addr, size_t size);

/// Retrieves the label associated with the given data.
///
/// The type of 'data' is arbitrary.  The function accepts a value of any type,
/// which can be truncated or extended (implicitly or explicitly) as necessary.
/// The truncation/extension operations will preserve the label of the original
/// value.
dfsan_label dfsan_get_label(long data);

/// Retrieves the label associated with the data at the given address.
dfsan_label dfsan_read_label(const void *addr, size_t size);

/// Returns whether the given label contains the label elem.
int dfsan_has_label(dfsan_label label, dfsan_label elem);

/// Computes the union of \c l1 and \c l2, resulting in a union label.
dfsan_label dfsan_union(dfsan_label l1, dfsan_label l2);

/// Flushes the DFSan shadow, i.e. forgets about all labels currently associated
/// with the application memory.  Use this call to start over the taint tracking
/// within the same process.
///
/// Note: If another thread is working with tainted data during the flush, that
/// taint could still be written to shadow after the flush.
void dfsan_flush(void);

提供以下函数以检查源跟踪状态和结果。

/// Retrieves the immediate origin associated with the given data. The returned
/// origin may point to another origin.
///
/// The type of 'data' is arbitrary. The function accepts a value of any type,
/// which can be truncated or extended (implicitly or explicitly) as necessary.
/// The truncation/extension operations will preserve the label of the original
/// value.
dfsan_origin dfsan_get_origin(long data);

/// Retrieves the very first origin associated with the data at the given
/// address.
dfsan_origin dfsan_get_init_origin(const void *addr);

/// Prints the origin trace of the label at the address `addr` to stderr. It also
/// prints description at the beginning of the trace. If origin tracking is not
/// on, or the address is not labeled, it prints nothing.
void dfsan_print_origin_trace(const void *addr, const char *description);

/// Prints the origin trace of the label at the address `addr` to a pre-allocated
/// output buffer. If origin tracking is not on, or the address is`
/// not labeled, it prints nothing.
///
/// `addr` is the tainted memory address whose origin we are printing.
/// `description` is a description printed at the beginning of the trace.
/// `out_buf` is the output buffer to write the results to. `out_buf_size` is
/// the size of `out_buf`. The function returns the number of symbols that
/// should have been written to `out_buf` (not including trailing null byte '\0').
/// Thus, the string is truncated iff return value is not less than `out_buf_size`.
size_t dfsan_sprint_origin_trace(const void *addr, const char *description,
                                 char *out_buf, size_t out_buf_size);

/// Returns the value of `-dfsan-track-origins`.
int dfsan_get_track_origins(void);

提供以下函数以注册由自定义包装器调用的钩子。

/// Sets a callback to be invoked on calls to `write`.  The callback is invoked
/// before the write is done. The write is not guaranteed to succeed when the
/// callback executes. Pass in NULL to remove any callback.
typedef void (*dfsan_write_callback_t)(int fd, const void *buf, size_t count);
void dfsan_set_write_callback(dfsan_write_callback_t labeled_write_callback);

/// Callbacks to be invoked on calls to `memcmp` or `strncmp`.
void dfsan_weak_hook_memcmp(void *caller_pc, const void *s1, const void *s2,
                            size_t n, dfsan_label s1_label,
                            dfsan_label s2_label, dfsan_label n_label);
void dfsan_weak_hook_strncmp(void *caller_pc, const char *s1, const char *s2,
                            size_t n, dfsan_label s1_label,
                            dfsan_label s2_label, dfsan_label n_label);

污点标签表示¶

我们使用 8 位无符号整数来表示标签。标签标识符 0 是特殊的，表示数据项未标记。这优化了插桩的低 CPU 和代码大小开销。当在连接点（任何具有两个或多个操作数的算术或逻辑运算，例如加法）请求标签并集运算时，我们可以简单地将两个标签进行按位或运算，时间复杂度为 O(1)。

用户负责管理 8 个整数标签（即跟踪他们迄今为止使用的标签，选择一个尚未使用的标签等）。

源跟踪轨迹表示¶

源跟踪轨迹是链的列表。每个链都包含一个堆栈跟踪，DFSan 运行时在其中记录标签传播，以及指向其先前链的指针。第一个链不指向任何链。

每四个 4 字节对齐的应用程序字节共享一个 4 字节源跟踪 ID。4 字节源跟踪 ID 包含 4 位深度和 28 位链的哈希 ID。

链 ID 计算为链结构的哈希值。链结构包含堆栈 ID 和先前链 ID。链头具有零先前链 ID。堆栈 ID 是堆栈跟踪的哈希值。4 位深度限制路径的最大长度。环境变量 origin_history_size 可以设置深度限制。非正值表示无限。其默认值为 16。当达到限制时，源跟踪将忽略后续传播链。

跟踪的第一个链通过使用非零标签的 dfsan_set_label 开始。当 -dfsan-track-origins 为 1 时，在存储或内存传输时会在跟踪的末尾追加新的链。内存传输包括 LLVM 内存传输指令、glibc memcpy 和 memmove。当 -dfsan-track-origins 为 2 时，在加载时也会追加新的链。

其他指令不创建新链，而只是传播源跟踪 ID。如果指令具有多个具有非零标签的操作数，则具有非零标签的最后一个操作数的源跟踪 ID 将传播到指令的结果。

内存布局和标签管理¶

以下是 Linux/x86_64 的内存布局

开始	结束	用途
0x700000000000	0x800000000000	应用程序 3
0x610000000000	0x700000000000	未使用
0x600000000000	0x610000000000	源 1
0x510000000000	0x600000000000	应用程序 2
0x500000000000	0x510000000000	阴影 1
0x400000000000	0x500000000000	未使用
0x300000000000	0x400000000000	源 3
0x200000000000	0x300000000000	阴影 3
0x110000000000	0x200000000000	源 2
0x100000000000	0x110000000000	未使用
0x010000000000	0x100000000000	阴影 2
0x000000000000	0x010000000000	应用程序 1

应用程序内存的每个字节对应一个阴影内存的单个字节，用于存储其污点标签。我们使用以下掩码和偏移量将内存、阴影和源区域相互映射

shadow_addr = memory_addr ^ 0x500000000000
origin_addr = shadow_addr + 0x100000000000

至于 LLVM SSA 寄存器，我们发现没有必要像其他一些工具那样将标签与每个数据字节或位关联起来。相反，标签直接与寄存器关联。加载将导致加载的所有阴影标签的并集，存储将导致存储值的标签被复制到所有存储到的字节的阴影中。

通过参数传播标签¶

为了通过函数参数和返回值传播标签，数据流消毒器更改了翻译单元中每个函数的 ABI。目前支持两种 ABI

参数 - 参数和返回值标签通过附加参数传递，并通过修改返回值类型来传递。
TLS - 参数和返回值标签通过 TLS 变量 __dfsan_arg_tls 和 __dfsan_retval_tls 传递。

TLS ABI 的主要优点是它对 ABI 不匹配更容忍（TLS 存储与任何其他形式的存储都不共享，而额外的参数可能会存储在寄存器中，这些寄存器在原生 ABI 下不用于参数传递，因此可能包含任意值）。另一方面，args ABI 效率更高，并且允许通过检查标称未标记程序中的非零标签来更容易地识别 ABI 不匹配。

实施 ABI 列表¶

该 ABI 列表提供了一个符合原生 ABI 的函数列表，每个函数都可以在插桩程序中调用。这是通过用使用插桩 ABI 的函数的引用替换对原生 ABI 函数的每个引用来实现的。此类函数是原生函数的自动生成的包装器。例如，给定用户手册中提供的 ABI 列表示例，以下包装器将在 args ABI 下生成

define linkonce_odr { i8*, i16 } @"dfsw$malloc"(i64 %0, i16 %1) {
entry:
  %2 = call i8* @malloc(i64 %0)
  %3 = insertvalue { i8*, i16 } undef, i8* %2, 0
  %4 = insertvalue { i8*, i16 } %3, i16 0, 1
  ret { i8*, i16 } %4
}

define linkonce_odr { i32, i16 } @"dfsw$tolower"(i32 %0, i16 %1) {
entry:
  %2 = call i32 @tolower(i32 %0)
  %3 = insertvalue { i32, i16 } undef, i32 %2, 0
  %4 = insertvalue { i32, i16 } %3, i16 %1, 1
  ret { i32, i16 } %4
}

define linkonce_odr { i8*, i16 } @"dfsw$memcpy"(i8* %0, i8* %1, i64 %2, i16 %3, i16 %4, i16 %5) {
entry:
  %labelreturn = alloca i16
  %6 = call i8* @__dfsw_memcpy(i8* %0, i8* %1, i64 %2, i16 %3, i16 %4, i16 %5, i16* %labelreturn)
  %7 = load i16* %labelreturn
  %8 = insertvalue { i8*, i16 } undef, i8* %6, 0
  %9 = insertvalue { i8*, i16 } %8, i16 %7, 1
  ret { i8*, i16 } %9
}

作为优化，对原生 ABI 函数的直接调用将直接调用原生 ABI 函数，并且传递将内部计算适当的标签。这具有减少在返回值标签已知为零（即 discard 函数或具有已知未标记参数的 functional 函数）时所需的并集运算次数的优点。

检查 ABI 一致性¶

DFSan 更改了模块中每个函数的 ABI。这使得可以使用插桩 ABI 调用具有原生 ABI 的函数，反之亦然，因此可能调用未定义的行为。静态检测此问题的一种简单方法是在每个插桩 ABI 函数的名称后追加 “.dfsan” 后缀。

这并不能捕获所有此类问题；特别是跨插桩-原生屏障传递的函数指针不能在另一侧使用。这些问题可能可以动态捕获。