【轉】Docker 生產環境之安全性 - 適用於 Docker 的 Seccomp 安全配置檔案
安全計算模式(secure computing mode,seccomp
)是 Linux 核心功能。可以使用它來限制容器內可用的操作。seccomp()
系統呼叫在呼叫程序的 seccomp
狀態下執行。可以使用此功能來限制你的應用程式的訪問許可權。
只有在使用 seccomp
構建 Docker 並且核心配置了 CONFIG_SECCOMP
的情況下,此功能才可用。要檢查你的核心是否支援 seccomp
:
$ cat /boot/config-`uname -r` | grep CONFIG_SECCOMP=
CONFIG_SECCOMP=y
- 1
- 2
注意:
seccomp
配置檔案需要 seccomp 2.2.1,這在 Ubuntu 14.04,Debian Wheezy 或 Debian Jessie 中不可用。要在這些發行版上使用seccomp
,必須下載 最新的靜態 Linux 二進位制檔案(而不是軟體包)。
1. 為容器傳遞配置檔案
預設的 seccomp
配置檔案為使用 seccomp
執行容器提供了一個合理的設定,並禁用了大約 44 個超過 300+ 的系統呼叫。它具有適度的保護性,同時提供廣泛的應用相容性。預設的 Docker 配置檔案可以在
實際上,該配置檔案是白名單,預設情況下阻止訪問所有的系統呼叫,然後將特定的系統呼叫列入白名單。該配置檔案工作時需要定義 SCMP_ACT_ERRNO
的 defaultAction
並僅針對特定的系統呼叫覆蓋該 action
。SCMP_ACT_ERRNO
的影響是觸發 Permission Denied
錯誤。接下來,配置檔案中通過將 action
被覆蓋為 SCMP_ACT_ALLOW
,定義一個完全允許的系統呼叫的特定列表。最後,一些特定規則適用於個別的系統呼叫,如 personality
socket
,socketcall
等,以允許具有特定引數的那些系統呼叫的變體(to allow variants of those system calls with specific arguments)。
seccomp
有助於以最小許可權執行 Docker 容器。不建議更改預設的 seccomp
配置檔案。
執行容器時,如果沒有通過 --security-opt
選項覆蓋容器,則會使用預設配置。例如,以下顯式指定了一個策略:
$ docker run --rm \
-it \
--security-opt seccomp=/path/to/seccomp/profile.json \
hello-world
- 1
- 2
- 3
- 4
1.1 預設配置檔案阻止的重要的系統呼叫
Docker 的預設 seccomp
配置檔案是一個白名單,它指定了允許的呼叫。下表列出了由於不在白名單而被有效阻止的重要(但不是全部)系統呼叫。該表包含每個系統呼叫被阻止的原因。
Syscall | Description |
---|---|
acct | Accounting syscall which could let containers disable their own resource limits or process accounting. Also gated by CAP_SYS_PACCT. |
add_key | Prevent containers from using the kernel keyring, which is not namespaced. |
adjtimex | Similar to clock_settime and settimeofday, time/date is not namespaced. Also gated by CAP_SYS_TIME. |
bpf | Deny loading potentially persistent bpf programs into kernel, already gated by CAP_SYS_ADMIN. |
clock_adjtime | Time/date is not namespaced. Also gated by CAP_SYS_TIME. |
clock_settime | Time/date is not namespaced. Also gated by CAP_SYS_TIME. |
clone | Deny cloning new namespaces. Also gated by CAP_SYS_ADMIN for CLONE_* flags, except CLONE_USERNS. |
create_module | Deny manipulation and functions on kernel modules. Obsolete. Also gated by CAP_SYS_MODULE. |
delete_module | Deny manipulation and functions on kernel modules. Also gated by CAP_SYS_MODULE. |
finit_module | Deny manipulation and functions on kernel modules. Also gated by CAP_SYS_MODULE. |
get_kernel_syms | Deny retrieval of exported kernel and module symbols. Obsolete. |
get_mempolicy | Syscall that modifies kernel memory and NUMA settings. Already gated by CAP_SYS_NICE. |
init_module | Deny manipulation and functions on kernel modules. Also gated by CAP_SYS_MODULE. |
ioperm | Prevent containers from modifying kernel I/O privilege levels. Already gated by CAP_SYS_RAWIO. |
iopl | Prevent containers from modifying kernel I/O privilege levels. Already gated by CAP_SYS_RAWIO. |
kcmp | Restrict process inspection capabilities, already blocked by dropping CAP_PTRACE. |
kexec_file_load | Sister syscall of kexec_load that does the same thing, slightly different arguments. Also gated by CAP_SYS_BOOT. |
kexec_load | Deny loading a new kernel for later execution. Also gated by CAP_SYS_BOOT. |
keyctl | Prevent containers from using the kernel keyring, which is not namespaced. |
lookup_dcookie | Tracing/profiling syscall, which could leak a lot of information on the host. Also gated by CAP_SYS_ADMIN. |
mbind | Syscall that modifies kernel memory and NUMA settings. Already gated by CAP_SYS_NICE. |
mount | Deny mounting, already gated by CAP_SYS_ADMIN. |
move_pages | Syscall that modifies kernel memory and NUMA settings. |
name_to_handle_at | Sister syscall to open_by_handle_at. Already gated by CAP_SYS_NICE. |
nfsservctl | Deny interaction with the kernel nfs daemon. Obsolete since Linux 3.1. |
open_by_handle_at | Cause of an old container breakout. Also gated by CAP_DAC_READ_SEARCH. |
perf_event_open | Tracing/profiling syscall, which could leak a lot of information on the host. |
personality | Prevent container from enabling BSD emulation. Not inherently dangerous, but poorly tested, potential for a lot of kernel vulns. |
pivot_root | Deny pivot_root, should be privileged operation. |
process_vm_readv | Restrict process inspection capabilities, already blocked by dropping CAP_PTRACE. |
process_vm_writev | Restrict process inspection capabilities, already blocked by dropping CAP_PTRACE. |
ptrace | Tracing/profiling syscall, which could leak a lot of information on the host. Already blocked by dropping CAP_PTRACE. |
query_module | Deny manipulation and functions on kernel modules. Obsolete. |
quotactl | Quota syscall which could let containers disable their own resource limits or process accounting. Also gated by CAP_SYS_ADMIN. |
reboot | Don’t let containers reboot the host. Also gated by CAP_SYS_BOOT. |
request_key | Prevent containers from using the kernel keyring, which is not namespaced. |
set_mempolicy | Syscall that modifies kernel memory and NUMA settings. Already gated by CAP_SYS_NICE. |
setns | Deny associating a thread with a namespace. Also gated by CAP_SYS_ADMIN. |
settimeofday | Time/date is not namespaced. Also gated by CAP_SYS_TIME. |
socket, socketcall | Used to send or receive packets and for other socket operations. All socket and socketcall calls are blocked except communication domains AF_UNIX, AF_INET, AF_INET6, AF_NETLINK, and AF_PACKET. |
stime | Time/date is not namespaced. Also gated by CAP_SYS_TIME. |
swapon | Deny start/stop swapping to file/device. Also gated by CAP_SYS_ADMIN. |
swapoff | Deny start/stop swapping to file/device. Also gated by CAP_SYS_ADMIN. |
sysfs | Obsolete syscall. |
_sysctl | Obsolete, replaced by /proc/sys. |
umount | Should be a privileged operation. Also gated by CAP_SYS_ADMIN. |
umount2 | Should be a privileged operation. Also gated by CAP_SYS_ADMIN. |
unshare | Deny cloning new namespaces for processes. Also gated by CAP_SYS_ADMIN, with the exception of unshare –user. |
uselib | Older syscall related to shared libraries, unused for a long time. |
userfaultfd | Userspace page fault handling, largely needed for process migration. |
ustat | Obsolete syscall. |
vm86 | In kernel x86 real mode virtual machine. Also gated by CAP_SYS_ADMIN. |
vm86old | In kernel x86 real mode virtual machine. Also gated by CAP_SYS_ADMIN. |
2. 不使用預設的 seccomp 配置檔案
可以傳遞 unconfined
以執行沒有預設 seccomp
配置檔案的容器。
$ docker run --rm -it --security-opt seccomp=unconfined debian:jessie \
unshare --map-root-user --user sh -c whoami