KMview kernel module interface
From Virtualsquare
kmview_module interface ----------------------- Renzo Davoli and Andrea Gasparini kmview_module has been designed as a kernel support for view-os on linux but it is effectively an efficient support for any virtualization based on system call interception and transformation. kmview_module could be effectively used also to security tools based on system call interposition. There are two main entities in kmview: tracer and traced processes. A tracer process cannot trace itself but it can be a traced process of another tracer. All the traced processes are in the offspring of their tracer, when a process is traced there is no way to exit from the control of the tracer. A tracer process first open a read only connection to /dev/kmview (major=10,minor=233, officially assigned) fd=open("/dev/kmview",O_RDONLY); Before starting its first traced process, the tracer can set some flags to set some extra features in this way: ioctl(fd, KMVIEW_SET_FLAGS, flags); This ioctl must be called when there are no traced processes otherwise it returns EACCES (to prevent inconsistencies). A "root" traced process is started in this way: if (fork() == 0) { ioctl(fd, KMVIEW_ATTACH); close(fd); ..... code of the traced process, e.g. exec of some program } The root traced process must register itself as a traced process and close the tracing file. If the traced process forks (or clones) other processes they will be traced, too. No further direct interaction will take place between the traced process and their tracer. If a tracer dies (or it closes the fd) all the traced processes will be killed (SIGKILL). A tracer receive all events related to its traced processes using a "read" or by the magicpoll technique (see over). The received data follows the struct kmview_event specification: struct kmview_event { unsigned long tag; union { .... data for specific events ... } } There are four basic events identified by the following tags: KMVIEW_EVENT_NEWTHREAD: a new traced thread/process has just started KMVIEW_EVENT_TERMTHREAD: a new traced thread/process terminated KMVIEW_EVENT_SYSCALL_ENTRY: a traced process started a syscall KMVIEW_EVENT_SYSCALL_EXIT: a syscall for a traced process completed its execution. In order to provide a fast interaction between kernel, module and tracer each layer keeps its own id for processes. The kernel identifies each process by its pid, the module has its own identifier named kmpid and the tracer can use its own identified, the umpid. (km stands for kernel-mode, um stands for user-mode). In this way each layer can use its identifier as an index within an array: there is not any waste of time to scan into tables or waste of code to keep hash tables. Technically speaking the whole system scales as O(1) (no extra costs related to the number of processes). All the events reported by the module to the tracer carry the umpid, (except KMVIEW_EVENT_NEWTHREAD). All the requests sent (through ioctl) from the tracer to the module carry the kmpid. If a tracer tries to send an ioctl for a process handled by another tracer it gets an error (EPERM). (a process handled by several nested tracers has a different kmpid for each tracer). Basic Events: ------------------------------------------------------------------------------ KMVIEW_EVENT_NEWTHREAD: struct kmview_event_newthread{ pid_t kmpid; pid_t pid; pid_t umppid; unsigned long flags; } newthread; A new thread has just started. The tracer must store its kmpid. umppid is the umpid of the parent (forking/cloning) process: the tracer can use this field to keep trace of the hierarchy in its data structures. flags are the cloning flags (as described in clone(2)). Before reading other events the tracer must send the umpid of this new thread to the module in this way: struct kmview_ioctl_umpid { pid_t kmpid; pid_t umpid; }; ioctl(fd, KMVIEW_UMPID, & {struct kmview_ioctl_umpid var} ); If a tracer wants to use pid or kmpid instead of having its own identifiers it should copy pid or kmpid respectively to the umpid field. ------------------------------------------------------------------------------ KMVIEW_EVENT_TERMTHREAD: struct kmview_event_termthread{ pid_t umpid; unsigned long remaining; } termthread; The process/thread identified by umpid terminated. No further event will be reported for that process/thread. The field remaining contains the overall number of processes handled by this tracer. Many tracers shut down when remaining==0; ------------------------------------------------------------------------------ KMVIEW_EVENT_SYSCALL_ENTRY: struct kmview_event_ioctl_syscall{ union { pid_t umpid; pid_t kmpid; unsigned long just_for_64bit_alignment; } x; unsigned long scno; unsigned long args[6]; unsigned long pc; unsigned long sp; } syscall; (forget the just_for_64bit_alignment field, it is not used.) x.umpid is the umpid identifier, scno is the syscall number, args are the arguments, pc is the program counter and sp the stack pointer. When the tracer receive the event the traced process is quiescent (as defined in utrace): it is waiting in a state very close to the user state. While a kmview traced process is quiescent the process can be restarted by its tracer or killed. There are three different ways to restart a quiescent process for a KMVIEW_EVENT_SYSCALL_ENTRY event: 1- KMVIEW_SYSRESUME: ioctl(fd,KMVIEW_SYSRESUME,kmpid). the syscall gets retarted as is. The tracer will not receive any KMVIEW_EVENT_SYSCALL_EXIT event for this call. 2- KMVIEW_SYSVIRTUALIZED: ioctl(fd,KMVIEW_SYSVIRTUALIZED,& {struct kmview_event_ioctl_sysreturn var}) struct kmview_event_ioctl_sysreturn{ union { pid_t umpid; pid_t kmpid; unsigned long just_for_64bit_alignment; } x; long retval; long errno; } sysreturn; the call has been virtualized. This system call will not be executed by the linux kernel. Kmpid must be set and the return value, errno will be those specified here. The tracer will not receive any KMVIEW_EVENT_SYSCALL_EXIT event for this call. 3- KMVIEW_SYSMODIFIED: ioctl(fd,KMVIEW_SYSMODIFIED,& {struct kmview_event_ioctl_syscall var}) The call may have been modified. The kernel will execute the syscall (maybe a different one) as stated by the registers. Registers, scno, will be changed. This call cause a KMVIEW_EVENT_SYSCALL_EXIT event after the syscall execution (to restore the original values of registers if needed). ------------------------------------------------------------------------------ KMVIEW_EVENT_SYSCALL_EXIT: struct kmview_event_ioctl_sysreturn syscall; With this event the tracer can get the result (return value or error) of the syscall executed by the linux kernel. The traced process is quiescent when the tracer receives the event. To restart the process the tracer can use KMVIEW_SYSRESUME or with the same syntax described above for KMVIEW_EVENT_SYSCALL_ENTRY or KMVIEW_SYSRETURN: ioctl(fd,KMVIEW_SYSRETURN,& {struct kmview_event_ioctl_sysreturn var}) by this latter call, return value and errno can be changed. ------------------------------------------------------------------------------ A minimal tracer appear as follows: #include <kmview.h> void dowait(int signal) { int w; wait(&w); } main(int argc, char *argv[]) { int fd; struct kmview_event event; fd=open("/dev/kmview",O_RDONLY); signal(SIGCHLD,dowait); if (fork()) { while (1) { read(fd,&event,sizeof(event)); switch (event.tag) { case KMVIEW_EVENT_NEWTHREAD: { struct kmview_ioctl_umpid ump; printf("new process %d\n",event.x.newthread.pid); ump.kmpid=event.x.newthread.kmpid; /* we use umpid == kmpid */ ump.umpid=event.x.newthread.kmpid; ioctl(fd, KMVIEW_UMPID, &ump); break; } case KMVIEW_EVENT_TERMTHREAD: printf("Terminated proc %d (%d left)\n", event.x.termthread.umpid, event.x.termthread.remaining); if (event.x.termthread.remaining == 0) exit (0); break; case KMVIEW_EVENT_SYSCALL_ENTRY: printf("Syscall %d->%d\n", event.x.syscall.x.umpid, event.x.syscall.scno); ioctl(fd, KMVIEW_SYSRESUME, event.x.syscall.x.umpid); break; } } } else { /* traced root process*/ ioctl(fd, KMVIEW_ATTACH); close(fd); argv++; execvp(argv[0],argv); } } ------------------------------------------------------------------------------ MAGICPOLL When a tracer is a virtual machine monitor (an hypervisor, leaning the word from xen), often it does not keep waiting on a read as there are many source of events, file descriptors or signals. If the hypervisor uses a ppoll it can wake up as soon as something happens. Unfortunately this means that for the standard virtualization cycle it needs two context switches to get the system call (or other event) from the kmview module. It the tracer sends the address of a buffer by the KMVIEW_MAGICPOLL ioctl, any select/poll like call will direcly tranfer the event to the buffer, thus when the return value of poll/select returns the availability of data for reading (e.g. POLLIN for poll), the data is already in the buffer. This reduces the number of context switches as there is no need to call read. In order to further decrease the number of context switches per system call, it is possible to use an array of struct kmview_event as a magicpoll buffer. If there are several pending events (at most one per traced process), the kmview module fills in several elements of the array. The magic poll ioctl is the following one: struct kmview_magicpoll { long magicpoll_addr; long magicpoll_cnt; }; ioctl(fd,KMVIEW_MAGICPOLL,& {struct kmview_magicpoll var} ); magicpoll_addr is the address of the buffer, magicpoll_cnt is the number of elements in the array. When poll returns either the array is full of pending events or the array element after the last significant pending event is tagged as KMVIEW_EVENT_NONE (0). ------------------------------------------------------------------------------ KMVIEW_FLAG_SOCKETCALL Linux supports the Berkeley socket interface, but in many architectures instead of defining several different system calls (e.g. one for socket(2), one for connect, listen, accept etc.) is has just one system call (__NR_socketcall) with two parameters: the number of the call (as defined in /usr/include/linux/net.h) and a pointer to the array of parameters. It is the case of several widely used architectures like i386 or powerpc. Other architectures like x86_64, ia64 or alpha, has several system calls one for each Berkeley socket call. To speed up the virtualization on architectures with __NR_socketcall, kmview provides the KMVIEW_FLAG_SOCKETCALL option. When KMVIEW_FLAG_SOCKETCALL flag is set by KMVIEW_SET_FLAGS, the tracer receives a KMVIEW_EVENT_SOCKETCALL_ENTRY instead of the event KMVIEW_EVENT_SYSCALL_ENTRY when the system call __NR_socketcall is starting. The kmview_event_socketcall structure is the following one: struct kmview_event_socketcall{ union { pid_t umpid; unsigned long just_for_64bit_alignment; } x; unsigned long scno; unsigned long args[6]; unsigned long pc; unsigned long sp; unsigned long addr; } socketcall; scno is the number of the socket call (the number of the call listed in /usr/include/linux/net.h), args are the socket call args, pc and sp as usual, the final addr is the address of the argument array. This prevents the hypervisor from spending one more context switch to grab the parameters. The system call can be restarted by KMVIEW_SYSRESUME, KMVIEW_SYSVIRTUALIZED. Currently the module does not provide any KMVIEW_FLAG_SOCKETMODIFIED call: it is possible to use KMVIEW_FLAG_SYSMODIFIED to start another system call instead of a socket call, parameters of the same socket call can be changed by rewriting them using addr. Changing a socket call with another is rare, and must be done by hand carefully as the buffer pointed by addr can have not enough space to fit the new arguments. ------------------------------------------------------------------------------ KMVIEW_FLAG_FDSET There are several system calls that have a file descriptor in the first argument. It is for example the case of read, write, fstat etc. When a virtual machine monitor virtualize just some files it is useless to notify the tracer/monitor for fd relatied to non virtualized files. When KMVIEW_FLAG_FDSET is set by KMVIEW_SET_FLAGS ioctl, all fd related calls are not notified to the tracer by default. The tracer can add and delete file descriptors to the set of traced fd by using the following calls: struct kmview_fd { pid_t kmpid; int fd; }; ioctl(fd,KMVIEW_ADDFD,& {struct kmview_fd var}); ioctl(fd,KMVIEW_DELFD,& {struct kmview_fd var}); The tracer receives system call (and socket call) events only when the first argument file descriptor belongs to the set of traced fd. The set of traced fd is automatically inherited during a clone/fork of a traced process. There are two flags that can be specified together with KMVIEW_FLAG_FDSET: KMVIEW_FLAG_EXCEPT_CLOSE and KMVIEW_FLAG_EXCEPT_FCHDIR. These are notable exception that the tracer may specify. When KMVIEW_FLAG_EXCEPT_CLOSE, the close system call (as well as shutdown) always cause an event to the tracer. KMVIEW_FLAG_EXCEPT_FCHDIR has the same effect for fchdir. These two calls cause a change of the system state: some virtual machine monitors (like our kmview) need to trace these changes.