KMview kernel module interface

From Virtualsquare
Jump to: navigation, search
kmview_module interface
Renzo Davoli and Andrea Gasparini

kmview_module has been designed as a kernel support for view-os on linux but
it is effectively an efficient support for any virtualization based on
system call interception and transformation.

kmview_module could be effectively used also to security tools based on
system call interposition.

There are two main entities in kmview: tracer and traced processes.
A tracer process cannot trace itself but it can be a traced process of
another tracer.

All the traced processes are in the offspring of their tracer,
when a process is traced there is no way to exit from the control
of the tracer.

A tracer process first open a read only connection to /dev/kmview
(major=10,minor=233, officially assigned)

Before starting its first traced process, the tracer can set some flags
to set some extra features in this way:

ioctl(fd, KMVIEW_SET_FLAGS, flags);

This ioctl must be called when there are no traced processes otherwise
it returns EACCES (to prevent inconsistencies).

A "root" traced process is started in this way:

  if (fork() == 0) {
    ioctl(fd, KMVIEW_ATTACH);
    ..... code of the traced process, e.g. exec of some program 

The root traced process must register itself as a traced process and close
the tracing file. If the traced process forks (or clones) other processes
they will be traced, too. No further direct interaction will take place
between the traced process and their tracer. 

If a tracer dies (or it closes the fd) all the traced processes will be
killed (SIGKILL).

A tracer receive all events related to its traced processes using
a "read" or by the magicpoll technique (see over).
The received data follows the struct kmview_event specification:

  struct kmview_event {
    unsigned long tag;
    union {
      .... data for specific events ...

There are four basic events identified by the following tags:
KMVIEW_EVENT_NEWTHREAD: a new traced thread/process has just started
KMVIEW_EVENT_TERMTHREAD: a new traced thread/process terminated
KMVIEW_EVENT_SYSCALL_ENTRY: a traced process started a syscall
KMVIEW_EVENT_SYSCALL_EXIT: a syscall for a traced process completed its

In order to provide a fast interaction between kernel, module and tracer
each layer keeps its own id for processes.
The kernel identifies each process by its pid, the module has its own
identifier named kmpid and the tracer can use its own identified, the
umpid. (km stands for kernel-mode, um stands for user-mode).
In this way each layer can use its identifier as an index within an array:
there is not any waste of time to scan into tables or waste of code to
keep hash tables. Technically speaking the whole system scales as O(1)
(no extra costs related to the number of processes).

All the events reported by the module to the tracer carry the umpid, (except
KMVIEW_EVENT_NEWTHREAD). All the requests sent (through ioctl) from the
tracer to the module carry the kmpid. If a tracer tries to send an ioctl
for a process handled by another tracer it gets an error (EPERM).
(a process handled by several nested tracers has a different kmpid for
each tracer).

Basic Events:
struct kmview_event_newthread{
  pid_t kmpid;
  pid_t pid;
  pid_t umppid;
  unsigned long flags;
} newthread;

A new thread has just started. The tracer must store its kmpid.
umppid is the umpid of the parent (forking/cloning) process: the tracer can
use this field to keep trace of the hierarchy in its data structures.
flags are the cloning flags (as described in clone(2)).
Before reading other events the tracer must send the umpid of this new
thread to the module in this way:

struct kmview_ioctl_umpid {
    pid_t kmpid;
    pid_t umpid;
ioctl(fd, KMVIEW_UMPID, & {struct kmview_ioctl_umpid var} );

If a tracer wants to use pid or kmpid instead of having its own identifiers
it should copy pid or kmpid respectively to the umpid field.

struct kmview_event_termthread{
  pid_t umpid;
  unsigned long remaining;
} termthread;

The process/thread identified by umpid terminated. No further event will be
reported for that process/thread.
The field remaining contains the overall number of processes handled by
this tracer. Many tracers shut down when remaining==0;

struct kmview_event_ioctl_syscall{
  union {
    pid_t umpid;
    pid_t kmpid;
    unsigned long just_for_64bit_alignment;
  } x;
  unsigned long scno;
  unsigned long args[6];
  unsigned long pc;
  unsigned long sp;
} syscall;

(forget the just_for_64bit_alignment field, it is not used.)
x.umpid is the umpid identifier, scno is the syscall number, args are
the arguments, pc is the program counter and sp the stack pointer.
When the tracer receive the event the traced process is
quiescent (as defined in utrace): it is waiting in a state very close to
the user state. While a kmview traced process is quiescent the process
can be restarted by its tracer or killed.

There are three different ways to restart a quiescent process for

the syscall gets retarted as is. The tracer will not receive any
KMVIEW_EVENT_SYSCALL_EXIT event for this call.

ioctl(fd,KMVIEW_SYSVIRTUALIZED,& {struct kmview_event_ioctl_sysreturn var})
struct kmview_event_ioctl_sysreturn{
  union {
    pid_t umpid;
    pid_t kmpid;
    unsigned long just_for_64bit_alignment;
  } x;
  long retval;
  long errno;
} sysreturn;

the call has been virtualized. This system call will not be executed by
the linux kernel. Kmpid must be set and the return value, errno will be
those specified here.
The tracer will not receive any KMVIEW_EVENT_SYSCALL_EXIT event for this call.

ioctl(fd,KMVIEW_SYSMODIFIED,& {struct kmview_event_ioctl_syscall var})
The call may have been modified. The kernel will execute the syscall
(maybe a different one) as stated by the registers.
Registers, scno, will be changed.
This call cause a KMVIEW_EVENT_SYSCALL_EXIT event after the syscall
execution (to restore the original values of registers if needed).

struct kmview_event_ioctl_sysreturn syscall;

With this event the tracer can get the result (return value or error)
of the syscall executed by the linux kernel.
The traced process is quiescent when the tracer receives the event.
To restart the process the tracer can use KMVIEW_SYSRESUME or
with the same syntax described above for KMVIEW_EVENT_SYSCALL_ENTRY or
ioctl(fd,KMVIEW_SYSRETURN,& {struct kmview_event_ioctl_sysreturn var})
by this latter call, return value and errno can be changed.

A minimal tracer appear as follows:
#include <kmview.h>

void dowait(int signal)
    int w;

main(int argc, char *argv[])
  int fd;
  struct kmview_event event;
  if (fork()) {
    while (1) {
      switch (event.tag) {
            struct kmview_ioctl_umpid ump;
            printf("new process %d\n",;
            /* we use umpid == kmpid */
            ioctl(fd, KMVIEW_UMPID, &ump);
          printf("Terminated proc %d (%d left)\n",
          if (event.x.termthread.remaining == 0)
            exit (0);
          printf("Syscall %d->%d\n",
          ioctl(fd, KMVIEW_SYSRESUME, event.x.syscall.x.umpid);
  } else { /* traced root process*/
    ioctl(fd, KMVIEW_ATTACH);


When a tracer is a virtual machine monitor (an hypervisor, leaning the word
from xen), often it does not keep waiting on a read as there are many
source of events, file descriptors or signals.

If the hypervisor uses a ppoll it can wake up as soon as something happens.
Unfortunately this means that for the standard virtualization cycle it
needs two context switches to get the system call (or other event) from
the kmview module.

It the tracer sends the address of a buffer by the KMVIEW_MAGICPOLL ioctl,
any select/poll like call will direcly tranfer the event to the buffer,
thus when the return value of poll/select returns the availability of data
for reading (e.g. POLLIN for poll), the data is already in the buffer.
This reduces the number of context switches as there is no need to call

In order to further decrease the number of context switches per system
call, it is possible to use an array of struct kmview_event as a
magicpoll buffer.
If there are several pending events (at most one per traced process),
the kmview module fills in several elements of the array.

The magic poll ioctl is the following one:
struct kmview_magicpoll {
  long magicpoll_addr;
  long magicpoll_cnt;

ioctl(fd,KMVIEW_MAGICPOLL,& {struct kmview_magicpoll var} );

magicpoll_addr is the address of the buffer, magicpoll_cnt is the number of
elements in the array.
When poll returns either the array is full of pending events or
the array element after the last significant pending event is tagged as

Linux supports the Berkeley socket interface, but in many architectures
instead of defining several different system calls (e.g. one for socket(2),
one for connect, listen, accept etc.) is has just one system call
(__NR_socketcall) with two parameters: the number of the call
(as defined in /usr/include/linux/net.h) and a pointer to the
array of parameters.
It is the case of several widely used architectures like i386 or powerpc.
Other architectures like x86_64, ia64 or alpha, has several system calls
one for each Berkeley socket call.

To speed up the virtualization on architectures with __NR_socketcall,
kmview provides the KMVIEW_FLAG_SOCKETCALL option.

When KMVIEW_FLAG_SOCKETCALL flag is set by KMVIEW_SET_FLAGS, the tracer
receives a KMVIEW_EVENT_SOCKETCALL_ENTRY instead of the event
KMVIEW_EVENT_SYSCALL_ENTRY when the system call __NR_socketcall is

The kmview_event_socketcall structure is the following one:
struct kmview_event_socketcall{
  union {
    pid_t umpid;
    unsigned long just_for_64bit_alignment;
  } x;
  unsigned long scno;
  unsigned long args[6];
  unsigned long pc;
  unsigned long sp;
  unsigned long addr;
} socketcall;

scno is the number of the socket call (the number of the call listed in
/usr/include/linux/net.h), args are the socket call args, pc and sp as
usual, the final addr is the address of the argument array.

This prevents the hypervisor from spending one more context switch to grab
the parameters.

The system call can be restarted by KMVIEW_SYSRESUME, KMVIEW_SYSVIRTUALIZED.
Currently the module does not provide any KMVIEW_FLAG_SOCKETMODIFIED call:
it is possible to use KMVIEW_FLAG_SYSMODIFIED to start another system call
instead of a socket call, parameters of the same socket call can be changed
by rewriting them using addr.
Changing a socket call with another is rare, and must be done by hand
carefully as the buffer pointed by addr can have not enough space to fit
the new arguments.


There are several system calls that have a file descriptor in the
first argument. It is for example the case of read, write, fstat etc.
When a virtual machine monitor virtualize just some files it is useless
to notify the tracer/monitor for fd relatied to non virtualized files.

all fd related calls are not notified to the tracer by default.

The tracer can add and delete file descriptors to the set of
traced fd by using the following calls:
  struct kmview_fd {
    pid_t kmpid;
    int fd;
ioctl(fd,KMVIEW_ADDFD,& {struct kmview_fd var});
ioctl(fd,KMVIEW_DELFD,& {struct kmview_fd var});

The tracer receives system call (and socket call) events only when
the first argument file descriptor belongs to the set of traced fd.

The set of traced fd is automatically inherited during a clone/fork of
a traced process.

There are two flags that can be specified together with KMVIEW_FLAG_FDSET:
These are notable exception that the tracer may specify.
When KMVIEW_FLAG_EXCEPT_CLOSE, the close system call (as well as shutdown)
always cause an event to the tracer. KMVIEW_FLAG_EXCEPT_FCHDIR has the
same effect for fchdir.
These two calls cause a change of the system state: some virtual machine
monitors (like our kmview) need to trace these changes.
Personal tools