Experimental 0.8 View-OS

From Virtualsquare
Revision as of 18:42, 27 December 2012 by Renzo (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search

The 0.8 experimental version is a major update in View-OS code. (In some notes you may find this version named as 0.9, but it has been later renumbered as 0.8)

It is currently available in our Sourceforge svn repository under the branch rd235.

The experimental version has been merged into the mainstream code. All the features described here are already in View-OS source code.

Contents

New features for users

The concepts of service module code and position do not exist any more. Modules have names and their position is not relevant.

um_mov_service and um_lock_service have been eliminated. The new syntax of um_add_service is:

 um_add_service [-p] service_name_or_path

e.g.

 $ um_add_service umfuse

-p means permanent, a permanent module cannot be unloaded. The module (which is a library) can be specified by pathname or by name, in this latter case the file name.so gets searched in a standard sequence of directories.

um_ls_service has a new output:

 $ um_ls_service 
 umproc: /proc virtualization
 umfuse: virtual file systems (user level FUSE)
 umnet: virtual (multi-stack) networking
 umdev: virtual devices

um_del_service has a very simple syntax:

 $ um_del_service umfuse

the command needs exactly one argument: the name of the module to delete (i.e. the name before the colon ':' in the output of um_ls_service).

There are new commands:

  • viewsu: the viewos couterpart of su.
 
  $ viewsu
  # whoami
  root
  # exit
  $ viewsu bin
  $ whoami
  bin
  $

viewsu does not ask for a password as the new identity is effection for the virtual environment only.

  • viewsudo: sudo for viewos.
  $ viewsudo whoami
  root
  $ viewsudo -u bin whoami
  bin
  $ 
  • viewmount/viewumount: quite the same syntax of mount/umount. Some extra controls on mount/umount (which run setuid root outside view-os) can sometimes create problems in view-os, viewmount/viewumount are not setuid and have no extra controls.

The new module umproc creates virtual files in /proc. Currently it redefines /proc/mounts. When umproc is loaded, /proc/mounts lists the current virtual mounttable. If [uk]mview runs several nested virtual machines (e.g. umview in umview) /proc/mounts provide each one with the current status of that machine.

Example:

  $ um_add_service umproc
  $ um_add_service unreal
  $ um_add_service umfuse
  $ mount -t umfuseext2 -o ro unreal/tmp/linux.img /mnt
  $ cat /proc/mounts
  rootfs / rootfs rw 0 0
  /dev/root / ext3 rw,errors=remount-ro,data=ordered 0 0
  tmpfs /lib/init/rw tmpfs rw,nosuid,mode=755 0 0
  ..
  /dev/sda6 /home ext3 rw,errors=continue,data=ordered 0 0
  none /proc/mounts proc ro 0 14
  / /unreal unreal rw 0 16
  / /unreal unreal rw 0 17
  /unreal/tmp/linux.img /mnt umfuseext2 ro 0 21
  $

The last four lines in the examples are the virtual mounts: umproc itself, two lines are the two layers of the unreal example (that duplicates the filesystem in /unreal and /unreal/unreal) and the last one is the umfuse file system mounted.

The modules can be loaded/unloaded in each submachine. The modules already loaded when the submachine started gets inherited, any further module started after the activation of a submachine operates on the that submachine. Two submachine can independently load the same module. um_ls_service lists the modules loaded in the submachine, the output may differ.

Example:

 $ umview xterm

in the new xterm

 $ um_add_service umproc
 $ umview xterm &
 $ um_add_service umnet
 $ um_ls_service
 umproc: /proc virtualization
 umnet: virtual (multi-stack) networking

in the other xterm (activated by the second command)

 $ um_ls_service
 umproc: /proc virtualization
 $ um_add_service umfuse
 $ um_ls_service
 umproc: /proc virtualization
 umfuse: virtual file systems (user level FUSE)
 $ um_add_service umnet
 $ um_ls_service
 umproc: /proc virtualization
 umfuse: virtual file systems (user level FUSE)
 umnet: virtual (multi-stack) networking

umproc gets inherited while the other modules have been loaded independently in the two submachines.

New core-module interface

The new version of View-OS include a major update in the core to module api. Here core means the partial virtual machine monitor, i.e. umview or kmview.

This change is related to the new module selection algorithm. Each module registers its range of control to the core. Each system call gets dispatched to the right module with almost no interaction between the core and the module.

A module must define a global variable of type struct service. This structure must be named viewos_service or redefined by the VIEWOS_SERVICE macro. Either

 struct service viewos_service

or

 static struct service s;
 VIEWOS_SERVICE(s);

Each module is idenitfied by a name and has a description (field .name and .description respectively). The name is a short identifier, used in um_ls_service and um_del_service. Two modules having the same name cannot be loaded in a viewos machine at the same time. It is a good idea to name the file of the module and the module itself in the same way, i.e. the module umfuse.so should be named umfuse, in this way a module can be loaded and unloaded using the same identifier.

Former fields like the module code and check function have been eliminated.

The following code is a hello world module for ViewOS 0.8.

#include <stdio.h>
#include <config.h>
#include "module.h"

static struct service s;
VIEWOS_SERVICE(s)

static void
__attribute__ ((constructor))
init (void)
{
  printk("hello world init");
  s.name="hwtest";
  s.description="hello world test";
  s.syscall=(sysfun *)calloc(scmap_scmapsize,sizeof(sysfun));
  s.socket=(sysfun *)calloc(scmap_sockmapsize,sizeof(sysfun));
}

static void
__attribute__ ((destructor))
fini (void)
{
  free(s.syscall);
  free(s.socket);
  printk("hello world fini");
}

printk is the function for error, debug or warning outputs. printk skips any virtualization defined and prints a line on the console, i.e. on the terminal where umview was started from. (in the former version this function was named fprint2, but the name printk is more convenient).

When a module starts, it registers the module itself and the module name as file system type prefix. It means that if a process tries to mount:

 $ mount -t hwtestfoo a b

or

 $ mount -t hwtestbar c d

the request gets forwared to the mount function of the hwtest module above.

A module is loaded when a machine adds the service (by um_add_service). If several submachines add the same service the module is loaded once and unloaded when the last machine deletes the service. It means that the constructor and the destructor gets called once when the module is loaded and unloaded, respectively, no matter how many submachines loaded the same service.

A module registers its range of operation or simply range by adding elements to tha global hash table. Several kinds of objects can be added to the hash table

CHECKPATH: file system subtree
CHECKSOCKET: address family
CHECKCHRDEVICE: dev_t of a chr device
CHECKBLKDEVICE: dev_t of a blk device
CHECKSC: scno of a system call
CHECKBINFMT: interpreter for executables

CHECKPATH is the implementation of the concept of mount. When a module adds a CHECKPATH element to the hash table, that module will handle all the files in the subtree. ViewOS can mount single files, i.e. the subtree can be composed by one node.

Modules can add path to the hashtable by the following function:

struct ht_elem *ht_tab_pathadd(unsigned char type, const char *source,
    const char *path, const char *fstype,
    unsigned long mountflags, const char *flags,
    struct service *service, unsigned char trailingnumbers,
    confirmfun_t confirmfun, void *private_data);
  • type: CHECKPATH (in the future maybe we'll register different path)
  • source, path, fstype, mountflags, flags: these fields correspond to the fields source, target, filesystemtype, mountflags, data of the system call mount(2). path (i.e. target) will be registered in the hash table while the other fields are used to update the (virtual) mount table.
  • service: the address of the global struct service variable of this module.
  • tralingnumbers: boolean, if true this mountpoint is extendend to all the files with trailing numbers, e.g. registering /dev/hda (tralingnumbers==1) will match also /dev/hda1,/dev/hda2,...
  • confirmfun: sometimes modules want to register a subset of the subtree. When confirmfun is a valid function pointer, the function gets called to confirm the management of a file/path/object. If the function returns a non zero value the module will handle the file. See over for the structure and parameters of this function.
  • private_data is an opaque data for module management. confirmfun can retrieve the private data using:
 void *ht_get_private_data(struct ht_elem *hte);

while all the other functions can use:

 void *um_mod_get_private_data(void);

Other object (i.e. CHECKSOCKET, CHECKCHRDEVICE,... except CHECKPATH) can be added to the hash table by calling:

struct ht_elem *ht_tab_add(unsigned char type,void *obj,int objlen,
    struct service *service, confirmfun_t confirmfun, void *private_data);

All the arguments have the same meaning explained above for ht_tab_pathadd.

  • obj is the object
  • objlen is the length of the object. An object must match up to objlen bytes, thus it must be at least objlen bytes wide, but can be longer. A module can ask to conferm all the requests of a specific kind by registering an object obj==NULL and objlen==0"".

A confirmation function has the following type:

typedef int (* confirmfun_t)(int type, void *arg, int arglen,
    struct ht_elem *ht);

type is the kind of match, arg is the object to check, arglen is the length of the match, and ht is the current item of the hash table. The search process needs to try partial matches (e.g. each directory in the path). The confirmation function always gets the entire object but the match must be limited to arglen bytes.

Module should never add/delete their elements in the hash table in their constructors/destructors because in this way modules cannot work properly when the core of ViewOS (e.g. umview) is supporting several virtual machines at the same time. In fact we have already seen that the constructor and descrutors are called once while the service can be added/deleted several times.

There are two scenarios for adding elements to the hash table: service bound and mount bound.

  • An element is service bound if it must be added when the service is added, and removed when the service is deleted.
  • mount bound means that the hash table element gets added as a result of a mount operation and deleted by the correspondent umount.

A module defines global functions viewos_init and viewos_fini to add/delete service bound elements:

void *viewos_init(char *args);
void viewos_fini(void *data);

These function gets called each time a service is added/deleted in a virtual machine, i.e. each time a user calls um_add_service or um_del_service. viewos_init has one argument args: the optional arguments that can be put as a suffix when adding a service. Example: when a service named test (of a module test.so) is loaded by

 $ um_add_service test,a,b,c

args is a,b,c. viewos_init returns an opaque that will be passed to viewos_fini. viewos_init must transfer to viewos_fini the pointers to all the hash table elements created, so that viewos_fini can safely delete all the elements. (If those elements stay in the hash table and the module gets unloaded, the core ViewOS monitor will call non-existent functions, causing the whole virtual system to abort).

The function to delete hash table elements is the following one:

 int ht_tab_del(struct ht_elem *mp)

A complete module (service bound)

#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <module.h>

static struct service s;
VIEWOS_SERVICE(s)

char *contents="hello world\n";

struct fileinfo {
        loff_t pos;
        loff_t size;
        char *buf;
};

static long file_open(char *path, int flags, mode_t mode)
{
        char *contents=um_mod_get_private_data();
        int fd = addfiletab(sizeof(struct fileinfo));
        struct fileinfo *ft=getfiletab(fd);
        int rv;
        ft->pos = 0;
        ft->size = strlen(contents);
        ft->buf=contents;
        return fd;
}

static long file_close(int fd)
{
        int rv;
        struct fileinfo *ft=getfiletab(fd);
        delfiletab(fd);
        return 0;
}

static long file_read(int fd, char *buf, size_t count)
{
        struct fileinfo *ft=getfiletab(fd);
        int len=ft->size - ft->pos;
        if (len>count) len=count;
        if (len <= 0)
                return 0;
        strncpy(buf,ft->buf + ft->pos,len);
        ft->pos += len;
        return len;
}

static long file_stat64(char *path, struct stat64 *buf64)
{
        memset(buf64,0,sizeof(struct stat64));
        buf64->st_mode=S_IFREG | 0444;
        return 0;
}

void *viewos_init(char *args)
{
        return ht_tab_pathadd(CHECKPATH,"none","/test","test",0,"ro",&s,0,NULL,contents);
}

void *viewos_fini(void *data)
{
        struct ht_elem *proc_ht=data;
        ht_tab_del(proc_ht);
}

        static void
        __attribute__ ((constructor))
init (void)
{
        s.name="file1";
        s.description="hello world file test 1";
        s.syscall=(sysfun *)calloc(scmap_scmapsize,sizeof(sysfun));
        s.socket=(sysfun *)calloc(scmap_sockmapsize,sizeof(sysfun));
        SERVICESYSCALL(s, open, file_open);
        SERVICESYSCALL(s, read, file_read);
        SERVICESYSCALL(s, close, file_close);
        SERVICESYSCALL(s, stat64, file_stat64);
        SERVICESYSCALL(s, lstat64, file_stat64);
}

        static void
        __attribute__ ((destructor))
fini (void)
{
        free(s.syscall);
        free(s.socket);
}

This module creates a virtual file test. It is a read only file, it contains hello world. file1.c can be compiled it in this way:

 gcc -shared -o file1.so -I file1.c

then in a viewos virtual machine it is possible to run:

 $ um_add_service ./file1.so
 $ cat /test
 hello world
 $

This example uses the filetab core library: a set of helper functions provided by the core View-OS monitor for modules to handle files. The interface is simple:

int addfiletab(int size);
void delfiletab(int i);
void *getfiletab(int i);
  • addfiletab allocates a data structure for an open file. The data structure is size bytes wide. The return value is a file descriptor identifier for it.
  • getfiletab retrieve the data structure given the file descriptor id.
  • delfiletab free a file descriptor id and its data structure area.

The modules use the filetab interface to keep the information for open files. In the example above for example it tracks the current position in the file, so that several processes can concurrently open our virtual file and have different offset in it.

viewos_init registers just the virtual file. This is a very efficient implementation: the core ViewOS monitor calls the module functions only when a system call refers to /test.

Sometimes it is not possible to register each pathname, and a module need a more complex check to verify whether a file must be virtualized or not. In this case module can use confirmation functions.

In the example above viewos_init can be changed with the following one:

static int file_confirm(int type, void *arg, int arglen,
        struct ht_elem *ht)
{
        if (CHECKPATH) {
                if (arglen == 5 && strncmp(arg,"/test",5)==0)
                        return 1;
                else
                        return 0;
        }
        return 1;
}

void *viewos_init(char *args)
{
        return ht_tab_pathadd(CHECKPATH,"none","/","test",0,"ro",&s,0,file_confirm,contents);
}

This implementation is less efficient than the previous one but it permits to define more complex cases where virtual and real files are mixed in a subtree of the file system.

A complete module (mount bound)

The following example uses "mount" to create virtual "hello world" in the mount target points.

#define _GNU_SOURCE
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <module.h>

static struct service s;
VIEWOS_SERVICE(s)

char *contents="hello world\n";

struct filemount {
        char *path;
};

struct fileinfo {
        loff_t pos;
        loff_t size;
        char *buf;
};

static long file_open(char *path, int flags, mode_t mode)
{
        struct filemount *fm=um_mod_get_private_data();
        int fd = addfiletab(sizeof(struct fileinfo));
        struct fileinfo *ft=getfiletab(fd);
        int rv;
        ft->pos = 0;
        asprintf(&(ft->buf),"%s: hello world\n",fm->path);
        ft->size = strlen(ft->buf);
        return fd;
}

static long file_close(int fd)
{
        int rv;
        struct fileinfo *ft=getfiletab(fd);
        free(ft->buf);
        delfiletab(fd);
        return 0;
}

static long file_read(int fd, char *buf, size_t count)
{
        struct fileinfo *ft=getfiletab(fd);
  int len=ft->size - ft->pos;
        if (len>count) len=count;
        if (len <= 0)
                return 0;
        strncpy(buf,ft->buf + ft->pos,len);
        ft->pos += len;
        return len;
}

static long file_stat64(char *path, struct stat64 *buf64)
{
        memset(buf64,0,sizeof(struct stat64));
        buf64->st_mode=S_IFREG | 0444;
        return 0;
}


static long file_mount(char *source, char *target, char *filesystemtype,
                    unsigned long mountflags, void *data) {
        struct filemount *new=(struct filemount *)malloc(sizeof(struct filemount));
        new->path=strdup(target);
        ht_tab_pathadd(CHECKPATH,source,target,filesystemtype,mountflags,data,&s,0,NULL,new);
        return 0;
}

static long file_umount2(char *target, int flags)
{
        struct filemount *fm = um_mod_get_private_data();
        free(fm->path);
        free(fm);
        ht_tab_del(um_mod_get_hte());
}

        static void
        __attribute__ ((constructor))
init (void)
{
        s.name="file3";
        s.description="hello world file test";
        s.syscall=(sysfun *)calloc(scmap_scmapsize,sizeof(sysfun));
        s.socket=(sysfun *)calloc(scmap_sockmapsize,sizeof(sysfun));
        SERVICESYSCALL(s, open, file_open);
        SERVICESYSCALL(s, read, file_read);
        SERVICESYSCALL(s, close, file_close);
        SERVICESYSCALL(s, stat64, file_stat64);
        SERVICESYSCALL(s, lstat64, file_stat64);
        SERVICESYSCALL(s, mount, file_mount);
        SERVICESYSCALL(s, umount2, file_umount2);
}

        static void
        __attribute__ ((destructor))
fini (void)
{
        free(s.syscall);
        free(s.socket);
}

The example has a minimal set of information for the mounted partition (struct filemount has just a path field).

Internals

The core structure of the new implementation (ViewOS 0.8) is the global hash table (hashtable.[ch]).

This data structure stores all the services provided by the modules (and their sub-modules) and allow a fast and scalable way to dispatch all the system calls to the right module.

Several kinds of objects can be stored in the hash table: modules, pathnmames, address families, char/block devices, system calls, interpreters for executable.

Each object has its own hash sum which is a one word (long) integer, the hash key is the sum modulo the number of elements of the hash table.

Each object is stored in the hash table in a collision list corresponding to its hash key. The data structure associated to each object follows:

struct ht_elem {
  void *obj;
  char *mtabline;
  struct timestamp tst;
  unsigned char type;
  unsigned char trailingnumbers;
  unsigned char invalid;
  struct service *service;
  struct ht_elem *service_hte;
  void *private_data;
  int objlen;
  long hashsum;
  int count;
  confirmfun_t confirmfun;
  struct ht_elem *prev,*next,**pprevhash,*nexthash;
};
  • obj is the object (whose length is objlen bytes)
  • type is the tag of the object type
  • hashsum is the hash sum, it allows a quick selection among the collision list, if the hash sum does not coincide, the object is not the one currently wanted one.
  • tst is the timestamp as defined by the treepoch module.
  • service and service_hte are quick link to the service (module) owning this element.
  • private data is an opaque data where the module can store its information about this object.
  • count is the number of instances currently used for garbage collection.
  • confirmfun if the confirmation function to manage exceptions.
  • mtabline is the mount tab line (the one shown by umproc in /proc/mounts)
  • prev,next,pprevhash,nexthash, links for the collision list, and for the linear scan of all the elements of the same type.

Each kind of object has its own search policy. Sometimes there are more different policies for the same type of object.

  • CHECKPATH (pathnames): there is a tree traversal from the root to the leaf. Step by step each component of the pathname is added and the resulting partial path is searched in the hash table. The search process provides the most recent match among those found. To be more precise, the first scan provides the sequence of all the most recent matched that has a non-null confirmation function (so may have exceptions) plus eventually the first without exceptions. This sequence is named carrot, see over.
  • CHECKPATHEXACT: for umount: only complete match is permitted.
  • CHECKSOCKET, CHECK CHR/BLK DEVICE, CHECKSC: these objects are integers or sequence of integers. All the objects stored in the hash table having a common prefix (integer by integer) can match.
  • CHECKMODULE: search a module by its module name (complete match).
  • CHECKFSTYPE: the name of the file system (for mount) must have a module name as a prefix.
  • CHECKFSALIAS: standard string match.

The first part of the search process generate the list of possible matches, i.e. the list of possible most recent matches (in terms of timestamp), those having a confirmation function plus the first with confirmfun==NULL. This list is named carrot. The idea of mount in View-OS can be thought as a layer that changes the view. A layer without exception is completely opaque while it is semi-transparent when exceptions may occur. A carrot is a probe resulting by digging all possibly tranparent layers to the first opaque. The search algorithm then calls all the confirmation function, returning the first confirmed match.

The object type is used in the hash sum and key computation, thus objects having the same value but different types are stored independently. Sometimes modules register null objects (zero-length). These objects (of the same type) obviously have the same hash sum and key. The collision list for zero-length is stored in a separate list (to prevent the collision with objects of different types).

When a user process requires a system call, view-os searches in the hash table which is the object which is responsible to handle it. The hash table element (often referred in the code as hte) is used by the whole virtual machine monitor (umview/kmview) and by modules as a key to find the virtualization which applies. View-OS modules does not need (any more) to implement their search methods or mount tables (as it happened in View-OS 0.6). The implementation of the system calls in the modules can access the private data of the virtualization for the current request using um_mod_get_private_data.

Personal tools
Namespaces

Variants
Actions
Navigation
Toolbox