Do no harm: Xenopsd should never touch domains/VMs which it hasn’t been
asked to manage. This means that it can co-exist with other VM managers
such as ‘xl’ and ’libvirt’.
Be independent: Xenopsd should be able to work in isolation. In particular
the loss of some other component (e.g. the network) should not by itself
prevent VMs being managed locally (including shutdown and reboot).
Asynchronous by default: Xenopsd exposes task monitoring and offers
cancellation for all operations. Xenopsd ensures that the system is always
in a manageable state after an operation has been cancelled.
Avoid state duplication: where another component owns some state, Xenopsd
will always defer to it. We will avoid creating out-of-sync caches of
this state.
Be debuggable: Xenopsd will expose diagnostic APIs and tools to allow
its internal state to be inspected and modified.
Subsections of Xenopsd
Xenopsd Architecture
Xenopsd instances run on a host and manage VMs on behalf of clients. This
picture shows 3 different Xenopsd instances: 2 named “xenopsd-xc” and 1 named
“xenopsd-xenlight”.
Each instance is responsible for managing a disjoint set of VMs. Clients should
never ask more than one Xenopsd to manage the same VM.
Managing a VM means:
allowing devices (disks, nics, PCI cards, vCPUs etc) to be manipulated
providing updates to clients when things change (reboots, console becomes
available, guest agent says something etc).
For a full list of features, consult the feature list.
Each Xenopsd instance has a unique name on the host. A typical name is
org.xen.xcp.xenops.classic
org.xen.xcp.xenops.xenlight
A higher-level tool, such as xapi
will associate VMs with individual Xenopsd names.
Running multiple Xenopsds is necessary because
The virtual hardware supported by different technologies (libxc, libxl, qemu)
is expected to be different. We can guarantee the virtual hardware is stable
across a rolling upgrade by running the VM on the old Xenopsd. We can then switch
Xenopsds later over a VM reboot when the VM admin is happy with it. If the
VM admin is unhappy then we can reboot back to the original Xenopsd again.
The suspend/resume/migrate image formats will differ across technologies
(again libxc vs libxl) and it will be more reliable to avoid switching
technology over a migrate.
In the future different security domains may have different Xenopsd instances
providing even stronger isolation guarantees between domains than is possible
today.
Communication with Xenopsd is handled through a Xapi-global library:
xcp-idl. This library supports
message framing: by default using HTTP but a binary framing format is
available
message encoding: by default we use JSON but XML is also available
RPCs over Unix domain sockets and persistent queues.
This library allows the communication details to be changed without having to
change all the Xapi clients and servers.
Xenopsd has a number of “backends” which perform the low-level VM operations
such as (on Xen) “create domain” “hotplug disk” “destroy domain”. These backends
contain all the hypervisor-specific code including
connecting to Xenstore
opening the libxc /proc/xen/privcmd interface
initialising libxl contexts
The following diagram shows the internal structure of Xenopsd:
At the top of the diagram two client RPC have been sent: one to start a VM
and the other to fetch the latest events. The RPCs are all defined in
xcp-idl/xen/xenops_interface.ml.
The RPCs are received by the Xenops_server module and decomposed into
“micro-ops” (labelled “μ op”). These micro ops represent actions like
create a Xen domain (recall a Xen domain is an empty shell with no memory)
build a Xen domain: this is where the kernel or hvmloader is copied in
launch a device model: this is where a qemu instance is started (if one is
required)
hotplug a device: this involves writing the frontend and backend trees to
Xenstore
unpause a domain (recall a Xen domain is created in the paused state)
Each of these micro-ops is represented by a function call in a “backend plugin”
interface. The micro-ops are enqueued in queues, one queue per VM. There is a
thread pool (whose size can be changed dynamically by the admin) which pulls
micro-ops from the VM queues and calls the corresponding backend function.
The active backend (there can only be one backend per Xenopsd instance)
executes the micro-ops. The Xenops_server_xen backend in the picture above
talks to libxc, libxl and qemu to create and destroy domains. The backend
also talks to other Xapi services, in particular
it registers datasources with xcp-rrdd, telling xcp-rrdd to measure I/O
throughput and vCPU utilisation
it reserves memory for new domains by talking to squeezed
it makes disks available by calling SMAPIv2 VDI.{at,de}tach, VDI.{,de}activate
it launches subprocesses by talking to forkexecd (avoiding problems with
accidental fd capture)
Xenopsd backends are also responsible for monitoring running VMs. In the
Xenops_server_xen backend this is done by watching Xenstore for
@releaseDomain watch events
device hotplug status changes
When such an event happens (for example: @releaseDomain sent when a domain
requests a reboot) the corresponding operation does not happen inline. Instead
the event is rebroadcast upwards to Xenops_server as a signal (for example:
“VM id needs some attention”) and a “VM_stat” micro-op is queued in the
appropriate queue. Xenopsd does not allow operations to run on the same VM
in parallel and enforces this by:
pushing all operations pertaining to a VM to the same queue
associating each VM queue to at-most-one worker pool thread
The event takes the form “VM id needs some attention” and not “VM id needs
to be rebooted” because, by the time the queue is flushed, the VM may well now
be in a different state. Perhaps rather than being rebooted it now needs to
be shutdown; or perhaps the domain is now in a good state because the reboot
has already happened. The signals sent by the backend to the Xenops_server are
a bit like event channel notifications in the Xen ring protocols: they are
requests to ask someone to perform work, they don’t themselves describe the work
that needs to be done.
An implication of this design is that it should always be possible to answer
the question, “what operation should be performed to get the VM into a valid state?”.
If an operation is cancelled half-way through or if Xenopsd is suddenly restarted,
it will ask the question about all the VMs and perform the necessary operations.
The operations must be designed carefully to make this work. For example if Xenopsd
is restarted half-way through starting a VM, it must be obvious on restart that
the VM should either be forcibly shutdown or rebooted to make it a valid state
again. Note: we don’t demand that operations are performed as transactions;
we only demand that the state they leave the system be “sensible” in the sense
that the admin will recognise it and be able to continue their work.
Sometimes this can be achieved through careful ordering of side-effects
within the operations, taking advantage of artifacts of the system such as:
a domain which has not been fully created will have total vCPU time = 0 and
will be paused. If we see one of these we should reboot it because it may
not be fully intact.
In the absense of “tells” from the system, operations are expected to journal
their intentions and support restart after failure.
There are three categories of metadata associated with VMs:
system metadata: this is created as a side-effect of starting VMs. This
includes all the information about active disks and nics stored in Xenstore
and the list of running domains according to Xen.
VM: this is the configuration to use when the VM is started or rebooted.
This is like a “config file” for the VM.
VmExtra: this is the runtime configuration of the VM. When VM configuration
is changed it often cannot be applied immediately; instead the VM continues
to run with the previous configuration. We need to track the runtime
configuration of the VM in order for suspend/resume and migrate to work. It
is also useful to be able to tell a client, “on next reboot this value will
be x but currently it is x-1”.
VM and VmExtra metadata is stored by Xenopsd in the domain 0 filesystem, in
a simple directory hierarchy.
There are a number of hook points at which xenopsd may execute certain scripts. These scripts are found in hook-specific directories of the form /etc/xapi.d/<hookname>/. All executable scripts in these directories are run with the following arguments:
<script.sh> -reason <reason> -vmuuid <uuid of VM>
The scripts are executed in filename-order. By convention, the filenames are usually of the form 10resetvdis.
clean-shutdown
hard-shutdown
clean-reboot
hard-reboot
suspend
source -- passed to pre-migrate hook on source host
destination -- passed to post-migrate hook on destination (Dundee only)
none
For example, in order to execute a script on VM shutdown, it would be sufficient to create the script in the post-destroy hook point:
/etc/xapi.d/vm-post-destroy/01myscript.sh
containing
#!/bin/bash
echo I was passed $@ > /tmp/output
And when, for example, VM e30d0050-8f15-e10d-7613-cb2d045c8505 is shut-down, the script is executed:
[vagrant@localhost ~]$ sudo xe vm-shutdown --force uuid=e30d0050-8f15-e10d-7613-cb2d045c8505
[vagrant@localhost ~]$ cat /tmp/output
I was passed -vmuuid e30d0050-8f15-e10d-7613-cb2d045c8505 -reason hard-shutdown
PVS Proxy OVS Rules
Rule Design
The Open vSwitch (OVS) daemon implements a programmable switch.
XenServer uses it to re-direct traffic between three entities:
PVS server - identified by its IP address
a local VM - identified by its MAC address
a local Proxy - identified by its MAC address
VM and PVS server are unaware of the Proxy; xapi configures OVS to
redirect traffic between PVS and VM to pass through the proxy.
OVS uses rules that match packets. Rules are organised in sets called
tables. A rule can be used to match a packet and to inject it into
another rule set/table such that a packet can be matched again.
Furthermore, a rule can set registers associated with a packet which that
can be matched in subsequent rules. In that way, a packet can be tagged
such that it will only match specific rules downstream that match the
tag.
Xapi configures 3 rule sets:
Table 0 - Entry Rules
Rules match UDP traffic between VM/PVS, Proxy/VM, and PVS/VM where the
PVS server is identified by its IP and all other components by their MAC
address. All packets are tagged with the direction they are going and
re-submitted into Table 101 which handles ports.
Table 101 - Port Rules
Rules match UDP traffic going to a specific port of the PVS server and
re-submit it into Table 102.
Table 102 - Exit Rules
These rules implement the redirection:
Rules matching packets coming from VM to PVS are directed to the Proxy.
Rules matching packets coming from PVS to VM are directed to the Proxy.
Rules matching packets coming from the Proxy are already addressed
properly (to the VM) are handled normally.
Requirements for suspend image framing
We are currently (Dec 2013) undergoing a transition from the ‘classic’ xenopsd
backend (built upon calls to libxc) to the ‘xenlight’ backend built on top of
the officially supported libxl API.
During this work, we have come across an incompatibility between the suspend
images created using the ‘classic’ backend and those created using the new
libxl-based backend. This needed to be fixed to enable RPU to any new version
of XenServer.
Historic ‘classic’ stack
Prior to this work, xenopsd was involved in the construction of the suspend
image and we ended up with an image with the following format:
+-----------------------------+
| "XenSavedDomain\n" | <-- added by xenopsd-classic
|-----------------------------|
| Memory image dump | <-- libxc
|-----------------------------|
| "QemuDeviceModelRecord\n" |
| <size of following record> | <-- added by xenopsd-classic
| (a 32-bit big-endian int) |
|-----------------------------|
| "QEVM" | <-- libxc/qemu
| Qemu device record |
+-----------------------------+
We have also been carrying a patch in the Xen patchqueue against
xc_domain_restore. This patch (revert_qemu_tail.patch) stopped
xc_domain_restore from attempting to read past the memory image dump. At which
point xenopsd-classic would just take over and restore what it had put there.
Requirements for new stack
For xenopsd-xenlight to work, we need to operate without the
revert_qemu_tail.patch since libxl assumes it is operating on top of an
upstream libxc.
We need the following relationship between suspend images created on one
backend being able to be restored on another backend. Where the backends are
old-classic (OC), new-classic (NC) and xenlight (XL). Obviously all suspend
images created on any backend must be able to be restored on the same backend:
It turns out this was not so simple. After removing the patch against
xc_domain_restore and allowing libxc to restore the hvm_buffer_tail, we found
that supsend images created with OC (detailed in the previous section) are not
of a valid format for two reasons:
i. The "XenSavedDomain\n" was extraneous;
ii. The Qemu signature section (prior to the record) is not of valid form.
It turns out that the section with the Qemu signature can be one of the
following:
a. "QemuDeviceModelRecord" (NB. no newline) followed by the record to EOF;
b. "DeviceModelRecord0002" then a uint32_t length followed by record;
c. "RemusDeviceModelState" then a uint32_t length followed by record;
The old-classic (OC) backend not only uses an invalid signature (since it
contains a trailing newline) but it also includes a length, and the length is
in big-endian when the uint32_t is seen to be little-endian.
We considered creating a proxy for the fd in the incompatible cases but since
this would need to be a 22-lookahead byte-by-byte proxy this was deemed
impracticle. Instead we have made patched libxc with a much simpler patch to
understand this legacy format.
Because peek-ahead is not possible on pipes, the patch for (ii) needed to be
applied at a point where the hvm tail had been read completely. We piggy-backed
on the point after (a) had been detected. At this point the remainder of the fd
is buffered (only around 7k) and the magic “QEVM” is expected at the head of
this buffer. So we simply added a patch to check if there was a pesky newline
and the buffer[5:8] was “QEVM” and if it was we could discard the first
5 bytes:
0 1 2 3 4 5 6 7 8
Legacy format from OC: [...| \n | \x | \x | \x | \x | Q | E | V | M |...]
Required at this point: [...| Q | E | V | M |...]
Changes made
To make the above use-cases work, we have made the following changes:
1. Make new-classic (NC) not restore Qemu tail (let libxc do it)
xenopsd.git:ef3bf4b
2. Make new-classic use valid signature (b) for future restore images
xenopsd.git:9ccef3e
3. Make xc_domain_restore in libxc understand legacy xenopsd (OC) format
xen-4.3.pq.hg:libxc-restore-legacy-image.patch
4. Remove revert-qemu-tail.patch from Xen patchqueue
xen-4.3.pq.hg:3f0e16f2141e
5. Make xenlight (XL) use "XenSavedDomain\n" start-of-image signature
xenopsd.git:dcda545
This has made the required use-cases work as follows:
A suspend image is now constructed as a series of header-record pairs. The
initial signature (1.) is used to determine whether we are dealing with the
unstructured, “legacy” suspend image or the new, structured format.
Each header is two 64-bit integers: the first identifies the header type and
the second is the length of the record that follows in bytes. The following
types have been defined (the ones marked with a (*) have yet to be
implemented):
* Xenops : Metadata for the suspend image
* Libxc : The result of a xc_domain_save
* Libxl* : Not implemented
* Libxc_legacy : Marked as a libxc record saved using pre-Xen-4.5
* Qemu_trad : The qemu save file for the Qemu used in XenServer
* Qemu_xen* : Not implemented
* Demu* : Not implemented
* End_of_image : A footer marker to denote the end of the suspend image
Some of the above types do not have the notion of a length since they cannot be
known upfront before saving and also are delegated to other layers of the stack
on restoring. Specifically these are the memory image sections, libxc and
libxl.
Tasks
Some operations performed by Xenopsd are blocking, for example:
suspend/resume/migration
attaching disks (where the SMAPI VDI.attach/activate calls can perform network
I/O)
We want to be able to
present the user with an idea of progress (perhaps via a “progress bar”)
allow the user to cancel a blocked operation that is taking too long
associate logging with the user/client-initiated actions that spawned them
Principles
all operations which may block (the vast majority) should be written in an
asynchronous style i.e. the operations should immediately return a Task id
all operations should guarantee to respond to a cancellation request in a
bounded amount of time (30s)
when cancelled, the system should always be left in a valid state
clients are responsible for destroying Tasks when they are finished with the
results
Types
A task has a state, which may be Pending, Completed or failed:
type async_result =unittype completion_t ={ duration :float; result : async_result option
}type state =|Pendingoffloat|Completedof completion_t
|Failedof Rpc.t
When a task is Failed, we assocate it with a marshalled exception (a value of type
Rpc.t). This exception must be one from the set defined in the
Xenops_interface.
To see how they are marshalled, see
Xenops_server.
From the point of view of a client, a Task has the immutable type (which can be
queried with a Task.stat):
type t ={ id: id; dbg:string; ctime:float; state: state; subtasks:(string* state)list; debug_info:(string*string)list;}
where
id is a unique (integer) id generated by Xenopsd. This is how a Task is
represented to clients
dbg is a client-provided debug key which will be used in log lines, allowing
lines from the same Task to be associated together
ctime is the creation time
state is the current state (Pending/Completed/Failed)
subtasks lists logical internal sub-operations for debugging
debug_info includes miscellaneous key/value pairs used for debugging
Internally, Xenopsd uses a
mutable record type
to track Task state. This is broadly similar to the interface type except
the state is mutable: this allows Tasks to complete
the task contains a “do this now” thunk
there is a “cancelling” boolean which is toggled to request a cancellation.
there is a list of cancel callbacks
there are some fields related to “cancel points”
Persistence
The Tasks are intended to represent activities associated with in-memory queues
and threads. Therefore the active Tasks are kept in memory in a map, and will
be lost over a process restart. This is desirable since we will also lose the
queued items and the threads, so there is no need to resync on start.
Note that every operation must ensure that the state of the system is recoverable
on restart by not leaving it in an invalid state. It is not necessary to either
guarantee to complete or roll-back a Task. Tasks are not expected to be
transactional.
Lifecycle of a Task
All Tasks returned by API functions are created as part of the enqueue functions:
queue_operation_*.
Even operations which are performed internally are normally wrapped in Tasks by
the function
immediate_operation.
A queued operation will be processed by one of the
queue worker threads.
It will
set the thread-local debug key to the Task.dbg
call task.Xenops_task.run, taking care to catch exceptions and update
the task.Xenops_task.state
unset the thread-local debug key
generate an event on the Task to provoke clients to query the current state.
Task implementations must update their progress as they work. For the common
case of a compound operation like VM_start which is decomposed into
multiple “micro-ops” (e.g. VM_createVM_build) there is a useful
helper function
perform_atomics
which divides the progress ‘bar’ into sections, where each “micro-op” can have
a different size (weight). A progress callback function is passed into
each Xenopsd backend function so it can be updated with fine granularity. For
example note the arguments to
B.VM.save
Clients are expected to destroy Tasks they are responsible for creating. Xenopsd
cannot do this on their behalf because it does not know if they have successfully
queried the Task status/result.
When Xenopsd is a client of itself, it will take care to destroy the Task
properly, for example see
immediate_operation.
Cancellation
The goal of cancellation is to unstick a blocked operation and to return the
system to some valid state, not any valid state in particular.
Xenopsd does not treat operations as transactions;
when an operation is cancelled it may
fully complete (e.g. if it was about to do this anyway)
fully abort (e.g. if it had made no progress)
enter some other valid state (e.g. if it had gotten half way through)
Xenopsd will never leave the system in an invalid state after cancellation.
Every Xenopsd operation should unblock and return the system to a valid state within
a reasonable amount of time after a cancel request. This should be as quick as possible
but up to 30s may be acceptable.
Bear in mind that a human is probably impatiently watching a UI say “please wait”
and which doesn’t have any notion of progress itself. Keep it quick!
Cancellation is triggered by TASK.cancel which calls
cancel.
This
if about to block: register a suitable cancel callback safely with with_cancel.
Xenopsd’s libxc backend can block in 2 different ways, and therefore has 2 different
types of cancel callback:
cancellable Xenstore watches
cancellable subprocesses
Xenstore watches are used for device hotplug and unplug. Xenopsd has to wait for
the backend or for a udev script to do something. If that blocks, we need
a way to cancel the watch. The easiest way to cancel a watch is to watch an
additional path (a “cancel path”) and delete it, see
cancellable_watch.
The “cancel paths” are placed within the VM’s Xenstore directory to ensure that
cleanup code which does xenstore-rm will automatically “cancel” all outstanding
watches. Note that we trigger a cancel by deleting rather than creating, to avoid
racing with delete and creating orphaned Xenstore entries.
Subprocesses are used for suspend/resume/migrate. Xenopsd hands file descriptors
to libxenguest by running a subprocess and passing the fds to it. Xenopsd therefore
gets the process id and can send it a signal to cancel it. See
Cancellable_subprocess.run.
Testing with cancel points
Cancellation is difficult to test, as it is completely asynchronous. Therefore
Xenopsd has some built-in cancellation testing infrastructure known as “cancel points”.
A “cancel point” is a point in the code where a Cancelled exception could
be thrown, either by checking the cancelling boolean or as a side-effect of
a cancel callback. The
check_cancelling
function increments a counter every time it passes one of these points, and
this value is returned to clients in the
Task.debug_info.
A test harness
runs a series of operations. Each operation is first run all the way through to
completion to discover the total number of cancel points. The operation is then
re-run with a
request to cancel at a particular point.
The test then waits for the system to stabilise and verifies that it appears to be
in a valid state.
Preventing Tasks leaking
The client who creates a Task must destroy it when the Task is finished, and
they have processed the result. What if a client like xapi is restarted while
a Task is running?
We assume that, if xapi is talking to a xenopsd, then xapi completely owns it.
Therefore xapi should destroy any completed tasks that it doesn’t recognise.
If a user wishes to manage VMs with xenopsd in parallel with xapi, the user
should run a separate xenopsd.
Features
General
Pluggable backends including
xc: drives Xen via libxc and xenguest
simulator: simulates operations for component-testing
Supports running multiple instances and backends on the same host, looking
after different sets of VMs
Extensive configuration via command-line (see manpage) and config
file
Command-line tool for easy VM administration and troubleshooting
User-settable degree of concurrency to get VMs started quickly
VMs
VM start/shutdown/reboot
VM suspend/resume/checkpoint/migrate
VM pause/unpause
VM s3suspend/s3resume
customisable SMBIOS tables for OEM-locked VMs
hooks for 3rd party extensions:
pre-start
pre-destroy
post-destroy
pre-reboot
per-VM xenguest replacement
suppression of VM reboot loops
live vCPU hotplug and unplug
vCPU to pCPU affinity setting
vCPU QoS settings (weight and cap for the Xen credit2 scheduler)
DMC memory-ballooning support
support for storage driver domains
live update of VM shadow memory
guest-initiated disk/nic hotunplug
guest-initiated disk eject
force disk/nic unplug
support for ‘surprise-removable’ devices
disk QoS configuration
nic QoS configuration
persistent RTC
two-way guest agent communication for monitoring and control
network carrier configuration
port-locking for nics
text and VNC consoles over TCP and Unix domain sockets
PV kernel and ramdisk whitelisting
configurable VM videoram
programmable action-after-crash behaviour including: shutting down
the VM, taking a crash dump or leaving the domain paused for inspection
ability to move nics between bridges/switches
advertises the VM memory footprints
PCI passthrough
support for discrete emulators (e.g. ‘demu’)
PV keyboard and mouse
qemu stub domains
cirrus and stdvga graphics cards
HVM serial console (useful for debugging)
support for vGPU
workaround for ‘spurious page faults’ kernel bug
workaround for ‘machine address size’ kernel bug
Hosts
CPUid masking for heterogenous pools: reports true features and current
features
Host console reading
Hypervisor version and capabilities reporting
Host CPU querying
APIs
versioned JSON-RPC API with feature advertisements
clients can disconnect, reconnect and easily resync with the latest
VM state without losing updates
all operations have task control including
asynchronous cancellation: for both subprocesses and xenstore watches
progress updates
subtasks
per-task debug logs
asynchronous event watching API
advertises VM metrics
memory usage
balloon driver co-operativeness
shadow memory usage
domain ids
channel passing (via sendmsg(2)) for efficient memory image copying
Operation Walk-Throughs
Let’s trace through interesting operations to see how the whole system
works.
Sequence diagram of the process of Live Migration.
Inspiration for other walk-throughs:
Shutting down a VM and waiting for it to happen
A VM wants to reboot itself
A disk is hotplugged
A disk refuses to hotunplug
A VM is suspended
Subsections of Walk-throughs
Walkthrough: Starting a VM
A Xenopsd client wishes to start a VM. They must first tell Xenopsd the VM
configuration to use. A VM configuration is broken down into objects:
VM: A device-less Virtual Machine
VBD: A virtual block device for a VM
VIF: A virtual network interface for a VM
PCI: A virtual PCI device for a VM
Treating devices as first-class objects is convenient because we wish to expose
operations on the devices such as hotplug, unplug, eject (for removable media),
carrier manipulation (for network interfaces) etc.
The “add” functions in the Xenopsd interface cause Xenopsd to create the
objects:
the XenAPI has many clients which are updated on long release cycles. The
main property needed is backwards compatibility, so that new release of xapi
remain compatible with these older clients. Quite often, we will choose to
“grandfather in” some poorly designed interface simply because we wish to
avoid imposing churn on 3rd parties.
the Xenopsd API clients are all open-source and are part of the xapi-project.
These clients can be updated as the API is changed. The main property needed
is to keep the interface clean, so that it properly hides the complexity
of dealing with Xen from other components.
The Xenopsd “VM.add” function has code like this:
let add' x = debug "VM.add %s"(Jsonrpc.to_string (rpc_of_t x)); DB.write x.id x;letmoduleB=(val get_backend () :S)in B.VM.add x; x.id
This function does 2 things:
it stores the VM configuration in the “database”
it tells the “backend” that the VM exists
The Xenopsd database is really a set of config files in the filesystem. All
objects belonging to a VM (recall we only have VMs, VBDs, VIFs, PCIs and not
stand-alone entities like disks) and are placed into a subdirectory named after
the VM e.g.:
Xenopsd doesn’t have as persistent a notion of a VM as xapi, it is expected that
all objects are deleted when the host is rebooted. However the objects should
be persisted over a simple Xenopsd restart, which is why the objects are stored
in the filesystem.
Aside: it would probably be more appropriate to store the metadata in Xenstore
since this has the exact object lifetime we need. This will require a more
performant Xenstore to realise.
Every running Xenopsd process is linked with a single backend. Currently backends
exist for:
Xen via libxc, libxenguest and xenstore
Xen via libxl, libxc and xenstore
Xen via libvirt
KVM by direct invocation of qemu
Simulation for testing
From here we shall assume the use of the “Xen via libxc, libxenguest and xenstore” (a.k.a.
“Xenopsd classic”) backend.
The backend VM.add
function checks whether the VM we have to manage already exists – and if it does
then it ensures the Xenstore configuration is intact. This Xenstore configuration
is important because at any time a client can query the state of a VM with
VM.stat
and this relies on certain Xenstore keys being present.
Once the VM metadata has been registered with Xenopsd, the client can call
VM.start.
Like all potentially-blocking Xenopsd APIs, this function returns a Task id.
Please refer to the Task handling design for a general
overview of how tasks are handled.
Clients can poll the state of a task by calling TASK.stat
but most clients will prefer to use the event system instead.
Please refer to the Event handling design for a general
overview of how events are handled.
The event model is similar to the XenAPI: clients call a blocking
UPDATES.get
passing in a token which represents the point in time when the last UPDATES.get
returned. The call blocks until some objects have changed state, and these object
ids are returned (NB in the XenAPI the current object states are returned)
The client must then call the relevant “stat” function, in this
case TASK.stat
The client will be able to see the task make progress and use this to – for example –
populate a progress bar in a UI. If the client needs to cancel the task then it
can call the TASK.cancel;
again see the Task handling design to understand how this is
implemented.
When the Task has completed successfully, then calls to *.stat will show:
the power state is Paused
exactly one valid Xen domain id
all VBDs have active = plugged = true
all VIFs have active = plugged = true
all PCI devices have plugged = true
at least one active console
a valid start time
valid “targets” for memory and vCPU
Note: before a Task completes, calls to *.stat will show partial updates. E.g.
the power state may be paused, but no disk may have been plugged.
UI clients must choose whether they are happy displaying this in-between state
or whether they wish to hide it and pretend the whole operation has happened
transactionally. If a particular, when a client wishes to perform side-effects in
response to xenopsd state changes (for example, to clean up an external resource
when a VIF becomes unplugged), it must be very careful to avoid responding
to these in-between states. Generally, it is safest to passively report these
values without driving things directly from them.
Note: the Xenopsd implementation guarantees that, if it is restarted at any point
during the start operation, on restart the VM state shall be “fixed” by either
(i) shutting down the VM; or (ii) ensuring the VM is intact and running.
In the case of xapi every Xenopsd
Task id bound one-to-one with a XenAPI task by the function
sync_with_task.
The function update_task
is called when xapi receives a notification that a Xenopsd Task has changed state,
and updates the corresponding XenAPI task.
Xapi launches exactly one thread per Xenopsd instance (“queue”) to monitor for
background events via the function
events_watch
while each thread performing a XenAPI call waits for its specific Task to complete
via the function
event_wait.
It is the responsibility of the client to call
TASK.destroy
when the Task is no longer needed. Xenopsd won’t destroy the task because it contains
the success/failure result of the operation which is needed by the client.
What happens when a Xenopsd receives a VM.start request?
When Xenopsd receives the request it adds it to the appropriate per-VM queue
via the function
queue_operation.
To understand this and other internal details of Xenopsd, consult the
architecture description.
The queue_operation_int
function looks like this:
let queue_operation_int dbg id op =let task = Xenops_task.add tasks dbg (fun t -> perform op t;None)in Redirector.push id (op, task); task
The “task” is a record containing Task metadata plus a “do it now” function
which will be executed by a thread from the thread pool. The
module Redirector
takes care of:
pushing operations to the right queue
ensuring at most one worker thread is working on a VM’s operations
reducing the queue size by coalescing items together
providing a diagnostics interface
Once a thread from the worker pool becomes free, it will execute the “do it now”
function. In the example above this is perform op t where op is
VM_start vm and t is the Task. The function
perform_exn
has fragments like this:
|VM_start(id, force)->( debug "VM.start %s (force=%b)" id force ;let power =(B.VM.get_state (VM_DB.read_exn id)).Vm.power_state inmatch power with|Running-> info "VM %s is already running" id
|_-> perform_atomics (atomics_of_operation op) t ; VM_DB.signal id "^^^^^^^^^^^^^^^^^^^^--------
)
Each “operation” (e.g. VM_start vm) is decomposed into “micro-ops” by the
function
atomics_of_operation
where the micro-ops are small building-block actions common to the higher-level
operations. Each operation corresponds to a list of “micro-ops”, where there is
no if/then/else. Some of the “micro-ops” may be a no-op depending on the VM
configuration (for example a PV domain may not need a qemu). In the case of
VM_start vm
the Xenopsd server starts by calling the functions that
decompose
the VM_hook_script, VM_create and VM_build micro-ops:
The VM_hook_script micro-op runs the corresponding “hook” scripts. The
code is all in the
Xenops_hooks
module and looks for scripts in the hardcoded path /etc/xapi.d.
2. create a Xen domain
The VM_create micro-op calls the VM.create function in the backend.
In the classic Xenopsd backend, the
VM.create_exn
function must
check if we’re creating a domain for a fresh VM or resuming an existing one:
if it’s a resume then the domain configuration stored in the VmExtra database
table must be used
ask squeezed to create a memory “reservation” big enough to hold the VM
memory. Unfortunately the domain cannot be created until the memory is free
because domain create often fails in low-memory conditions. This means the
“reservation” is associated with our “session” with squeezed; if Xenopsd
crashes and restarts the reservation will be freed automatically.
create the Domain via the libxc hypercall Xenctrl.domain_create
callgenerate_create_info()
for storing the platform data (vCPUs, etc) the domain’s Xenstore tree.
xenguest then uses this in the build phase (see below) to build the domain.
“transfer” the squeezed reservation to the domain such that squeezed will
free the memory if the domain is destroyed later
compute and set an initial balloon target depending on the amount of memory
reserved (recall we ask for a range between dynamic_min and dynamic_max)
apply the “suppress spurious page faults” workaround if requested
set the “machine address size”
“hotplug” the vCPUs. This operates a lot like memory ballooning – Xen creates
lots of vCPUs and then the guest is asked to only use some of them. Every VM
therefore starts with the “VCPUs_max” setting and co-operative hotplug is
used to reduce the number. Note there is no enforcement mechanism: a VM which
cheats and uses too many vCPUs would have to be caught by looking at the
performance statistics.
3. build the domain
The build phase waits, if necessary, for the Xen memory scrubber to catch
up reclaiming memory, runs NUMA placement, sets vCPU affinity and invokes
the xenguest to build the system memory layout of the domain.
See the walk-through of the VM_build μ-op for details.
4. mark each VBD as “active”
VBDs and VIFs are said to be “active” when they are intended to be used by a
particular VM, even if the backend/frontend connection hasn’t been established,
or has been closed. If someone calls VBD.stat or VIF.stat then
the result includes both “active” and “plugged”, where “plugged” is true if
the frontend/backend connection is established.
For example xapi will
set VBD.currently_attached
to “active || plugged”. The “active” flag is conceptually very similar to the
traditional “online” flag (which is not documented in the upstream Xen tree
as of Oct/2014 but really should be) except that on unplug, one would set
the “online” key to “0” (false) first before initiating the hotunplug. By
contrast the “active” flag is set to false after the unplug i.e. “set_active”
calls bracket plug/unplug. If the “active” flag was set before the unplug
attempt then as soon as the frontend/backend connection is removed clients
would see the VBD as completely dissociated from the VM – this would be misleading
because Xenopsd will not have had time to use the storage API to release locks
on the disks. By cleaning up before setting “active” to false, clients
can be assured that the disks are now free to be reassigned.
5. handle non-persistent disks
A non-persistent disk is one which is reset to a known-good state on every
VM start. The VBD_epoch_begin is the signal to perform any necessary reset.
6. plug VBDs
The VBD_plug micro-op will plug the VBD into the VM. Every VBD is plugged
in a carefully-chosen order.
Generally, plug order is important for all types of devices. For VBDs, we must
work around the deficiency in the storage interface where a VDI, once attached
read/only, cannot be attached read/write. Since it is legal to attach the same
VDI with multiple VBDs, we must plug them in such that the read/write VBDs
come first. From the guest’s point of view the order we plug them doesn’t
matter because they are indexed by the Xenstore device id (e.g. 51712 = xvda).
call VDI.attach and VDI.activate in the storage API to make the
devices ready (start the tapdisk processes etc)
add the Xenstore frontend/backend directories containing the block device
info
add the extra xenstore keys returned by the VDI.attach call that are
needed for SCSIid passthrough which is needed to support VSS
write the VBD information to the Xenopsd database so that future calls to
VBD.stat can be told about the associated disk (this is needed so clients
like xapi can cope with CD insert/eject etc)
if the qemu is going to be in a different domain to the storage, a frontend
device in the qemu domain is created.
The Xenstore keys are written by the functions
Device.Vbd.add_async
and
Device.Vbd.add_wait.
In a Linux domain (such as dom0) when the backend directory is created, the kernel
creates a “backend device”. Creating any device will cause a kernel UEVENT to fire
which is picked up by udev. The udev rules run a script whose only job is to
stat(2) the device (from the “params” key in the backend) and write the major
and minor number to Xenstore for blkback to pick up. (Aside: FreeBSD doesn’t do
any of this, instead the FreeBSD kernel module simply opens the device in the
“params” key). The script also writes the backend key “hotplug-status=connected”.
We currently wait for this key to be written so that later calls to VBD.stat
will return with “plugged=true”. If the call returns before this key is written
then sometimes we receive an event, call VBD.stat and conclude erroneously
that a spontaneous VBD unplug occurred.
7. mark each VIF as “active”
This is for the same reason as VBDs are marked “active”.
8. plug VIFs
Again, the order matters. Unlike VBDs,
there is no read/write read/only constraint and the devices
have unique indices (0, 1, 2, …) but Linux kernels have often (always?)
ignored the actual index and instead relied on the order of results from the
xenstore-ls listing. The order that xenstored returns the items happens
to be the order the nodes were created so this means that (i) xenstored must
continue to store directories as ordered lists rather than maps (which would
be more efficient); and (ii) Xenopsd must make sure to plug the vifs in
the same order. Note that relying on ethX device numbering has always been a
bad idea but is still common. I bet if you change this, many tests will
suddenly start to fail!
compute the port locking configuration required and write this to a well-known
location in the filesystem where it can be read from the udev scripts. This
really should be written to Xenstore instead, since this scheme doesn’t work
with driver domains.
add the Xenstore frontend/backend directories containing the network device
info
write the VIF information to the Xenopsd database so that future calls to
VIF.stat can be told about the associated network
if the qemu is going to be in a different domain to the storage, a frontend
device in the qemu domain is created.
Similarly to the VBD case, the function
Device.Vif.add
will write the Xenstore keys and wait for the “hotplug-status=connected” key.
We do this because we cannot apply the port locking rules until the backend
device has been created, and we cannot know the rules have been applied
until after the udev script has written the key. If we didn’t wait for it then
the VM might execute without all the port locking properly configured.
9. create the device model
The VM_create_device_model micro-op will create a qemu device model if
the VM is HVM; or
the VM uses a PV keyboard or mouse (since only qemu currently has backend
support for these devices).
(if using a qemu stubdom) it will create and build the qemu domain
compute the necessary qemu arguments and launch it.
Note that qemu (aka the “device model”) is created after the VIFs and VBDs have
been plugged but before the PCI devices have been plugged. Unfortunately qemu
traditional infers the needed emulated hardware by inspecting the Xenstore
VBD and VIF configuration and assuming that we want one emulated device per
PV device, up to the natural limits of the emulated buses (i.e. there can be
at most 4 IDE devices: {primary,secondary}{master,slave}). Not only does this
create an ordering dependency that needn’t exist – and which impacts migration
downtime – but it also completely ignores the plain fact that, on a Xen system,
qemu can be in a different domain than the backend disk and network devices.
This hack only works because we currently run everything in the same domain.
There is an option (off by default) to list the emulated devices explicitly
on the qemu command-line. If we switch to this by default then we ought to be
able to start up qemu early, as soon as the domain has been created (qemu will
need to know the domain id so it can map the I/O request ring).
10. plug PCI devices
PCI devices are treated differently to VBDs and VIFs.
If we are attaching the device to an
HVM guest then instead of relying on the traditional Xenstore frontend/backend
state machine we instead send RPCs to qemu requesting they be hotplugged. Note
the domain is paused at this point, but qemu still supports PCI hotplug/unplug.
The reasons why this doesn’t follow the standard Xenstore model are known only
to the people who contributed this support to qemu.
Again the order matters because it determines the position of the virtual device
in the VM.
Note that Xenopsd doesn’t know anything about the PCI devices; concepts such
as “GPU groups” belong to higher layers, such as xapi.
11. mark the domain as alive
A design principle of Xenopsd is that it should tolerate failures such as being
suddenly restarted. It guarantees to always leave the system in a valid state,
in particular there should never be any “half-created VMs”. We achieve this for
VM start by exploiting the mechanism which is necessary for reboot. When a VM
wishes to reboot it causes the domain to exit (via SCHEDOP_shutdown) with a
“reason code” of “reboot”. When Xenopsd sees this event VM_check_state
operation is queued. This operation calls
VM.get_domain_action_request
to ask the question, “what needs to be done to make this VM happy now?”. The
implementation checks the domain state for shutdown codes and also checks a
special Xenopsd Xenstore key. When Xenopsd creates a Xen domain it sets this
key to “reboot” (meaning “please reboot me if you see me”) and when Xenopsd
finishes starting the VM it clears this key. This means that if Xenopsd crashes
while starting a VM, the new Xenopsd will conclude that the VM needs to be rebooted
and will clean up the current domain and create a fresh one.
12. unpause the domain
A Xenopsd VM.start will always leave the domain paused, so strictly speaking
this is a separate “operation” queued by the client (such as xapi) after the
VM.start has completed. The function
VM.unpause
is reassuringly simple:
See the walk-through of the Domain.build function
for more details on this phase.
Apply the cpuid configuration
Store the current domain configuration on disk – it’s important to know
the difference between the configuration you started with and the configuration
you would use after a reboot because some properties (such as maximum memory
and vCPUs) as fixed on create.
Callwait_xen_free_mem
to wait (if necessary), for the Xen memory scrubber to catch up reclaiming memory.
It
calls Xenctrl.physinfo which returns:
hostinfo.free_pages - the free and already scrubbed pages (available)
host.scrub_pages - the not yet scrubbed pages (not yet available)
repeats this until a timeout as long as free_pages is lower
than the required pages
unless if scrub_pages is 0 (no scrubbing left to do)
Note: free_pages is system-wide memory, not memory specific to a NUMA node.
Because this is not NUMA-aware, in case of temporary node-specific memory shortage,
this check is not sufficient to prevent the VM from being spread over all NUMA nodes.
It is planned to resolve this issue by claiming NUMA node memory during NUMA placement.
Call the hypercall to set the timer mode
Call the hypercall to set the number of vCPUs
Call the numa_placement function
as described in the NUMA feature description
when the xe configuration option numa_placement is set to Best_effort
(except when the VM has a hard CPU affinity).
match!Xenops_server.numa_placement with|Any-> ()
|Best_effort-> log_reraise (Printf.sprintf "NUMA placement")(fun () ->if has_hard_affinity then D.debug "VM has hard affinity set, skipping NUMA optimization"else numa_placement domid ~vcpus
~memory:(Int64.mul memory.xen_max_mib 1048576L))
NUMA placement
build_pre passes the domid, the number of vCPUs and xen_max_mib to the
numa_placement
function to run the algorithm to find the best NUMA placement.
When it returns a NUMA node to use, it calls the Xen hypercalls
to set the vCPU affinity to this NUMA node:
let vm = NUMARequest.make ~memory ~vcpus inlet nodea =match!numa_resources with|None-> Array.of_list nodes
|Some a -> Array.map2 NUMAResource.min_memory (Array.of_list nodes) a
in numa_resources :=Some nodea ; Softaffinity.plan ~vm host nodea
By using the default auto_node_affinity feature of Xen,
setting the vCPU affinity causes the Xen hypervisor to activate
NUMA node affinity for memory allocations to be aligned with
the vCPU affinity of the domain.
Summary: This passes the information to the hypervisor that memory
allocation for this domain should preferably be done from this NUMA node.
Invoke the xenguest program
With the preparation in build_pre completed, Domain.buildcalls
the xenguest function to invoke the xenguest program to build the domain.
This can be used, for example, when there might not be enough memory on the preferred
NUMA node, and there are other NUMA nodes (in the same CPU package) to use
(reference).
xenguest
As part of starting a new domain in VM_build, xenopsd calls xenguest.
When multiple domain build threads run in parallel,
also multiple instances of xenguest also run in parallel:
xenguest is called by the xenopsd Domain.build function
to perform the build phase for new VMs, which is part of the xenopsdVM.start operation.
xenguest
was created as a separate program due to issues with
libxenguest:
It wasn’t threadsafe: fixed, but it still uses a per-call global struct
It had an incompatible licence, but now licensed under the LGPL.
Those were fixed, but we still shell out to xenguest, which is currently
carried in the patch queue for the Xen hypervisor packages, but could become
an individual package once planned changes to the Xen hypercalls are stabilised.
Over time, xenguest has evolved to build more of the initial domain state.
xenopsd must pass this information to xenguest to build a VM:
The domain type to build for (HVM, PHV or PV).
It is passed using the command line option --mode hvm_build.
The domid of the created empty domain,
The amount of system memory of the domain,
A number of other parameters that are domain-specific.
xenopsd uses the Xenstore to provide platform data:
the vCPU affinity
the vCPU credit2 weight/cap parameters
whether the NX bit is exposed
whether the viridian CPUID leaf is exposed
whether the system has PAE or not
whether the system has ACPI or not
whether the system has nested HVM or not
whether the system has an HPET or not
When called to build a domain, xenguest reads those and builds the VM accordingly.
Walkthrough of the xenguest build mode
---
theme: ''
---
flowchart
subgraph xenguest[xenguest #8209;#8209;mode hvm_build domid]
direction LR
stub_xc_hvm_build[stub_xc_hvm_build#40;#41;] --> get_flags[
get_flags#40;#41; <#8209; Xenstore platform data
]
stub_xc_hvm_build --> configure_vcpus[
configure_vcpus#40;#41; #8209;> Xen hypercall
]
stub_xc_hvm_build --> setup_mem[
setup_mem#40;#41; #8209;> Xen hypercalls to setup domain memory
]
end
Based on the given domain type, the xenguest program calls dedicated
functions for the build process of the given domain type.
These are:
stub_xc_hvm_build() for HVM,
stub_xc_pvh_build() for PVH, and
stub_xc_pv_build() for PV domains.
These domain build functions call these functions:
get_flags() to get the platform data from the Xenstore
configure_vcpus() which uses the platform data from the Xenstore to configure vCPU affinity and the credit scheduler parameters vCPU weight and vCPU cap (max % pCPU time for throttling)
The setup_mem function for the given VM type.
The function hvm_build_setup_mem()
For HVM domains, hvm_build_setup_mem() is responsible for deriving the memory
layout of the new domain, allocating the required memory and populating for the
new domain. It must:
Derive the e820 memory layout of the system memory of the domain
including memory holes depending on PCI passthrough and vGPU flags.
Load the BIOS/UEFI firmware images
Store the final MMIO hole parameters in the Xenstore
Call the libxenguest function xc_dom_boot_mem_init() (see below)
Call construct_cpuid_policy() to apply the CPUID featureset policy
The function xc_dom_boot_mem_init()
---
theme: ''
---
flowchart LR
subgraph xenguest
hvm_build_setup_mem[hvm_build_setup_mem#40;#41;]
end
subgraph libxenguest
hvm_build_setup_mem --> xc_dom_boot_mem_init[xc_dom_boot_mem_init#40;#41;]
xc_dom_boot_mem_init -->|vmemranges| meminit_hvm[meninit_hvm#40;#41;]
click xc_dom_boot_mem_init "https://github.com/xen-project/xen/blob/39c45c/tools/libs/guest/xg_dom_boot.c#L110-L126" _blank
click meminit_hvm "https://github.com/xen-project/xen/blob/39c45c/tools/libs/guest/xg_dom_x86.c#L1348-L1648" _blank
end
hvm_build_setup_mem() calls
xc_dom_boot_mem_init()
to allocate and populate the domain’s system memory.
It calls
meminit_hvm()
to loop over the vmemranges of the domain for mapping the system RAM
of the guest from the Xen hypervisor heap. Its goals are:
Attempt to allocate 1GB superpages when possible
Fall back to 2MB pages when 1GB allocation failed
Fall back to 4k pages when both failed
It uses the hypercall
XENMEM_populate_physmap
to perform memory allocation and to map the allocated memory
to the system RAM ranges of the domain.
At the end of this walkthrough, a sequence diagram of the overall process is included.
Invocation
The command to migrate the VM is dispatched
by the autogenerated dispatch_call function from xapi/server.ml. For
more information about the generated functions you can have a look to
XAPI IDL model.
The command triggers the operation
VM_migrate
that uses many low level atomics operations. These are:
The migrate command has several parameters such as:
Should it be started asynchronously,
Should it be forwarded to another host,
How arguments should be marshalled, and so on.
A new thread is created by xapi/server_helpers.ml
to handle the command asynchronously. The helper thread checks if
the command should be passed to the message forwarding
layer in order to be executed on another host (the destination) or locally (if
it is already at the destination host).
It will finally reach xapi/api_server.ml that will take the action
of posted a command to the message broker message switch.
It is a JSON-RPC HTTP request sends on a Unix socket to communicate between some
XAPI daemons. In the case of the migration this message sends by XAPI will be
consumed by the xenopsd
daemon that will do the job of migrating the VM.
Overview
The migration is an asynchronous task and a thread is created to handle this task.
The task reference is returned to the client, which can then check
its status until completion.
As shown in the introduction, xenopsd
fetches the
VM_migrate
operation from the message broker.
The entities that need to be migrated are: VDI, VIF, VGPU and PCI components.
During the migration process, the destination domain will be built with the same
UUID as the original VM, except that the last part of the UUID will be
XXXXXXXX-XXXX-XXXX-XXXX-000000000001. The original domain will be removed using
XXXXXXXX-XXXX-XXXX-XXXX-000000000000.
Preparing VM migration
At specific places, xenopsd can execute hooks to run scripts.
In case a pre-migrate script is in place, a command to run this script
is sent to the original domain.
Likewise, a command is sent to Qemu using the Qemu Machine Protocol (QMP)
to check that the domain can be suspended (see xenopsd/xc/device_common.ml).
After checking with Qemu that the VM is can be suspended, the migration can begin.
Importing metadata
As for hooks, commands to source domain are sent using stunnel a daemon which
is used as a wrapper to manage SSL encryption communication between two hosts on the same
pool. To import the metadata, an XML RPC command is sent to the original domain.
Once imported, it will give us a reference id and will allow building the new domain
on the destination using the temporary VM uuid XXXXXXXX-XXXX-XXXX-XXXX-000000000001
where XXX... is the reference id of the original VM.
Memory setup
One of the first steps the setup of the VM’s memory: The backend checks that there
is no ballooning operation in progress. If so, the migration could fail.
Once memory has been checked, the daemon will get the state of the VM (running, halted, …) and
The backend retrieves the domain’s platform data (memory, vCPUs setc) from the Xenstore.
Once this is complete, we can restore VIF and create the domain.
The synchronisation of the memory is the first point of synchronisation and everything
is ready for VM migration.
Destination VM setup
After receiving memory we can set up the destination domain. If we have a vGPU we need to kick
off its migration process. We will need to wait for the acknowledgement that the
GPU entry has been successfully initialized before starting the main VM migration.
The receiver informs the sender using a handshake protocol
that everything is set up and ready for save/restore.
Destination VM restore
VM restore is a low level atomic operation VM.restore.
This operation is represented by a function call to backend.
It uses Xenguest, a low-level utility from XAPI toolstack, to interact with the Xen hypervisor
and libxc for sending a migration request to the emu-manager.
After sending the request results coming from emu-manager are collected
by the main thread. It blocks until results are received.
During the live migration, emu-manager helps in ensuring the correct state
transitions for the devices and handling the message passing for the VM as
it’s moved between hosts. This includes making sure that the state of the
VM’s virtual devices, like disks or network interfaces, is correctly moved over.
Destination VM rename
Once all operations are done, xenopsd renames the target VM from its temporary
name to its real UUID. This operation is a low-level atomic
VM.rename
which takes care of updating the Xenstore on the destination host.
Restoring devices
Restoring devices starts by activating VBD using the low level atomic operation
VBD.set_active. It is an update of Xenstore. VBDs that are read-write must
be plugged before read-only ones. Once activated the low level atomic operation
VBD.plug
is called. VDI are attached and activate.
Next devices are VIFs that are set as active VIF.set_active and plug VIF.plug.
If there are VGPUs we will set them as active now using the atomic VGPU.set_active.
Creating the device model
create_device_model
configures qemu-dm and starts it. This allows to manage PCI devices.
PCI plug
PCI.plug
is executed by the backend. It plugs a PCI device and advertises it to QEMU if this option is set. It is
the case for NVIDIA SR-IOV vGPUs.
Unpause
The libxenctrl call
xc_domain_unpause()
unpauses the domain, and it starts running.