KV Scheduler

KV Scheduler#

This section describes the KV Scheduler.

Package references: kvscheduler, api, graph

Introduction#

The KV Scheduler core plugin supports dependency resolution, and computes the proper programming sequence for multiple interdependent configuration items. It works with VPP and Linux agents on the southbound (SB) side, and external data stores and rpc clients on the northbound (NB) side.

Motivation#

The KV Scheduler addresses several challenges observed in the original VPP agent design. These challenges arose as the variety and complexity of different configuration items increased.

VPP and Linux plugins became bloated, complicated, and suffered from race conditions and a lack of visibility.
VPP and Linux plugin configurator components, each processing a specific configuration item type, were built from scratch, solving the same set of problems, with frequent code duplication.
Plugin configurators would communicate with each other through notifications, and react to changes asynchronously to ensure a proper operational sequence. Dependency resolution processing was distributed across all configurators, making it difficult to understand, predict, and stabilize the system behavior.

Re-synchronization (resync) occurring between the desired northbound (NB) configuration state and the southbound (SB) runtime configuration state became unreliable and unpredictable.

Terminology

Northbound (NB) describes the desired or intended configuration state originating from NB entities such as a KV data store or gRPC.

Southbound (SB) describes the actual runtime configuration state running the data plane. Resync stands for state reconciliation. CRUD means create, read, update and delete. KV Descriptors are referred to as descriptors.

Basic Concepts#

Internally, the KV Scheduler builds a graph to model system state. The vertices represent configuration items; the edges represent the relationships between the configuration items. The KV Scheduler walks the tree to mark dependencies, and compute the sequence of programming actions to perform. In addition, it builds transaction plans, refreshes state, and performs resync.

The KV Scheduler generates a transaction plan that drives CRUD operations to the VPP or Linux agents in the SB direction. It will cache configuration items until outstanding dependencies are resolved. It coordinates and performs partial or full state synchronization for NB and SB.

To abstract away from the details of specific configuration items, graph vertices are “described” to the KV Scheduler using KV Descriptors. You can think of a KV Descriptor as a handler, each supporting a distinct subset of graph vertices. KV Descriptors provide the KV Scheduler with pointers to callbacks that implement CRUD operations.

kvs-system

You can retrieve information from the KV Scheduler using agentctl and REST APIs.

Note

You can use the KV Scheduler system for any application containing an object with other dependent objects.

Mediator Pattern#

KV Descriptors employ a mediator pattern, where plugins are decoupled, and do not directly communicate with each other. Instead, any interactions between plugins occur through the KV Scheduler mediator.

Plugins provide CRUD callbacks, and describe their dependencies on other plugins through one or more KV Descriptors. The KV Scheduler plans operations without knowing what the graph vertices stand for in the system.

Furthermore, the number and variety of configuration items can grow without altering the transaction processing engine, or increasing the complexity of the components in your application. You just need to implement and register new KV Descriptors.

Terminology#

The KV Scheduler’s graph-based representation uses the following terminology:

Model describes a single configuration item type such as an interface, route, or bridge domain. For more details, see model. To look over an implementation, see the bridge domain model.
Value (proto.Message) is a runtime instance of a given model. For more details, see proto.Message.
Key (string) identifies a specific indentifier built from the model specification and value attributes. For more information, see keys.
Label (string) identifies a unique value associated with the same type such as interface name.
Value State (enum ValueState) defines the operational state of a value. For example, a value can be successfully CONFIGURED, PENDING due to unmet dependencies, or FAILED after the last CRUD operation returns an error. Value state proto lists the possible value states.
Value Status (struct BaseValueStatus) contains details on the executed operation, the last returned error, and the list of unmet dependencies. The KV Scheduler API can read the value status, and watch for updates.
Metadata (interface{}) consists of additional runtime information of an undefined type, and assigned to a value. Metadata updates follow a CRUD operation or agent restart. For example, the KV Scheduler graph contains sw_if_index metadata for each interface.
Metadata Map, also known as index-map, implements the mapping between a value label and its metadata for a given value type. The KV Scheduler creates and updates metadata maps. Other plugins can read and reference the metadata in read-only mode.

For example, the interface plugin exposes its interface metadata map containing the sw_if-index for each interface. The ARP plugin, the route plugin, and other plugins use the sw_if_index to reference specific interfaces.
Value origin defines the source of the value. The value origin options are NB, SB and unknown.
Key-value pairs specify a key and associated value. CRUD operations such as CREATE() manipulate key-value pairs. Every key-value pair must have, at most, one KV Descriptor.
Dependency represents a dependency value. It references another key-value pair that must be created and exist before the dependent value is created.

If a dependency is not satisfied, the dependent value remains cached in the PENDING state. A dependent value can have multiple dependencies. All dependencies must be satisfied before creating the dependent value.
Derived value is single field associated with an original value or its property. Custom CRUD operations manipulate the derived value, possibly using its own dependencies. A derived value can serve as a target for dependencies of other key-value pairs.

For example, every interface assigned to a bridge domain is treated as a separate key-value pair, dependent on the existing target interface. Additionally, it does not block the remaining bridge domain programming.

See the bridge domain control-flow demonstrating the order of operations required to create a bridge domain.
Graph of values contains all configured and pending key-value pairs inside KV Scheduler-internal in-memory. Graph edges represent inter-value relations, such as “depends-on” or “is-derived-from”. Graph nodes represent the key-value pairs.
Graph Refresh updates the graph content to reflect the real SB state. This process calls the Retrieve() function of every descriptor supporting this operation. It adds or updates graph vertices with the retrieved values. Refresh executes just prior to full or downstream resync, or after a CRUD operation failure impacts any vertices.
KV Descriptor implements CRUD operations and defines derived values and dependencies for a single value type.

To learn more, read how to implement your own KV Descriptor. For an in-depth discussion, see KV Descriptors.

To retrieve KV Scheduler graph details, use GET /scheduler/graph-snapshot, or agentctl dump.

Dependencies#

The KV Scheduler must learn about two types of relationships between values when scheduling configuration updates for your application.

A depends on B:

A cannot exist without B.
Your request to create A, without the existence of B, must be postponed. A is marked PENDING.
If A exists, and you need to remove B, you must first remove A.
A is marked PENDING in case you later restore B.

B is derived from A:

value B is not created directly by either NB or SB. Rather, it is derived from a base value A using the DerivedValues() method of the A descriptor.
a derived value B exists for only as long as its base A exists.
B is removed immediately when the base value A disappears.
a derived value’s descriptor can differ from the base value descriptor. You might have a base value property that other values depend on. Or you have an extra action to perform when additional dependencies are met.

You will use dependencies to implement descriptors and plugin lookup in your application.

Note

Values obtained from SB via notifications are not checked for dependencies

Diagram#

The diagram below illustrates the interactions between the KV Scheduler and the layers above and below. Using Bridge Domain as an example, it depicts the dependency and the derivation relationships. It also shows a cached pending value of unspecified type waiting for the system to first create the interface.

KVScheduler diagram

If you wish to re-create the transactions shown in the graph, use a combination of agentctl put, dump and config history. Quick Start, Plugin and agentctl provide examples for configuring interfaces and bridge domains.

Resync#

You don’t have to implement plugin-specific resync functions. By providing a KV Descriptor with CRUD operation callbacks to the KV Scheduler, your plugin “teaches” the KV Scheduler how to handle the plugin’s configuration items. The KV Scheduler computes and executes the operations to perform a complete resynchronization.

The KV Scheduler further enhances the concept of state reconciliation by defining three resync types: full resync, upstream resync, and downstream resync.

Full resync:
- Re-reads intended configuration from NB.
- Refreshes the SB view using one or more Retrieve() operations.
- Resolves inconsistencies using Create()\Delete()\Update() operations.
Upstream resync:
- Partial resync, similar to full resync.
- DOES NOT refresh SB view, and assumes the SB view is up-to-date and/or not required in the resync. Upstream resync excludes the SB refresh because it is easier to re-calculate the intended state, rather than determine the minimal difference.
Downstream resync:
- Partial resync, similar to full resync.
- DOES NOT re-read intended configuration from NB, and assumes it is up-to-date.
- Used periodically to resync, even without interacting with the NB.

You can initiate a downstream resync using agentctl, or POST /scheduler/downstream-resync

Transactions#

The KV Scheduler groups and applies related changes as a transaction. However, not all NB interfaces support this function. For example, changes from the etcd data store are received one at a time.

To leverage the transaction support, you must use the localclient for the same process, or gRPC for remote access.

The KV Scheduler queues and asynchronously executes transactions. This simplifies the algorithm and avoids concurrency issues.

The KV Scheduler splits transaction processing into several stages.

The primary functions of the simulation and execution stages consist of the following:

Simulation:
- generates the transaction plan consisting of a set of planned operations.
- DOES NOT perform CRUD callbacks to descriptors.
- Assumes no failures.
Execution:
- Executes operations sorted in the correct order.
- If any operation fails, it reverts to applied changes, unless you enable BestEffort mode. In the best effort case, the KV Scheduler attempts to apply the maximum possible set of required changes.
  
  BestEffort is the default for resync.

Following simulation, the KV Scheduler generates transaction metadata, and the transaction plan. This process happens before execution to inform you of pending operations. Note this process occurs even if any of the operations cause the agent to crash.

After the transaction has executed, the set of completed operations and any errors prints to a log. You can view the transaction logs using agentctl, or GET /scheduler/txn-history.

The example below shows an abbreviated transaction log. Observe that the planned operations match the executed operations. This indicates no errors.

+======================================================================================================================+
| Transaction #5                                                                                        NB transaction |
+======================================================================================================================+
  * transaction arguments:
      - seq-num: 5
      - type: NB transaction
      - Description: example transaction
      - values:
          - key: vpp/config/v2/nat44/dnat/default/kubernetes
            value: { label:"default/kubernetes" st_mappings:<external_ip:"10.96.0.1" external_port:443 local_ips:<local_ip:"10.3.1.10" local_port:6443 > twice_nat:SELF >  }
          - key: vpp/config/v2/ipneigh
            value: { mode:IPv4 scan_interval:1 stale_threshold:4  }
  * planned operations:
      1. ADD:
          - key: vpp/config/v2/nat44/dnat/default/kubernetes
          - value: { label:"default/kubernetes" st_mappings:<external_ip:"10.96.0.1" external_port:443 local_ips:<local_ip:"10.3.1.10" local_port:6443 > twice_nat:SELF >  }
      2. MODIFY:
          - key: vpp/config/v2/ipneigh
          - prev-value: { scan_interval:1 max_proc_time:20 max_update:10 scan_int_delay:1 stale_threshold:4  }
          - new-value: { mode:IPv4 scan_interval:1 stale_threshold:4  }

// here you would potentially see logs from C(R)UD operations           

o----------------------------------------------------------------------------------------------------------------------o
  * executed operations (2019-01-21 11:29:27.794325984 +0000 UTC m=+8.270999232 - 2019-01-21 11:29:27.797588466 +0000 UTC m=+8.274261700, duration = 3.262468ms):
     1. ADD:
         - key: vpp/config/v2/nat44/dnat/default/kubernetes
         - value: { label:"default/kubernetes" st_mappings:<external_ip:"10.96.0.1" external_port:443 local_ips:<local_ip:"10.3.1.10" local_port:6443 > twice_nat:SELF >  }
      2. MODIFY:
          - key: vpp/config/v2/ipneigh
          - prev-value: { scan_interval:1 max_proc_time:20 max_update:10 scan_int_delay:1 stale_threshold:4  }
          - new-value: { mode:IPv4 scan_interval:1 stale_threshold:4  }
x----------------------------------------------------------------------------------------------------------------------x
x #5                                                                                                          took 3ms x
x----------------------------------------------------------------------------------------------------------------------x

WaitDone#

In some scenarios, you could have configuration updates that depend on a data plane event. Problems can arise if your agent expects a completed configuration update, while the underlying data plane is not fully configured. In other words, the configurator signals config update finished, but transaction key-values are still PENDING.

See Issue #1732 for a description of this problem.

You can tell the configurator to wait until all key-value pairs are non-PENDING with the WaitDone option. The configurator proto defines a wait_done boolean. If you set to true, the configurator waits for all transaction key-values to reach non-PENDING state before signaling config update finished, or until the request times out.

The WaitDone option works for update requests and delete requests.

For more information on WaitDone, see the following:

KV Scheduler API#

A separate sub-package “api” of the KV Scheduler plugin defines the KV Scheduler API.

Multiple files describe interfaces that constitute the KV Scheduler API. The table below contains the file name and brief description. For more details on the specific files, click on the file name.

File	Description
errors.go	defines errors returned from the KV Scheduler. `InvalidValueError` error wrapper allows plugins to specify the specific reason for a validation error.
kv_scheduler_api.go	used by NB for transaction commit, or read and watch for value status updates; used by SB push notifications.
kv_descriptor_api.go	defines the KV Descriptor interface.
txn_options.go	Set of available options for transactions.
txn_record.go	Type definition for storing processed transaction records.
value_status.proto	Operational value status proto. API reads the current status of one or more values, and watches for updates through a channel.
value_status.pb.go	Go code generated from value status proto.
value_status.go	extends `value_status.pb.go` to implement proper (un)marshalling for proto.Messages.

REST API#

You can retrieve the KV Scheduler runtime state through REST APIs. For more information, see KV Scheduler REST API.

Agentctl Dump#

You can use the agentctl to dump the KV Scheduler runtime state. For more information, see agentctl dump.