Workflows - performance analysis | Anagolay and Kelp docs

Workflows generation: retrospective and performance analysis

One of Anagolay strong points is the capability to generate code by combining Operations into a Workflow, given a manifest that is plain JSON.

This task is performed by the publish service through a template that, once fed with the Workflow manifest and all the required Operation information along with their Versions, produces some Rust code that is suitable to be called from WASM or natively and is preferably no-std (all Operations involved permitting this). This is important because it makes the Workflow very versatile, in fact:

Rust is a widespread system language and if the Workflow is performant and supports no-std it can possibly be executed also on a Rust-based blockchain, like substrate.
WASM is a standard assembly format that can run in the browser, under nodejs, and potentially be integrated with other environments as well.

In the following paragraphs, I will go through the different phases of implementation of the Workflow generation, outlining the difficulties I faced and the search for performance improvements. I believe my experience could be useful for many (or at least interesting for some).

How it all started

A few months ago I got my hands on a POC of some Operations written in Rust and compiled to WASM, the initial form of what would be automated code generation and execution, a.k.a a Workflow.

A test scenario for the performance was going to be producing a CID v1 out of some bytes, using the sequence of op_multihash and Blake3 encoder with op_cid.

The initial idea which we respected this far is that both Operations and Workflows should have the following characteristics:

Have a native (Rust) and a WASM interface
Be capable, where possible, of working in both std and no-std environment
Have similar structure, from their manifest to their implementation (which, of course, may vary)
Be capable of high throughput and asynchronous
An exceptional category of Operations is represented by those that can get an indefinite number of inputs of any type and produce one output out of them (FLOWCONTROL group)

Implementing all the above criteria appeared to be a real challenge, pushing the limit of what the Rust language allows, but let’s start from the beginning.

Crossing the WASM boundary

The first problem I focused on was sharing objects between the Javascript interpreter memory and the WASM memory. The thing is that once a WASM method returns, its memory is deallocated, so WASM cannot return references to locally created objects because such references would be invalid once Javascript tries to access them. It is, though, possible to maintain those references if they belong to a “registry” object which has a WASM-bound implementation instantiated in the scope of Javascript, like in the Rust WASM tutorial. On the other hand, a common approach is to deserialize the input of the WASM functions and serialize their output, since, aside from the cost of the procedure, the transfer across the boundary is very performant. So these were my considerations:

Registry approach:
- PROS: no performance loss due to serialization since it’s possible to return references to WASM memory
- CONS: the Javascript code looks cumbersome since everything must pass through this registry. All Operations have a dependency on this central entry point, which complicates the compilation to WASM. This is what I thought it would look like to invoke an Operation:
```
let opOutput = Anagolay.runOperation('op_cid', inputs, config)
```
Serialization/Deserialization, the chosen approach:
- PROS: the js code is sleek and every Operation is completely independent of the others
```
let opOutput = op_cid(inputs, config)
```
- CON: there was a huge overhead in method calls, dependent on the performance of the serde-serialize feature of wasm-bindgen, which serializes to JSON.

Introducing near-to-memory serialization

Having tested quickly [serde-wasm-bindgen](https://crates.io/crates/serde-wasm-bindgen) as well and not having obtained good enough results for bytes input (oh, I will come back on this…), I proceeded to evaluate the fastest (de)serializer I could find, one that almost copies and restores the content of the memory, which (among other choices) is [bincode](https://crates.io/crates/bincode).

Uint8Array can cross the WASM boundary with little overhead; therefore we required an Operation to produce an OperationOutput that provides two methods:

as_input() to be called whenever it must be given in input to another Operation since it will produce itself as Uint8Array. This call is fast and occurs often, from one Operation to the other.
decode() to be called to access the actual value in Javascript (a WASM-bound object or a primitive type). This call has a performance impact and it’s intended just for the final result.

This approach turned out to be one order of magnitude faster than serializing to JSON but introduced a lot of complications due to this OperationOutput dual representation.

It was necessary to have this definition as a trait, but traits don’t go well along with wasm-bindgen. However, through a derive macro, it’s possible to have the WASM-bound struct implement a trait (only from the Rust point of view):

pub fn impl_operation_return_trait(return_type: &Ident, struct_item: &ItemStruct) -> TokenStream {
    let name = &struct_item.ident;
    let gen = quote! {
      #struct_item

      // omitted downcast implementation

      impl an_operation_support::operation::OperationOutput<#return_type> for #name {
          fn as_input(&self) -> js_sys::Uint8Array {
             <#name>::as_input(self)
          }
          fn decode(&self) -> #return_type {
             <#name>::decode(self)
          }
      }
    };
    gen.into()
}

The macro implementation made sure that no compilation was possible without the required methods, and that they were implemented coherently:

#[operation_return_type(String)]
pub struct MyOperationOutput {
  state: String
}

impl MyOperationOutput {
    pub fn as_input(&self) -> js_sys::Uint8Array {
      // omitted serialization code
    }
    pub fn decode(&self) -> String {
        self.state.to_string()
    }
}

Complexity was growing but it was an acceptable tradeoff to have such good performances: using this approach, we could grind 16KB of data and get their CID in an astonishing 0.27ms from the WASM interface.

original_workflow_benchmark

A side problem: manifest generation

Since one of the objectives is to have all Operations structured similarly, their manifest is generated directly from the code, through a derive macro applied to the execute() function. A problem arose since such function was returning a dynamic implementation of trait OperationOutput, generic-typed with the real return type of the Operation.

#[describe([
    groups = [
      "SYS",
    ],
    config = []
])]
pub async fn execute(
    state: &String,
    _: BTreeMap<String, String>,
) -> Result<Box<dyn an_operation_support::operation::OperationOutput<String>>, String> {
    Ok(Box::new(MyOperationOutput { state: state.to_string() }))
}

At that time, in the WASM-bound execute function, this dynamic trait needed to be downcasted to the actual MyOperationOutput implementation that was WASM-bound and returned to Javascript:

#[wasm_bindgen(js_name=execute)]
pub async fn wasm_execute(
    operation_inputs: Vec<Uint8Array>,
    config: Map,
) -> Result<MyOperationOutput, JsValue> {
  // omitted code to deserialize input and config map
  execute(&deserialized_input, deserialized_config)
        .await
        .map(|operation_output| MyOperationOutput::downcast(operation_output))
        .map_err(|error| JsValue::from_str(&error.to_string()))
}

This magic can be done with some unsafe code from the OperationOutput trait implementation (previously omitted in the impl_operation_return_trait() snippet):

impl #name {
  pub fn downcast(trait_impl: Box<dyn an_operation_support::operation::OperationOutput<#return_type>>) -> #name {
    let raw = Box::into_raw(trait_impl) as *mut #name;
    unsafe { raw.read() }
  }
}

Hitting the wall with flow-control Operations

Even with all the boilerplate code and the explicit type declarations, all was working well as long as it was possible to know the cardinality and the type of the inputs and output of the Operation. This is not the case when we need to provide an Operation that collects several results into one array, or, even more naively, outputs the identity. These are possible behaviors for an Operation of the FLOWCONTROL group, called op_collect, which provides no input or output in its manifest (as it’s not known in advance) but has them specified in the Workflow manifest when op_collect is used in conjunction with other operations.

Whenever I looked at the problem, I found it impossible to declare or implement the OperationOutput trait for the output of op_collect, and difficult to get around Rust's strict type checking to provide a variable cardinality of inputs, having unknown types.

In the end, the Rust side was the easiest to implement, since in this category of Operations the execute() function is substituted by an execute!() macro and the Operation manifest is generated by a derive macro out of the latter. But implementing the WASM, required completely reconsidering the direction I had taken.

OperationOutput dual input/decode nature became a problem since there is no information about its decoded type in the WASM-bound execute() function of op_collect. Therefore I had to get rid of the structure completely, unwrapping the type it was wrapping and having all values passed to Javascript as serialization of themselves, and not as bytes.

This way, FLOWCONTROL Operation WASM implementation can blindly deal with JsValue input and produce a JsValue output, ignoring which objects it is actually manipulating: the only way to do so was reverting to serde_wasm_bindgen. The code was greatly simplified and I was much more satisfied with the code readability and architecture:

#[describe([
    groups = [
      "SYS",
    ],
    config = []
])]
pub async fn execute(
    state: &String,
    _: BTreeMap<String, String>,
) -> Result<String, String> {
    Ok(state.to_string())
}

#[wasm_bindgen(js_name=execute)]
pub async fn wasm_execute(
    operation_inputs: Vec<JsValue>,
    config: Map
) -> Result<JsValue, JsValue> {
    // omitted code to deserialize input and config map
    let output = execute(&input, config).await?;
    serde_wasm_bindgen::to_value(&output)?
}

However, as had happened before, this approach (above in the graphs) proved to be much less performant than bincode implementation (below in the graphs). The loss was not negligible as the size of the input increased:

64KB_of_data

64KB of data

800KB of data

This had actually to do with the way an array of bytes is treated by the serializer. Differently from bincode decoding, **serde-wasm-bindgen requires ownership of the value it’s deserializing**, which in turn means that the whole array was cloned just to be transformed from a JsValue into a Vec<u8> (aliased Bytes). But this can be done in a much more efficient way, retaining only the reference for the cast of JsValue to UInt8Array:

pub fn from_bytes(operation_input: &JsValue) -> Result<Bytes, JsValue> {
    let cast: Option<&Uint8Array> = operation_input.dyn_ref();
    match cast {
        Some(array) => Ok(array.to_vec()),
        None => Err(JsValue::from_str(
            "Expected a JsValue that could be casted to UInt8Array",
        )),
    }
}

The Workflow template

Particular attention was given to writing the handlebars template that generates the Workflow; indeed, it proved to be beneficial to have a slightly more complicated template that is capable of recognizing several different cases rather than a simpler one which is more readable but produces less performant code.

The fact of avoiding crossing the WASM boundary by calling in sequence Operations that belong to the same Segment, along with the memoisation of Segment execution results to be used afterward, is much more performant and user-friendly than calling each operation manually from Javascript.

Both the WASM-bound structure and the asynchronous trait implementation expose only two methods: the constructor and the next(), which will advance up to the point that some external input is required or the Workflow is terminated.

The generated code is meticulously tuned for performance, from the explicit Bytes deserialization described before, to the fact that only the last segment of the WASM-bound Workflow will spend time serializing the Workflow output; all these practices are reasoned by the considerations made previously.

Wherever is possible, and most extensively as it can go, instead of calling clone() or to_owned() on values, they are wrapped into **Rc to reuse references**. The native code pushes it even further by returning a reference also in the Workflow output, and it’s up to the caller to decide whether to take ownership or not.

All these final optimizations achieved in this latest Workflow (below in the graphs) allowed us to perform even a little better than the original bincode Workflow (above in the graphs):

original_and_improved_workflow_benchmark

Conclusion

To sum up, this search for performance will make the difference when calling the WASM interface while dealing with an even higher amount of input data nearly real-time. Moreover, the possibility to call the same code from the native Rust interface as well greatly increases the versatility of the code written in the form of an Anagolay Workflow. If you want to learn more about this implementation and Anagolay Workflows - join our Discord server or follow our updates on Twitter.

_{_14.07.2022}

How it all started​

Crossing the WASM boundary​

Introducing near-to-memory serialization​

A side problem: manifest generation​

Hitting the wall with flow-control Operations​

The Workflow template​

Conclusion​