Dataflow, Perform Flow

In the last post, I presented some basic algorithms to operate on the dataflow dependency graph. I will assume now that you have both the flow graph, and the topological dependency list. (Recall the dependency list is a vector where each slot is a set of independent nodes. Understanding this is important.)

The next next algorithm we will discuss is perform flow. It takes a series of cells needing update, and follows along the flow graphs updating each, along with all cells which depend on them. The critical aspect of this algorithm is that is does no unnecessary work. No cell is visited twice. No cell is touched if a dependent did not change.

Here it is:

   A flow graph (fg)
   A topological dependency list (dl)
   A set of cells needing update (needed)
For each d in dl
   If needed is empty, break out, you’re done
   For each cell in the intersection of needed and d
      Recompute the cell, if it isn’t a source cell
      If the cell changed add all its neighbors
                        from the flow graph to needed.
      (Note: source cells always "changed" for this purpose)
      Remove cell from needed.

This iterates over the topological vector, visiting the cells at each level that need updating. If a cell is updated, those that immediately depend on it are added to the needed list (note, those new cells will never be in the current or an earlier strata of the dependency list, or else it was computed incorrectly). This continues until you run out of needed cells, or complete the dependency list (the latter should not occur if the first does not, although the algorithm does not check).

When you first establish your model, you will need to set all the source cells to their initial value, and the add all of them to the needed list, and run the algorithm. When you update a cell (or cells), set those that directly changed to the needed list, and also run the algorithm.

Next: I will discuss the Clojure implementation of all of this.


Dataflow, Algorithms

Let’s assume we have some cell data structures.  There need to be at least two types:

  1. Source Cells, which contains unevaluated data, directly provided by the client, and
  2. Evaluated Cells, which contain a function to evaluate the content, and a slot to hold the current content.

Depending on your implementation language, you’ll likely want to make these polymorphic, and generic if your language has compile time typing.  Also, you’ll need some sentinel value to indicate an unevaluated cell.  Null/nil/void is usually not a good choice, since it is possible for an evaluated cell to be legitimately null.  We will call this sentinel value “undefined.”

Here is a basic algorithm to evaluate a dataflow:

  1. All source cells are set to some initial value
  2. All evaluated cells are set to undefined.
  While some cell changes:
    For each cell that is currently undefined:
      If all cells on which the current cell depends are defined:
        Evaluate this cell.
  Check that all cells are now evaluated.
      (If any are not, there is a circularity in your data.)

This is an example of a fixed point algorithm; it repeatedly evaluates its data until there are no more changes.  Fixed point algorithms are nice because they are usually easy to understand.  Often, however, they are too slow to be useful.  This is one of those cases, particularly if you’ve only changed one or two source values.  We shall see how to speed this up.

We will need an additional data structure: a directed graph where the nodes are cells, and the edges represent dependencies.  (I hope the reader is reasonably familiar with graph algorithms.  If not, I recommend some time on Wikipedia.)

Actually, we will need two graphs, but they mirror each other.  One we will call the, “dependency graph,” the other the, “flow graph.”  In the dependency graph, the edges point from the evaluated cell to its sources, in the flow graph the edges point from the sources to the resulting cells.  So, for this data flow:

source-cell fred = 0
source-cell mary = 1
eval-cell sally = fred * mary
eval-cell joan = sally * fred

The dependency edges would be:

sally -> fred,
sally -> mary,
joan -> sally,
joan -> fred.

And the flow graph would be:

fred -> sally,
mary -> sally,
sally -> joan,
fred -> joan.

The dependency graph is easy to compute from the cell equations above, and the flow graph is simply the reversal of all the nodes.

Next we need to build a topological sort of the cells.  Actually, for reasons I’ll discuss in a later post, I’m going to build a somewhat more elaborate dependency list.  To do so we will need to annotate each cell with a priority, which are initialize to zero.

Here is the algorithm:

  Precondition: Each cell must have its priority set to zero.
  While no priorities change:
    For each cell c:
      Compute the maximum priority (m) of its dependent cells.
      Set c’s priority to m + 1.
  Return an array of sets, where the set at position i contains
  those cells whose priority is equal to i.

So, for our example flow, we’d have:

0 {fred, mary}

1 {sally}

2 {joan}

This algorithm is also fixed point, and in fact is very similar to the first algorithm listed above.  The difference is it need only be computed once for a set of cells.

Given a flow graph, and the dependency list, it is possible to propagate any changes to the dataflow quite efficiently.

Next: Performing the flow


Dataflow, Introduction

So, I think I’ll start this by talking about dataflow techniques in general, and also my Clojure implementation of the concept.

“Dataflow,” as a term gets used all over the place in Computer Science. In general, it refers to any algorithm that has:

  1. Data elements with non-trivial dependencies, and
  2. Some automated procedure to cascade updates.

 A concrete example will help.

Say you were working on an application to track your customer’s nutritional intake. The model would include their daily ranges of calories, fats, and other nutritional measures. These would be assigned by dietitians based on rather complex formulae based on the user’s height, weight, sex, and so on.

Alright, so the user would enter what she ate during the day. In a conventional application, your code when then loop over these meals, adding up each nutritional (in some sort of aggregator object), and then compare these values to daily limits. Simple, right?

Now, imagine there were changes to the application. Perhaps the formulae change. Perhaps new criteria are added. Specific clients might wish certain formulae to apply in certain limited cases. Over several years, the simple architecture would start to grow complex. It might reach the case that a simple question like, “What are the calorie limits for diabetics from customer X?” gets very difficult to answer just from inspecting the code. The idea that the dietitians could directly examine the code seem preposterous.

A dataflow model works like this. For each requirement you enter an equation, ideally using a domain specific language that hides away the complex code. Into this model you could plug the user’s meals, and the dataflow engine would “auto-magically” cascade the values across the model. You write no loops, and the code looks more-or-less like the equations that appear in the dietitian’s text books. Done correctly, they can directly examine the code to find errors. This makes the model highly tractable. This is priceless.

Next: some dataflow algorithms.



Log in