Previous Entry Share Next Entry
Dataflow, Algorithms

Let’s assume we have some cell data structures.  There need to be at least two types:

  1. Source Cells, which contains unevaluated data, directly provided by the client, and
  2. Evaluated Cells, which contain a function to evaluate the content, and a slot to hold the current content.

Depending on your implementation language, you’ll likely want to make these polymorphic, and generic if your language has compile time typing.  Also, you’ll need some sentinel value to indicate an unevaluated cell.  Null/nil/void is usually not a good choice, since it is possible for an evaluated cell to be legitimately null.  We will call this sentinel value “undefined.”

Here is a basic algorithm to evaluate a dataflow:

  1. All source cells are set to some initial value
  2. All evaluated cells are set to undefined.
  While some cell changes:
    For each cell that is currently undefined:
      If all cells on which the current cell depends are defined:
        Evaluate this cell.
  Check that all cells are now evaluated.
      (If any are not, there is a circularity in your data.)

This is an example of a fixed point algorithm; it repeatedly evaluates its data until there are no more changes.  Fixed point algorithms are nice because they are usually easy to understand.  Often, however, they are too slow to be useful.  This is one of those cases, particularly if you’ve only changed one or two source values.  We shall see how to speed this up.

We will need an additional data structure: a directed graph where the nodes are cells, and the edges represent dependencies.  (I hope the reader is reasonably familiar with graph algorithms.  If not, I recommend some time on Wikipedia.)

Actually, we will need two graphs, but they mirror each other.  One we will call the, “dependency graph,” the other the, “flow graph.”  In the dependency graph, the edges point from the evaluated cell to its sources, in the flow graph the edges point from the sources to the resulting cells.  So, for this data flow:

source-cell fred = 0
source-cell mary = 1
eval-cell sally = fred * mary
eval-cell joan = sally * fred

The dependency edges would be:

sally -> fred,
sally -> mary,
joan -> sally,
joan -> fred.

And the flow graph would be:

fred -> sally,
mary -> sally,
sally -> joan,
fred -> joan.

The dependency graph is easy to compute from the cell equations above, and the flow graph is simply the reversal of all the nodes.

Next we need to build a topological sort of the cells.  Actually, for reasons I’ll discuss in a later post, I’m going to build a somewhat more elaborate dependency list.  To do so we will need to annotate each cell with a priority, which are initialize to zero.

Here is the algorithm:

  Precondition: Each cell must have its priority set to zero.
  While no priorities change:
    For each cell c:
      Compute the maximum priority (m) of its dependent cells.
      Set c’s priority to m + 1.
  Return an array of sets, where the set at position i contains
  those cells whose priority is equal to i.

So, for our example flow, we’d have:

0 {fred, mary}

1 {sally}

2 {joan}

This algorithm is also fixed point, and in fact is very similar to the first algorithm listed above.  The difference is it need only be computed once for a set of cells.

Given a flow graph, and the dependency list, it is possible to propagate any changes to the dataflow quite efficiently.

Next: Performing the flow



Log in

No account? Create an account