Datasports on Software Development

Articles and updates from Datasports about the craft of software

Tuple Identity and Lifetime

leave a comment »

1. Introduction

The graphical nature of EventFlow applications and the metaphor of streams of tuples flowing through operators and adapters, naturally gives rise to a mental model of tuples that is not strictly correct, and a potential barrier to correct understanding of how tuples are processed in the StreamBase platform.

This article provides a brief overview of tuple lifetime and identity that is meant to clarify understanding of what a tuple is, what it does, and for how long.

2. Discussion

2.1 Lifetime

It is natural to look at an EventFlow application like the following:

Basic Stream

Basic Stream

And imagine a single tuple flying through this processing pipeline, entering at InParameters, being modified at CalcProduct, and being passed on out through OutResults, like a car going through a carwash. As it turns out, this is not correct. A tuple is not a mutable entity that travels through a set of operators and adapters, but better thought of as an immutable package used to bundle up parameters and associated metadata for passing them to downstream method calls.

In the above simple example, InParameters constructs a tuple and relays it to CalcProduct. On receipt of that tuple, CalcProduct creates a new tuple, evaluating the expressions in its configuration using fields in the source tuple, module parameters, dynamic variables, global parameters, built-in functions, etc. That new tuple is then relayed to OutResults.

That this must be so becomes more clear when we consider a slightly more complicated application:

More Sophisticated Stream

More Sophisticated Stream (Click for full size)

Items are added to the qtItems Query Table via InStoreItem and StoreItem. InGetItem requests a single item by ItemID. If the item is found, it is sent via OutItemResults, otherwise a tuple indicating failure is sent via OutEmptyResults. InDumpItems will send out all currently stored items via OutItemResults, or a tuple indicating failure via OutEmptyResults if there are no items currently stored.

For the InStoreItem path and the InGetItem path, we can imagine a single tuple traveling through the processing logic, as every operator in those paths emits at most 1 tuple for each input tuple. However, for the InDumpItems path, a single input tuple hitting DumpItems may result in an arbitrarily high number of output tuples. If we were to continue to think of a single tuple traversing this processing logic, we would have to imagine these tuples coming out of DumpItems as somehow constituent parts of the tuple that was passed in, which really doesn’t make sense.

As EventFlow logic gets more complicated, the metaphor of a tuple flowing through multiple operators and adapters gets more strained. It is still a useful abstraction, but one must be careful how it’s applied.

2.2 Identity

An understanding of the management of tuple identity follows logically from the above discussion of tuple lifetime. As a tuple only exists to pass a set of values from one component to another, there is no requirement to have a notion of the identity of a tuple.

When one tuple hitting DumpItems generates N output tuples, nothing is done to associate the output tuples with the input tuple that triggered their creation.

2.3 References

Following up on some comments made on the article Thoughts on Encapsulation Schemes, Part 2, Steve Barber and I had an email exchange which shed a lot of light on some of these interrelated issues for me. I am including some excerpts here as a bit of back story, and because I expect that others may find it elucidating.

Phil Martin to Steve Barber:

  1. What constitutes “processing to completion”? In a single-threaded application, if the processing of a tuple (call it the parent) gives rise to one or more new tuples (call them children, say via Iterate or Query operators), and those child tuples get queued (via blue arcs or entry to referenced modules), is the parent tuple considered to have been processed to completion at the point when all child tuples have been queued, or only when each child tuple has completed or hit a thread boundary?
  2. Is the identity of a tuple tracked through the processing somehow, or is some other mechanism used to determine when something has been processed to completion?
    1. When a parent tuple gives rise to child tuples, are those children somehow associated with the parent?
    2. If there is a notion of tuple identity, is that identity somehow preserved when passing through Operators like the Map, where the content of an output tuple may have nothing at all in common with the input tuple that triggered its processing? What about in custom operators?
    3. The more I think about this, the more I think that there must not be a notion of tuple identity, and that the “some other mechanism” has to relate to the call stack of the compiled EventFlow code. Is that correct?

Readers who have made a thorough study of StreamBase Execution Order and Concurrency will recognize some RTFM-failure on my part. However I suspect that I’m not the only person doing StreamBase development with an understanding of these issues that has grown organically through trial, error, superstition, assumption, and occasional skimming of documentation. Taking a bit of time to challenge assumptions and think analytically is something from which we all benefit.

Anyhow, on to some key excerpts from Mr. Barber’s reply:

Steve Barber to Phil Martin:

In the case of the Execution Order and Concurrency Page, it is worth reading through several times and thinking about slowly, for anyone that wants to write EventFlow applications that are more than trivial. Read the page, though, only from 7.1.3 and beyond. Earlier ones are missing a lot of rules about concurrency. There are also some nice pictures to help.

  1. Processing a tuple “to completion” (covered in Rule 1) means “as far downstream as it can go before an operator either does not send a tuple out of any output port in response to receiving that tuple on an input port or is queued.” A tuple can be queued at the input to a parallel region — which is some operator or module reference that is marked as concurrent (Rule 8, especially when read together with Rule 7.) — or queued or discarded at an Output Stream. Note that module calls do not by themselves introduce a parallel region into the processing path and there is no queueing going on at module boundaries by default (Rule 4) — operators in submodules execute synchronously with the calling module unless the module reference itself is marked as concurrent. In the case of loop queue, that’s a little special (see Rule 5) — the queue is fully drained before processing of the tuple that caused the queuing is considered to have completed. To say it another way, a loop queue does not create a parallel region, and doesn’t by itself introduce a thread re-scheduling opportunity. For operators that can emit multiple tuples in response to a single input tuple, all those “child” tuples are processed to completion before another “parent” tuple is processed (see Rule 3).
  2. Identity of a tuple is not maintained as processing flows from operator to operator. In fact, even for operators as simple as a “pass thru” (no transformations) Map, the output tuple is distinct from the input tuple, even where the input and output schemas are “the same.” There’s no tracking from “parent” to “child” (I’m quoting these terms because we don’t define them in our own rules — just using your terminology!) in anyway. The processing rules are enforced by the runtime, which is to say, yes, you are right, it might be helpful to think of each operator being a method call on a call stack and tuples being passed by value through the call stack — that’s not exactly what happens all the time, but as a mental model it’s not bad.

3. Conclusion

The discussion of tuple identity and lifetime can be made short and sweet: Tuples aren’t identified or associated, and they don’t live long. This may seem like a simple point not worthy of an article, but I hope that it brings some clarity to your understanding that isn’t obvious from examining (and even writing and debugging) EventFlow applications on the canvas.

If you have found this article interesting or helpful, please take some time to post your comments, questions, or suggestions on this blog.

Thanks,
Phil Martin
Datasports Inc.

Advertisements

Written by datasports

Oct 10, 2011 at 5:40 PM

Posted in Tutorials

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: