What the fook is a pipeline?

Published on: 2025-05-24

Introduction

One thing I've grown to love about managing a team of software engineers is that they call out the hand-wavey terms you use without thinking. This is not to say the terms are not well founded or even, arguably, used with a decent underpinning ethos in mind. It's more to say, be prepared to define something when you talk about it. Pipeline is a term that definitely fits this notion.

Some etymology

My favourite thing to do is to look at the Wikipedia disambiguation page and there are some really nice links to definitions in there.

But this isn't etymology, though I can imagine the first of those goes some way towards defining the word. Pipeline from a typical etymology defining website is interesting, offering the definition "channel of communication", derived from pipe + line. That's it: it's literally a flow of information in one direction, but with no other defining characteristic.

This is funny. Think about it. You draw complex pictures from many lines. You (depending on style) might draw many lines to draw a flowing directional line.

The definition of a pipeline is, I feel, much more than the sum of its parts.

So what do I think a pipeline actually is!?!

A unix pipe. Put programs either side of it. Repeat.

A pipe is a flow of information. Put some things either side and transfer information using that flow. Repeat.

Is that useful?

Well yes, but not so much as a definition that tells you how to design a pipeline. It's a useful way to think about how things interoperate. I like pipeline definition #1 simply because it's easily demonstrated in any *nix shell with doing command chaining things like this:

$ downloadmydata.sh | prepmydata.sh | savemydata.sh | runmyprocesing.sh

That works to demonstrate the example. If the programs are doing more interesting things and processing each others data, literally feeding one another resiliently, it becomes really powerful as tool for breaking apart steps in an algorithm. As usual for me, selective parts of the unix philosophy offer a nice engineering approach to think about things: write programs that do one thing well and that use each others outputs as inputs.

I think pipelines as a concept are even more important though, especially if you think about them beyond the shell and *nix environment. For example, thinking about a machine learning or remote sensor network, having components via a sequential flow that is well understood is pivotal when deploying applications that started out as research or exploratory engineering efforts.

Processing and data flows

A useful intuition is to think about data transfer between processing engines. This gives a reason for pipelines more generally and then being able to step back when you talk about them. In my day job, there are numerous examples where pipelines exist:

Remote renewable sensor networks: we wait for things to wake up; the things send data back to us; data processing chains that lead to insight or outcomes are executed.
Data management: acquisition processes deposit data in stores; processing starts generating datasets; these are annotated and published or processed downstream.
Machine learning: training datasets are composed from various sources; models are trained and evaluated; incremental training or inference datasets are generated as new data arrives for pretrained models; predictions are pushed out from the arrival of new data.
Procurement: a project starts; equipment is bought based on designs; it arrives and development and deployment builds are arranged; things are built; data processing pipelines are configured; deployments take place; data flows back.

The list goes on, and on, and on, and....

Though there are so many examples, there aren't necessarily well engineered examples everywhere. I think this is what I've tried to exemplify, particularly through my work on developing ensembling workflows for environmental modelling and machine learning for operations in the polar environment. As such, I'd like to highlight two key characteristics that help to communicate the benefits of thinking of these flows as data pipelines: configuration as data and the separation of concerns.

Configuration as data

For me the critical notion to a pipeline is that the data, processing and configuration controlling the processing of those data are kept separate, but all influence each other. This is a design imperative. If you have processing to deal with certain types of data, aside from being well engineered to handle anomolies, that processing should be controlled by independently managed configuration. The configuration needs to be the controller of the processing, so that it can be managed as its own data source that scales, is well managed, describes the operation of the processing and data used...

Separation of concerns

I like thinking about pipelines in a way that doesn't describe the execution infrastructure. To be fair I have a really bad habit of leaning on simple implementations to describe pipelines, with software tooling and configuration bound through something nice and portable like shell scripts.

This is not to say that I don't favour things like Airflow, Cylc, BPM or other workflow management systems. My intention is simply that when I implement pipelines in bash, connecting together tooling, feeding it via configurations, to process data, it's all done in a way that means you can hop the set of actions between these WMS systems. That way you only have integration of known workflows in sets of portable steps into a new graph based execution system to worry about, hopefully without implementation changes!

Conclusions

The flow in one direction might seem like a limitation, but ultimately any fully integrated system still has to work around the principle of unidirectional information flow. If there is a real world object and a decision making entity independent of it, information must flow one way before the information provided by the entity flows back the other, still considered a unidirectional flow.

I'd be interested to understand from information theorists whether this flies in the face of cybernetics and feedback systems, but I don't think it does. What I describe is merely a way of handling that complexity, by splitting it into individual information exchanges. This then helps me to build more complex systems with well engineered wares!

Comprehending the simplicity of data flows is important, especially if you build them.

As I am hopeful to describe what I do more to others in the future, it will be even more important in being able to describe them. "Pipeline" is an unhelpful term, if you can't really work out how you apply it in the digital domain.

The next rambling piece will be "what the flip is a..." something or other.

Please give me your thoughts below, I'd like to talk pipelines some more!