Why We Wrote Zephyr

Since releasing Zephyr, we’ve been asked by numerous people why we wrote Zephyr instead of sticking to OSGi. Our goal was pretty simple: create an extensible system suitable for SaaS or on-prem.  We looked in our toolbox and knew that we could do this using OSGi, Java, and Spring, and so that’s how it started.

How We Started

First, we wrote our extensible distributed graph reduction machine: Gyre.  This allowed us to describe computations as graphs. It generated a maximally-parallel schedule, did its best to figure out whether to ship a) a computation to data or b) data to a computation or c) both to an underutilized node and executed the schedule.

Then we wrote Anvil, our general-purpose optimization engine that efficiently solved linear and non-linear optimization problems. These were described as Gyre graphs (including how the Gyre could better execute tasks based off of its internal metrics). We deployed Anvil and Gyre together as bundles into an OSGi runtime.  Obviously, Anvil couldn’t operate without Gyre, and so we referenced Gyre services in Anvil.  But Anvil and Gyre themselves were extensible.  We wrote additional solvers and dynamically installed them into Anvil, or wrote different concurrency/distribution/serialization strategies and deployed them into Gyre, and gradually added more and more references.

Then we wrote Troposphere, our deployment engine. Troposphere would execute its tasks on Gyre, and Anvil would optimize them. Troposphere would define types of tasks, and we exported them as requirements to be satisfied by capabilities. (For example, Troposphere would define a “discovery” task, and an AWS EC2 plugin would fulfill that capability.)

Handling OSGi with Spring

Being a small team, we pretty much only used one actual framework (Spring), so we deployed yet another bundle containing only the Spring classpath, to be depended on by any bundle that required it.  We initially used bnd to generate our package import/export statements in our manifest, and pulled in the bnd Gradle plugin as part of the build, but the reality was that if a plugin depended on Troposphere, then it pretty much always depended on Gyre, Anvil, and Spring.

If Anvil contains a service-reference to Gyre, and Troposphere contains one to Anvil, you get the correct start-order.  But if you stop Gyre while Troposphere is running?  Well, that’s a stale reference, and Troposphere needs to handle it, which means refactoring Troposphere and Gyre to use service factories, prototype service factories, or whatever else.

But we just wanted to write Spring and Java.  To really use Spring in an OSGi-friendly way, you have to use Blueprints, and now you’re back to writing XML in addition to all of the OSGi-y things you’re doing in your code. The point isn’t that OSGi’s way doesn’t work — it does. These are solid technologies written by smart people. The point is that introduces a lot of additional complexity, and you’re forced to really understand both Spring and OSGi to be productive when Spring is the only framework that’s actually providing value (in the form of features) to your users because the extensibility component (OSGi) is a management concern.

What Zephyr gets us that OSGi didn’t

Testability

We’re big fans of unit tests, and we write a lot of them.  Ideally, if you’re sure components A and B both work, then the combination of A and B should work.  The reality is that sometimes they don’t for a huge variety of reasons. For example, for us, using any sort of concurrency mechanism outside of Gyre could severely bork Gyre, which could and did bamboozle dozens of plugins. We’re small enough that we could just set a pattern and decree that hey, that is the pattern, and catch violations in reviews or PMD rules. But once again, we just wanted to write integration tests and we wanted to use Spring Test to do it.

With OSGi, you can create projects whose test classpath matches the deployment classpath (although statically), and we did.  We also wrote harnesses and simulations that would set up OSGi and deploy plugins from Maven, etc., and it all worked. But it was still complex, and it wasn’t just Spring Test. This was, and continues to be, a big source of pain for us.  The fact of the matter is that, once again, Spring was providing the developer benefit and OSGi was introducing complexity.

Quick Startup/Shutdown Times

We use a lot of Spring’s features and perform DB migrations in a variety of plugins — not an unusual use case.  A plugin might only take a few seconds to start, but amortized over dozens of plugins, startup time became pretty noticeable.  There are some ways to configure parallel bundle lifecycle, but they’re pretty esoteric, sometimes implementation-dependent, and always require additional metadata or code. With Zephyr, we get parallel deployments out-of-the-box and as the default, reducing startup times from 30+ seconds to 5 or so.

Remote Plugins

One of our requirements is the ability to run plugins whose processes and lifecycles reside outside of Zephyr’s JVM. OSGi (understandably) wasn’t designed to support this, but Zephyr was.

Getting it right with Zephyr

We spent about two years wrangling OSGi and Spring, by turns coping with these and other problems either in code or operations. It was generally successful, but there was always an understanding that we were paying a high price in terms of time and complexity. After the first dozen or so plugins, we’d really come to understand what we wanted from a plugin framework.

To boot, we are pretty good at graph processing, and it had been clear to us for a while that the plugin management issues we were continually encountering were graph problems. Classpath dependency issues could be easily understood through the transitive closure of a plugin, and most of our plugins had the same transitive closure. Even if they didn’t, that was the disjoint-subgraph problem and we could easily cope with that. Correct parallel start schedules were easily found and correctly executed by Coffman-Graham scheduling, and we could tweak all of these subgraphs through subgraph-induction under a property.  Transitive reduction allowed us to easily and transparently avoid problems caused by non-idempotent plugin management operations.

Once we’d implemented those, we discovered that a lot of the problems we struggled with just went away. Required services could never become stale, and optional services just came and went.  A lot of the OSGi-Spring integration code we’d written became dramatically simpler, and we could provide simple but powerful Spring Test extensions that felt very natural.

What’s Next

But we’re not stopping with Spring: Zephyr can support any platform and any JVM language, and we’re planning on creating support for Clojure, Kotlin, and Scala initially as installable runtimes. We’re investigating NodeJS support via Graal and should have some announcements about that in the new year. Spring is already supported, and we hope to add Quarkus and Dropwizard soon. And keep in mind that these integrations should require little or no knowledge of Zephyr at all.

We’re also in the process of open-sourcing a beautiful management UI, a powerful repository, and a host of other goodies — stay tuned!

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

%d