Happy 1.0, Knative!

It’s been a long time coming, as mentioned on the blog post (depending on how you count it, almost 4 years). Here’s the timeline, as I experienced it:

Formation

In the fall of 2017, I was working on Google App Engine (in particular, the management API) as well as Google Cloud Functions, as part of a larger group inside Google Cloud known as “Serverless”. We’d just gotten a new senior PM in the last year, and he was asking good questions about why we had two interfaces for customers which were very different while sharing some of the same back-end infrastructure (and having some back-end infrastructure which was different). Insert the standard xkcd.com/927 joke, and we decided that we’d build one platform to unify them all.

But we wouldn’t just build one platform, we’d build two platforms – one open-source, so that you could use it anywhere in the world, and one built on Google’s existing serverless experience. We’d also write a specification, so that tools could target both platforms with a common interface, and there would be API conformance tests to make sure that the two implementations were compatible. Work started in earnest around December of 2017, and a team of about 8 met 2-3 times a week for 5-6 weeks to design the control plane APIs and concepts based on some PM requirements and research on existing serverless platforms.

Once we had a general shape for the serving platform – Routes, Configurations, Revisions that you see in Knative today, we started shopping around for partners to build the open-source platform. Pretty quickly, we got a lot of interest from Red Hat, Pivotal, Cisco, VMware, IBM, etc.

By July, we had basic functionality for Knative Serving, and a skeleton of Eventing and Build (which would later become Tekton).

The Real World

We announced Knative and Google Cloud Run in July 2018 at GCP Next, and got a lot of great feedback pretty much immediately. One of the items of feedback we got from Joe Beda (who looked at this in two TGIK episodes (#44 and #46), and had a lot of kind world and early advice) – we’d built a very tightly coupled system that tried to do everything “out of the box”, but was very hard to weave into a larger cluster story. For example, our recommended install included both Prometheus and an ELK stack, as well as HA Istio install. The minimum cluster size was around 6 CPUs and 10-12GB of memory, before running any applications!

I dove right into this feedback, and we ended up making both the Serving and Eventing stacks substantially more “pluggable” at the bottom. This eventually grew into the “KIngress” interface (here‘s an early doc I wrote collecting the requirements), supporting 4+ HTTP routing layers (Istio, Contour, Kong, and Kourier) with a “more-advanced-than-Ingress” feature set. We also worked on thinning out the requirements and unbundling some of the core components (the monitoring and log collection turned out to be something that we weren’t even very good at, and eventually got abandoned and then re-initiated when people expressed interest in building and maintaining that in our “sandbox”).

This Is What Progress Looks Like

We spent much of the next year building capabilities towards a GA release. Eventing figured out an object model (Channel, Subscription) and some interesting Kubernetes tricks (duck typing resources), Build figured out a bunch of machinery for running standard containers in a specific order, Serving built out pluggability and features, and we figured out a release cadence (every 6 weeks, starting around 0.3).

In early 2019, the Tekton project split off from Knative , taking the “Build” component with it. Amusingly, a lot of people missed this split, and still think that Knative has something to do with converting code into containers…

Around the same time, the team realized that the initial API shape that we’d built had some rough edges and inconsistencies that suggested that we should revise the API on the way to a v1 API. A few principles emerged:

  • We would try to make a Configuration look similar to other resources that used a PodTemplateSpec, like Deployment and Job. This would improve the ability for users to transition between Kubernetes resources and Knative resources.
  • We removed the “build” components from Configuration. This simplified the Configuration and Revision lifecycle, because they would always be driven by explicit Kubernetes actions, rather than (sometimes) having an embedded build which could trigger a new revision.
    • Removing build from the Configuration also smoothed out a wart that we discovered – having on-cluster build as part of deployment was great when getting started, but it ended up being a bit of a dead-end when you needed to manage more sophisticated rollouts. You ended up building a bunch of tooling which you’d need to throw away later when you graduated to a more capable solution. Lesson: think about the user’s overall project lifecycle, and try to respect their investments.
  • Service should be a simple composition of Configuration and Route. The initial “Service” concept was somewhat shoehorned in by Google PM and UX researchers, who (correctly) recognized that a single object was easier for novice developers to grok. Unfortunately, the initial design ended up being more “modal” than you’d want a resource to be, so the “Service = Route + Configuration” model was pretty appealing.

Updating the Serving API at that time meant crawling over a lot of broken glass with CustomResourceDefinitions, before tools like storage conversion webhooks were complete. We did a bunch of that, and passed the feedback upstream. Things were running pretty smoothly and we started talking about what “GA” might look like internally at Google, when…

Cooperation Is Hard

Senior leadership at Google decided that they didn’t want to donate Knative to the CNCF (which many internally and externally had expected). This was announced with little discussion in October of 2019, a little over a year after the public launch of the project. Note that the decision was made well above the actual team at Google working on the project, and the Google steering members had no say in the matter; they were just the messengers.

What had been a fairly cooperative and friendly working environment was quickly poisoned; for about 2 months, :strike: emoji reactions were a common sight within the Knative slack. Those of us on the project working for Google understood how our colleagues felt, but couldn’t do much except scramble inside Google to figure out what sort of damage control and concessions they could offer.

In November, two of the three TOC members at the time (Matt Moore, Ville Aikas) left Google for VMware in protest, along with Scott Nichols. Unrelated to these events, I also left Google in December to join VMware; my reasons were had nothing to do with the Knative project, and everything to do with the larger changes in culture at Google over the previous 8 years. Being at Google when the “Thanksgiving Four” were fired gave me one last opportunity to register my protest; I ended up departing on my 15-year anniversary at the company.

Eventually, fences were mended, but the community had lost a certain amount of trust in Google. Lesson: once spent, trust is expensive to regain.

Getting Back on Track

Heading into 2020, it was unclear whether Knative would remain a multi-vendor project, or whether there would be a fork and project rename. Much of 2020 was devoted to improving the internal governance of the project; starting with the official procedures and elections for the Technical Oversight Committeee (TOC), followed by procedures for electing the Steering Committee and formation of a separate vendor-appointed Trademark Committee.

2020 ended with a TOC with 3 elected members and 2 bootstrap members, and a Steering Committee with 2 elected members and 3 historical members, which was an enormous step forward on the governance front. Prior to 2020, all seats were held by vendor representatives based on contribution numbers, which sometimes meant that important leadership seats were filled by “warm bodies” who didn’t really have time to devote to the project.

For a lot of contributors, 2020 was a really painful year; we also had a number of other contributors switch jobs and drift away from the project (a trend which would continue in 2021, though 2021 would also see new contributors arrive in greater numbers than previously).

Technically, the Serving side of the project continued to add features to be more production ready, including better support for automatic TLS, the introduction of the DomainMapping construct and feature flagging for new components and other features. The Eventing side made a few major moves during this time as well; the initial implementations of Channel and Broker were single-tenant implementations which tended to create a lot of unexpected resources in the user’s namespace; in 2020, many of these were moved to multi-tenant implementations which would live in the knative-eventing namespace and could be scaled based on cluster size, rather than per-resource.

Getting Ready for GA

At the beginning of 2021, Serving had effectively been GA-ready for a year or a bit more, but the “1.0” version number had been held up by governance concerns and a general need to figure what being “1.0” actually meant. Eventing was also getting close to complete, and began putting in an effort to define both a conformance specification and migrating its API resources to v1 API versions from v1alpha1 APIs. By April, there was talk about launching a 1.0 and having a conformance program that would allow vendors to license the Knative trademark. I participated a bit in this in June and July by rewriting the eventing specs (with lots of helpful community feedback) to make the conformance test conditions a bit more precise.

By September, we had a fairly good handle on what the Serving and Eventing specs for conformance would look like, but also a much better idea as to how big a task a conformance program would actually be from the technical, legal, and operational points of view. In addition to writing or reviewing at 100-200 tests for each badge, there were challenges around packaging and versioning the tests, administering and recording the results, what rights would be granted and for how long, how that versioning interacted with vendor stability requirements, etc.

With some regret, we removed “have conformance program running” from the GA requirements, and replaced it with “verify that 1.0 would pass conformance suite”, a somewhat fuzzier goal. The intent was to avoid needing to adjust either the specifications or make a backwards-incompatible change immediately post-1.0 in order for the open-source Knative release to actually be conformant with its own specification, without needing to build all the conformance tests. The last of these changes landed early in the 0.27 release cycle, and it looked like we would have all the GA Requirements complete with room to spare.

On Oct 7, we announced that the next release would be numbered 1.0. Amusingly, two weeks before the release, we realized that while we’d announced that the next version would be numbered 1.0, our automation didn’t actually support that, and we needed to figure out exactly what that meant. We ended up drawing up a number of plans, comparing on them, and then executing on the best one given the various constraints. Maybe we’ll wake up smarter tomorrow and pick a different plan in the future.

That Brings Us To Today

The amazing release team pulled out all the stops to get the release done in one day (it often takes 2-3 days, though not intensive work the whole time). The docs and operator are released, and there are fresh knative-v1.0.0 releases in Serving, Eventing, and the other 27-ish repos which are part of the release train.

And, this blog is running on Knative 1.0!

One thought on “Happy 1.0, Knative!

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.