Page MenuHomePhabricator

Implement Almanac edit endpoints in Conduit
Closed, ResolvedPublic

Description

Provisioning in the cluster still has some significant manual components. I've automated some of the more annoying / error-prone parts, but ideally the whole thing should be automated.

The next-most-annoying manual step is currently configuring instances in Almanac. Between the provision launch and remote deploy steps, devices and bindings need to be created in Almanac. This is currently manual, but can be fully automated:

  • provision bind can read addresses out of EC2 (this code exists today).
  • Then it can read device and service records out of Almanac (I haven't written this yet, but it could be written today).
  • Finally, it can create or update missing bindings and records (this can't be written today because there are no Almanac *.edit endpoints yet).

Automating this would mostly leave us with just instance launching and Route 53 un-automated.

Revisions and Commits

Restricted Differential Revision
Restricted Differential Revision
Restricted Differential Revision
Restricted Differential Revision
rP Phabricator
D19343
D19342
D19341
D19340
D19338
D19337
D19336
D19335
D19334
D19329
D19328
D19325
D19324
D19323
D19322
D19321
D19320
D19318
D19317
D19316

Event Timeline

epriestley mentioned this in Unknown Object (Maniphest Task).Apr 20 2017, 10:07 PM
epriestley added a commit: Restricted Diffusion Commit.Jun 6 2017, 11:18 AM
amckinley claimed this task.Sep 8 2017, 1:45 AM

What's the best way to add API endpoints for resources like bindings? Call it almanac.create_binding and have it take a service and an interface as arguments?

The "most right" way in terms of consistency is to fully convert Binding to EditEngine, then implement almanac.binding.edit which can create/edit bindings. When creating a binding, it would require transactions specifying the service and interface. PhamePostBlogTransaction is sort of an example of this: when you create a new post with phame.post.edit, you must specify a blog transaction.

This is probably not hugely complicated. AlmanacInterface might be a little more work to modernize since it isn't currently transactional, but it's a fairly simple object so it's probably not too bad.

We should probably do this eventually regardless, just for consistency, but I think it will give us a pretty verbose API -- we'll need to make a lot of calls to synchronize things.

Because Almanac is unusual, it might be worth considering a special almanac.service.publish or somesuch which takes a big blob of JSON describing the desired state and makes all the edits to put things in that state. I'm not sure how practical this is -- we possibly run into the Terraform problem where slight misuses of the endpoint delete the world. A possible advantage of this approach is that Almanac mutations could be atomic by default, while calling 4-5 different almanac.thing.edit methods might require some care to ensure Almanac is never in an inconsistent/bogus state. But this might be less work than I'm imagining.

We currently do something sort-of-similar here:

https://secure.phabricator.com/source/services/browse/master/src/management/ServicesSyncWorkflow.php

That synchronizes services from admin to instances, so they know which databases to connect to.

It uses Almanac objects directly and sort of follows the path of "a lot of different edit methods" and feels pretty bad, but I think I might have just not written it very well. It's also doing a lot of stuff that we probably don't need to do to make sure that resources have the same PHIDs on instnaces and on admin. I was worried that if a device was renamed on admin for some reason, we might end up with a huge mess on instances. Another concern is that users can technically interact with Almanac themselves, so the code has to be careful about that. A sync from AWS to Almanac on admin doesn't need to worry about these issues as much.

The other shadow lurking in the water here -- which I think we can mostly avoid -- is that Almanac is mostly a-bit-bare-bones-but-overall-pretty-functional, except that the way properties on Bindings and Services are specified and edited is complete garbage. You more or less just have to magically know which properties are valid, and there's no real support for defaults or nice UI controls or hints about what you can set or suggestions that you're making stuff up and probably typo'd something.

EditEngine/EditField have some pretty good support for fixing this at some point, the Almanac pieces just got written before those existed. I think they won't impact this, but just something to maybe keep an eye out for.

It's also possible to write a custom instances.do-exactly-what-we-need sort of endpoint and generalize later if that seems like a more promising approach.

There's a bit of a mess with AlmanacInterface and AlmanacDevice. Currently, AlmanacInterface does not use transactions, and is edited purely as a side effect of INTERFACE transactions applying to AlmanacDevice. I'm going to change how this works so that AlmanacInterface is a normal transactional object and can use the same rules and infrastructure as everything else.

Before I can get rid of AlmanacDeviceTransaction::TYPE_INTERFACE, we have two meaningful callsites in rSERVICES and one unit test in rSAAS to clean up.

epriestley added a revision: Restricted Differential Revision.Apr 10 2018, 1:38 PM
epriestley added a revision: Restricted Differential Revision.Apr 10 2018, 2:24 PM
epriestley added a revision: Restricted Differential Revision.Apr 10 2018, 5:08 PM
epriestley added a revision: Restricted Differential Revision.
amckinley reassigned this task from amckinley to epriestley.Apr 10 2018, 7:32 PM
epriestley added a comment.EditedApr 11 2018, 2:08 PM

Everything here should pretty much work except:

Enable/Disable: There's no way to disable or enable bindings from the API. Before adding this, I want to consider whether "enable/disable" should be converted from a bool to a "status" field.

The use cases for API disable are testing failover (in PHI145) and something in PHI473 although I'm not actually sure what the use case was there. I think we don't directly need this capability in the upstream/Phacility today.

An adjacent use case is T10883: marking nodes as read-only.

One possible approach is to change "Enable/Disable" to "Access: Read-Only, Read/Write, None".

I tend not to like this because "Read-Only" won't be meaningful for all services, and I don't anticipate any other possible values here. Phacility/cluster plans don't involve anything here, I don't think. PHI270 discusses a "drain-this-resource/no-new-allocations" mode from a Drydock/Harbormaster perspective but that seems orthogonal. T10884 discusses sorting services by network distance but that couldn't be captured with a single "status" flag.

So I'm inclined to keep "enable/disable" as-is and just provide API access for them.

That implies that T10883 (read-only) will happen with a "writable" or property.

Properties: Properites can be read from some (but not all?) objects now, but can't be edited.

I'd like to improve the way properties are specified (which isn't actually quite as bad as I remembered) and provide this via EditEngine extension; I expect this to be straightforward.

Once both these pieces are in, I'll add some kind of "writable" property for repository bindings.

epriestley added a commit: Restricted Diffusion Commit.Apr 11 2018, 5:30 PM
epriestley added a commit: Restricted Diffusion Commit.
epriestley added a commit: Restricted Diffusion Commit.
epriestley added a commit: Restricted Diffusion Commit.
epriestley closed this task as Resolved.EditedApr 11 2018, 5:54 PM

I think that's pretty much everything. There will be a little followup work in T10883 and maybe T13076 / T13120.