Page MenuHomePhabricator

Evaluate various "infrastructure-as-code" products
Closed, ResolvedPublic

Description

We're looking into some off-the-shelf solutions to help manage hosted phacility.com instances, as well as private cluster support.

Desired features:

  • Great integration with AWS
  • Works well for both "orchestration" (spinning up new EC2 resources on demand) and "config management" (installing PHP/nginx/etc on new nodes)
  • Preferably not a client/server model, to minimize the number of moving parts
  • Supports multiple simultaneous users

Candidates:

  • Terraform
  • Chef
  • Puppet
  • CFEngine
  • Ansible
  • SaltStack

Event Timeline

Terraform Review
Pros:

  • Simple declarative syntax
  • Intended to be invoked first with a dry run, which shows in detail all the resources that will be rebuilt or altered
  • Very focused on "immutable" infrastructure. Instead of manipulating resources in place, prefers to spin up new resources with the new config and spin down the old ones. For the stateful hosts, we might need to be clever about mounting and unmounting EBS volumes when instances get rebuilt.

Cons:

  • Client-only model requires the maintenance of a "state file". This can live in S3, but out of the box there's no locking support (plugins available to use Consul though). Corrupting the state file isn't the end of the world because it can be regenerated by querying AWS, but it still looks like a very brittle piece. Will play around with trying to corrupt the state file on purpose to see what the worst case is (maybe the current operation fails, maybe all infrastructure is wiped). As an alternative, running terraform only on the bastion with a lockfile on that host seems like a much better solution, and arguably much better than having devs walking around with AWS credentials on their laptops.
  • "Don't use Terraform for config management" is pretty much written on the tin. This isn't quite a dealbreaker if we decide to use Docker containers (please no), pre-baked AMIs (somewhat less insane), or some compromise that continues to use first-party solutions for config management. From the docs:

"Provisioners are only run when a resource is created. They are not a replacement for configuration management and changing the software of an already-running server, and are instead just meant as a way to bootstrap a server. For configuration management, you should use Terraform provisioning to invoke a real configuration management solution."

  • Written in Go, which none of us have a lot of expertise in.

Just out of curiosity, why Salt is not a candidate? I think it is very comparable to the others.

Just out of curiosity, why Salt is not a candidate? I think it is very comparable to the others.

Just an oversight when putting up the list. Will look at Salt as well.

You probably don't want to put much value in advice from passing Internet strangers, but... Having used Ansible for configuration management for a while, it certainly has a "worse is better" feeling to it. You folks seem like you might be more of a "better is better" outfit. Contrast, for example, your approach to upstreaming user-written linters versus their approach to upstreaming user-written modules.

Now, I'd actually quite like you to pick Ansible, but for the selfish reason that you'll probably wind up building some interesting integrations into Phabricator for it, not because it's necessarily the right fit for you.

Having played with a few more of these tools and thinking about the problem, I'm starting to lean towards "use Terraform or CloudFormation for orchestrating AWS infrastructure, and leave the provisioning stuff in rCORE as-is". The basic flow for provisioning private instances could be:

  1. Internal status page that checks if stuff like TLS certs have been setup by the customer
  2. "Go" button that generates the CloudFormation or Terraform template from the customer's config
  3. Maybe that template gets automatically checked in to version control? Maybe you just download it locally, check it in yourself, and run terraform apply by hand on the bastion? Maybe launch a daemon job to run terraform apply and eventually report the results back to the status page?
  4. Operator observes that the output of terraform apply looks good
  5. Operator runs bin/remote deploy
  6. Profit

Some other thoughts:

  • I'm leaning heavily away from the tools that require a dedicated server (chef, puppet, salt) because that's just a ton more code to bring into our environment, manage, and depend on. It also adds another trusted piece of infrastructure that needs to have access to every single instance for all time. It also requires running a dedicated agent process on every host and opening up for ports to access it, instead of just relying on SSH. That leaves Ansible, CoudFormation, and Terraform.
  • I've found Terraform modules for all the AWS functionality I can think we might need, so that gives it an advantage over CloudFormation since CloudFormation is AWS-only. CloudFormation is also closed-source, limiting our debugging options.
  • Ansible does everything we would need, but is more focused on the provisioning/configuring stuff that we already handle pretty well in rCORE. It's basically "provisioning is an afterthought in Terraform", "orchestration is an afterthought in Ansible", and orchestration is the piece we're missing right now.

How does the weekly deployment to all clusters work?

In the older bin/provision model, the script updates Almanac on admin after it brings resources online. (This doesn't actually happen today because it's blocked by T12414, so the actual workflow is "go manually copy stuff into Almanac", but the next change under bin/provision-based orchestration would be to automate that step.) Then, bin/remote and phage read from Almanac to identify hosts (today, with phage --pools db,repo for example), and production services also read from Almanac to figure out where other services (like databases) live.

If we're sticking with this model but using terraform apply to make API calls for us, what updates Almanac? Are we going to write a Terraform module for that? How involved is it? Or, if we aren't sticking with this model, where does the list of devices that need to be deployed come from, and how to Service definitions make it into Almanac so that, e.g., https://phabricator.red-widgets.com/ knows which database server it should connect to to serve a page?

Currently, we have a fair amount of information about the structure of the cluster in PHP. For example, bin/host knows which volumes attach to which mountpoints on a "repo" machine, and bin/provision knows how large those volumes should be. Does the authoritative copy of this information stay in PHP, or move to some kind of "resource definition" in Terraform? (What if we let installs purchase add-on storage?)

Today, bin/provision has some code to execute a limited set of orchestration changes. Terraform appears to replace "call AWS" with "write to a file, then call terraform apply to have Terraform call AWS". Admittedly, this is easier: the Terraform syntax is nicer than the raw AWS syntax, especially for some of the APIs like Route 53 which are kind of a mess to call. But "call AWS" isn't that hard -- what else are we getting from Terraform? Or is that the bulk of the value?

When you run terraform apply and the cluster already exists but reality disagrees with the configuration, what happens? Suppose a host has a volume of the wrong size mounted on /dev/xdf. What does Terraform do?

From the documentation here, it looks like the answer might be "destroy the volume", based on the workflow of changing an instance type, running terraform plan, and seeing terraform plan to destroy the resource?

https://www.terraform.io/intro/getting-started/change.html

If a future hire wants to increase the size of rdata volumes, so they update the config from 256GB to 512GB and then run terraform apply, does that destroy all the repository data in the entire cluster without so much as prompting them?

The CLI output of terraform apply destroying resources is very surprising to me. And it seems like it truly destroys them, permanently, without prompting? There's no terraform undo or terraform empty-trash? This approach makes perfect sense if your infrastructure is totally stateless, but about 90% of the hosts we deploy today are stateful. The only related option I see here is lifecycle.prevent_destroy, which is something, but a far cry from how carefully we handle data in first-party stuff.

Today, we technically have a process for growing volumes "mostly in place" in bin/host swap: it mounts the new volume on /dev/xdvi, copies the donor volume to it, then detaches the new volume and degrades into some manual nonsense with button clicking. This doesn't tend to make much sense in the shared cluster as an actual approach, but might make good sense in private clusters. If we wanted to do this workflow with Terraform (mount new volume, copy data, umount both volumes, swap volumes), how would we? Can Terraform do something like this on its own? Would we have to write a module? Is this basically not a workflow Terraform can do, so we'd just have bin/host swap talk to AWS directly?

One point from earlier is that Terraform is necessarily more battle-tested than anything first-party, but looking through the GitHub issues list I see a couple of open issues which look like they might be significant correctness problems:

Some more AWS-specific stuff here, too: https://github.com/terraform-providers/terraform-provider-aws/issues

How long would it take us to understand and fix a problem like https://github.com/hashicorp/terraform/issues/14547 if we encountered it during an incident response? Generally, what do we do if terraform apply doesn't work? We can't stop a terraform apply in the middle to make a manual adjustment by clicking a button, right?

Terraform also seems to have terraform taint (mark a resource for destruction) but no, e.g., terraform freeze (mark a resource for preservation). This is sort of philosophically bewildering to me. Is the modern approach to ops just like "data isn't important"? Is everyone just really, really careful when they type commands into their fully-automatic resource destruction rifles loaded with high-explosive foot-seeking shells?

Also:

https://github.com/hashicorp/terraform/issues/3885

Also also:

https://github.com/hashicorp/terraform/issues/1139

I hear what you're saying [that you didn't like it when Terraform deleted your database], and I'm sorry [deleting your database] caused you trouble. The other side of this tradeoff is "should terraform by default leak artifacts that a user will have to pay for?"

Yes? "Obviously, yes?" This was apparently changed in 0.9, released in March 2017.

First, a disclaimer that I'm not by any means wedded to terraform. From playing with the other tools, terraform looked like the clearest pathway towards my goal of "minimizing AWS UI clicks", so I decided to see what the terraform config for a private cluster would look like and play with the tool. See example terraform configs here: P2063 P2064. I agree with the general thrust of your comment, which I interpret as "this all sounds terrifying and we probably shouldn't do it", but I wanted to at least get a feeling for the tool and build a proof of concept.

How does the weekly deployment to all clusters work?

I envision only invoking terraform apply to create new clusters or to make hardware changes to existing clusters (like adding new nodes or regions). The weekly deploy through bin/remote would work as-is.

If we're sticking with this model but using terraform apply to make API calls for us, what updates Almanac?

What information needs to get propagated back to Almanac? From a quick poking around, it looks like just hostnames and IPs (as opposed to things like EC2 instance IDs). If PHP just injects that information into the templates during generation, we might not need to feed anything back into Almanac.

Currently, we have a fair amount of information about the structure of the cluster in PHP. For example, bin/host knows which volumes attach to which mountpoints on a "repo" machine, and bin/provision knows how large those volumes should be. Does the authoritative copy of this information stay in PHP, or move to some kind of "resource definition" in Terraform? (What if we let installs purchase add-on storage?)

I don't see any reason why we can't leave PHP as the authority. In my mental model, we'll have PHP code generating these terraform configs and terraform just carries them out.

Today, bin/provision has some code to execute a limited set of orchestration changes. Terraform appears to replace "call AWS" with "write to a file, then call terraform apply to have Terraform call AWS". Admittedly, this is easier: the Terraform syntax is nicer than the raw AWS syntax, especially for some of the APIs like Route 53 which are kind of a mess to call. But "call AWS" isn't that hard -- what else are we getting from Terraform? Or is that the bulk of the value?

We also get the ability to deploy change across the infrastructure quickly (say, adding a new listener to the load balancers), resource creation parallelism (not that important since it's not like we're spinning up dozens of clusters per day), dependency ordering (don't create routing tables until the subnets are created), waiting on AWS calls that run asynchronously (ie, polling to see when an EC2 instance is actually finished creating as opposed to when the API call completes), and version control for infrastructure (because we're using text files).

When you run terraform apply and the cluster already exists but reality disagrees with the configuration, what happens? Suppose a host has a volume of the wrong size mounted on /dev/xdf. What does Terraform do?

I intend to test this particular case, but I believe it will make a new host with a new volume of the correct size, and attempt to destroy the old host and volume.

If a future hire wants to increase the size of rdata volumes, so they update the config from 256GB to 512GB and then run terraform apply, does that destroy all the repository data in the entire cluster without so much as prompting them?

That is probably what would happen. Assuming we wanted to use terraform to handle migrations like changing rdata volume sizes, we would likely need to write our own module that handles preserving our stateful resources.

The CLI output of terraform apply destroying resources is very surprising to me. And it seems like it truly destroys them, permanently, without prompting? There's no terraform undo or terraform empty-trash? This approach makes perfect sense if your infrastructure is totally stateless, but about 90% of the hosts we deploy today are stateful. The only related option I see here is lifecycle.prevent_destroy, which is something, but a far cry from how carefully we handle data in first-party stuff.

Yes, terraform apply will cheerfully destroy things and re-create them if there's no way to modify in-place an attribute you want changed. Agreed that terraform is much better suited to a stateless infrastructure.

Today, we technically have a process for growing volumes "mostly in place" in bin/host swap: it mounts the new volume on /dev/xdvi, copies the donor volume to it, then detaches the new volume and degrades into some manual nonsense with button clicking. This doesn't tend to make much sense in the shared cluster as an actual approach, but might make good sense in private clusters. If we wanted to do this workflow with Terraform (mount new volume, copy data, umount both volumes, swap volumes), how would we? Can Terraform do something like this on its own? Would we have to write a module? Is this basically not a workflow Terraform can do, so we'd just have bin/host swap talk to AWS directly?

We would either write our own module to handle this, or have bin/host swap talk to AWS directly.

One point from earlier is that Terraform is necessarily more battle-tested than anything first-party, but looking through the GitHub issues list I see a couple of open issues which look like they might be significant correctness problems:

I've been looking through the open GitHub issues as well, but I didn't spot 14547. As far as I'm concerned, assuming there's no user error explanation forthcoming, that's a dealbreaker for having terraform manage EC2 instances. "terraform apply will faithfully carry out the output of terraform plan, especially if you provide the same state file" is pretty much the core promise of the product.

How long would it take us to understand and fix a problem like https://github.com/hashicorp/terraform/issues/14547 if we encountered it during an incident response? Generally, what do we do if terraform apply doesn't work? We can't stop a terraform apply in the middle to make a manual adjustment by clicking a button, right?

Agreed that this is not confidence-inspiring. Some more lurid stories here and here. There is no button (aside from ctrl-c) to stop a terraform apply in the middle.

To the extent that terraform could be useful to us, we should probably just use it for stateless pieces, like network configurations and load balancers. That would at least give us the ability to rapidly change the infrastructure without writing dedicated tooling for each change we wanted to make.

Some other anecdotes from playing around: terraform seems to have mixed opinions about API errors that should result in bailing out. I've seen some errors like "an ALB must be attached to at least two subnets" stop the plan right away (but, disturbingly, in the apply phase instead of the plan phase), but other errors like "you're trying to attach a routing table to a subnet that doesn't exist" only fail after a bunch of retries and hitting an eventual timeout. Presumably errors like the latter just happen all the time and can't be treated as unambiguous "time to abort" erors, as the backing store for AWS API calls is not consistent, leading to successful API calls whose results haven't been propagated everywhere, leading to temporary failures when performing other API calls that depend on the results of previous calls. Alternately, this behavior could be considered a strength, since our first-party tooling will have to deal with this "maybe this isn't an error and we're just waiting for AWS to become consistent" issue when making our own calls.

I've also seen terraform attempt to recreate EC2 instances for changes completely unrelated to the instance. This turned out to be the problem: if you specify security_groups in your instance config instead of the more modern vpc_security_group_ids, terraform pretty much always wants to recreate them.

Maybe we should just use CloudFormation ¯\_(ツ)_/¯

At least with CF templates, we can do stuff like "wait on signal" to get nodes to announce stuff like "Dear AWS: I'm done importing this backup (or whatever); please proceed to the next phase of the plan".

If PHP just injects that information into the templates during generation, we might not need to feed anything back into Almanac.

PHP can't know the IPs ahead of time (I think?) since they don't exist until EC2 allocates them. So PHP can say "I want a new database server", and then make that a real thing with either "terraform apply" or by calling AWS directly, but then it needs to either ask AWS or Terraform what the server's IP is, or AWS/Terraform need to tell it as a side effect of whatever they do.

Having Terraform publish results into Almanac is presumably possible (and seems a little cleaner?) but I assume it's a moderate amount of work in Go and from looking at the files I'm not immediately sure how you would structure a side effect like that (via "Modules"?). Also not sure how fluent you are in Go. I've generally had positive experiences with it but have roughly ~8 hours of total experience under my belt.

Having PHP run terraform apply and then go look at AWS to see what it did and read IPs from AWS would work, but seems a little flimsier and relegates Terraform to a fairly thin wrapper around the AWS APIs and means we're still talking to AWS directly from PHP.

Or PHP could run terraform apply, then parse the Terraform state file, I think? That feels very fragile but would be less work than either alternative.

(Or am I misunderstanding a piece of the flow?)

We also get the ability to deploy change across the infrastructure quickly (say, adding a new listener to the load balancers), resource creation parallelism (not that important since it's not like we're spinning up dozens of clusters per day), dependency ordering (don't create routing tables until the subnets are created), waiting on AWS calls that run asynchronously (ie, polling to see when an EC2 instance is actually finished creating as opposed to when the API call completes), and version control for infrastructure (because we're using text files).

Quick changes are conceptually nice, but does Terraform accomplish them by actually making changes, or always by throwing away resources and starting new ones? If adding a listener to the LBs deletes all the LBs first that seems sort of not useful. There are some create_before_delete flags so maybe this would be fine for LBs with the right config (but wouldn't we need to create, change DNS, wait for it to propagate, then tear down the old ones? And Terraform can't do "wait for it to propagate", at least not easily?). I also don't know how often we need this capability, and if we were first-party I don't think "write the call to add a listener" would dramatically extend the total time involved in "plan, make a test change, test that it works, make the real production change, test that it works".

Like, it seems as though the Terraform strategy for doing this is "bring up a new cluster with a full set of nodes (but this time they have all the listeners), then swap over, then delete the old one", maybe? If our servers were stateless and our cluster size was bounded I think that would be a good strategy, but it seems weirdly unscalable. Notes like this one make me think that Terraform isn't trying to deal with this?

"Note: using apply_immediately can result in a brief downtime as the server reboots."
"https://www.terraform.io/docs/providers/aws/r/db_instance.html"

But maybe I'm misunderstanding how Terraform works and it actually can perform a useful set of mutations without destroying + recreating?

I intend to test this particular case, but I believe it will make a new host with a new volume of the correct size, and attempt to destroy the old host and volume.

I, uh... okay. What? How can any non-toy infrastructure work with this tool except by destroying and recreating the entire world on every change? If we try to resize the log volume on a database server, we not only lose all the data forever but the database server moves to a different IP so all the services that talk to it break?

aside from ctrl-c

I found this -- I think this issue is sort of not very important (local state shouldn't matter much?) rather than disastrous -- but for completeness, "^C destroys all your data too":

https://github.com/hashicorp/terraform/issues/13851

Alternately, this behavior could be considered a strength, since our first-party tooling will have to deal with this "maybe this isn't an error and we're just waiting for AWS to become consistent" issue when making our own calls.

I'd guess we'll have at least one case where we want an error to have a different severity than the provider has sooner or later (I'd guess we'll want "INSTANTLY ABORT" and Terraform might keep going), and it seems somewhat involved to change that (fork the module, make Go changes, recompile?).

As a sort of concrete example, I previously observed some cases where instances launched with the "Allocate Public IP" checkbox checked failed to come up with a public IP. I'm fairly sure this was an AWS bug because in several cases I launched a series of instances (e.g., 8 hosts) and had a small number (e.g., 2) affected, and it was fixed when I relaunched them without adjusting configuration. The Terraform provider doesn't appear to check for this (although I might be misreading), and adding a check seems like fork + Go? Or "manual AWS API calls after terraform apply to verify the state". This isn't specifically something we should need to do going forward, and I "mostly" trust that AWS works correctly almost all the time, but the actual code in the Terraform providers seems just like the PHP code in bin/provision launch -- individual calls and manual checks, not some kind of cohesive magic which completely compares state -- but much further out of reach if we need to modify it to tailor checks.

That would at least give us the ability to rapidly change the infrastructure without writing dedicated tooling for each change we wanted to make.

Can Terraform actually do this without downtime (and without deleting all the hosts behind that infrastructure)?


Terraform seems potentially promising as a sort of bridge tool between the dumb button clicking of today and some magical distant future where all the calls are built out, but its apparent tendency to delete all hosts and resources makes me uneasy about it.

That said, it's obviously great for infrastructure prototyping, and I think P2063 / P2064 actually building working infrastructure that's reproducible and described in a compact text file is pretty huge. I can get a way better sense of what you've done there by looking at those files than by clicking around a bunch of panels in EC2. But maybe terraform is more of a prototyping tool to get config in the right ballpark that we try to stretch a bit into initial provisioning, and then once it works we do the legwork to write a real provisioning flow or something like that, and never ever let terraform run against anything in production.

It looks like Terraform probably can add a listener without destroying the LB (since it's a separate resource type? But the "changing the size of a volume destroys the instance" case suggests maybe it doesn't work like this, if that's how that works), but if you want to enable access logging on your LBs hope you didn't need 'em for anything important? I could probably just install it and figure this out / nuke production forever.


Here's another more recent post:

https://blog.heapanalytics.com/terraform-gotchas/

They wanted to add a volume to an instance, and accomplished this by manually clicking buttons in the web UI and then explicitly synchronizing Terraform to the new state. They carefully didn't run terraform apply in the middle because it would have destroyed 10TB of data.

They wanted to change how some resources were configured in Terraform, so they wrote a parser for the Terraform state file, parsed it, modified it, and wrote it back out.

They have a lot of workarounds for keeping the barrel of terraform apply aimed roughly away from anything important.

They say "You can iterate on your config changes and verify them without ever risking a stray apply ruining your day -- or week!" unironically.

Wow:

Most outages are caused by human error and configuration changes, and applying Terraform changes is a terrifying mix of the two.

For example, with a tiny team, it’s easy to be sure only one person is running Terraform at any given time.
With a larger team, that becomes less of a guarantee and more of a hope.
If two terraform apply runs were happening at the same time, the result could be a horrible non-deterministic mess.

Apparently a fix for this was introduced in March, 2017. But they aren't using it yet.

I don't know how the author managed to remain so upbeat and positive while writing this.


This is old (July 2014) but the author appears to work at HashiCorp (and now be VP Eng):

https://serverfault.com/a/616945

How to use terraform.io to change the image of a stateful server without downtime or data loss?

Practically speaking, the best way to handle this problem is maintain your database servers differently, outside of Packer [or Terraform] (building the initial image, yes! But not necessarily upgrading them in the same way as the stateless web servers) or outsource managing the state to someone else. Notable options include Heroku Postgres or AWS RDS.


I guess I don't really understand what class of problems this is aimed at solving. Who has a large number of different types of totally stateless services which need to be configured in a complicated way?

It looks like one big answer is "Heroku customers deploying applications on Heroku", and this is the first "Use Case" on their "Use Cases" page:

https://www.terraform.io/intro/use-cases.html

If we think of Terraform as "Fancy Heroku App Installer" some of the design kind of makes sense?

Also we aren't the only ones to be surprised by the AWS NAT junk:

BUT MIND THE FUCKING TRAP. You do not attach these NAT gateways to your PRIVATE subnets, you attach them to the PUBLIC FUCKING SUBNETS, and then a route to from the private subnet to that gateway. Gahhhhhh.
https://charity.wtf/2016/04/14/scrapbag-of-useful-terraform-tips/

In my personal experience, CloudFormation is vastly better than Ansible/Chef/insertflavourofthemonthtoolhere. Those other tools all require some specialised syntax, meanwhile CF is just JSON and you can use the CloudFormation designer to get the JSON for any AWS resource that you're not sure how to describe with JSON.

It's also very specific about what it's going to change and will flat out deny config updates if it would require data destruction (i.e. if you try and change a configuration value on an RDS instance that can't be changed after creation, CloudFormation will deny the update instead of unexpectedly destroying and recreating the RDS instance). The only way to apply the change in this scenario is to explicitly remove the RDS resource from the JSON (which indicates that you do actually want to destroy it in the stack), or to delete the CloudFormation stack which will delete all associated resources.

Whenever you apply a CloudFormation JSON file, it creates a changeset which describes all the changes it would make, and then you apply the changeset. This means that only the actions listed in the changeset will be taken, so you won't get things magically happening differently if two people try to apply a change at the same time (one will lose out because changes lock the stack while they apply, and can only either be allowed to continue to completion or rolled back to previous state).

Not sure if this information is helpful, but that's been my experience with infrastructure-as-code stuff.

@hach-que thanks for the suggestion. I've used CloudFormation with some success previously. My biggest problem is occasionally having a stack just get "stuck" while deploying, and taking forever to timeout on a failure. I've also seen very ambiguous error messages when some random resource fails to deploy, but these are all just anecdotes. I'm working on a demo CF config to give it a try.

One thing I should add is that your different application tiers should be different stacks: i.e. DNS config should be one stack, web boxes another stack, DB another etc.

That way if the web stack gets stuck or hosed, you can spin up a new copy of the web stack next to it (since it doesn't store any data), flick the DNS to point at the new one and tear the old one down as a recovery mechanism. If you don't do this and you put routing, stateless and stateful all in one stack then you're going to have a bad time if deployment gets stuck as you mentioned.

FWIW, I am actively working on a Puppet module for Phabricator, see https://forge.puppet.com/joshuaspence/phabricator.

Terraform Review
Pros:

  • Written in Python, which everyone knows is superior to Ruby in every way

Huh? Terraform is written in Golang, not Python.

Terraform Review
Pros:

  • Written in Python, which everyone knows is superior to Ruby in every way

Huh? Terraform is written in Golang, not Python.

Whoops; you're right, I mixed up my notes from Ansible. Correcting original comment...

Another correction, Puppet doesn't //need/ a dedicated server. You can run Puppet in a standalone (agent) mode, if you are willing to somehow get your Puppet manifests onto each host that you wish to provision.

We use Almanac + Passphrase + Ansible + (Dynamic inventory client) for this.

Ansible can work with dynamic inventories (data from Almanac + Passphrase). Having some rules in Herald, Harbormaster and Drydock, even you can automatically build/update servers config or software with commits.

PS:
I wrote a small dynamic inventory client (based on almanac /passphrase conduit API). It is not open source but I can open source it, if there is a need.

I wrote a small dynamic inventory client (based on almanac /passphrase conduit API). It is not open source but I can open source it, if there is a need.

I would be interested to what this is and how it fits together with the other pieces.

PS. This is off-topic I guess so feel free to reach out over email in my profile.

For any body still interested, the project was in PHP, I had to rewrite in go, so it can be used elsewhere.

Here is the link:
https://github.com/uniwue-rz/a2a

Continued in T13542. I wrote a Terraform/CloudFormation-style service in PHP over the last couple of days.

It physically can't delete resources since I haven't written any code to support deleting things, which makes it a strict upgrade over Terraform.

$ec2_source = $this->getSource('aws.ec2');
$conduit_source = $this->getSource('conduit');

$volume_ref = $ec2_source->newVolumeRef()
  ->setRefName('test-volume')
  ->setSize(1);

$device_ref = $conduit_source->newAlmanacDeviceRef()
  ->setRefName('test-volume.device')
  ->setDeviceName(
    $this->newString(
      '%s.phacility.net',
      $volume_ref->bindEC2ResourceName()));