Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Perhaps off-topic but how have people upgraded TF codebases to new versions? Just last year we had a big effort to upgrade a huge code-base from 0.11 to 0.12. I feel like it should be a lot smoother than a full-team full-sprint effort.


I'm one of the HashiCorp founders.

Terraform 0.11 to 0.12 is by far the most difficult of the versions to upgrade between. I am really sorry about that. The other upgrades should be relatively minor as long as you read and follow the upgrade guides and upgrade one minor version at a time (0.11 => .12 => .13 etc.). There are rough edges for very specific cases but most of our customers were able to upgrade from 0.12 to subsequent versions same day without issue.

Breaking changes and difficult upgrades is not something we want to do with Terraform (0.12 being a big exception as that was a very core "reset" so to speak). The reason there have been these changes in these recent releases is that we've been ensuring Terraform is in a place for 1.0 that we don't have to have difficult upgrades.

You can see this path happening in each release:

- Terraform 0.15: state file format stability

- Terraform 0.14: provider dependency lock file

- Terraform 0.13: required provider block (for more deterministic providers)

- Terraform 0.12: stable JSON formats for plans/config parsing/etc. (particularly useful for interop with things like Terraform Cloud and 3rd party tooling)

This was all done to lead up to a stable 1.0 release.

As noted in the blog post, 0.15 is effectively a 1.0 pre-release so if everything goes well, we've MADE IT. For 1.0, we'll be outlining a number of very detailed compatibility promises which will make upgrades much easier going forward. :)


Our teams have something like 100,000 LOC in Terraform 0.12, and it's not all in one big monorepo. At that scale there is no such thing as a relatively minor version upgrade.

We want to upgrade to get away from some persistent 0.12 bugs, but we literally don't have the time. We have to change all of the code, and then test every single project that uses that code in non-prod, and pray that the testing finds most of the problems that will appear in production. And it's all owned by different groups and used in different projects, so that makes things longer/more complex. We also have to deal with provider version changes, upgrading CI pipelines and environments to be able to switch between Terraform binaries, and conventions to switch between code branches.

I am already looking around for some way to remove Terraform from our org because it is slowly strangling our productivity. It's way too slow, there's too many footguns, it doesn't reliably predict changes, it breaks on apply like half of the time, and it's an arduous manual process to fix and clean up its broken state when it does eventually break. Not to mention just writing and testing the stuff takes forever, and the very obvious missing features like auto-generation and auto-import. I keep a channel just to rant about it. After Jenkins, Ansible and Puppet, it's one of those tools I dread but can't get away from.


You can use tfenv to upgrade individual workspaces one at a time. You don't need to do a big bang upgrade.

Note upgrading to 0.13 is quite easy and terraform actually has a subcommand that does most of the work you (usually no additional steps required).

> I am already looking around for some way to remove Terraform from our org because it is slowly strangling our productivity.

The only other alternatives you have are Pulumi. All other alternatives are in my opinion, way worse. You can use ansible, which I'd even worse because you have to manage ansible version upgrades and have no way of figuring out what changes will be made (yes, --diff is usually useless). You can manage manually, but good luck. Lastly your option is CFN (or Azure/GCP equivalent) but then you have no way of managing anything outside of the cloud environment.


There is no solution where 100k loc is not going to be challenging to keep over time.


While it's not possible to make an apple-to-apple comparison (Terraform-to-?), if we compare to something based on an imperative language, say Puppet or Chef, there is a huge difference.

In my opinion, Terraform's big issue is that it was born as a declarative tool for managing infrastructure. Large configurations (IMO) necessarily ossify, because you don't have an imperative language that makes small progressive changes toleratable - it's a giant interrelated lump.

What's worse, when it grows, one needs to split it in different configurations, and one loses referential safety (resources will needed to be linked dynamically).

A, say, Chef project of equivalent size, can be changed with more ease, even if it's in a way, even less safe, because you have the flexibilty of a programming language (of course, configuration management frameworks like that have of different set of problems).

I'm really puzzled by the design choice of a declarative language. Having experience with configuration management, it's obvious to me that a declarative language is insufficient and destined to implode (and make projects implode). Look at the iterative constructs, for example, or the fact that some entities like modules have taken a long time to be more first class citizens (for example, we're stuck with old-style modules that are hard to migrate).


> compare to something based on an imperative language, say Puppet or Chef

I'm puzzled by this comparison. I consider both of these to be primarily declarative languages. You declare the state you want puppet or chef to enforce, not how they get there.

E.G. https://puppet.com/blog/puppets-declarative-language-modelin...


I've indeed stretched the concept by equating Chef and Puppet (I guess the latter is closer to TF).

To be more accurate, I'd say that Chef has a declarative structure supported by the imperative constructs of the underlying language, and this is what makes for me a big difference.

Consider the for loop as example. By the time it was added (v0.12), there was a (200 pages) commercial book available. And there are people in this discussion stuck at v0.11.

The difference in the declarative vs. imperative nature, as I see it now that the for loop is implemented in TF, is that it's embedded inside resources, that is, it fits strictly a declarative approach, and has limits. In Chef, you can place a for loop wherever you prefer.

Object instances is also another significant difference. It took a while for TF to be able to move (if I remember correctly) module instances around (that is, to promote them to "more" first class citizens), which made a big difference. In an imperative language, accessing/moving instances around is a core part of the language. In Chef, pretty much everything is global - both in the good and in the bad. But certainly the good part is that refactoring is way more flexible.

I think TF has always been plagued by repetition; in my view, this is inherent in the "more" declarative approach (since they're trying to embed imperative constructs in the language).


I have really bad memories of the change between puppet 2 and 3 for example.


Same. I went through a puppet 2->3 migration and also through a terraform 0.11->0.12 update.

The puppet migration was definitely more painful, because of the entangled code.


It's not clear if it was entangled because it was written in the specific framework or because it was just badly written code. In the latter case, this hasn't really anything to do with the framework. Additionally: did you make the v0.12 a migration just work, or did you change the codebase to take advantage of the new features (and remove inherent duplication)?

There are inherent problems in the TF framework and the migraitons. 0.12 introduced for loops, and 0.13 added modules support to them. So a proper migration should convert deduplicate resources into lists of resources. This is painful for big models, since one needs to write scripts in order to convert resource associations in the statefile. And hope not to miss anything!

Due to the strictly declarative nature, it's also difficult to slowly move duplicated resources into lists, and handle both of them at the same time.

At this time, our time is stuck with a certain TF version, and can't move without spending considerable resources.


Yeah same boat. We ended up doing several complete rewrites and finally giving up. My main grievance is hcl, it’s so close but so far from an actual programming language that it drives me mad, even after a few kilolines of it in prod.. we ended up going with pulumi which so far has served us well


It seems that terraform CDK has been introduced to compete directly with pulumi.

I think both are a great idea as the DSL has given me so many headaches over the years.


Tangential, but curious how did you get to 100k lines of TF? I’d imagine most things within your company would follow very similar patterns and therefor be extracted into modules, and the per app/team code would be relatively small and focused on how to compose these modules together.


Modules are useful only up-to a point. Creating complex modules with a ton of moving parts makes it difficult to make changes, to upgrade etc. The best recipe that I’ve found is to use modules to enhance some core functional component and then compose these modules to build infrastructure, rather than defining your entire stack in a single module.


We also found tf 0.12 to be quite slow. But this was fixed in 0.13 and how it feels lightning fast compared to before.


> and it's not all in one big monorepo

There's your first problem.


Thanks, good to know that the upgrade to 12 is the biggest jump.


I had the same question or concern. I also realized too late that 0.12 is a bigger one than first thought. I was not severely impacted in the end, but boy it was a long time that I didn't experience such a tough upgrade. Happy to know that the hardest is behind and looking forward to try 0.15. Thanks


I upgraded from 11 to 12 like one year ago and from 12 to 13 some days ago (upgrade to 14 seems that will be straightforward) but in my case what I did:

  - Don't upgrade directly from 12 to 14, go to 13 first
  - If you have warnings after moving from 11, fix them first
  - Run the 0.13upgrade command in your code that will generate the required_providers
  - Run terraform-v13 init
  - Change to correct workspace if using some
  - Run terraform-v13 plan which will probably fail due to the new explicit required-providers rule, if that happens you need to modify the state with the correct providers https://www.terraform.io/upgrade-guides/0-13.html#why-do-i-see-provider-during-init- . In my case I have a lot of modules so I created an script that automated that process
  - Execute again terraform-v13 plan and verify that it will not make uncommon changes
  - Then run terraform-v13 apply.


> Don't upgrade directly from 12 to 14, go to 13 first

The line above, plus running apply in each version is key. I literally just did the update from 11->latest for 3 different repos a couple of weeks ago. And tbh, its was only the first update where I had to make any code changes. The rest mostly worked.


Someone not in our team upgraded version by mistake from 0.12 to 0.13 (he was contributing something small and used the latest), the CTO got involved and made us update everything and it was a big undertaking.

Personally I have a nix shell file pinned to the exact version of terraform (as in, commit hash on the nix-packages repo) we use in every repo and just switch to that shell before doing anything.


We had this issue as well where another team was on 0.12 and we were still on 11 and even running a plan I think could ruin the state.

We now have all of the systems tf version pinned:

``` terraform { required_version = "= 0.12.29" ... } ```

As other said tfenv let's you easily switch between versions but you don't get a warning if you're accidentally on 0.12 but the repo is currently using something else.



I'm a fan of tfenv for this; it's really easy to use and makes it trivial to pin each stack to an exact version of TF.


Between rbenv, tfenv, pyenv, sdkman and so on and so forth, maybe it's time for some sort of common OS-level env-management interface...?



I use asdf and it’s great. You keep your versions in .tool-versions abs when you switch for branches you automatically get the right versions of node, java, terraform etc on your path.


Using this. It's fantastic.


You can whip up something pretty quickly as a shell wrapper and add to it as you need. I just threw this together: https://gist.github.com/peterwwillis/755002d6d3849af5bbc6cb8...

  $ cliv
  
  Usage: /home/vagrant/bin/cliv [OPTS]
         /home/vagrant/bin/cliv [-i] VERSION [CMD [ARGS ..]]
  Opts:
          -l              List versions
          -h              This screen
          -n              Create a new /home/vagrant/.cliv/VERSION
          -i              Clear current environment
  
  $ cliv -n tf012
  $ cliv -n tf013
  $ cliv -l
  tf012
  tf013
  $ cp terraform-v12 ~/.cliv/tf012/bin/terraform
  $ cp terraform-v13 ~/.cliv/tf013/bin/terraform
  $ cliv tf012 terraform
  Terraform v0.12.29
  $ cliv tf013 terraform
  Terraform v0.13.3
  $ cliv -i tf012
  PATH=/home/vagrant/.cliv/tf012/bin:/home/vagrant/bin:/usr/local/sbin:/usr/local/bin:
  /usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/snap/bin
  PWD=/home/vagrant


I had to put a newline in your PATH because the unbroken string was borking the page layout. Sorry, it's our bug. Someday someone will show me how to fix it. Some people have come close.


That is a selling point of nix and docker, yes:) The catch is that doing it generically makes the whole thing more complicated (although I suspect nix and docker are more complex than is strictly required for that use case)


Our team just did a big effort to get from 11 to 12. After that the effort was quite minimal. Some little gotchas re: providers in 13 but we're just today finished the 14 upgrade and will probably let 15 marinate until we upgrade to that.


I think we're dealing with the providers now as we get warnings on 0.12.29 that they'll be deprecated in 13 so nice to know that we're over the biggest hill.


It becomes much easier between other versions. Though upgrading 0.12 to 0.13 I remember, i had to pull the state and change the provider field manually to avoid recreation of some resources.


> i had to pull the state and change the provider field manually to avoid recreation of some resources.

Terraform CLI introduced an upgrade command (can't remember what it's called) that automatically does this for you.


In 0.14 that is no longer available.


Terraform 0.14 is pretty much fully compatible with 0.13, so no such command is necessary. All you have to do is make sure your state is for version 0.13. Outside that it's a bunch of usability changes that do not affect your tf code.


On the bright side, now that they've committed to a stable state file format it should get better


With lots of swearing, at least in my experience when we did the same upgrade as you. I too am hoping this will be smoother in future and the changes stabilise. I guess if you're doing very simple stuff then it's not an issue but the larger the estate and 'custom' stuff that gets added the more difficult any upgrade becomes (tf or not)


From our experience. 0.11 -> 0.12: We're not attempting. We're in the process of changing out config management anyway and we're unhappy with a bunch of decisions in that terraform stack, so it's a good point to get rid of it.

0.12 -> 0.13: Well, we had to add what feels like a million required-provider-blocks and then some more. And sometimes it was tricky to pinpoint the module pulling in a default provider and crashing - `terraform providers` and `terraform graph` help there. The graph is easy to grep through to find the resources and modules pulling in wrong providers. And in the beginning the error message we got when we had to run 'terraform state upgrade-providers' was .. obscure. In newer versions, that message is much better.

0.13 -> 0.14 just happened and now the lock files are slowly piling in on demand.


As no one seems to have mentioned anything, version numbers below 1.0 are generally considered unstable in terms of API interface (like HCL in this case) so if you'd like to avoid similar things in the future (with either Terraform or other tools), you're best to wait until they at least release version 1.0.


I started using Terraform on our project in early 2019 which was version 0.11.13. The upgrade to 0.12.x seemed non-trivial so I put it off... now 2 years later and we're at 0.15.x.

Looks like I need to clear my schedule in a upcoming sprint to get this done so the pain doesn't get even worse :)


As everyone else said: 0.11 to 0.12 was painful. After that, you just need to deal with converting the state whenever a new version comes out, but that's basically automated so not a biggie.


terraform 0.1xupgrade went pretty smoothly most of the time, with some minor manual changes and fixes here and there. Most of them were easily done in batch over the repository using sed.

This approach worked for me in various setups from a small SaaS company to a major travel company as well as personal projects without issues.


After 0.12 it has been much easier.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: