PostgreSQL with Node.js: Migrations and schema dumps

Published on August 05, 2017 under Blog

While working on the Fizzyo project for one of my summer internships I've noticed the lack of articles on database migrations and database integration testing for Node.js projects. Don't get me wrong, there are plenty of resources talking about each part separately, but you still have to do the job of combining all pieces of the puzzle together and coming up with something that works.

In this article I'll be talking about how I tackled PostgreSQL migrations and database schema, pointing out useful tools and techniques. This is less of a tutorial and more of a description of one possible approach, which worked. In a follow-up article, I'll talk about actual integration testing.

Quick word about Fizzyo backend

I was working on the backend architecture, developing a RESTful API running on Node, defining a bunch of PostgreSQL databases and writing a bunch of scripts for data processing and validation, all in JavaScript. Microsoft was a huge stakeholder in this project so everything was running on Azure, some parts on Linux servers and others on Azure SaaS services.

The whole Fizzyo system has not reached production stage yet, but there were other developers using the API for their prototypes so I wanted to achieve more or less stable service even during initial development phase. I wrote unit tests for all of the basic features using Mocha and Chai. This is a fairly standard procedure so I won't talk about unit testing in this article.

My DB setup

I often work on the go and tend to use different machines at home and at work, so I decided to keep all databases online. At the time I was also the only backend developer, so I didn't have to worry about anyone making changes to the development database. If you're working in a team, it would make more sense to host your testing and development databases locally. In my app, one could switch to a local database by simply changing the database connection string in the JSON config. If you're gonna take the same approach, here's a friendly reminder to never push any sensitive data to your Git repo.

I ended up using 3 PostgreSQL databases in total, each with its own purpose:

  1. Testing database. Automatically dropped and rebuilt from schema before every integration testing sequence. After the database is rebuilt, the bare minimum of "dummy" data is inserted to make testing a bit easier. This database will be the main focus of this article, I'll elaborate more on it below.
  2. Development database. Used for running the local instance of the application (think npm start on localhost). Unlike the testing DB, this one retains all of its data unless it's dropped manually. As you can imagine, running an instance of HTTP API locally is quite boring and is pretty much pointless if most of its features are covered in unit tests. In the end I hardly ever used the development database.
  3. Production database. This is the holy grail of Fizzyo databases. It contains actual, real data submitted by various components of the Fizzyo system and is used by the production instance of the API. Needless to say that the data stored here is very important and it is not possible to reset the database. Meanwhile, the structure of tables inside it should still be up-to-date with the development database once a new API update rolls out, which is where DB migrations come in.

Alternatively, I could choose to setup an embedded database every time I ran integration tests, i.e. a database that would be created programmatically before tests are run and deleted without any trace once testing is complete. This is a perfectly feasible approach, but in my case I was still wrapping my head around PostgreSQL and pg-promise and being able to browse the database after some tests have failed was a very helpful feature.

Keeping databases in sync using migrations

Database migrations are essential when it comes to versioning your database. While some developers (especially those who work in teams) can't imagine the life without them, people who work alone might not even be familiar with the concept. If you come from the latter category, I suggest you google around to learn more before proceeding with this article.

Migration tools

There are tons of different migration tools for all sorts of databases. I've tried around five PostgreSQL migration tools from npm registry alone. What tool you will end up using heavily depends on your preferences and use case, but my choice fell on pg-migrator. It's a dead simple tool that allows you to write your migrations in SQL (unlike some other tools that force you to create a JS file for every migration). To perform a migration, you need to specify a connection string for the Postgres database you're about to migrate, which was perfect for my use case. I wrote a simple JS script that would take the type of database as the input parameter (e.g. prod, dev or test), pull the relevant connection string from the config and plug it into pg-migrator. As most migration tools, pg-migrator creates a version table in your database (unless one already exists) and uses it it to store the integer representing the ID of the most recent migration that was run. This way, I could migrate my production database with a single command when needed, e.g. when I roll out a new version of the API to the production server.

Writing migrations

Since I'm using pg-migrator, when I say "migration" I mean a .sql file that can be used transform the database from one version to another. How the tool of your choice defines "migration" can differ, but the basic concept is the same.

I would strongly suggest storing all of your migrations in your Git repository. If you have an npm project setup (which you should, if you're reading this article) you can simply define a new script in your package.json that would handle migrations. Imagine the blissful expression on the face of your coworker or teammate when they realise they can simply run npm run migrate-db test and the database schema will be updated to the most recent version, without them ever knowing what's going on behind the scenes.

I'm not too good with SQL and I certainly can not recall all PostgreSQL functions from memory, so I relied on DataGrip IDE to do most of the heavy lifting for me. (In fact, if you're using any of the recent JetBrains IDEs you should have access to built-in database manager.) DataGrip provides a ton of useful features. If you connect your IDE to your database, you can get auto-completion and validation for table and column names, as well as run queries on your database directly from your IDE. This can be very useful if you're writing some non-trivial queries.

The feature I found the most useful for writing migrations is that DataGrip prints all of the queries it's about to execute. When you use the graphical interface to, say, create or edit a table, you can see exactly which SQL commands will be ran against the database. The trick is, when you define a new table/column/whatever, use the interface to make the changes and then copy the SQL script DataGrip generates without actually executing it. This way, you have a very convenient way to create SQL scripts for migrations, but remember to never change the database manually - this defeats the whole purpose of migrations. On the screenshot below, I made some arbitrary changes to the table to demonstrate the kind of SQL DataGrip generates.

Editing a PostgreSQL table in DataGrip IDE

Maintaining a DB schema

Having database migration scripts ready is great and all but sometimes you have to create a database from scratch. Technically, you could have migrations setup to take an empty database all the way to your current schema, but that would take forever and introduces unnecessary complexity.

In my system, I created a script based on Postgres' pg_dump to generate a schema of the database. The process of creating a reusable, up-to-date schema is very simple:

  1. I run all of the available migrations on some database, e.g dev, to make sure its structure is up to date.
  2. I run my DB dump script to dump the database schema (without any data) into an SQL file, schema.sql from here onwards.
  3. I note down the database version (i.e. ID of the most recent migration ran), and create a separate SQL file called dummy-data.sql that inserts this version into a newly created database. This step is very important because without it your DB migration tool would not know the version of the database and it could end up running some migrations multiple times.

Now, imagine someone else would have to setup a instance of your application locally. Given they have a PostgreSQL server running, all they will need to do is to do is install your npm dependencies, run schema.sql to create the database, run dummy-data.sql to insert some basic data into that brand-new database and they're ready to use your app!

The part about running schema.sql and dummy-data.sql could further be automated by adding an npm script that would take a database connection string as input and run these scripts. If your application relies on some data to already be present in the database on startup, you could add everything to dummy-data.sql. The only thing I'd suggest here is to keep the schema and dummy data in separate SQL files, as that makes them easier to manage. Just to clarify - schema.sql and dummy-data.sql are meant to go into your Git repository for so that other developers could use them too.

Conclusion

This article didn't really describe anything revolutionary but it provides an introduction to the DB migration/schema dump flow in Fizzyo system, which will become important in the Node.js database integration testing article I will post shortly.


End of Article

Timur Kuzhagaliyev Author

I'm a computer science graduate from UCL & Caltech, working as a systems engineer at Jump Trading. Before Jump, I was doing computer vision for Video Quality Analysis team at Amazon.