You don’t understand your code: seeding/benchmarking methodology

Published in

Niole Net

6 min readNov 13, 2021

In my opinion, there should be certain prerequisites to writing app-dev code. You should know all inputs to your code.

By all inputs, I mean all. We love to debate different solutions. We discuss the pros and cons until we are blue in the face. But when it comes down to how things will actually go down in front of a user, nobody really knows unless they’ve tried it out.

You don’t really know how your code works. You think you know, but you don’t.

You should know: what’s in your database, how will your user behave, what resources are available on your deployment, and if you don’t know these things, you need to find out.

This is a major gap in app-dev culture. This gap exists because it’s hard to “do it all”. But you really shouldn’t be developing a feature if you don’t know how it works. And if you haven’t at least benchmarked your best-case and worst-case scenarios you definitely don’t know how it works.

First, we undergo a phase of exploration: what’s in our database, and what do our users do? Then

What’s in your database?

You should know how to query your database and you should be really really good at doing so.

You should be able to answer questions about what’s in what collection. You should be able to answer questions about how different collections relate to one another. You should be able to issue database queries in order get all data necessary to for the feature that you are building.

Finally, you should be able to manufacture this data because this is an important part of how your feature works.

What do your users do?

You should know what your users do most often and least often.

You should be able to manufacture the interactions that they do. You should at least be able to issue curl commands that mimic what they do. You should eventually be able to use a load testing tool to mimic what they do.

Keep it simple

First of all, we must face the facts: we’ve never done this before and we don’t know what we’re doing.

Let’s not make this fancy and complicated. Remember that this is another reason why we’re here: we’re breaking the app-dev pattern of premature optimization. We are accepting that we don’t know how our code works. We are reformulating our approach to come from first principles.

We are in a process of discovery: 1. we are discovering how our code works and 2. we are discovering how to seed and benchmark

This leads us into our seeding and benchmarking methodology.

Seeding Methodology

It’s easy to overcomplicate this. You brain is probably already deadlocked on how to seed the scenarios that you want, while also making your code extensible, reusable, and all the other things that good code is supposed to be.

Here are some things that your seeding code absolutely must do:

it must be able to remove data

If you mess up your seeding, you need to be able to easily remove that data without giving yourself carpal tunnel. This will also make it easier to reseed with a different configuration.

data from a seeding session must be identifiable

Often you will want to differentiate between groups of data for different seeding scenarios. You will want to know the size of the data, what it looks like, and you will want to remove only it without disturbing other data in your deployment.

this code must be performant

This is another reason why a lot of people crash and burn when writing seeding code: it must be performant. At some point, you will end up waiting an hour for your seeding code to complete and you will have to rewrite everything.

Write code that garbage collects big arrays of stuff ASAP. Don’t underestimate the power of garbage collection. Never let multiple arrays get created in the same function scope.
Never do many IOs in a row. Always do bulk database queries for creating/deleting data. Send your data in batches to your database.
Bring in parallel processing.
If you want to go REALLY big with your seeding, consider writing your data to a file and then uploading those files into your database. It is LOT faster to upload millions of objects this way instead of having a database client chunk through that giant array inside of your seeding program. Always write these files to the /tmp directory so that if you fill up your disk and your machine crashes, the data will be gone on startup. If you are using mongodb for example, you would use mongoimport to do this.

this code must be simple and easy to understand

You must understand what’s on your deployment, otherwise your benchmarking results will be meaningless. Writing data summary functions will also be extremely helpful.

keep it SIMPLE

It can be helpful to think through what are the general things that your application does and then write very simple seeding code according to that.

For example, all web applications have API requests that get many things and then others that get only one thing. With that in mind, you can think of the good and bad performance scenarios that these kind of querying patterns create.

Get one

In the “get one” element scenario, bad things happen when that one element is linked to many other elements and they are all retrieved in order to create the API response.

Get many

In the “get many” elements scenario, bad things happen when there are a lot of those elements and then if there are many things associated with each of those elements, that would worsen performance.

With these two ideas in mind, you don’t even have to read through the entire codebase in order to figure out how to design your seeding code in order to tease out performance problems.

Why not just seed from an existing database?

There definitely is a time and place for seeding with an existing database. Seeding from an existing database may let you reproduce scenarios that you missed with your manufactured data.

It’s still helpful to do the seeding yourself, because then you will fully understand what’s “in there”.

Tooling

It’s easy to overcomplicate this as well. The prevailing principle is to keep this simple for yourself. Stick to a tool that you already understand or that’s in a language that you can already write. Pick something that will let you iterate and experiment easily.

Keep an open mind to trying out something different if your first choice isn’t working out. Every time I think to my self, “gee, this is kind of hard” or “oh no, not this again”, that takes a point away from the approach or tooling that I’m using. Eventually, it’s time to try something different. The tooling has to be easy for me to use.

Over time, I have found that working from the command line is easiest for me. Having tools that are easy to use together on the command line in order to seed some data, verify what it looks like, run a test, verify the output, and then upload the results all without leaving the command line, has been most helpful for me.

Tools for data munging and analysis

less, cat and grep and a good knowledge of regex are number one because they’re on every linux machine. For fancier data processing, jq is very helpful.

Tools for running stuff remotely

Use scp to copy stuff onto a remote machine so that you can run your seeding and testing without interference from your home network, vpns, etc, and then ssh in to actually run the thing.

If kubectl or docker is an option, that makes life even easier: submit your seeding/benchmarking code as a deployment into your k8s cluster, or launch it in a docker container.

Tools for seeding

You can get yourself up and running pretty quickly withbash and curl, since they’re everywhere. Runcurl in a for-loop in bash. Go directly into the command line for your database and duplicate objects, add slightly different parameters, and reinsert them. In mongodb, the syntax on the command line is JavaScript. You can save data gotten from the database into variables and manipulate it using JavaScript. This makes creating more data in mongodb really easy.

Once you have a good idea of what you want, you can transfer what you’ve learned to a script that is callable from the command line using whatever database client you want.