air born

An odd database

Sep 25, 2023

Imagine a database that allows storing data for both transactional and analytical usage. Not very difficult to imagine. There are several out there already.

Now imagine each record being able to store a matrix (of arbitrary size) as the value of a field. That is, in addition to the usual string, boolean and numeric types, you can store matrices (and therefore arrays) and reference them while querying. Now, there could be other databases that did this too.

There’s probably very few databases that did both of these things. But most do not even anticipate the size of their matrices to be more than a few hundred cells. Fortunately, there is at least one format, that I am aware of, that can be used to build such a database. The creator(s) of TileDB seem to be designing for the kinds of domains that I used to be a part of. They have an enterprise offering now, but it seems to be driven by broader market needs. I had a specific set of requirements. One that could benefit from a dedicated solution. Anyway, before I detail out my requirements, I will shed some light on why I had them in the first place.

Because science!

Scientific experiments tend to generate a lot of data. The computational scientists from biology, astronomy, geology, etc. have invented various custom information exchange formats, built especially to support the kind of data their experiments generate. I have some experience with biomedical data formats. So I will only talk about these. Each experiment may generate many files that may get consolidated into a single large textual or binary file of observations. Observations are often represented as matrices of two, three or more dimensions. These files are only queryable using niche (but often well supported) libraries. The libraries may help the practitioner deduce some conclusions from their experiment.

But a lot of science is also about running your hypothesis or observations against past data. A lot of past experimental data is available publicly or in your lab’s private archives. Naturally, many companies have realized the need for collating and curating such data and offering it to practitioners through various interfaces. This interface could be a user-friendly GUI for searching on all datasets, based on different biological factors (species, tissue, disease, etc.). This interface could also be a powerful query language to find datasets of interest, again based on biological factors.

The company I work for, while writing this piece, is in the business of collating and curating biomedical datasets and allowing users to find relevant insights from them. We have a large corpus of datasets in various file formats. At one point, we were exploring this cool idea of letting users query over all datasets of a given file format using SQL. Querying over multiple such files is of-course a challenge. Most of these formats were never meant to be used for meta-level querying or analysis. Supporting SQL querying over each format, across a large number of files, each with their own dimensions was not a smart idea.


Update: I am not going to work on this. We found that most users do not care about SQL on data matrices. They would rather pull them into their workspace and run their scripts & notebooks on them. And for the metadata, we just needed Postgres. People are happy.

Copyright © Saiful B. Khan 2024