air born

Technical sketch: Database pages

Writing technical proposals at work has made me realize that this stuff is best done before writing the code. It lets me ask the hard questions a little ahead. These things are never perfect, and I often end up having to revise some of them during the implementation phase. But at least I get a rough blueprint of why and what I am doing.

I will try to write these for the major design decisions while building kyadb. Unfortunately, nobody will review these, so I will call them technical sketches instead of proposals. This will be the first of many (hopefully).


Databases store data. They store it such that reads and writes are as fast as possible for the assumed use cases. Sometimes, read performance is favored over writes and sometimes writes are favored over reads. Either way, the data has to be organized in some structure.

Typically, records in a database are stored in "pages". A page is the smallest piece of data that a database reads and writes. This piece is usually a fixed-size block of memory or disk which houses a collection of records. If the database persists its data to disk, then many such pages together comprise a database file. A database may have one or many such files. That's it. That's the long and short of database storage. Of course, there is more to it, but this is enough to give the reader an overview of what this article is about. I am about to describe what pages could look like in kyadb, my toy database.


Most databases target a certain kind of workload and optimize everything from storage to querying for that workload. There are many types of workloads (not just transactional and analytical). Since I am writing my database from scratch, I am going to imagine some fairly typical requirements:


With our requirements set, we can finally talk about how a database page can be implemented. Before doing that, let me say that I am not going to talk about doing any file I/O here. How to read and write pages to and from a disk will be a topic of another future sketch. This sketch is just about the representation of data on a page. Might sound like simple stuff, but I feel like this is a powerful abstraction that deserves to be written about on its own.

The layout of a record

Before I talk more about pages, let's first illustrate what a record will look like. A record here is just a row that has been serialized into a string of bytes. There are many efficient ways of packing a row into a byte representation. I will implement a simple one:


An element in a record can be one of the following types of data:


With these data types, it is possible to model 90% of the data for common OLTP use cases.

The page

Knowing that records are just a bunch of bytes, we can now talk about how they will be organized within a page. Let's assume one thing about our system ‒ the process that reads and writes a record will always identify a record using two pieces of information, the page ID and the slot number on that page. I will talk more about the slot number very soon. A global page table maps page IDs to the memory location of each page. If the page is not in memory it will be brought into memory (more on this in a later article).

Each page has a header that contains the following:

  1. A two-byte number (uint16) storing the number of "slots" currently reserved on the page.

  2. A two-byte number (uint16) storing the pointer to the end of free space on the page. This pointer will be used when placing a new record on the page.

  3. An array of "slots". Each slot is two bytes long. The n-th slot stores the pointer to the n-th record.

Records are added to a page, starting from the end. So, the first record added to the page will occupy the last bytes on the page. As new records are added, they start filling the page from the end. But each record also needs a slot in the header, that will store the pointer to that record. So with each new record, a new slot is also allocated. This way, the header's "slot array" expands from left to right on the page whereas the records fill the page from right to left. At some point, the header and records will no longer be able to expand because the page does not have enough space left to accommodate a new record. New records will then be sent to other pages.


One of our requirements was to be able to store variable-length records. The slotted-page architecture helps us satisfy this requirement since each record's location is stored in a dedicated slot on a given page.


Obviously, the deletion of records is supported. However, there is one complication with deletion - there may be pointers to a given record. I am not yet sure as to why this is the case, but that's what the literature says. I will find out the reason later in my journey. Anyway, if we delete the record and replace it with some other record or rearrange the page to fill in that space, then those pointers will end up pointing to something completely different than what they should be. To get around this problem, slot values of deleted records are often replaced with something called a tombstone. In my database, I will use zero as my tombstone. This number will be present where you would expect the record's pointer to be. Since no record can have zero as its pointer, it will be easy to tell whether the value is a tombstone or not.


As long as we are updating the fixed-size elements of a record, things will be straightforward. But when we are updating strings, arrays or maps, it is possible that the new value makes the record larger than what it used to be. The record may no longer fit in the current space it occupies. What we can then do is, move the record to a fresh location where it fits and update its slot to point to this new location. The location may also be on another page if the current page does not have enough space to accommodate the updated record. Since all records are identified just using a page ID and a slot number on that page, processes will still be able to find the right record through the pointer stored in the slot. We need not reclaim the space left by the old version of the record just yet. An operation like that would involve sliding multiple records along the page and can be time-taking.


So are we just going to leave the unused space being generated by deletions and updates to fill up our memory and disks? No, we should give the user the option to consolidate this space, and make it available for use again. Pages will support a consolidation or vacuum function in which any unused space will be reclaimed by rearranging records and updating slots. This operation should be part of a global operation that reconstructs the table's data (and associated indexes) completely, making sure that there are no dangling or corrupted pointers arising from rearrangements.

That's about it. I could write more about how it can be implemented at the code level and all. But this is not a documentation exercise. I think sketching this out gives me enough clarity on what interfaces I need to write.