Technical sketch: Database pages

03 Dec, 2022

Writing technical proposals at work has made me realize that this stuff is best done before writing the code. It lets me ask the hard questions a little ahead. These things are never perfect, and I often end up having to revise some of them during the implementation phase. But at least I get a rough blueprint of why and what I am doing.

I will try to write these for the major design decisions while building kyadb. Unfortunately, nobody will review these, so I will call them technical sketches instead of proposals. This will be the first of many (hopefully).

Overview

Databases store data. They store it such that reads and writes are as fast as possible for the assumed use cases. Sometimes, read performance is favored over writes and sometimes writes are favored over reads. Either way, the data has to be organized in some structure.

Typically, records in a database are stored in "pages". A page is the smallest piece of data that a database reads and writes. This piece is usually a fixed-size block of memory or disk which houses a collection of records. If the database persists its data to disk, then many such pages together comprise a database file. A database may have one or many such files. That's it. That's the long and short of database storage. Of course, there is more to it, but this is enough to give the reader an overview of what this article is about. I am about to describe what pages could look like in kyadb, my toy database.

Requirements

Most databases target a certain kind of workload and optimize everything from storage to querying for that workload. There are many types of workloads (not just transactional and analytical). Since I am writing my database from scratch, I am going to imagine some fairly typical requirements:

OLTP layout ‒ A classic use case. Lots of rows but I want to read only a few at a time. This has important implications for the overall layout of how the record data is stored in pages in our DB. I am going to implement row-oriented data storage.
8 KB page size ‒ This is an important decision. If you think your users are often going to do sequential reads (one row after another) often, then a larger page size (maybe 16 KB) is good. You fetch a page and get a lot of records at once. But if you think that users will probably query odd records at random and each record is small, then a smaller page size (maybe 4 KB) will be enough. Without knowing anything about my users (who?), I am going to just imitate Postgres here and choose 8 KB since it seems to work for them.
Just storing records ‒ For now, I will assume that we are only storing records on a page. But literally, everything that a database knows will be arranged in pages. This includes things like configurations and schema. But those are also some sort of records, just that they store data about data, i.e., metadata.
No transaction data ‒ A record will typically also store information about transactions that are currently working on that row. I have to do my homework here, and I will implement a transaction-based locking mechanism later. For now, let's assume that a record is only going to store the user's row data.
Records are of variable length ‒ Some databases require that each record is of a fixed size. We will not make that assumption here. We will be supporting variable-length data types like strings, arrays and maps, to be stored in a record. It is difficult for the users to always know, in advance, how large this type of data will be.
No record can be larger than a page ‒ This assumption simplifies my work. It is also probably more efficient since I always have to read at most one page to get the data for a single record. This is an assumption many real-world databases make. One important consequence of this is that no record can be larger than 8 KB. So, to represent any kind of "length", be it for records or strings and arrays in records, 2 bytes will be enough.
Disks never fail ‒ In the real world, disks do fail, but there's no reason to worry about this in my toy world. Lots of literature and techniques exist (mainly via replication of some sort) to counter disk failures. For now, the Linux file I/O is the lowest level of storage abstraction that I care about.

Implementation

With our requirements set, we can finally talk about how a database page can be implemented. Before doing that, let me say that I am not going to talk about doing any file I/O here. How to read and write pages to and from a disk will be a topic of another future sketch. This sketch is just about the representation of data on a page. Might sound like simple stuff, but I feel like this is a powerful abstraction that deserves to be written about on its own.

The layout of a record

Before I talk more about pages, let's first illustrate what a record will look like. A record here is just a row that has been serialized into a string of bytes. There are many efficient ways of packing a row into a byte representation. I will implement a simple one:

The first two bytes in the record are the length of the record. As mentioned earlier, 2 bytes is enough to represent the length of any record, since no record is larger than a page and a page is fixed to be 8 KB.
This is followed by the "header" section.
- Our header begins with 2 bytes of header length.
- The rest of the elements in the header are also 2 bytes each. Think of them as a bunch of uint16 values. The n-th uint16 value stores the byte offset for the n-th element of the record. If the uint16 value is 0, then it should be understood that the value of the n-th element is unset or null.
Following the header are the actual elements. The record will not store information about the data types of each element. This information will have to be obtained from the schema of the table to which the record belongs. This is to prevent a lot of redundant information from being encoded in each record, which would otherwise blow up our storage requirements.

record-layout

An element in a record can be one of the following types of data:

unsigned integers (uint32 and uint64)
signed integers (int32 and int64)
floating point numbers (float32 and float64)
time values with nanosecond precision (time.Time in golang)
variable-length strings
variable-length arrays of a fixed primitive type (arrays and maps are not considered primitive types)
variable-length maps with a fixed primitive type for key and a fixed type for value (maps can have arrays as values but not other maps)

record-elements

With these data types, it is possible to model 90% of the data for common OLTP use cases.

The page

Knowing that records are just a bunch of bytes, we can now talk about how they will be organized within a page. Let's assume one thing about our system ‒ the process that reads and writes a record will always identify a record using two pieces of information, the page ID and the slot number on that page. I will talk more about the slot number very soon. A global page table maps page IDs to the memory location of each page. If the page is not in memory it will be brought into memory (more on this in a later article).

Each page has a header that contains the following:

A two-byte number (uint16) storing the number of "slots" currently reserved on the page.
A two-byte number (uint16) storing the pointer to the end of free space on the page. This pointer will be used when placing a new record on the page.
An array of "slots". Each slot is two bytes long. The n-th slot stores the pointer to the n-th record.

Records are added to a page, starting from the end. So, the first record added to the page will occupy the last bytes on the page. As new records are added, they start filling the page from the end. But each record also needs a slot in the header, that will store the pointer to that record. So with each new record, a new slot is also allocated. This way, the header's "slot array" expands from left to right on the page whereas the records fill the page from right to left. At some point, the header and records will no longer be able to expand because the page does not have enough space left to accommodate a new record. New records will then be sent to other pages.

page-structure

One of our requirements was to be able to store variable-length records. The slotted-page architecture helps us satisfy this requirement since each record's location is stored in a dedicated slot on a given page.

Deletions

Obviously, the deletion of records is supported. However, there is one complication with deletion - there may be pointers to a given record. I am not yet sure as to why this is the case, but that's what the literature says. I will find out the reason later in my journey. Anyway, if we delete the record and replace it with some other record or rearrange the page to fill in that space, then those pointers will end up pointing to something completely different than what they should be. To get around this problem, slot values of deleted records are often replaced with something called a tombstone. In my database, I will use zero as my tombstone. This number will be present where you would expect the record's pointer to be. Since no record can have zero as its pointer, it will be easy to tell whether the value is a tombstone or not.

Updates

As long as we are updating the fixed-size elements of a record, things will be straightforward. But when we are updating strings, arrays or maps, it is possible that the new value makes the record larger than what it used to be. The record may no longer fit in the current space it occupies. What we can then do is, move the record to a fresh location where it fits and update its slot to point to this new location. The location may also be on another page if the current page does not have enough space to accommodate the updated record. Since all records are identified just using a page ID and a slot number on that page, processes will still be able to find the right record through the pointer stored in the slot. We need not reclaim the space left by the old version of the record just yet. An operation like that would involve sliding multiple records along the page and can be time-taking.

Vacuum

So are we just going to leave the unused space being generated by deletions and updates to fill up our memory and disks? No, we should give the user the option to consolidate this space, and make it available for use again. Pages will support a consolidation or vacuum function in which any unused space will be reclaimed by rearranging records and updating slots. This operation should be part of a global operation that reconstructs the table's data (and associated indexes) completely, making sure that there are no dangling or corrupted pointers arising from rearrangements.

That's about it. I could write more about how it can be implemented at the code level and all. But this is not a documentation exercise. I think sketching this out gives me enough clarity on what interfaces I need to write.