DumbDB: Implementing a dumb DBMS from scratch

1. DumbDB I: an Append-Only Database
2. DumbDB II: introducing the Hash Index
3. DumbDB III: from text to results - an high-level overview

Having a fundamental understanding of how databases work is crucial for developers - that’s true even if you work as a frontend developer and you won’t ever touch a database directly.

Databases are fundamentally a systematic collection of data, and are managed by software called DBMS (Database Management Systems).

A DBMS is an abstraction of a file system, that hides from the developer the complexity of the underlying storage, and takes care of a huge amount of details - concurrency, data integrity, availability, optimization, etc., which we usually take for granted.

As I periodically find myself refreshing my knowledge on the subject, and I’m a huge fan of learning by doing, I had the idea of trying to implement a simple DBMS from scratch. As per the title of the project (DumbDB), the aim is not to create a database useful for anything, but rather explore at a deeper level some concepts such as:

basic database strategies
basic database APIs
indexes
query parser (if I’m able to be consistent up to this point :D)
…

The implementation will be done in Python and the source code will be available on GitHub.

A Database Management System interface

We well define a basic interface for our DBMS - the minimal set of operations that we expect it to support. We do this using an abstract class:

def require_isset_database(func):
    @wraps(func)
    def wrapper(self, *args, **kwargs):
        if self.current_database is None:
            raise ValueError(
                "No database selected. Use 'use_database' to select a database first.")
        return func(self, *args, **kwargs)
    return wrapper


@dataclass
class DBMS(ABC):
    root_dir: Path = Path("./data")
    current_database: str | None = None

    @abstractmethod
    def create_database(name: str) -> None:
        raise NotImplementedError()

    @abstractmethod
    def use_database(name: str) -> None:
        raise NotImplementedError()

    @abstractmethod
    def create_table(name: str) -> None:
        raise NotImplementedError()

    @abstractmethod
    def drop_table(name: str) -> None:
        raise NotImplementedError()

    @abstractmethod
    def insert(table: str, value: dict) -> None:
        raise NotImplementedError()

    @abstractmethod
    def update(table: str, value: dict) -> None:
        raise NotImplementedError()

    @abstractmethod
    def delete(table: str, value: dict) -> None:
        raise NotImplementedError()

    @abstractmethod
    def query(table: str, query: dict) -> QueryResult:
        raise NotImplementedError()

Basically, we want to be able to:

create a database
use a database
create tables into the database
drop tables from the database
insert data (rows) into a table
update data
delete data
query data from a table, based on some criteria

Most of the operations need a specific database to be selected, so we’ll add a decorator to check this and throw an error if no database is selected.

As previously said, a DBMS is basically an abstraction of a file system, so we’ll need to define the structure of the databases files. For this project, the approach will be to have:

a folder for each database
inside the database, a folder for tables (in the future, we could have a folder for indexes and other things as well)
inside the tables folder, a file for the table data

The choice of how to structure the data for the table is a fundamental one, and it will have a big impact on the performance of the database. Our first implementation will be an append-only database.

An Append-Only Database

An append-only database is a database that only allows to append data to the end of the file. This makes the database much simpler to implement, and makes inserts extremely fast.

On the other hand, to find a specific row, we need to read the entire file, which makes the queries much slower.

Let’s see how we can implement the operations of the interface described above.

Create Database

Creating a database is straightforward: we just need to create a folder for the database, and inside of it a folder for the tables.

@dataclass
class AppendOnlyDBMS(Database):
    """
    A class representing an append-only DBMS.
    This DBMS does not store data in memory, but rather on disk.
    It only appends data to the end of the file. For each primary key, the last record is the valid one.
    """

    def get_database_dir(self, db_name: str) -> Path:
        return self.root_dir / db_name

    def get_tables_dir(self, db_name: str) -> Path:
        return self.get_database_dir(db_name) / "tables"

    @property
    def current_database_dir(self) -> Path:
        return self.get_database_dir(self.current_database)

    @property
    def tables_dir(self) -> Path:
        return self.get_tables_dir(self.current_database)

    def create_database(self, db_name: str) -> None:
        db_dir = self.get_database_dir(db_name)
        if db_dir.exists():
            raise ValueError(f"Database '{db_name}' already exists")

        tables_dir = self.get_tables_dir(db_name)
        tables_dir.mkdir(parents=True)

Use Database

Using a database simply means setting the current database to the one we want to use - this allows us to use the decorators we defined earlier to check that the database is selected.

Note: From now on, all the operations will have the require_isset_database decorator.

    def use_database(self, name: str) -> None:
        db_dir = Path(self.root_dir / name)
        if not db_dir.exists():
            raise ValueError(
                f"Database '{name}' does not exist. You must create it before using it.")
        self.current_database = name

Create Table

Creating a table means creating a file for the table data. One of the simplest way to store data is to use a CSV file. At this point, we only need to store the header of the CSV - which is the column names for the table. We enforce that each table has a primary key column named id (the reason will be clear later).

Note: We also append a column named __deleted__ to the table, which is used to implement the delete operation.

    def get_table_file_path(self, table_name: str) -> Path:
        return self.tables_dir / f"{table_name}.csv"

    @require_isset_database
    def create_table(self, table_name: str, headers: list[str] = None) -> None:
        """Create a new table in the database - which is just a new csv file."""
        table_file = self.get_table_file_path(table_name)

        if table_file.exists():
            raise ValueError(f"Table '{table_name}' already exists")

        if not headers:
            headers = ["id"]

        headers.append("__deleted__")

At this point, we are able to run the following code:

import dumbdb

dbms = dumbdb.AppendOnlyDBMS()
dbms.create_database("my_first_db")
dbms.use_database("my_first_db")
dbms.create_table("my_first_table", ["id", "name", "age"])

And the situation on disk should be:

>> tree data
data
└── my_first_db
    └── tables
        └── my_first_table.csv

>> cat data/my_first_db/tables/my_first_table.csv
id,name,age,__deleted__

Hurray! We have successfully initialized a database and a table!

Drop Table

Being a table just a file on disk, the drop operation is straightforward: we just need to delete the file.

    @require_isset_database
    def drop_table(self, table_name: str) -> None:
        table_file = self.get_table_file_path(table_name)
        if not table_file.exists():
            raise ValueError(f"Table '{table_name}' does not exist")

        table_file.unlink()

Insert Data

Right now, our database is pretty much useless. To make anything useful, we need at least to be able to store data. Since the file is an append-only CSV, inserting a row simply means opening the file in append mode and writing the row at the end.

We must be careful to set the __deleted__ column to False for all of the rows.

    @require_isset_database
    def insert(self, table_name: str, value: dict):
        table_file = self.get_table_file_path(table_name)
        if not table_file.exists():
            raise ValueError(f"Table '{table_name}' does not exist")

        with open(table_file, 'a', newline='') as f:
            csv_writer = csv.writer(f)
            csv_writer.writerow(list(value.values()) + [False])

Note: At this point, we are not checking that the inserted data respects the table schema, so we just trust the user.

Update Data

Any respectable DBMS (even if DumbDB is not one of them!) should be able to update data. The constraint of having an append-only file means that we cannot directly update a previously inserted row directly on the CSV file, but we have to think of another strategy.

In reality, the strategy is pretty simple: we just insert a new row with the updated data, and when we query we only consider the most recent data. So, updating data is just a wrapper around the insert operation.

However, for this to work, we need to be able to understand when two CSV rows are referring to the same logical row. To do this, we’ll use the primary key column that we enforced to be present in each table.

Note: Before inserting, we add a check to make sure that the row that we want to update exists.

    @require_isset_database
    def update(self, table_name: str, value: dict) -> None:
        table_file = self.get_table_file_path(table_name)
        if not table_file.exists():
            raise ValueError(f"Table '{table_name}' does not exist")

        query_result = self.query(table_name, {"id": value["id"]})
        if not query_result.rows:
            raise ValueError(f"Row with id {value['id']} does not exist")

        self.insert(table_name, value)

Delete Data

As for the update operation, being the file append-only, we cannot directly delete a row. That’s why we have been preparing the way for this operation by adding the __deleted__ column.

When we want to delete a row, we can just insert a new row with the same id, but with the __deleted__ column set to True.
When we query the data, if we found a row with the __deleted__ column set to True, we need to ignore all previous rows with the same id.

Note: The delete operation is identical to the insert operation, except for the __deleted__ column value - so it could be better engineered.

    @require_isset_database
    def delete(self, table_name: str, value: dict) -> None:
        table_file = self.get_table_file_path(table_name)
        if not table_file.exists():
            raise ValueError(f"Table '{table_name}' does not exist")

        with open(table_file, 'a', newline='') as f:
            csv_writer = csv.writer(f)
            csv_writer.writerow(list(value.values()) + [True])

Query Data

We’re almost there. We are now able to insert, update and delete data. However this is pretty much useless if we cannot query it. The query operation is probably the most complex one, since it needs to be able to handle the update and delete operations as we have defined them.

Basically, our simplified query operation will take a dict of column names and values, and will return a list of rows that match the query. Only the equality operator is supported at this point.

The query operation will need to:

find all of the rows in the CSV file that match the query
for each logical row, keep only the latest row
if a row is marked as deleted, ignore all previous rows with the same id

The strategy implemented is pretty simple: we create a dictionary of the rows that we have found, using the id as the key. When we find an updated version of the row, we just overwrite the previous value in the dictionary. We do the same for the deleted rows. After that, we exclude from the result all of the rows that are marked as deleted.

    @require_isset_database
    def query(self, table_name: str, query: dict) -> QueryResult:
        start_time = time.time()
        table_file = self.get_table_file_path(table_name)
        if not table_file.exists():
        raise ValueError(f"Table '{table_name}' does not exist")

        matching_rows = {}
        with open(table_file, 'r', newline='') as f:
            csv_reader = csv.DictReader(f)
            for row in csv_reader:
                if all(row[key] == query[key] for key in query):
                    matching_rows[row['id']] = row

    # Remove all rows for which the last line is a delete
    matching_rows = {k: v for k,
                        v in matching_rows.items() if v['__deleted__'] == 'False'}

    return QueryResult(time.time() - start_time, list(matching_rows.values()))

Let’s test it!

At this point, we have a fully functional database - at least, for those that were our initial requirements. We can run a few tests and see if it works as expected.

Let’s add some data:

import dumbdb

dbms = dumbdb.AppendOnlyDBMS()
dbms.create_database("my_first_db")
dbms.use_database("my_first_db")
dbms.create_table("users", ["id", "name", "age"])

dbms.insert("users", {"id": "1", "name": "John", "age": "30"})
dbms.insert("users", {"id": "2", "name": "Jane", "age": "25"})
dbms.insert("users", {"id": "3", "name": "Luke", "age": "25"})

Now, let’s query the data:

print(dbms.query("users", {"id": "2"}))

>>> QueryResult(time=0.0006139278411865234, rows=[{'id': '2', 'name': 'Jane', 'age': '25', '__deleted__': 'False'}])

print(dbms.query("users", {"age": "25"}))

>>> QueryResult(
    time=0.0006849765777587891, 
    rows=[
        {'id': '2', 'name': 'Jane', 'age': '25', '__deleted__': 'False'}, 
        {'id': '3', 'name': 'Luke', 'age': '25', '__deleted__': 'False'}
    ]
)

Oh, we forgot that today is Jane’s birthday! Let’s update her age:

dbms.update("users", {"id": "2", "name": "Jane", "age": "26"})
print(dbms.query("users", {"id": "2",}))

>>> QueryResult(time=0.0006139278411865234, rows=[{'id': '2', 'name': 'Jane', 'age': '26', '__deleted__': 'False'}])

Now that we are done with it, let’s drop our table and call it a day!

dbms.drop_table("users")
print(dbms.query("users", {"id": "2"}))

>>> ValueError: Table 'users' does not exist

Performance considerations

Now let’s test the performance of our database in various situations. Let’s try to insert a few rows, first in an empty database, and then in a database already full of data.

-------------------------------- live log call ---------------------------------
15:34:32 INFO Current number of rows in the db: 0
15:34:32 INFO Time taken to insert 1000 rows: 0.0286 seconds
15:34:32 INFO Current number of rows in the db: 1000
15:34:32 INFO Time taken to insert 1000 rows: 0.0282 seconds
15:34:32 INFO Current number of rows in the db: 2000
15:34:32 INFO Time taken to insert 1000 rows: 0.0282 seconds
15:34:32 INFO Current number of rows in the db: 3000
15:34:32 INFO Time taken to insert 1000 rows: 0.0268 seconds
15:34:32 INFO Current number of rows in the db: 4000
15:34:32 INFO Time taken to insert 1000 rows: 0.0272 seconds
15:34:32 INFO Current number of rows in the db: 5000
15:34:32 INFO Time taken to insert 1000 rows: 0.0275 seconds
15:34:32 INFO Current number of rows in the db: 6000
15:34:32 INFO Time taken to insert 1000 rows: 0.0272 seconds
15:34:32 INFO Current number of rows in the db: 7000
15:34:32 INFO Time taken to insert 1000 rows: 0.0279 seconds
15:34:32 INFO Current number of rows in the db: 8000
15:34:32 INFO Time taken to insert 1000 rows: 0.0293 seconds
15:34:32 INFO Current number of rows in the db: 9000
15:34:32 INFO Time taken to insert 1000 rows: 0.0294 seconds

As expected, the insert operation performance is not affected by the number of rows in the database.

Now, let’s try to query the data at different sizes of the database:

-------------------------------- live log call ---------------------------------
15:37:37 INFO Current number of rows in the db: 1000
15:37:37 INFO Time taken to query 1 row: 0.0008 seconds
15:37:37 INFO Current number of rows in the db: 2000
15:37:37 INFO Time taken to query 1 row: 0.0015 seconds
15:37:37 INFO Current number of rows in the db: 3000
15:37:37 INFO Time taken to query 1 row: 0.0022 seconds
15:37:37 INFO Current number of rows in the db: 4000
15:37:37 INFO Time taken to query 1 row: 0.0029 seconds
15:37:37 INFO Current number of rows in the db: 5000
15:37:37 INFO Time taken to query 1 row: 0.0036 seconds
15:37:37 INFO Current number of rows in the db: 6000
15:37:37 INFO Time taken to query 1 row: 0.0044 seconds
15:37:37 INFO Current number of rows in the db: 7000
15:37:37 INFO Time taken to query 1 row: 0.0050 seconds
15:37:37 INFO Current number of rows in the db: 8000
15:37:37 INFO Time taken to query 1 row: 0.0057 seconds
15:37:37 INFO Current number of rows in the db: 9000
15:37:37 INFO Time taken to query 1 row: 0.0064 seconds
15:37:37 INFO Current number of rows in the db: 10000
15:37:37 INFO Time taken to query 1 row: 0.0071 seconds

Again, as expected, the query operation performance is affected by the number of rows in the database. It scales linearly with the number of rows.

Compaction

For how we have handled the file until now, we can easily see that the file will grow indefinitely. This will both make it take a lot of space on disk unnecessarily, and even worse will make the query operation slower and slower as the file grows.

To avoid this, we need to implement a compaction operation.

The compaction operation will read the file, and for each id it will keep only the latest row. It will also remove all of the rows that are marked as deleted. The compacted file will contain only the current state of the table.

    @require_isset_database
    def compact_table(self, table_name: str):
        """Compact a table."""
        table_file = self.get_table_file_path(table_name)
        if not table_file.exists():
            raise ValueError(f"Table '{table_name}' does not exist")

        compacted_data = {}
        with open(table_file, 'r', newline='') as f:
            csv_reader = csv.DictReader(f)
            for row in csv_reader:
                compacted_data[row['id']] = row

        # Remove all rows for which the last line is a delete
        compacted_data = {k: v for k,
                          v in compacted_data.items() if v['__deleted__'] == 'False'}

        with open(table_file, 'w', newline='') as f:
            csv_writer = csv.writer(f)
            # Write headers from the first row's keys
            csv_writer.writerow(list(compacted_data.values())[0].keys())
            for row in compacted_data.values():
                csv_writer.writerow(row.values())

Let’s check if it works:

import dumbdb

dbms = dumbdb.AppendOnlyDBMS()
dbms.create_database("compact_test")
dbms.use_database("compact_test")
dbms.create_table("users", ["id", "name", "age"])

dbms.insert("users", {"id": "1", "name": "John Smith", "age": "20"})
dbms.insert("users", {"id": "2", "name": "Mike Smith", "age": "21"})
dbms.insert("users", {"id": "3", "name": "Luke Skywalker", "age": "26"})

dbms.update("users", {"id": "1", "name": "John Smith", "age": "21"})
dbms.delete("users", {"id": "2", "name": "Mike Smith", "age": "21"})

File content:

>> cat data/compact_test/tables/users.csv
id,name,age,__deleted__
1,John Smith,20,False
2,Mike Smith,21,False
3,Luke Skywalker,26,False
1,John Smith,21,False
2,Mike Smith,21,True

Let’s compact the table:

dbms.compact_table("users")

File content after compaction:

>> cat data/compact_test/tables/users.csv
id,name,age,__deleted__
1,John Smith,21,False
3,Luke Skywalker,26,False

As we can see, the row with id 2 has been removed, and for the row with id 1 only the latest version has been kept.

Considerations

When to run the compaction operation is a design choice, that for real databases would involve a lot of trade-offs. Would the db halt operations to run the compaction? Would it run the compaction in the background? Fortunately, for our DumbDB we can just ignore this and run the compaction operation when we want.

DumbDB I: an Append-Only Database