= SQLite - SQL



O4 is a Container File Format which combines Information and Logic into a single file, giving a result that is comparable to SQLite (an embedded relational database) but without any of that SQL nonsense.

Instead of relying on SQL syntax, O4 uses Smalltalk-80, which gives the user a full-featured programming environment comparable to Java, Python, Swift and so on.

How It Works

O4 is a library which can be embedded in C/C++ programs, similarly to how you might use a library like SQLite or Lua. The program is then able to load O4 files, inspect/modify their contents, and invoke behaviours embedded within the files. It can also expose any C/C++-accessible behaviour to the O4 file, allowing it to access optimised algorithms or system-specific functionality.

A tool is also provided to create and edit O4 files from the command line. Both the library and the tool include the full compiler and storage system.

Comparison To SQLite

SQLite is probably the best system to compare O4 to, both from a use-case perspective and from an implementation perspective. The two systems are very similar in some ways, but completely different in others.


(Note that the term "relational database" is usually taken very strictly, but the data model is still basically equivalent to an "object database" like O4.)


(Support for random-access databases is planned for a future release.)

Comparison To Pharo, Squeak & Amber

O4 is based on Smalltalk, but O4 is very different to other modern Smalltalk implementations.



Format Details

O4 files consist of a simple file header, followed by one or more object descriptions, terminating at a "nil" object description.

Each object description contains the "key" id of the object, some flags indicating the type of data it holds, as well as the data size and a class reference. Each object description is followed by the data in that object, if any. The data is either raw bytes, or compressed object identifiers. (Future versions will also support endian-agnostic 16/32/64-bit values, but for now these are stored as bytes in an endian-specific format.)

O4 currently uses a simple palette-based compression method. More-common object references use a smaller palette, which can be addressed in a single byte, while less-common objects use a larger palette of three bytes, and so on). This is surprisingly effective for object data, shaving about 40% off of the uncompressed file size and allowing complete object descriptions to fit in as little as two bytes. However, it doesn't help with binary data or text strings, only with relational data.

Two additional compression methods will be incorporated in the future: Run-Length Encoding (similar to Zip/PNG/GIF formats) and Delta Compression (similar to Diff/Git formats). The former deals near-optimally with binary/text/string data, while the latter removes higher-level redundancies. Additional palette-based compression methods may also be applied to integer arrays. I expect the resulting combination of techniques should provide practically-optimal byte efficiency with close to minimum compression/decompression overheads in most scenarios.

Implementation Details

O4 is implemented in C (and Smalltalk), with the core functionality being based on a heavily-reworked version of Public Domain Smalltalk (PDST). It has no third-party dependencies apart from a C compiler and standard library, and it is mostly reentrant and thread-safe (provided that individual files are locked by individual threads).

O4 doesn't yet have a proper I/O or VFS mechanism under the hood, but instead relies on a simple byte streaming API (i.e. stdio with a thin abstraction over the top). In the future, SQLite's VFS or a comparable system may be used instead. Future versions will also offer improved reentrance/thread-safety behaviour.

Foolishly Assumed Queries

People are confusing. I have no idea how they think or what they want, but they might be wondering...

Why Smalltalk-80?

As well as being very simple and flexible, Smalltalk-80 is extremely stable and well-documented. It also deals with persistence (i.e. memory) in a much more natural way than most other systems.

O4 is designed to power certain types of next-gen business appliances. These appliances have very different hardware and use cases compared to a PC/Phone/Server, so only a small fraction of today's solutions remain applicable (e.g. Java and .NET are equally useless on resource-constrained hardware, regardless of what Oracle or Microsoft may tell you).

Before settling on Smalltalk-80, many other languages were considered. Five new ones were developed (Magic, Magic 2, MettaScript, ZXE & ZPL), and new base libraries were developed for other languages (C, Objective C, Object Pascal). Special attention was also given to Oberon, Lua, Haxe, Go and Erlang. Around two dozen languages were given serious consideration, and many others were briefly investigated.

All of the languages had advantages and disadvantages when compared to each other, but Smalltalk-80 worked well enough for everything, and nothing else really offered any compelling advantages outside of certain niche cases.

There was one final nail in the coffin: Smalltalk is the closest to English, and since internationalised programming languages don't work (long story), it makes a lot more sense to use something close to English, especially for important business logic.

(To make a long story short, internationalised programming languages don't work because Unicode only works for very simple cases. When you try to mix languages or implement cross-language logic, you realise that Unicode is exactly equivalent to the Latin alphabet except approximately one hundred million times larger and much more ambiguous. For an example, try using an Arabic or Hebrew identifier in a Java program, then open it in a non-Java-based editor. For extra credit, try it in a string, next to an equals sign and before a semicolon - this should make it abundantly clear that Unicode is the most confusing version of Latin ever created.)