Steward’s data Structures

Steward uses both a directory tree for storage of the binaries and a database for their meta-data.

About object identifiers

Steward assigns unique identifiers to objects that are uploaded into its store based on the contents of the object. While the file is being uploaded it calculates a one-way SHA-1 hash of the contents. This hash is then encoded base32 to give a Identifier that is exactly 32bytes long and only consists of numbers and uppercase letters. The probability that two different files will have the same identifier is exceedingly low (for a 50% chance of having one collision you need 2^80 or 1.2*10^24 objects in a store.)

Using this identifier has a few interesting properties:

  • It is next to impossible to guess the name of a document.
  • It is simple to know the name of the object before it is uploaded into the Steward instance.
  • Changing anything on the document will change the identifier completely.
  • It’s easy to verify if the file has been modified while in store, since its contents will not match its identifier.
  • Uploading the same file twice will store it only once.

The store

In principal the files are stored in a directory with sub-directories. They are stored with the name of the Id Steward assigns to them according to their name. This directory is called “the store”.

Next to the directories where objects are stored storage, there is also a directory where files are stored while they are being uploaded. These file names are random. Once the entire object has been received, the final document Id will be calculated and the document will be moved to it’s final location with file being renamed to this identifier. Lastly the meta-data will be added to the database, making the object available in Steward.

The Database

The metadata about the files is stored in the database.

This database can be any of the database systems that sqlalchemy supports. So you can scale from a maintenence free Sqlite to a full blown PostgreSQL, MySQL or MSSQL (see http://www.sqlalchemy.org/docs/04/dbengine.html#dbengine_supported for the entire list)

The registry table

This is the main table containing all the files known in the Steward instance.

Name Type Observations
sha (PK) varchar(32) Identifier of the object in the store.
length Integer Length in bytes of the object.
mimetype text(64) Mimetype of the object.
date_stored timestamp Moment that the object was made available in the store.
status varchar(7) Current status of the object. Can be ‘active’, ‘deleted’, ‘purged’

The statuses have the following means:

  • active Object can be retrieved from the Steward instance.
  • deleted Object has been marked for deletion, but is still physically available in the store
  • purged Once there was an object with this meta-data available but it has been removed and is no longer physically in the available.

The events table

Each access or change in the steward database is registered this is done in a simple table called ‘events’.

Name Type Observations
id (PK) integer Internal number, never used outside of the database.
date timestamp Moment of the event
register_id varchar(32) The object the alias points to. (FK with registery.sha)
action varchar(6) Kind of action. This can be one of ‘get’, ‘save’, ‘check’, ‘delete’, ‘purge’.
user varchar(32) The user that performed the action (if authenticated)

The alias table

Sometimes it is desirable to associate a easy to remember name to file. This can be done via aliases. Here is the detail of the table that maintains the aliases.

Name Type Observations
id (PK) integer Internal number, never used outside of the database.
name varchar(255) The alias itself.
register_id varchar(32) The object the alias points to. (FK with registery.sha)

The stewardversion table

Steward tries to detect the version of the database with each start up of the server. It does this by checking the stewardversion table. If it detects a version diferent from what it expects, it will fail.

Name Type Observations
id (PK) integer Internal number, never used outside of the database.
schema integer The number of the schema. Higher is later.
version varchar(255) The version of the software that installed that version of the schema.