Why several Database Systems?
It’s not a secret that I perform most of my daily work in a Python environment. In my previous blog entries, I already gave some insights into the variety of tools of data processing that it provides. But besides processing, the storage of data is playing field with a large potential for optimization.
We usually use so-called relational databases (most famous candidate: MySQL) in our data analysis tasks. These store data sets in tables with well defined columns and relations among each other.
MongoDB, however, is a document-oriented database system: There are no tables, but only collections of documents. The purpose of the former lies mostly in grouping and assigning indices, while the latter represent the individual data sets as a data object, that may include various levels of nesting.
Which database concept is superior depends largely on the intended purpose: In a relational database, you always know the type of a column’s value – or even if the column exists – which can have enormous advantages for an efficient analysis of large quantities of data. The document-oriented database on the other hand offers for example:
- Each object can be extended by arbitrary attributes.
- Hierarchical structures can be mapped directly.
- It’s unnecessary to define a structure beforehand.
This implies that it may be used easily ad hoc – you don’t have to define a structure in the database itself, as it is already part of the saved object. Furthermore, many objects in Python are structured hierarchically, which enables a data conversion with minimal friction in both directions.
It therefore has multiple areas of application where relational databases would yield enormous overhead. For some of them this article will provide Python recipes, following an insight into the interface of MongoDB for Python:
- Caching: Especially our webserver’s on-demand reports benefit from intelligent caching, that stores required objects with an expiry date in the database.
- Logging: We connect the native logging module of Python with our MongoDB to be able to store, search and load arbitrary information.
The database system has a long list of other features that I won’t touch in this article. It’s main purpose is therefore to help users with some Python experience along their first steps with MongoDB.
There are several cloud services providing MongoDB access, but for first steps and many small applications in your intranet a local installation is easily sufficient. It’s available for all relevant operating systems and usually a matter of a couple of minutes.
The current standard configuration does not use authentication and access is only allowed locally – so you can start using it right away. However, it must be highlighted that is your responsibility to verify this, before storing sensitive data or permanently exposing an interface to your network!
The steps to enforce authentication are best described in the official documentation. As a newbie it might help you to know, that users only exist within individual databases, i.e. a user in database “test” only exists there and can therefore only authenticate there – excluded from this rule are users on the “admin” database, that have global rights. It might also be confusing, that MongoDB creates databases on demand, but does not save them permanently if they stay empty.
The interface for Python is named PyMongo and can be easily installed via pip on systems with a properly configured Python environment:
The base class that manages all operations on the database system is MongoClient. To establish a connection, you only need the hostname (default: “localhost”) and the port (default: 27017):
The most important method of the client object is get_database. For administration purposes you can for example get the names of all databases with database_names, but you need special rights for that, and usually only one database is needed in a single session. MongoClient is also available as a context manager (= with statement in Python), which closes the connection by calling the close method automatically:
The Database class represents an individual database. If it does not exist, it is simply created as needed. It is the instance against which you have to authenticate – if the server is configured that way:
This is the object that you will use most of the time. It is therefore quite handy to include it in a context manager:
Here, the function get_default_mongo_database uses authentication data, that obviously needs to be set beforehand for the respective database. It will be used from here on in many examples for convenience and may be implemented in many different ways.
The elements of the database are the collections, that group data sets by topic and have additional purposes like managing indices. The collection_names method of a database returns a list of all existing collections. Just like databases, they are created on demand and only saved if they don’t stay empty. They are accessed via name, which is possible in multiple ways:
As documents are converted directly to Python dictionaries in query responses, collections are already the lowest level object of the pymongo module with relevance for us in this article. They are the starting point for all search, delete, and replacement procedures of data sets.
Data sets within the MongoDB are based on BSON, which except for some details (e.g. datetime fields) corresponds to a binary JSON format. Therefore each data set is convertible to a Python dictionary with any level of nesting and unambiguous data types. Some peculiarities apply however:
- A key/value pair in a dictionary is called a field.
- As in JSON, field names must be strings.
- BSON knows datetime field values, that convert to Python datetime.datetime objects – and vice versa.
- Field names must not contain neither whitespaces nor dots nor leading “$” signs.
- Each entry contains a unique “_id” field, that is created automatically if not provided.
In order to insert documents there is a choice between single execution with insert_one and batch execution with insert_many:
The collection object used here is of course a previously created instance of a pymongo Collection. The insert_x methods return objects containing the results of the operation (like all delete and replacement methods later on). The value of acknowledged indicates the success of the operation, and inserted_id(s) provides the _ids of the concerning objects within the database, for their unambiguous identification.
Finding and reading documents is the task of the find_one and find (not find_many!) methods. The former returns exactly one document as dictionary (if it could find one at all), while the latter returns a Cursor object, which allows iterating over the found documents, or applying further operations like counting the matched documents with count() or sorting them with sort(). Otherwise the handling of the two find methods is identical.
So find_one always returns the first matching document. Called without arguments, any document matches. The first and most important argument is the filter, which itself is a JSON-compatible dictionary, just as the second one – the projection. This is used to reduce the returned documents to certain fields. It is always a dictionary, whose keys are field names, and the values are either True or False – depending on whether the field is desired or not. If a field is set True, all unmentioned fields are automatically False – except for _id, which has to be set False explicitly – turning the argument into a whitelist instead of a blacklist. I’ll be using this in the following for demonstration and to improve readability of the results:
The first two examples are simple queries by exact field value. From the third example on you’ll see the syntax “FieldA.FieldB” regularly: it is used to progress along nested documents and therefore filter by field values in deeper levels of the hierarchy. The fourth example introduces operators, identified by a leading “$” sign. First a string pattern is searched using “$regex”, then the existence of a particular field is tested via “$exists”, and finally “$or” is used to logically connect the check for a fields existence with its value. There are many more operators for you to explore.
Extending the pattern of the inserts, there are two methods for deletion: delete_one and delete_many. Instead of an object to insert they take a filter as argument and delete the first or all matching documents respectively. Their return value is an object with the result, whose deleted_count property gives the number of deleted entries.
replace and update
For replacements there is only the method replace_one, that takes a filter and the replacement document as arguments, and replaces the first match with it. Interesting here is the optional upsert argument, that causes the replacement document to be written even if the filter doesn’t match any documents.
Updates are more versatile: There is update_one and update_many, both taking a filter as argument and modifying the first or all matches respectively. The second argument is the update operation: It can set values to fields directly, or modify them based on the current values, for example by incrementing or – in case of an array – add or remove elements.
The optional upsert argument exists here too, in order to guarantee, that at least one matching document exists after the operation.
You can find more detailed information on read and write processes in the documentation of Collections.
A simple but common scenario is the storing and loading of Python objects, where a possibly existing one should be overwritten, so you can for example store intermediate states of a complex process. For that purpose, one can store the object as a field value in a document of a certain collection and add a field for identification:
This way one has to choose just the name of a collection and the name for the entry, and the object will be written in the “data” field of a document – or loaded from it. This works in a direct manner with many Python objects, and for more complex ones (e.g. pandas.DataFrame) the pickling flag enforces a conversion to a string.
We can now expand this principle: many of my reports are generated on demand in a Flask webserver – as shown in my last article – which can consume substantial processing power. As they are requested multiple times a day by different people, there is a lot of overlap, e.g. in request parameters or underlying data sets. So why not simply store intermediate states of data or even full websites together with a timestamp in a MongoDB, and adding a grace period to the requesting method? In total it looks like this:
Entry point is the get_cached_object function. In your procedure it replaces a function call with certain parameters that would return the desired object (DataFrame, HTML string etc.). The replaced function becomes the fcn argument, and its arguments are simply passed as keyword arguments. This is completed by passing a name for identification and a grace period in hours. The collection “website_cache” is hardcoded here. If the grace period is zero or negative, the cache (= MongoDB) is not queried. Otherwise _mongodb_get_entry searches for an entry with the given name and the same arguments, verifies that the timestamp lies within the grace period and – if successful – returns the data object, which is then unpickled. If no dataset could be retrieved from the cache, it is created by calling the function before being stored in the database via _mongodb_upsert_entry as pickle-string along all necessary information. The usage is demonstrated in the following:
It is worth mentioning here, that in rapidly growing collections it can quickly become beneficial to create indices using the create_index method.
With the loggin module, Python provides a very flexible logging interface, that you can also connect to a MongoDB. This way, access to logs is detached from any file system of the logging computer system, and the logs can be browsed efficiently without command line voodoo. Core element of such a connection is a logging.Handler, which structures the information and writes them to the MongoDB:
In case of an exception, the Handler saves the available stack traceback. As collection name it chooses “log_” plus the name of the logger – which is individual for each module by Python convention. The replacement of dots in the name is necessary, as they are not allowed in collection names. One way to attach this logger to a module is by adding
to its head. One shortcoming here is, that __name__ for the name of the logger can also be “__main__” if the module itself is executed, which leads to the probably unwanted collection “log___main__”.
Now you just need access to the concerning database in MongoDB to search specific log entries:
In larger scale productive systems there are often obvious arguments for and against certain database types, and a quick migration to another type is an improbable scenario. However, it costs only little effort and no strong computing infrastructure to launch a MongoDB instance with the capabilities to support many small but useful tasks within a Python environment. Therefore – even in computing environments with otherwise contrary requirements concerning data storage – it might very well pay to take this peek beyond the horizon. If you have questions, ideas or want to share your experiences, I’m looking forward to your comments!