What is Azure Cosmos DB in general?
Cosmos DB is a document database. Document database with an API familiar with SQL query language. This feature distinguishes it from other NoSQL databases. It is worth mentioning that this is not the only API for Cosmos DB. But SQL API is recommended, and in fact, it is the most popular option selected by developers. However, generally, the SQL syntax has its limits. For example, joins across different documents are not possible.
In MongoDb, the support for SQL queries is available as a Preview feature.
Data modeling. Forget about things you learned using relational databases.
The most important thing when you are starting your adventure with Cosmos DB is how to organize your data in a database. In a SQL database, mostly, we store data grouping them into entity tables. So each type has its own table, for example, Users, Orders, Products, and so on. In Cosmos DB this approach is not the good one. sIt is because data in the database are kept split into partitions. Each partition has its own partition key. Operations like querying are efficient in partition boundary. Of course, the best performance we get querying using the partition key. When we would like to query data across more than one partition it will have its price in efficiency and money.
Few more things about partitions. As I wrote, partitions’ keys allow us to query data in the most efficient way. The partition key also affects how data are arranged between physical partitions (in other words, how they are stored somewhere on servers’ drives). This is important because, as Microsoft says, Cosmos DB works best when data are evenly loaded across partitions. And one more thing that is related, Microsoft set a limit for the partition size – which is 20GB.
As you see, it is extremely important to define the good partitions’ keys. In particular, there is no possibility of changing the partition key after partition creation. Mistakes here will cause new partition creation and data migrations. Yeap, try to imagine this on your production database fully loaded with data.
The costs were the thing that put me off using Cosmos DB. Maybe not the costs themselves as their unpredictability and the problem with their estimation upfront. They are still some kind of mystery to me. Ok, generally, the largest amount of money we pay for operations like reading, inserting, upserting, deleting, and querying. Read is when we retrieve data from DB using the partition key index. It is the cheapest operation. It consumes 1 RU – the pay unit for Cosmos DB. The query, i.e. the read operation not using PK, mays be more resource-consuming and makes queries’ cost really hard to estimate. Inserting, upserting, delete consume more RU more indexes have to be changed during these operations. As you see, things like adding an index to partition change the costs of CRUD operations. This is something new when we compare this with for example Azure SQL Server pricing.
How to estimate the costs? Calculating RU without data in a database may be hard. So maybe we should deploy on production and wait for what happens? Yeah… we could, but there is one more option. SDK goes ahead and gives us the possibility to get the cost of a particular DB statement. It allows, with small effort, to make simulations with costs summary returned programmatically. We can prepare different data models and use some simulated inbound/outbound to check the total prices for each of them. Where the price will be returned just after the simulation ends. This, in my opinion, is the smartest way to estimate the overall cost of the used DB structure, before this cost would be a problem on production.