One of Studio 3T’s strengths is our support for SQL: for direct queries written in SQL, for simple imports of SQL tables, and also for more complex migrations of entire databases between MongoDB and SQL.
The import and migration tools help data flow smoothly between MongoDB and SQL databases such as MySQL, Microsoft SQL Server, Oracle, Postgresql, Sybase and IBM DB2. In the recent 2021.5 release of Studio 3T, the SQL feature received a performance boost. We wanted to know more, so we sat down with Hugo Almeida, developer lead on the SQL connections.
In one test, importing two million records from Oracle was taking over four hours. In the forthcoming 2021.6 release of Studio 3T, that import took six minutes.
Evaluating Speeds
Was there a problem or was this all standalone performance work? It turns out that if you were migrating a single table, it was fine. But anything with relationships could slow it down. That slowness was in proportion to the distance the Studio 3T system was from the SQL database server, typically expressed in a higher latency in calls.
Hugo explained how the SQL to MongoDB migration tool dealt with building collections. When there were two tables with a relationship, Studio 3T would get one record from the first table, then lookup the relationship in the second table, build the document, write it and move on to the next record. These simple atomic operations work well with the database, keeping I/O and compute load down. More relationships mean, of course, more secondary queries.
But, while this approach works well with local SQL databases, the addition of latency into the process had a dramatic effect on the migration performance, pushing what should have been a quick transfer up into the hours or even days. Every lookup ends up waiting for at least the connection latency time to get a response, and the more relationships there are, the more waiting there is.
Doing the math
“If you just do simple math,” Hugo explained, “now, imagine each round trip is like two seconds – which it isn’t, but for simplicity’s sake let’s say that – and the query itself takes one second, then it takes three seconds to get a result. Then for 100 records, that would be three hundred seconds”. Give each record two related tables with relationships, you are now up to 900 seconds. And a vast amount of that time is waiting for the request to travel across the Internet.
Now, one approach might be to make the relationships happen with an SQL join at the server end. This turns out to be a bad idea though; you not only move the compute/I/O load up to the server, but to get any value out of doing that, you have to perform the query for a substantial number of records. If they have any number of relationships, that’s going to potentially impact your server performance and that’s something we don’t want to see with production servers.
Batch to the future
The solution was to batch requests in a smart way. Rather than request records one at a time, the team moved to request in larger (1000+) batches. For each relationship, a relationship record request is created by scanning that batch for the IDs of the relationships and crafting a query which would select the appropriate records for the entire batch. “Going back to our example, when we batch everything together the round trip time for the batch is the same as a query for a single record, and even if the query takes longer, 300 seconds becomes 3 seconds for a simple case, and 900 seconds becomes 10 seconds. It’s a huge difference.”
The queries being made on the server are also essentially simple queries so they should place minimal load on the server, unlike a complex join, and be safe to run alongside production workloads.
With this performance enhancement in place in 2021.5, we are looking forward to people being able to migrate their SQL data to MongoDB faster and from further afield than before.
Not just migrations
This isn’t the only performance optimization work going on in Studio 3T’s SQL stack either. Hugo explained that the team have been diving into various aspects of vendor’s SQL drivers and isolating issues within them too. Probably the most spectacular of those issues came with the Oracle driver. Its performance for import and migration was weak; in one test, importing two million records from Oracle was taking over four hours. In the forthcoming 2021.6 release of Studio 3T, that import took six minutes. It’s a huge boost and we aim to keep pushing the performance envelope on our import and export operations to SQL databases going forward.