I’ve recently gotten a message on LinkedIn congratulating me for 5 years in my current job. This led me to reflect on developments I’ve witnessed and I have some thoughts I want to share with you. In this first article, I wanted to reflect on QVD files and what they’re worth in the current Qlik and data world. I will describe my perception of current developments in data storage and how and why QVD files could be redundant – and why they’re probably not. I’d be happy to hear your thoughts in the comments! Full disclaimer: It might get technical.
For the uninitiated (read: non-Qlik developers)
In case you’re new or foreign to the Qlik ecosystem, here’s the primer: Qlik Sense is a Business Intelligence Tool that evolved from QlikView. Both systems share an engine (the name is QIX engine, a.k.a. The Beast) that is built on in memory technology, that allows rapid queries and – as a convincing argument to heads of IT and infrastructure – therefore doesn’t require an underlying database or data warehouse that contains transformed, aggregated or consolidated data. Qlik Sense (and QlikView which I will omit from now on, but most considerations will go for both) comes with a built-in Extract, Transform and Load (ETL) processor – mostly referred to as Qlik Script. Combined with a large number of connectors it’s possible to mount Qlik Sense directly on many source systems directly and mix and mangle the data to an architect’s heart’s content. Qlik developers then have the option to store the memory image of a table as a binary file on system storage. Since the data structure – in layman’s terms – in this binary file resembles the data structure in memory, storage and loading ack into Qlik are lightning fast and – not to forget – highly compressed. Depending on cardinality factors reach from 10x to 100x compression. If you want to know more about this, or I forgot something, please leave a comment.
What I’ve used QVD files for
When I started working with Qlik one of the first things I have learned was how to layer my data pipeline within Qlik: Source Systems were to be replicated as QVD files to create a copy of the source tables to work and load from. This raw data could then, still using Qlik Sense, be transformed to create a logical layer which in some cases already resembles the final data model. Back in the day, these aggregated files were then mixed in a QlikView document to compose the data model (or data mart). Finally, these data models were then loaded into one or more front end applications using the data from these data models to facilitate analytics. In our practice, we do this still. Except for the data model layer, which has become somewhat awkward in Qlik Sense – there are approaches to still do this or something equally as efficient but I don’t want to get into too much detail here.
Why did we pretty much build data warehouses even though there were data warehouses? Speed, cost and convenience! A data warehousing project used to belong (months to years with creating pipelines, defining storage, modelling data and running batch jobs), expensive (6 to 7 digit projects and insane maintenance cost) and inflexible. No parentheses for the last point, because I want to elaborate a bit: Getting a new dimension, measure or field into a cube in e.g. SAP Business Warehouse is a multistage effort that starts in the ECC and needs to be manually modelled all the way into the multiprovider or BEx query it is supposed to be used in. This is far from a business user skillset (and probably one of the reasons SAP is a DAX company…). So when I say convenience I mean: flexibility, ease of access and time to deploy.
Sounds like a pretty good deal Boris, why am I reading this?
Good question and Qlik Sense sales would be easier if I weren’t to ask myself the same thing. All the praise for QVDs is offset by some relevant downsides:
- Vendor Lock-In, this may sound weird, but you cannot easily use QVD data in external applications. That was OK when the main purpose of data was using it in Qlik analytics, but times have changed. With AI, machine learning and other data science projects getting lined up the question for central data storage becomes relevant again. When you use the layered model above, you do not only lock in raw data, but also business knowledge and logic by modelling the data in Qlik Sense, this can lead to redundancy should this be necessary outside of the Qlik ecosystem as well.
- Storage on hard drives is a data integrity nightmare. I’m not talking about access, even though that is a problem as well. I’m talking about file corruption, file backup and restore and all the other reasons companies moved from MS Access to MS SQL Server.
These are my two main gripes with an otherwise pretty solid system. The first point has certainly become more relevant these days, while the second one has always been relevant but was trumped (Not my favourite word in 2020) by the benefits outlined above.
What has also changed is the availability of manageable, affordable and easy-to-deploy cloud storage solutions.
Most of you will have come across Snowflake, Microsoft Azure Synapse, Google BigQuery, Amazon Redshift, and all the other names that are selling convenience in data warehousing at an unprecedented level.
Your data in the cloud
At an entry point of $23 per months for a terabyte of storage (e.g. Snowflake as of writing of this article), we cannot really speak about expensive data storage anymore. Where the cost comes in is at the compute layer. Any database operation that is run as a query SELECT (data retrieval), DELETE (well… delete), but also UPDATE (I’m not doing this) or INSERT (…) will be charged by resource consumption for the operation in a serverless architecture or is run by a dedicated data warehouse virtual machine on the cloud. The available resources (read query performance) are highly elastic so you can tailor request times to the business requirements – at a cost. I don’t want to get into too much detail about cloud data warehouses and how they work, for further reading refer to an article by e.g. Julian Schulte, he posted a free version here.
These data warehouses come with substantial connectivity options to a multitude of data sources. Usability is well documented, intuitive – still, not low tech but also not shell prompt, and they come with built-in features for security and data governance. User access to data is very straight forward, well organised and intuitive. To name just a few benefits. They are – most importantly – fully agnostic to how and where you are going to use your data.
This leads to me thinking about if data staging as we (in the Qlik world) know it, is still the best way to go? While we mostly told our customers to get rid of their data warehouses in the past, because you didn’t need them for analytics anymore, I would be hesitant to make such a suggestion today with so many business and use-cases for data being now readily available.
So, no more QVD files? Ever?
Don’t be sad, but I think Qlik developers will have to take data consumption and usage into consideration. When designing data pipelines for analytics, it is important to ask oneself the question of whether business logic is only required in Qlik Sense or if it could possibly be required elsewhere. Every time the answer to this question is “yes”, modelling the use case on the data warehouse is probably the better choice – Qlik Script is still very powerful, sometimes an immensely quicker time to value can be achieved by starting in Qlik and transitioning when necessary, but this is also a decision analysts and architects will have to make.
What about the data in general? I think, to be honest, having data in source systems and QVDs only is a risk when it comes to deploying future use cases and serving data to business users and data science teams in the organisation. This risk is compounded if the data in the source systems is subject to change or is regularly discarded – QVDs can buffer this but see above.
I’ve mentioned above that every procedure in a modern cloud data warehouse will cost you money, so there is still life in QVDs yet. Especially during development QVD files can save you a ton of requests to e.g. Azure Synapse. Also data that isn’t highly volatile can be “cached” as QVD to save money. Maybe a very complex calculation that the QIX engine can more easily (and thus more cost-efficiently) handle…
What about other options?
Cloud data warehouses certainly are not the only way data could be stored these days. Data Lakes with potent data lake or query engines, like Dremio will allow an even more price efficient way to store data and make it available to the business.
I’m not talking about on-premises data warehouses because they seldom profit from the benefits coming from low maintenance, elasticity and flexibility although technology has evolved in this sector as well.
Where does this leave us?
I think it’s giving us – as Qlik developers – more things to consider when deploying our architecture. We will have to align the way we model our pipelines with our customer’s data strategy. Cloud data warehousing allows even smaller companies to access this technology that was formerly exclusive to ones who could afford high maintenance contracts and had engineers to take care of data processing and provisioning.
I think the standard (if there ever was such a thing) data processing for Qlik will become more similar to its competitors since more and more companies will want their data to be accessed via centralised data storage – and to be honest that’s the key concept of data democratisation.
I would be interested to learn whether or not you’re experiencing the same shift? How do you see this development? How do you see QVD files in current times and how do you use them? Have I used QVD files wrong this entire time? 😀 Let me know and I’m looking forward to the discussion.