Parquet File Format – Every part You Must Know! -

the quantity of Knowledge rising exponentially in the previous couple of years, one of many largest challenges has turn out to be discovering essentially the most optimum technique to retailer numerous information flavors. Not like within the (not up to now) previous, when relational databases have been thought of the one technique to go, organizations now wish to carry out evaluation over uncooked information – consider social media sentiment evaluation, audio/video recordsdata, and so forth – which often couldn’t be saved in a standard (relational) manner, or storing them in a standard manner would require vital time and effort, which enhance the general time-for-analysis.

One other problem was to someway stick to a standard method to have information saved in a structured manner, however with out the need to design advanced and time-consuming ETL workloads to maneuver this information into the enterprise information warehouse. Moreover, what if half of the info professionals in your group are proficient with, let’s say, Python (information scientists, information engineers), and the opposite half (information engineers, information analysts) with SQL? Would you insist that “Pythonists” be taught SQL? Or, vice-versa?

Or, would you favor a storage choice that may play to the strengths of your whole information staff? I’ve excellent news for you – one thing like this has already existed since 2013, and it’s known as Apache Parquet!

Parquet file format in a nutshell

Earlier than I present you the ins and outs of the Parquet file format, there are (not less than) 5 important the explanation why Parquet is taken into account a de facto normal for storing information these days:

Knowledge compression – by making use of numerous encoding and compression algorithms, Parquet file gives diminished reminiscence consumption
Columnar storage – that is of paramount significance in analytic workloads, the place quick information learn operation is the important thing requirement. However, extra on that later within the article…
Language agnostic – as already talked about beforehand, builders could use totally different programming languages to govern the info within the Parquet file
Open-source format – which means, you aren’t locked with a selected vendor
Assist for advanced information varieties

Row-store vs Column-store

We’ve already talked about that Parquet is a column-based storage format. Nonetheless, to know the advantages of utilizing the Parquet file format, we first want to attract the road between the row-based and column-based methods of storing the info.

In conventional, row-based storage, the info is saved as a sequence of rows. One thing like this:

Now, after we are speaking about OLAP situations, a number of the widespread questions that your customers could ask are:

What number of balls did we promote?
What number of customers from the USA purchased a T-shirt?
What’s the whole quantity spent by buyer Maria Adams?
What number of gross sales did now we have on January 2nd?

To have the ability to reply any of those questions, the engine should scan each row from the start to the very finish! So, to reply the query: what number of customers from the USA purchased T-shirt, the engine has to do one thing like this:

Basically, we simply want the data from two columns: Product (T-Shirts) and Nation (USA), however the engine will scan all 5 columns! This isn’t essentially the most environment friendly resolution – I feel we are able to agree on that…

Column retailer

Let’s now study how the column retailer works. As chances are you’ll assume, the method is 180 levels totally different:

On this case, every column is a separate entity – which means, every column is bodily separated from different columns! Going again to our earlier enterprise query: the engine can now scan solely these columns which are wanted by the question (Product and nation), whereas skipping scanning the pointless columns. And, normally, this could enhance the efficiency of the analytical queries.

Okay, that’s good, however the column retailer existed earlier than Parquet and it nonetheless exists outdoors of Parquet as nicely. So, what’s so particular in regards to the Parquet format?

Parquet is a columnar format that shops the info in row teams

Wait, what?! Wasn’t it difficult sufficient even earlier than this? Don’t fear, it’s a lot simpler than it sounds 🙂

Let’s return to our earlier instance and depict how Parquet will retailer this similar chunk of information:

Let’s cease for a second and clarify the illustration above, as that is precisely the construction of the Parquet file (some further issues have been deliberately omitted, however we are going to come quickly to clarify that as nicely). Columns are nonetheless saved as separate models, however Parquet introduces further buildings, known as Row group.

Why is this extra construction tremendous necessary?

You’ll want to attend for a solution for a bit :). In OLAP situations, we’re primarily involved with two ideas: projection and predicate(s). Projection refers to a SELECT assertion in SQL language – which columns are wanted by the question. Again to our earlier instance, we want solely the Product and Nation columns, so the engine can skip scanning the remaining ones.

Predicate(s) seek advice from the WHERE clause in SQL language – which rows fulfill standards outlined within the question. In our case, we’re thinking about T-Shirts solely, so the engine can fully skip scanning Row group 2, the place all of the values within the Product column equal socks!

Let’s rapidly cease right here, as I would like you to understand the distinction between numerous kinds of storage by way of the work that must be carried out by the engine:

Row retailer – the engine must scan all 5 columns and all 6 rows
Column retailer – the engine must scan 2 columns and all 6 rows
Column retailer with row teams – the engine must scan 2 columns and 4 rows

Clearly, that is an oversimplified instance, with solely 6 rows and 5 columns, the place you’ll undoubtedly not see any distinction in efficiency between these three storage choices. Nonetheless, in actual life, while you’re coping with a lot bigger quantities of information, the distinction turns into extra evident.

Now, the truthful query can be: how does Parquet “know” which row group to skip/scan?

Parquet file incorporates metadata

Because of this each Parquet file incorporates “information about information” – data reminiscent of minimal and most values in a selected column inside a sure row group. Moreover, each Parquet file incorporates a footer, which retains the details about the format model, schema data, column metadata, and so forth. You’ll find extra particulars about Parquet metadata varieties right here.

Vital: To be able to optimize the efficiency and remove pointless information buildings (row teams and columns), the engine first must “get acquainted” with the info, so it first reads the metadata. It’s not a sluggish operation, but it surely nonetheless requires a sure period of time. Due to this fact, if you happen to’re querying the info from a number of small Parquet recordsdata, question efficiency can degrade, as a result of the engine should learn metadata from every file. So, you ought to be higher off merging a number of smaller recordsdata into one greater file (however nonetheless not too massive :)…

I hear you, I hear you: Nikola, what’s “small” and what’s “massive”? Sadly, there isn’t any single “golden” quantity right here, however for instance, Microsoft Azure Synapse Analytics recommends that the person Parquet file ought to be not less than just a few hundred MBs in dimension.

What else is in there?

Here’s a simplified, high-level illustration of the Parquet file format:

Can it’s higher than this? Sure, with information compression

Okay, we’ve defined how skipping the scan of the pointless information buildings (row teams and columns) could profit your queries and enhance the general efficiency. However, it’s not solely about that – bear in mind once I advised you on the very starting that one of many important benefits of the Parquet format is the diminished reminiscence footprint of the file? That is achieved by making use of numerous compression algorithms.

I’ve already written about numerous information compression varieties in Energy BI (and the Tabular mannequin usually) right here, so perhaps it’s a good suggestion to begin by studying this text.

There are two important encoding varieties that allow Parquet to compress the info and obtain astonishing financial savings in house:

Dictionary encoding – Parquet creates a dictionary of the distinct values within the column, and afterward replaces “actual” values with index values from the dictionary. Going again to our instance, this course of seems one thing like this:

You may assume: why this overhead, when product names are fairly quick, proper? Okay, however now think about that you just retailer the detailed description of the product, reminiscent of: “Lengthy arm T-Shirt with software on the neck”. And, now think about that you’ve this product offered million occasions…Yeah, as a substitute of getting million occasions repeating worth “Lengthy arm…bla bla”, the Parquet will retailer solely the Index worth (integer as a substitute of textual content).

Can it’s higher than THIS?! Sure, with the Delta Lake file format

Okay, what the heck is now a Delta Lake format?! That is the article about Parquet, proper?

So, to place it in plain English: Delta Lake is nothing else however the Parquet format “on steroids”. After I say “steroids”, the principle one is the versioning of Parquet recordsdata. It additionally shops a transaction log to allow monitoring all adjustments utilized to the Parquet file. That is often known as ACID-compliant transactions.

Because it helps not solely ACID transactions, but in addition helps time journey (rollbacks, audit trails, and so on.) and DML (Knowledge Manipulation Language) statements, reminiscent of INSERT, UPDATE and DELETE, you received’t be unsuitable if you happen to consider the Delta Lake as a “information warehouse on the info lake” (who stated: Lakehouse😉😉😉). Inspecting the professionals and cons of the Lakehouse idea is out of the scope of this text, however if you happen to’re curious to go deeper into this, I recommend you learn this text from Databricks.

Conclusion

We evolve! Identical as we, the info can also be evolving. So, new flavors of information required new methods of storing it. The Parquet file format is without doubt one of the most effective storage choices within the present information panorama, because it gives a number of advantages – each by way of reminiscence consumption, by leveraging numerous compression algorithms, and quick question processing by enabling the engine to skip scanning pointless information.

Thanks for studying!

Parquet File Format – Every part You Must Know!