The way to Cut back Your Energy BI Mannequin Measurement by 90%

what makes Energy Bi so quick and highly effective on the subject of efficiency? So highly effective, that it performs complicated calculations over hundreds of thousands of rows within the blink of a watch.

On this article, we’ll dig deep to find what’s “below the hood” of Energy BI, how your knowledge is being saved, compressed, queried, and at last, introduced again to your report. When you end studying, I hope that you’re going to get a greater understanding of the onerous work occurring within the background and respect the significance of making an optimum knowledge mannequin to get most efficiency from the Energy BI engine.

First look below the hood — Formulation Engine and Storage Engine

First, I would like you to fulfill the VertiPaq engine, “mind & muscle tissues” of the system behind not solely Energy BI, but additionally Evaluation Providers Tabular and Excel Energy Pivot. Reality to be mentioned, VertiPaq represents just one a part of the storage engine throughout the Tabular mannequin, moreover DirectQuery, which we’ll talk about individually in one of many subsequent articles.

Whenever you ship the question to get knowledge in your Energy BI report, here’s what occurs:

  • Formulation Engine (FE) accepts the request, processes it, generates the question plan, and at last executes it
  • Storage Engine (SE) pulls the info out of the Tabular mannequin to fulfill the request issued throughout the question generated by the Formulation Engine

Storage Engine works in two other ways to retrieve requested knowledge: VertiPaq retains a snapshot of the info in reminiscence. This snapshot will be refreshed on occasion from the unique knowledge supply.

Quite the opposite, DirectQuery doesn’t retailer any knowledge. It simply forwards the question straight to the info supply for each single request.

Picture by RKTW prolong on Unsplash

Information within the Tabular mannequin is often saved both as an in-memory snapshot (VertiPaq) or in DirectQuery mode. Nonetheless, there may be additionally a chance of implementing a hybrid Composite mannequin, which depends on each architectures in parallel.

Formulation Engine — “Mind” of Energy BI

As I already careworn, Formulation Engine accepts the question, and because it’s in a position to “perceive” DAX (and MDX additionally, however it’s out of the scope of this sequence), it “interprets” DAX into a selected question plan, consisting of bodily operations that must be executed to get outcomes again.

These bodily operations will be joins between a number of tables, filtering, or aggregations. It’s essential to know that Formulation Engine works in a single-threaded manner, which signifies that requests to Storage Engine are all the time being despatched sequentially.

Storage Engine — “Muscle groups” of Energy BI

As soon as the question has been generated and executed by the Formulation Engine, the Storage Engine comes into the scene. It bodily goes via the info saved throughout the Tabular mannequin (VertiPaq) or goes on to a distinct knowledge supply (SQL Server, for instance, if DirectQuery storage mode is in place).

Relating to specifying the storage engine for the desk, there are three potential choices to select from:

  • Import mode — based mostly on VertiPaq. Desk knowledge is being saved in reminiscence as a snapshot. Information will be refreshed periodically
  • DirectQuery mode — knowledge is being retrieved from the info supply at question time. Information resides in its authentic supply earlier than, throughout, and after the question execution
  • Twin mode — a mix of the primary two choices. Information from the desk is being loaded into reminiscence, however at question time it may also be retrieved instantly from the supply

Versus Formulation Engine, which doesn’t assist parallelism, the Storage Engine can work asynchronously.

Meet VertiPaq Storage Engine

As we’ve got drawn a giant image beforehand, let me clarify in additional element what VertiPaq does within the background to spice up the efficiency of our Energy BI stories.

After we select Import mode for our Energy BI tables, VertiPaq performs the next actions:

  • Reads the info supply, transforms knowledge right into a columnar construction, encodes, and compresses knowledge inside every of the columns
  • Establishes a dictionary and index for every of the columns
  • Prepares and establishes relationships
  • Computes all calculated columns and calculated tables, and compresses them

The 2 fundamental traits of VertiPaq are:

  1. VertiPaq is a columnar database
  2. VertiPaq is an in-memory database
Picture by creator

As you possibly can see within the illustration above, columnar databases retailer and compress knowledge differently from conventional row-store databases. Columnar databases are optimized for vertical knowledge scanning, which signifies that each column is structured in its personal manner and bodily separated from different columns!

With out going into deep evaluation about benefits and disadvantages between row-store vs column-store databases, since it might require a separate sequence of articles, let me simply pinpoint a number of key differentials by way of efficiency.

With columnar databases, single-column entry is quick and efficient. As soon as the computation begins to contain a number of columns, issues grow to be extra complicated, because the middleman steps’ outcomes must be quickly saved not directly.

Merely mentioned, columnar databases are extra CPU-intensive, whereas row-store databases improve I/O, due to many scans of ineffective knowledge.

To date, we painted a giant image of the structure that permits Energy BI to totally shine as an final BI software. Now, we’re able to dive deeper into particular architectural options and consequently leverage this data to take advantage of our Energy BI stories, by tuning our knowledge mannequin to extract the utmost from the underlying engine.

Inside VertiPaq in Energy BI — Compress for fulfillment!

Picture by Kaboompics at Pexels

As you may recall from the earlier a part of this text, we scratched the floor of VertiPaq, a robust storage engine, which is “accountable” for the blazing-fast efficiency of most of your Energy BI stories (at any time when you might be utilizing Import mode or Composite mannequin).

3, 2, 1…Fasten your seatbelts!

One of many key traits of the VertiPaq is that it’s a columnar database. We discovered that columnar databases retailer knowledge optimized for vertical scanning, which signifies that each column has its personal construction and is bodily separated from different columns.

That reality permits VertiPaq to use various kinds of compression to every of the columns independently, selecting the optimum compression algorithm based mostly on the values in that particular column.

Compression is being achieved by encoding the values throughout the column. However, earlier than we dive deeper into an in depth overview of encoding methods, simply understand that this structure will not be solely associated to Energy BI — within the background is a Tabular mannequin, which can also be “below the hood” of Evaluation Providers Tabular and Excel Energy Pivot.

Worth Encoding

That is essentially the most fascinating worth encoding sort since it really works solely with integers and, due to this fact, requires much less reminiscence than, for instance, when working with textual content values.

How does this look in actuality? Let’s say we’ve got a column containing quite a lot of cellphone calls per day, and the worth on this column varies from 4.000 to five.000. What the VertiPaq would do, is to seek out the minimal worth on this vary (which is 4.000) as a place to begin, then calculate the distinction between this worth and all the opposite values within the column, storing this distinction as a brand new worth.

Picture by creator

At first look, 3 bits per worth may not appear to be a big saving, however multiply this by hundreds of thousands and even billions of rows and you’ll respect the quantity of reminiscence saved.

As I already careworn, Worth Encoding is being utilized solely to integer knowledge sort columns (forex knowledge sort can also be saved as an integer).

Hash Encoding (Dictionary Encoding)

That is most likely essentially the most regularly used compression sort by a VertiPaq. Utilizing Hash encoding, VertiPaq creates a dictionary of the distinct values inside one column and afterward replaces “actual” values with index values from the dictionary.

Right here is an instance to make issues clearer:

Picture by creator

As chances are you’ll discover, VertiPaq recognized distinct values throughout the Topics column, constructed a dictionary by assigning indexes to these values, and at last saved index values as tips that could “actual” values. I assume you might be conscious that integer values require manner much less reminiscence house than textual content, in order that’s the logic behind this kind of knowledge compression.

Moreover, by having the ability to construct a dictionary for any knowledge sort, VertiPaq is virtually knowledge sort impartial!

This brings us to a different key takeover: regardless of in case your column is of textual content, bigint or float knowledge sort — from VertiPaq perspective it’s the identical — it must create a dictionary for every of these columns, which suggests that every one these columns will present the identical efficiency, each by way of velocity and reminiscence house allotted! After all, by assuming that there aren’t any vital variations in dictionary sizes between these columns.

So, it’s a delusion that the info sort of the column impacts its measurement throughout the knowledge mannequin. Quite the opposite, the variety of distinct values throughout the column, which is called cardinality, largely influences column reminiscence consumption.

RLE (Run-Size-Encoding)

The third algorithm (RLE) creates a sort of mapping desk, containing ranges of repeating values, avoiding storing each single (repeated) worth individually.

Once more, having a look at an instance will assist to raised perceive this idea:

Picture by creator

In actual life, VertiPaq doesn’t retailer Begin values, as a result of it could possibly rapidly calculate the place the subsequent node begins by summing earlier Rely values.

As highly effective as it’d have a look at first look, the RLE algorithm is very depending on the ordering throughout the column. If the info is saved the best way you see within the instance above, RLE will carry out nice. Nonetheless, in case your knowledge buckets are smaller and rotate extra regularly, then RLE wouldn’t be an optimum answer.

Yet another factor to bear in mind concerning RLE: In actuality, VertiPaq doesn’t retailer knowledge the best way it’s proven within the illustration above. First, it performs Hash encoding and creates a dictionary of the themes, after which applies the RLE algorithm, so the ultimate logic, in its most simplified manner, can be one thing like this:

Picture by creator

So, RLE happens after Worth or Hash Encoding, in these situations when VertiPaq “thinks” that it is smart to compress knowledge moreover (when knowledge is ordered in that manner that RLE would obtain higher compression).

Re-Encoding concerns

Regardless of how “good” VertiPaq is, it could possibly additionally make some unhealthy selections, based mostly on incorrect assumptions. Earlier than I clarify how re-encoding works, let me simply briefly iterate via the method of knowledge compression for a selected column:

  • VertiPaq scans a pattern of rows from the column
  • If the column knowledge sort will not be an integer, it is going to look no additional and use Hash encoding
  • If the column is of integer knowledge sort, some further parameters are evaluated: if the numbers within the pattern linearly improve, VertiPaq assumes that it’s most likely a major key and chooses Worth encoding
  • If the numbers within the column are moderately shut to one another (the quantity vary will not be very extensive, like in our instance above with 4.000–5.000 cellphone calls per day), VertiPaq will use Worth encoding. Quite the opposite, when values fluctuate considerably throughout the vary (for instance between 1.000 and 1.000.000), then Worth encoding doesn’t make sense, and VertiPaq will apply the Hash algorithm

Nonetheless, it could possibly occur generally that VertiPaq decides about which algorithm to make use of based mostly on the pattern knowledge, however then some outlier pops up and it must re-encode the column from scratch.

Let’s use our earlier instance for the variety of cellphone calls: VertiPaq scans the pattern and chooses to use Worth encoding. Then, after processing 10 million rows, unexpectedly it discovered a 500.000 worth (it may be an error, or no matter). Now, VertiPaq re-evaluates the selection, and it could possibly determine to re-encode the column utilizing the Hash algorithm as a substitute. Absolutely, that may influence the entire course of by way of the time wanted for reprocessing.

Lastly, right here is the checklist of parameters (so as of significance) that VertiPaq considers when selecting which algorithm to make use of:

  • Variety of distinct values within the column (Cardinality)
  • Information distribution within the column — column with many repeating values will be higher compressed than one containing regularly altering values (RLE will be utilized)
  • Variety of rows within the desk
  • Column knowledge sort — impacts solely the dictionary measurement

Decreasing the info mannequin measurement by 90% — actual story!

After we laid the theoretical floor for understanding the structure behind the VertiPaq storage engine, and which sorts of compression it makes use of to optimize your Energy BI knowledge mannequin, it’s the best second to get our arms soiled and apply our data in a real-life case!

Start line = 776 MB

Our knowledge mannequin is sort of easy, but memory-intensive. We’ve got a reality desk (factChat), which accommodates knowledge about stay assist chats and one dimension desk (dimProduct), which pertains to a reality desk. Our reality desk has round 9 million rows, which shouldn’t be a giant deal for Energy BI, however the desk was imported as it’s, with none further optimization or transformation.

Picture by creator

Now, this pbix file consumes a whopping 777 MB!!! You possibly can’t consider it? Simply have a look:

Picture by creator

Simply keep in mind this image! After all, I don’t have to let you know how a lot time this report must load or refresh, and the way our calculations are sluggish due to the file measurement.

…and it’s even worse!

Moreover, it’s not simply 776 MBs that take our reminiscence, since reminiscence consumption is being calculated making an allowance for the next elements:

  • PBIX file
  • Dictionary (you’ve discovered in regards to the dictionary at first sections of this text)
  • Column hierarchies
  • Person-defined hierarchies
  • Relationships

Now, if I open Activity Supervisor, go to the Particulars tab, and discover the msmdsrv.exe course of, I’ll see that it burns greater than 1 GB of reminiscence!

Oh, man, that actually hurts! And we haven’t even interacted with the report! So, let’s see what we are able to do to optimize our mannequin…

Rule #1 — Import solely these columns you actually need

The primary and an important rule is: hold in your knowledge mannequin solely these columns you actually need for the report!

That being mentioned, do I really want each the chatID column, which is a surrogate key, and the sourceID column, which is a major key from the supply system? Each of those values are distinctive, so even when I have to rely the full variety of chats, I’d nonetheless be high quality with solely one in all them.

Picture by creator

So, I’ll take away the sourceID column and verify how the file appears now:

Picture by creator

By eradicating only one pointless column, we saved greater than 100 MB!!! Let’s look at additional what will be eliminated with out taking a deeper look (and we’ll come to this later, I promise).

Do we actually want each the unique begin time of the chat and UTC time, one saved as a Date/Time/Timezone sort, the opposite as Date/Time, and each going to a second stage of precision??!!

Let me do away with the unique begin time column and hold solely UTC values.

Picture by creator

One other 100 MB of wasted house gone! By eradicating simply two columns we don’t want, we lowered the dimensions of our file by 30%!

Now, that was with out even wanting into the small print of the reminiscence consumption. Let’s now activate DAX Studio, my favourite software for troubleshooting Energy BI stories. As I already careworn a number of instances, this software is a MUST should you plan to work significantly with Energy BI — and it’s fully free!

One of many options in DAX Studio is a VertiPaq Analyzer, a really useful gizmo constructed by Marco Russo and Alberto Ferrari from sqlbi.com. After I hook up with my pbix file with DAX Studio, listed here are the numbers associated to my knowledge mannequin measurement:

Picture by creator

I can see right here what the costliest columns are in my knowledge mannequin and determine if I can discard a few of them, or if I have to hold all of them.

At first look, I’ve few candidates for elimination — sessionReferrer and referrer columns have excessive cardinality and due to this fact can’t be optimally compressed. Furthermore, as these are textual content columns and must be encoded utilizing a Hash algorithm, you possibly can see that their dictionary measurement is extraordinarily excessive! If you happen to take a more in-depth look, you possibly can discover that these two columns take virtually 40% of my desk measurement!

After checking with my report customers in the event that they want any of those columns, or perhaps solely one in all them, I’ve bought a affirmation that they don’t carry out any evaluation on these columns. So, why on Earth ought to we bloat our knowledge mannequin with them??!!

One other robust candidate for elimination is the LastEditDate column. This column simply reveals the date and time when the file was final edited within the knowledge warehouse. Once more, I checked with the report customers, they usually didn’t even know that this column exists!

I eliminated these three columns, and the result’s:

Picture by creator

Oh, God, we halved the dimensions of our knowledge mannequin by simply eradicating a number of pointless columns.

Reality be informed, there are a number of extra columns that may very well be dismissed from the info mannequin, however let’s now give attention to different methods for knowledge mannequin optimization.

Rule #2 — Cut back the column cardinality!

As chances are you’ll recall from the earlier a part of the article, the rule of thumb is: the upper the cardinality of a column, the tougher for VertiPaq to optimally compress the info. Particularly if we’re not working with integer values.

Let’s take a deeper have a look at VertiPaq Analyzer outcomes:

Picture by creator

As you see, even when the chatID column has greater cardinality than the datetmStartUTC column, it takes virtually 7 instances much less reminiscence! Since it’s a surrogate key integer worth, VertiPaq applies Worth encoding, and the dimensions of a dictionary is irrelevant. However, Hash encoding is being utilized for the column of date/time knowledge sort with excessive cardinality, so the dictionary measurement is enormously greater.

There are a number of methods for decreasing the column cardinality, reminiscent of splitting columns. Listed below are a number of examples of utilizing this system.

For Integer columns, you possibly can break up them into two even columns utilizing division and modulo operations. In our case, it might be:

SELECT chatID/1000 AS chatID_div
,chatID % 1000 AS chatID_mod……….

This optimization approach have to be carried out on the supply facet (on this case, by writing a T-SQL assertion). If we use the calculated columns, there is no such thing as a profit in any respect, because the authentic column must be saved within the knowledge mannequin first.

The same approach can deliver vital financial savings when you may have decimal values within the column. You possibly can merely break up values earlier than and after the decimal as defined in this text.

Since we don’t have any decimal values, let’s give attention to our downside — optimizing the datetmStartUTC column. There are a number of legitimate choices to optimize this column. The primary is to verify in case your customers want granularity greater than the day stage (in different phrases, are you able to take away hours, minutes, and seconds out of your knowledge).

Let’s verify what financial savings this answer would deliver:

Picture by creator

The very first thing we discover is that our file is now 271 MB, so 1/3 of what we began with. VertiPaq Analyzer’s outcomes present that this column is now virtually completely optimized, going from taking up 62% of our knowledge mannequin to simply barely over 2.5%! That’s huuuuge!

Picture by creator

Nonetheless, it appeared that the day-level grain was not high quality sufficient, and my customers wanted to research figures on the hour stage. OK, so we are able to no less than do away with minutes and seconds, and that may additionally lower the cardinality of the column.

So, I’ve imported values rounded per hour:

SELECT chatID
                ,dateadd(hour, datediff(hour, 0, datetmStartUTC), 0) AS datetmStartUTC
                ,customerID
                ,userID
                ,ipAddressID
                ,productID
                ,countryID
                ,userStatus
                ,isUnansweredChat
                ,totalMsgsOp
                ,totalMsgsUser
                ,userTimezone
                ,waitTimeSec
                ,waitTimeoutSec
                ,chatDurationSec
                ,sourceSystem
                ,topic
                ,usaccept
                ,transferUserID
                ,languageID
                ,waitFirstClick
            FROM factChat

It appeared that my customers additionally didn’t want a chatVariables column for evaluation, so I’ve additionally eliminated it from the info mannequin.

Lastly, after disabling Auto Date/Time in Choices for Information Load, my knowledge mannequin measurement was round 220 MB! Nonetheless, one factor nonetheless bothered me: the chatID column was nonetheless occupying virtually 1/3 of my desk. And that is only a surrogate key, which isn’t utilized in any of the relationships inside my knowledge mannequin.

Picture by creator

So, right here I used to be analyzing two totally different options: the primary was to easily take away this column and mixture the variety of chats, counting them utilizing the GROUP BY clause. Nonetheless, there can be no profit by protecting the chatID column in any respect, because it’s not getting used anyplace in our knowledge mannequin. As soon as I’ve eliminated it from the mannequin, one final time, let’s verify the pbix file measurement:

Picture by creator

Please recall the quantity we began at: 776 MB! So, I’ve managed to scale back my knowledge mannequin measurement by virtually 90%, making use of some easy methods which enabled the VertiPaq storage engine to carry out extra optimum compression of the info.

And this was an actual use case, which I confronted over the past 12 months!

Basic guidelines for decreasing knowledge mannequin measurement

To conclude, right here is the checklist of common guidelines it is best to take into account when making an attempt to scale back the info mannequin measurement:

  • Hold solely these columns your customers want within the report! Simply sticking with this one single rule will prevent an unbelievable quantity of house, I guarantee you…
  • Attempt to optimize column cardinality at any time when potential. The golden rule right here is: check, check, check…and if there’s a vital profit from, for instance, splitting one column into two, or to substitute a decimal column with two entire quantity columns, then do it! However, additionally understand that your measures must be rewritten to deal with these structural adjustments, with a purpose to show anticipated outcomes. So, in case your desk will not be large, or if you must rewrite lots of of measures, perhaps it’s not price splitting the column. As I mentioned, it will depend on your particular state of affairs, and it is best to rigorously consider which answer makes extra sense
  • Similar as for columns, hold solely these rows you want: for instance, perhaps you don’t have to import knowledge from the final 10 years, however solely 5! That can even cut back your knowledge mannequin measurement. Discuss to your customers, ask them what they really want, earlier than blindly placing every little thing inside your knowledge mannequin
  • Mixture your knowledge at any time when potential! Which means — fewer rows, decrease cardinality, so all good issues you might be aiming to realize! If you happen to don’t want hours, minutes, or seconds stage of granularity, don’t import them! Aggregations in Energy BI (and Tabular mannequin on the whole) are an important and extensive matter, which is out of the scope of this sequence, however I strongly advocate you verify Phil Seamark’s weblog and his sequence of posts on artistic aggregations utilization
  • Keep away from utilizing DAX calculated columns at any time when potential, since they aren’t being optimally compressed. As an alternative, attempt to push all calculations to an information supply (SQL database, for instance) or carry out them utilizing the Energy Question editor
  • Use correct knowledge sorts (for instance, in case your knowledge granularity is on a day stage, there is no such thing as a want to make use of Date/Time knowledge sort. Date knowledge sort will suffice)
  • Disable Auto Date/Time choice for knowledge loading (this can take away a bunch of routinely created date tables within the background)

Conclusion

After you discovered the fundamentals of the VertiPaq storage engine and totally different methods it makes use of for knowledge compression, I wished to wrap up this text by displaying you a real-life instance of how we are able to “assist” VertiPaq (and Energy BI consequently) to get the most effective out of report efficiency and optimum useful resource consumption.

Thanks for studying, hope that you simply loved the article!