This week I was assisting a client who was faced with a very interesting challenge. At its crux, the issue was that the data being ingested into the report had a high degree of interdependence, making it very challenging to extract the insights required of the data. The data in question surrounded corporate risk, but differed from Risk Reports I had worked with in the past, because risk wasn't the only element being investigated. Essentially, there was a need to identify patterns in the risk data, including the usual, controls, incidents, issues, and treatments. This part of the report was familiar and what I expected. Where the interesting challenge came in was when the client explained they also need to report on each of the other profiles. i.e. they want to know all of the incidents, and which ones are associated with risks, controls, issues and treatments, all the controls and which ones are associated with risks, issues, treatments and incidents, and so on. This problem could be solved in a number of different ways, but the best solution is to create an appropriate model to best support this ask, as it leads to much simpler DAX when it is time to extract the meaningful insights.
In this article, I'll explore the different modelling approaches that could be used to support this report, but ultimately will establish that putting more effort in up front to shape the data appropriately to support the ask is the best solution.
Words of Wisdom from the Masters of DAX
Recently, there has been a flurry of LinkedIn posts surrounding a tweet from Marco Russo of SQLBI. Marco's post was a hint at a book that is in development, focusing on the importance of investing your learning into Modelling as a priority over, or at least in line with the priority of learning DAX. As we dive deeper into this article, you will see that this proves to be true.
Honestly, it would be better to be a good data modeler and a mediocre DAX author than a DAX guru but poor data modeler. Therefore, if data modelling is not your cup of tea, roll up your sleeves and start learning it.
Marco Russo
SQLBI
The Star Schema is the Optimal Model Shape for Power BI
I won't go into detail about the components of the Star Schema today, but the generally accepted best practice approach to loading data into Power BI is to load it in a “Star Schema” format. This is a technical term coming from the dimensional modeling techniques of Kimball et al., which describes the logical way data should be structured for optimal reporting performance. You can see why it is called a Star Schema by looking at the following image – note the star shape.
I am a firm believer in the importance of the Star Schema, I am also a firm believer that there is a significant benefit to be gained (from a DAX authoring perspective) if we opt to alter the layout of our Star Schemas to a top-down approach. A method of model shaping promoted by a different set of Power BI Masters, Rob Collie and Matt Allington.
If you compare the 2 images above, you will see they have exactly the same logical relationships (links) between the tables, it is just that they have a different visual layout. By adopting this top-down approach, at a glance, it is possible to identify that the tables at the top filter the table at the bottom, but the table at the bottom does not filter the tables above it. This is evident by the arrow heads all pointing toward the bottom table.
Too Complex for a Star
You have probably already identified that with the number of transaction (fact) tables in the model (Risk, Controls, Treatments (Actions), Incidents and Issues) a simple Star Schema would be difficult to achieve. That being the case, the approach I have taken to modelling risk data in the past revolves around layering the transactional records at a different level. Typically, the Controls, Actions, Issues and Incidents are all sub-facts of the Risks, and therefore would be appropriate to model as a second layer of transaction tables, as is illustrated below.
This model shape however is not appropriate for the current task at hand, because there is a need to monitor concepts like how many Incidents are not associated with a Risk, or How many Issues do not have a Control. With the current layered approach, the model is designed to support interrogation against Risks and its associated items, but does not allow analysis of each of the transactions independent of Risk.
Solving the Challenge with Multiple Stars
The first response to this discovery was to redesign the model with all of the Transactions at the same level. This was achieved by splitting each transaction table's sub-facts to rows, effectively duplicating the other transaction details.
For example, in the raw data, risk 1001000 had multiple Controls and multiple Actions associated with it.
When the data was split, Risk 1001000 repeated many times for each Control and Action combination.
This approach was repeated on all of the transaction tables, allowing the tables to be modelled as multiple overlapping Star Schemas.
With this model, each of the transaction tables could be analyzed independent of Risks.
Simple to Model, Hard to Analyze
Although the multiple-star model allowed for simple analysis initially, one of the early warning signs that this wasn't the best solution was the fact that to calculate the total number of Risks, it was necessary to run a DISTINCTCOUNT on the Risk[RiskID], rather than a simple COUNTROWS of Risk. The complexity in the DAX only grew from here.
Because there was no record of Incidents in the Risk table, in order to determine Risks that were associated with Incidents, it was necessary to write a complex DAX formula to search the Incidents table for mentions of RiskID's.
Risks with Linked Incidents =
1 CALCULATE (
2 CALCULATE ( [Total Risks], TREATAS ( VALUES ( Incidents[RiskID] ), Risk[ID] ) ),
3 Incidents[RiskID] <> BLANK (),
4 CROSSFILTER ( Incidents[Business Unit], BusUnit[Business Unit], NONE ),
5 CROSSFILTER ( Incidents[Create Date], 'Calendar'[Date], NONE )
6 )
It's okay if you look at this formula and get a little perplexed. That is to be expected given that I am now using DAX to wrangle the data to compensate for inadequate model design, but let me step through the calculation to help you understand it.
The first CALCULATE (lines 1-6) is modifying the filter context impacting the data.
Line 5 is removing the filter being applied to the Incidents table by the Calendar
Line 4 is removing the filter being applied to the Incidents table by the BusUnit
Line 3 is adding a filter to the Incidents table, filtering the table for rows that contain a value in the Incidents[RiskID] column
The first CALCULATE is effectively turning the Incidents table into a Descriptive (dimension) table.
Now that the data tables are receiving the correct filters, the second CALCULATE (line 2) is engaged. This CALCULATE is applying a filter onto the Risk table that is being driven by the list of unique Incidents[RiskID] values. The TREATAS function is generating a virtual 1-to-many relationship from the Incidents table to the Risk table Incidents[RiskID] to Risk[RiskID], and passes the Incidents[RiskID] <> BLANK filter onto the Risk table.
The Risk table is now receiving filters from the Calendar, BusUnit and Incidents tables.
The second CALCULATE then allows the calculation of [Total Risks] to be evaluated.
Rest assured that this calculation is working as intended, and correctly identifies the Risks with associated Incidents.
The problem we now face however, is that this has only identified the Risks with associated Incidents. We now have to replicate this logic for each of the other tables, until we are finally in a position where we can determine which Risks do not have an associated link with any other table. The calculation to achieve this would look something like this:
Risks with No Linked Controls/Actions/Issues/Incidents =
VAR Risks = [Total Risks]
VAR IncidentRisks =
CALCULATE (
CALCULATE ( [Total Risks], TREATAS ( VALUES ( Incidents[RiskID] ), Risk[ID] ) ),
Incidents[RiskID] <> BLANK (),
CROSSFILTER ( Incidents[Business Unit], BusUnit[Business Unit], NONE ),
CROSSFILTER ( Incidents[Create Date], 'Calendar'[Date], NONE )
)
VAR IssueRisks =
CALCULATE (
CALCULATE ( [Total Risks], TREATAS ( VALUES ( Issues[RiskID] ), Risk[ID] ) ),
Issues[RiskID] <> BLANK (),
CROSSFILTER ( Issues[Business Unit], BusUnit[Business Unit], NONE ),
CROSSFILTER ( Issues[Create Date], 'Calendar'[Date], NONE )
)
VAR ControlRisks =
CALCULATE (
CALCULATE ( [Total Risks], TREATAS ( VALUES ( Controls[RiskID] ), Risk[ID] ) ),
Controls[RiskID] <> BLANK (),
CROSSFILTER ( Controls[Business Unit], BusUnit[Business Unit], NONE ),
CROSSFILTER ( Controls[Create Date], 'Calendar'[Date], NONE )
)
VAR ActionRisks =
CALCULATE (
CALCULATE ( [Total Risks], TREATAS ( VALUES ( Actions[RiskID] ), Risk[ID] ) ),
Actions[RiskID] <> BLANK (),
CROSSFILTER ( Actions[Business Unit], BusUnit[Business Unit], NONE ),
CROSSFILTER ( Actions[Create Date], 'Calendar'[Date], NONE )
)
VAR Result = Risks - IncidentRisks - IssueRisks - ControlRisks - ActionRisks
RETURN
Result
Needless to say, this is a complex DAX formula, and requires a reasonably significant level of DAX knowledge to author and understand.
Digging Deeper into the Schema
DAX is a simple language, but it is not easy! The above calculation highlights this fact all too well. It is for this reason that Marco's quote above says to focus on becoming a good modeler rather than a DAX guru. Our job as data modeler's is to prepare our data to an optimal shape for answering the questions at hand, to ensure we can extract the insights required to analyze the data according to the business problems that need answers. To help communicate this to others in the organization, it is also critical that we keep the logic necessary to achieve these answers as simple to understand as possible. With that in mind, let's take another pass at the data model shape.
This time, we will take an approach that is a combination of the first layered model and the multi-star of the second model. To achieve this we will only keep unique records in each of the main transaction tables. To break out the interdependency of the tables, we will create a second layer of transaction tables that will be an appended list of each of the transaction tables combined with the list of Items and ItemIDs from the associated tables.
When these new tables are loaded into the multi-star model, they can be connected to their respective parent table.
Harder to Model, Simpler to Analyze
With this new data model in place, extracting the insights explored earlier is significantly easier. To identify [Total Risks], there is no longer a need to perform a DISTINCTCOUNT. This time around, because the Risk table only contains unique Risk records, we can tally the total number of risks with a simple COUNTROWS. To calculate the Risks with Linked Incidents, the calculation is now a simple DISTINCTCOUNT of RiskItems[RiskID], with a filter on RiskItems[Item].
The true power of the model is revealed however when we try to solve the complex Risks with No Linked Controls/Actions/Issues/Incidents calculation. This time we simply need a CALCULATE (to alter the natural filtering behavior of the model), and a FILTER to perform a roll-call on the data (to find things that haven't happened).
Risks with No Linked Controls/Actions/Issues/Incidents =
CALCULATE ( [Total Risks], FILTER ( Risk, [Risk Items] = 0 ) )
A Final Word on Shaping
While the optimum shape of a data model in Power BI is a Star Schema, that doesn’t mean this is the only model shape available for us to use. Power BI is a very flexible tool and will let you work with any shape you require. The catch to remember is that DAX is simple, but it's not always easy, so it is important to find a data shape that allows you to extract the insights you need with clear easy to understand calculation logic.
Although the galaxy schema model used in this article has more tables and consequently more relationships, the DAX calculations are significantly more performant because they are able to leverage the natural filtering behavior of the model, without need to generate virtual relationships on the fly, in order to evaluate.
All things considered, it is clear there are many reasons that support Marco's advice, to invest your effort in learning how to model well rather than get caught up on wanting to become a master at authoring and demystifying complex Data Analysis eXpressions.
Excellent article, very insightful and helpful, going straight to the business need. I had a similar revelation when analyzing Salesforce data in the past. The SF model is quite a gargantuan mess of tables and relationships, and the first attempt in Power BI was soon plagued with weird DAX measures. These measures were stretched totally out of proper shape to show the required metric, and any change cause other parts to fail. Not to mention that nobody could really say why a measure was coded in the way it did.
We redesigned the model from scratch following a similar approach to handle interconnected Leads, Contacts, Opportunities, Salesperson, etc fact tables.
Your article puts a very comprehensive frame to such endeavors.