In this article, we will explore how to view Data flow execution plan using Azure Data Factory. The execution plan provides partition and performance metrics for the data engineers or operation team for review. This crucial step allows developers to ensure that performance goals are met.
In our previous article, we created a pipeline using Mapping Data flow. Now, we need to review the time it took to transform the data and build a better understanding of where we can optimize our solution.
Azure Data Factory Monitor provides this capability for us. This is available for debugging and scheduled pipelines. Before we dive into the tutorial, we need to understand 4 key concepts:
- Stage: A Mapping data flow pipeline is broken up into different stages. A stage consists of a group of activities that can be executed together.
- Partition: A dataset is split into multiple chunks so it can be processed in parallel. Each chunk is represented as a partition.
- Skewness: Skewness measures how the data is evenly distributed the dataset is. In data flow, the ideal range is between -3 and 3.
- Kurtosis: Kurtosis measures how many outliers is in the dataset. A high kurtosis value means it has heavy outliers. In data flow, the ideal value is less than 10.
In Azure Data Factory Studio, click on ‘Monitor’. Under ‘Runs’, click on ‘Pipeline runs’.
Select the pipeline and click on the ‘eye glasses icon’ under ‘Activity runs’.
Now, we see the data flow. Click on ‘Stages’ to see how long each stage took. A ‘Stages’ side panel will be shown.
Let’s examine the different stages we have:
Let’s dive into the ‘Join transformation’ by clicking on the Join icon. We observed that:
- The first thing we notice is both Skewness and Kurtosis are within the range specified in the Challenges section.
- The ‘Partition chart’ shows most of the partitions are ranging from 1000 to just under 1200 records. The last few partitions have a below average of records.
In summary, it is very important to review the execution plan of the Mapping data flow. With Azure Data Factory Monitor, we can examine each data flow activity on data changes, Skewness, Kurtosis, and the number of rows for each partition. This information enables us to optimize our pipeline when required.
Finally, remember to turn on verbose logging in the Data flow activity to generate all the monitoring details. This will have some performance impact but without this information, it will be difficult to identify where the issue is.