How To Remodel Information In Azure Information Manufacturing facility


On this article, we are going to discover how one can use Azure Information Manufacturing facility’s Mapping Information Move to remodel knowledge. Mapping Information Move runs on scaled-out Apache Spark clusters like Databricks. Nevertheless, Information Move supplies Information engineers with a graphical designer to assemble transformation logics with little to no code.


In our earlier article, we created an Azure Information Manufacturing facility occasion with managed digital community and we wish to leverage this service to do our knowledge transformation.

One of many widespread activity Information engineer will get is to create aggregation over a set of knowledge. We’ll exhibit this utilizing the WideWorldImporters pattern database offered by Microsoft. You possibly can setup the database following the hyperlink within the References part.

On this database, we’re concerned about 2 tables:

  1. Gross sales.Invoices: This desk accommodates the main points of buyer invoices, together with the bill date.
  2. Gross sales.InvoiceLines: This desk accommodates the road objects of the client invoices, together with the revenue for every line objects.

By becoming a member of the two tables, we will calculate the full income for every bill date. We’ll create a brand new desk within the database to retailer the output, so the values are pre-calculated.


Create Database schema and report desk

Earlier than we dive into Mapping Information Move, we are going to create a schema and the output desk first. We’ll do that utilizing SQL,

-- Create the schema and consequence desk
CREATE SCHEMA [SampleReports];

CREATE TABLE [SampleReports].[DailyProfits] (
    [InvoiceDate] [date] NOT NULL,
    [DailyProfit] [decimal](18, 2) NOT NULL,
    [CreatedWhen] [datetime2](7) NOT NULL,
    [PipelineId] [nvarchar](100) NOT NULL

Along with the information we’re concerned about, we added 2 further columns for audit functions,

  • CreatedWhen accommodates the timestamp when the file is created.
  • PipelineId accommodates the Pipeline Run Id from Information Manufacturing facility.

After executing the SQL statements above, we’re prepared to enter Azure Information Manufacturing facility.

Mapping Information circulate

  1. In Azure Information Manufacturing facility Studio, click on on ‘Managed’. Underneath ‘Linked Providers’, create a brand new Linked service to your Azure SQL server.

    • We chosen the Azure Integration Runtime for the Managed VNet and allow ‘Interactive authoring’ to check the connection.
    • Create a brand new ‘Managed personal endpoint connection’ is required. Please discuss with my earlier article, when you need assistance with this step.
    • Click on on ‘Take a look at Connection’ earlier than saving to make sure all of the configurations are right.
    • For simplicity, I offered the password on to the linked service. The very best follow can be to leverage Managed Id for the connection.
  2. In Azure Information Manufacturing facility Studio, click on on ‘Creator’. Underneath Manufacturing facility Sources, click on on ‘+’ icon and choose ‘Dataset’.

  3. We’ll create three datasets, 1 for every desk by choosing from the ‘Desk identify’ drop down:

    • SalesInvoices: Gross sales.Invoices
    • SalesInvoiceLines: Gross sales.InvoiceLines
    • ReportDailyProfits: SampleReports.DailyProfits

  4. We should always see three Dataset underneath Manufacturing facility Sources now.

  5. To create a Information circulate, we are going to click on on the ‘+’ icon underneath Manufacturing facility Sources and choose ‘Information circulate’. 

  6. We’ve a clean Information circulate canvas. We’ll identify our Information circulate: ReportsDataflow.

  7. Earlier than we begin including elements into the information circulate, we are going to activate ‘Information circulate debug’. This can enable us to Preview the information within the database and any transformation output.

    We should always assessment the settings within the screenshot:

    • We’ve chosen the AutoResolveIntegrationRuntime this time. The benefit of that is it allow us to dynamically configure the Compute kind and Core depend throughout runtime within the later step.
    • We will probably be paying for 1 hour of compute minimal.
    • The Core depend could be configured within the Managed > Integration Runtime display screen.

  8. Let’s add our SalesInvoices dataset by clicking on ‘Add Supply’. We’ll repeat this similar step so as to add SalesInvoiceLines dataset.

  9. Let’s click on on ‘Information preview’ for the ‘salesInvoices’ dataset. We are able to see a few of the information within the database.

  10. We’ll be part of the two tables collectively utilizing the InvoiceID.

    1. Click on on the ‘+’ subsequent to the ‘salesInvoices’ and choose ‘Be part of’ underneath ‘A number of Inputs/Outputs’.
    2. Present an Output stream identify. This identify will probably be used as a reference for later steps.
    3. Choose ‘salesInvoices’ for Left stream.
    4. Choose ‘salesInvoiceLines’ for the Proper stream.
    5. Choose ‘Interior’ for Be part of kind
    6. Underneath Be part of situations, choose ‘salesInvoices.InvoiceID’ and ‘salesInvoiceLines.InvoiceId’.

  11. Earlier than we do the aggregation, we have to take away all the additional columns first.

    1. Click on on the ‘+’ subsequent to the ‘JoinInvoiceWithLines’ and choose ‘Choose’ underneath Schema modifier.
    2. Present an Output stream identify.
    3. Choose ‘JoinInvoiceWithLines’ for Incoming Stream. An inventory of mappings will probably be displayed underneath Enter columns. If in case you have ‘Auto mapping’ enabled, disable it.
    4. Choose all of the columns besides for InvoiceDate and LineProfit then click on ‘Delete’.

  12. Let’s mixture the information to create the DailyProfit column.

    1. Click on on the ‘+’ subsequent to the ‘SelectColumns’ and choose ‘Mixture’ underneath Schema modifier.
    2. Present an Output stream identify.
    3. Underneath ‘Group by’ part, choose ‘InvoiceDate’ because the Column.
    4. Underneath ‘Aggregates’ part, kind in ‘DailyProfit’ underneath Column and ‘sum(LineProfit)’ underneath Expression. In case you are unfamiliar with what features could be referred to as, you may entry the Expression Builder for help.

  13. Now its a very good time to confirm our aggregation is right. Click on on ‘Information preview’ for the pattern consequence. 

  14. With the pattern consequence validated, we have to add our audit columns. To take action, we have to introduce a Information circulate parameter. This parameter will probably be handed in by the ADF Pipeline in a later step.

    1. Click on on the empty canvas space.
    2. Underneath Parameters click on ‘+ New’.
    3. Enter in ‘PipelineId’ for Title, ‘string’ for Sort and ‘0000-0000’ as Default worth

  15. To create the audit columns,

    1. Click on on the ‘+’ subsequent to the ‘DailyProfit’ and choose ‘Derived Column’ underneath Schema modifier.
    2. Present an Output stream identify.
    3. Choose ‘DailyProfit’ because the Incoming stream.
    4. Click on on ‘+ Add’ twice to create 2 entries.
    5. Enter ‘CreatedWhen’ for Column and ‘currentTimestamp()’ for Expression. The column will probably be populated with the processing timestamp.
    6. Enter ‘PipelineId’ for Column and ‘$PipelineId’ for Expression. The worth will probably be retrieved by the Dataflow parameter we created final step throughout processing.

  16. Lastly, we are going to write the consequence to our vacation spot desk.

    1. Click on on the ‘+’ subsequent to the ‘AuditColumns’ and choose ‘Sink’ underneath Vacation spot.
    2. Present an Output stream identify.
    3. Choose ‘AuditColumns’ because the Incoming stream.
    4. Choose ‘ReportDailyProfits’ because the Dataset.

  17. To finish the end-to-end course of, we have to create a ADF pipeline. We will be unable to cowl the complete element on this article, however I’ll cowl three key objects:

    1. Settings
      As referenced earlier within the tutorial, if we make the most of the ‘AutoResolveIntegrationRuntime’, we will configure the Compute kind and Core depend. That is carried out right here.

    2. Parameters
      The screenshot reveals how one can move the Pipeline Run Id to the information circulate.

    3. ADF Pipeline Debug
      Run the pipeline debug to make sure the end-to-end course of completes and the output is saved within the database. Use your favour SQL Editor to verify the consequence! 


With the ability of Apache Spark, Azure Information Manufacturing facility supplies Information engineers a single device for orchestration and course of massive knowledge effectively. The no-code strategy allows faster adoption and supply some similarity to SSIS growth.

To productionize the answer above, we have to think about the retry settings and making a set off as effectively.

Completely happy Studying!


Show More

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button