Azure

CETAS (Creating External Table As Select) In Azure Synapse Analytics

Introduction

In this post, we will discuss how to create an external table and store the data inside your specified Azure storage in parallel  using TSQL statements.

What is CETAS

CETAS or ‘Create External Table as Select’ can be used with both Dedicated SQL Pool and Serverless SQL Pool to create an external table and parallelly export the results using SQL statement to Hadoop, Azure storage blob or Azure Data Lake Storage Gen2. The data will be stored inside a folder path within the storage which has to be specified. Some of the important elements of CETAS are External Data Source and External File Format and I have written in detail about External data source in my previous article.

I am going to use the same logical database that I created in my previous exercise when creating external datasource and the sample data we will be using is ‘NycTaxi.parquet’ file which I have already uploaded to the data lake storage. Below are the complete steps including the TSQL statements, please refer to my previous article for a detailed explanation of steps 1 & 2.

Step 1 Creating Database scoped Credential

CREATE DATABASE SCOPED CREDENTIAL [ADLS_credential]
WITH IDENTITY='SHARED ACCESS SIGNATURE',
SECRET = 'sv=XXXXXxxxxxxxxxxxxxxxxxxx'
GO

Step 2 – Creating External Data Source

CREATE EXTERNAL DATA SOURCE demo
WITH ( LOCATION = 'https://synadlsgen2.blob.core.windows.net/output',
CREDENTIAL=[ADLS_credential])

Step 3 – Creating External File Format

External file format defines the format of the external data that you are going to access. It will specify the layout of the type of data that is going to be referenced by the external table. In azure synapse currently only two file formats are supported,

  1. Delimited Text
  2. Parquet

I have used the second one since the demo file NycTaxi we are going to use is parquet file format. Default snappycodec compression option has been used for this.

CREATE EXTERNAL FILE FORMAT ParquetFileFormat
WITH (
   FORMAT_TYPE = PARQUET,
   DATA_COMPRESSION = 'org.apache.hadoop.io.compress.SnappyCodec'
)

Step 4 – Creating Schema

The next step is creating the schema as all the external table and other database objects will be contained within the Schema.

CREATE SCHEMA nyctaxi

Step 5 – Creating External Table

It is now time to create the external table which is the objective of this demo. Given below is the query which creates external table and the select statement from which it pulls the data from the parquet file. Now it all looks good, let’s run it.

You can see that the select statement has been included when creating the external table as I am loading the aggregated results from the select statement directly onto the table.

CREATE EXTERNAL TABLE nycdemotable
WITH (
   LOCATION = 'synadlsgen2/output',
   DATA_SOURCE = demo,
   FILE_FORMAT = ParquetFileFormat
)
AS
SELECT passengercount, sum(tripdistancemiles) as TripDistance_SUM, avg(tripdistancemiles) as TripDistance_AVG
FROM
OPENROWSET(BULK 'https://synadlsgen2.blob.core.windows.net/adlsfssyn/NYCTripSmall.parquet',
FORMAT='PARQUET') AS [rows]
WHERE tripdistancemiles>0 AND passengercount>0
GROUP BY passengercount
GO

I have run the select query alone to verify the table count and output and if the parquet available in ADLS is accessible.

We could see the external table is now created. Right click and select top 100 rows to see the results.

Azure Synapse Analytics

Now go to the ADLS storage to check if the output from select has been saved into file format as well.

Azure Synapse Analytics

Summary

This article explains what is CETAS in synapse and how could we utilize it for parallel processing on different file formats. This works best when studying the previous article which explains base concepts of what we used here.

Reference

Microsoft official docs

Tags
Show More

Related Articles

Leave a Reply

Your email address will not be published. Required fields are marked *

Close