This is the part one article of the two part series with demo which explains analyzing data with spark pool in azure synapse analytics. Since the topic touches apache spark heavily, I have decided to write a dedicated article to explain apache spark in azure -hence this part one. Pls make sure to read part two as well for a complete understanding.
Apache Spark in Azure Synapse Analytics
Apache spark is an open source in-memory framework and a data processing engine to store and process data real-time from multiple cluster computers in a simplified way. It basically can load the data into the memory for frequent queries which helps in faster results than the normal disk-based reads. Azure lets you create and configure serverless Apache Spark pool easily and the spark pools in Azure Synapse are compatible with Azure storage and Azure data lake gen2 storage through which you can process the data that are stored in Azure.
Normally in Azure Databricks we will create the spark clusters which will run the notebooks but in Azure Synapse analytics we won’t create cluster instead we will create spark pool. Spark pools to define, are nothing but fully managed spark service. It has many advantages like speed, efficiency, ease of usage etc., to name a few. When creating the spark pool, we have to define the number of nodes and the sizes of each etc., post which the spark pool will be executing the notebook. Though we will set the number of min and max nodes, it is the spark pool which will decide how many number of nodes that it should use based on the task execution, we will not be having any control over it.
Creating Apache Spark Pool
Serverless spark pool is similar to serverless SQL pool (refer my previous article which explains the difference between Serverless and Dedicated pools in azure synapse analytics) you will only pay for the nodes that are getting consumed for the query you run. If you created the nodes and did not run anything (not utilized), you will not be charged any cloud cost. On the inside, basically the serverless spark pool will create an Spark session based on the requirement which will run your code. You will also have an option to auto-pause the spark pool after some idle minutes.
Apache spark naturally supports many languages likes Scala, Java, Python and R with Python and Scala having interactive shells (PySpark and spark-shell). This is very valuable for the other services within Synapse analytics to consume the large volumes of data processed by Apache Spark. Apache Spark also has machine learning library built on top of it for the users to access it within the Synapse Spark Pool. This can be combined with the built in support for the notebooks to set and create machine learning applications in Apache Spark.
Microsoft official documentation