Select your cookie preferences

We use essential cookies and similar tools that are necessary to provide our site and services. We use performance cookies to collect anonymous statistics, so we can understand how customers use our site and make improvements. Essential cookies cannot be deactivated, but you can choose “Customize” or “Decline” to decline performance cookies.

If you agree, AWS and approved third parties will also use cookies to provide useful site features, remember your preferences, and display relevant content, including relevant advertising. To accept or decline all non-essential cookies, choose “Accept” or “Decline.” To make more detailed choices, choose “Customize.”

Skip to main content

Amazon EMR FAQs

General

Open all

Amazon EMR is the industry-leading cloud big data platform for data processing, interactive analysis, and machine learning using open source frameworks such as Apache Spark, Apache Hive, and Presto. With EMR you can run petabyte-scale analysis at less than half of the cost of traditional on-premises solutions and over 1.7x faster than standard Apache Spark.

Amazon EMR lets you focus on transforming and analyzing your data without having to worry about managing compute capacity or open-source applications, and saves you money. Using EMR, you can instantly provision as much or as little capacity as you like on Amazon EC2 and set up scaling rules to manage changing compute demand. You can set up CloudWatch alerts to notify you of changes in your infrastructure and take actions immediately. If you use Kubernetes, you can also use EMR to submit your workloads to Amazon EKS clusters.. Whether you use EC2 or EKS, you benefit from EMR’s optimized runtimes which speed your analysis and save both time and money.

You can deploy your workloads to EMR using Amazon EC2, Amazon Elastic Kubernetes Service (EKS), or on-premises AWS Outposts. You can run and manage your workloads withthe EMR Console, API, SDK or CLI and orchestrate them using Amazon Managed Workflows for Apache Airflow (MWAA) or AWS Step Functions. For an interactive experience you can use EMR Studio or SageMaker Studio.

To sign up for Amazon EMR, click the “Sign Up Now” button on the Amazon EMR detail page http://aws.amazon.com/emr. You must be signed up for Amazon EC2 and Amazon S3 to access Amazon EMR; if you are not already signed up for these services, you will be prompted to do so during the Amazon EMR sign-up process. After signing up, please refer to the Amazon EMR documentation, which includes our Getting Started Guide – the best place to get going with the service.

Please refer to our Service Level Agreement.

Developing & debugging

Open all

Check out the sample code in these Articles and Tutorials. If you use EMR Studio, you can explore the features using a set of notebook examples.

You can develop, visualize and debug data science and data engineering applications written in R, Python, Scala, and PySpark in Amazon EMR Studio. You can also develop a data processing job on your desktop, for example, using Eclipse, Spyder, PyCharm, or RStudio, and run it on Amazon EMR. Additionally, you can select JupyterHub or Zeppelin in the software configuration when spinning up a new cluster and develop your application on Amazon EMR using one or more instances.

The Command Line Tools or APIs provide the ability to programmatically launch and monitor progress of running clusters, to create additional custom functionality around clusters (such as sequences with multiple processing steps, scheduling, workflow, or monitoring), or to build value-added tools or applications for other Amazon EMR customers. In contrast, the AWS Management Console provides an easy-to-use graphical interface for launching and monitoring your clusters directly from a web browser.

Yes. Once the job is running, you can optionally add more steps to it via the AddJobFlowSteps API. The AddJobFlowSteps API will add new steps to the end of the current step sequence. You may want to use this API to implement conditional logic in your cluster or for debugging.

You can sign up for up Amazon SNS and have the cluster post to your SNS topic when it is finished. You can also view your cluster progress on the AWS Management Console or you can use the Command Line, SDK, or APIs to get a status on the cluster.

Yes. You can terminate your cluster automatically when all your steps finish by turning the auto-terminate flag on.

Amazon EMR 5.30.0 and later, and the Amazon EMR 6.x series are based on Amazon Linux 2. You can also specify a custom AMI that you create based on the Amazon Linux 2. This allows you to perform sophisticated pre-configuration for virtually any application. For more information, see Using a Custom AMI.

Yes. You can use Bootstrap Actions to install third-party software packages on your cluster. You can also upload statically compiled executables using the Hadoop distributed cache mechanism. EMR 6.x supports Hadoop 3, which allows the YARN NodeManager to launch containers either directly on the EMR cluster host or inside a Docker container. Please see our documentation to learn more.

There are several tools you can use to gather information about your cluster to help determine what went wrong. If you use Amazon EMR studio, you can launch tools like Spark UI and YARN Timeline Service to simplify debugging. From the Amazon EMR Console, you can get off-cluster access to persistent application user interfaces for Apache Spark, Tez UI and the YARN timeline server, several on-cluster application user interfaces, and a summary view of application history in the Amazon EMR console for all YARN applications. You can also connect to your Master Node Using SSH and view cluster instances via these the web interfaces . For more information, see our documentation.

EMR Studio

Open all

EMR Studio is an integrated development environment (IDE) that makes it easy for data scientists and data engineers to develop, visualize, and debug data engineering and data science applications written in R, Python, Scala, and PySpark.

It is a fully managed application with single sign-on, fully managed Jupyter Notebooks, automated infrastructure provisioning, and the ability to debug jobs without logging into the AWS Console or cluster. Data scientists and analysts can install custom kernels and libraries, collaborate with peers using code repositories like GitHub and BitBucket, or execute parameterized notebooks as part of scheduled workflows using orchestration services like Apache Airflow, AWS Step Functions, and Amazon Managed Workflows for Apache Airflow. You can read orchestrating analytics jobs on Amazon EMR notesbooks using Amazon MWAA to learn more. EMR Studio kernels and applications run on EMR clusters, so you get the benefit of distributed data processing using the performance optimized Amazon EMR runtime for Apache Spark. Administrators can setup EMR Studio for analysts to run their applications on existing EMR clusters or create new clusters using pre-defined AWS CloudFormation templates for EMR.

With EMR Studio, you can log in directly to fully managed Jupyter notebooks using your corporate credentials without logging into the AWS console, start notebooks in seconds, get onboarded with sample notebooks, and perform your data exploration. You can also customize your environment by loading custom kernels and python libraries from notebooks. EMR Studio kernels and applications run on EMR clusters, so you get the benefit of distributed data processing using the performance optimized Amazon EMR runtime for Apache Spark. You can collaborate with peers by sharing notebooks via GitHub and other repositories. You can also run Notebooks directly as continuous integration and deployment pipelines. You can pass different parameter values to a notebook. You can also chain notebooks, and integrate notebooks into scheduled workflows using workflow orchestration services like Apache Airflow. Further, you can debug clusters and jobs using as few clicks as possible with native applications interfaces such as the Spark UI and the YARN Timeline service.

There are five main differences.

  1. There is no need to access AWS Management Console for EMR Studio. EMR Studio is hosted outside of the AWS Management Console. This is useful if you do not provide data scientists or data engineers access to the AWS Management Console.

  2. You can use enterprise credentials from your identity provider using AWS IAM Identity Center (successor to AWS SSO) to log in to EMR Studio. 

  3. EMR Studio brings you a notebook first experience. EMR Studio kernels and applications run on EMR clusters, so you get the benefit of distributed data processing using the performance optimized Amazon EMR runtime for Apache Spark. Running code on a cluster is as simple as attaching the notebook to an existing cluster or provisioning a new one.

  4. EMR Studio has a simplified user interface and abstracts hardware configurations. For example, you can setup cluster templates once and use the templates to start new clusters. 

  5. EMR Studio enables a simplified debugging experience so that you can access the native application user interfaces in one place using as few clicks as possible.

You can use both EMR Studio and SageMaker Studio with Amazon EMR. EMR Studio provides an integrated development environment (IDE) that makes it easy for you to develop, visualize, and debug data engineering and data science applications written in R, Python, Scala, and PySpark. Amazon SageMaker Studio provides a single, web-based visual interface where you can perform all machine learning development steps. SageMaker Studio gives you complete access, control, and visibility into each step required to build, train, and deploy models. You can quickly upload data, create new notebooks, train and tune models, move back and forth between steps to adjust experiments, compare results, and deploy models to production all in one place, making you much more productive. 

Your administrator must first set up an EMR Studio. When you receive a unique sign-on URL for your Amazon EMR Studio from your administrator, you can log in to the Studio directly using your corporate credentials.

No. After your administrator sets up an EMR Studio and provides the Studio access URL, your team can log in using corporate credentials. There’s no need to log in to the AWS Management Console. In an EMR Studio, your team can perform tasks and access resources configured by your administrator.

AWS IAM Identity Center (successor to AWS SSO) is the single sign-on service provider for EMR Studio. The list of identity providers supported by AWS IAM can be found in our documentation.

Workspaces help you organize Jupyter Notebooks. All notebooks in a Workspace are saved to the same Amazon S3 location and run on the same cluster. You can also link a code repository like a GitHub repository to all notebooks in a workspace. You can create and configure a Workspace before attaching it to a cluster, but you should connect to a cluster before running a notebook.

Yes, you can create or open a workspace without attaching it to a cluster. Only when you need to execute, you should connect them to a cluster. EMR Studio kernels and applications are executed on EMR clusters, so you get the benefit of distributed data processing using the performance optimized Amazon EMR runtime for Apache Spark.

All Spark queries run on your EMR cluster, so you need to install all runtime libraries that your Spark application uses on the cluster. You can easily install notebook-scoped libraries within a notebook cell. You can also install Jupyter Notebook kernels and Python libraries on a cluster master node either within a notebook cell or while connected using SSH to the master node of the cluster. For more information, see documentation. Additionally, you can use a bootstrap action or a custom AMI to install required libraries when you create a cluster. For more information, see Create Bootstrap Actions to Install Additional Software and Using a Custom AMI in the Amazon EMR Management Guide.

Workspace together with notebook files in the workspace are saved automatically at regular intervals to the ipynb file format in the Amazon S3 location that you specify when you create the workspace. The notebook file has the same name as your notebook in the Amzon EMR Studio.

You can associate Git-based repositories with your Amazon EMR Studio notebooks to save your notebooks in a version controlled environment.

With EMR Studio, you can run notebook code on Amazon EMR running on Amazon Elastic Compute Cloud (Amazon EC2) or Amazon EMR on Amazon Elastic Kubernetes Service (Amazon EKS). You can attach notebooks to either existing or new clusters. You can create EMR clusters in two ways in EMR Studio: create a cluster using a pre-configured cluster template via AWS Service Catalog, create a cluster by specifying cluster name, number of instances, and instance type.

Yes, you can open your workspace, choose EMR Clusters icon on the left, push Detach button, and then select a cluster from the Select cluster drop down list, and push Attach button.

In EMR Studio, you may choose Workspaces tab on the left and view all workspaces created by you and other users in the same AWS account.

Each EMR studio needs permissions to interoperate with other AWS services. To give your EMR Studios the necessary permissions, your administrators need to create an EMR Studio service role with the provided policies. They also need to specify a user role for EMR Studio that defines Studio-level permissions. When they add users and groups from AWS IAM Identity Center (successor to AWS SSO) to EMR Studio, they can assign a session policy to a user or group to apply fine-grained permission controls. Session policies help administrators refine user permissions without the need to create multiple IAM roles. For more information about session policies, see Policies and Permissions in the AWS Identity and Access Management User Guide.

Yes. High Availability (Multi-master) clusters, Kerberized clusters, and AWS Lake Formation clusters are currently not supported.

Amazon EMR Studio is provided at no additional charge to you. Applicable charges for Amazon Simple Storage Service storage and for Amazon EMR clusters apply when you use EMR Studio. For more information about pricing options and details, see Amazon EMR pricing.

EMR Notebooks

Open all

We recommend that new customers use Amazon EMR Studio, not EMR Notebooks. EMR Notebooks provide a managed environment, based on Jupyter Notebook, that allows data scientists, analysts, and developers to prepare and visualize data, collaborate with peers, build applications, and perform interactive analysis using EMR clusters. Although we recommend that new customers use EMR Studio, EMR Notebooks is supported for compatibility.

You can use EMR Notebooks to build Apache Spark applications and run interactive queries on your EMR cluster with minimal effort. Multiple users can create serverless notebooks directly from the console, attach them to an existing shared EMR cluster, or provision a cluster directly from the console and immediately start experimenting with Spark. You can detach notebooks and re-attach them to new clusters. Notebooks are auto-saved to S3 buckets, and you can retrieve saved notebooks from the console to resume work. EMR Notebooks are prepackaged with the libraries found in the Anaconda repository, allowing you to import and use these libraries in your notebooks code and use them to manipulate data and visualize results. Further, EMR notebooks have integrated Spark monitoring capabilities that you can use to monitor the progress of your Spark jobs and debug code from within the notebook.

To get started with EMR Notebooks, open the EMR console and choose Notebooks in the navigation pane. From there, just choose Create Notebook, enter a name for your notebook, choose an EMR cluster or instantly create a new one, provide a service role for the notebook to use, and choose an S3 bucket where you want to save your notebook files and then click on Create Notebook. After the notebook shows a Ready status, choose Open to start the notebook editor.

EMR Notebooks can be attached to EMR clusters running EMR release 5.18.0 or later.

EMR notebooks are provided at no additional charge to you. You will be charged as usual for the attached EMR clusters in your account. You and find out more about the pricing for your cluster by visiting https://aws.amazon.com/emr/pricing/ 

Managing data

Open all

Amazon EMR provides several ways to get data onto a cluster. The most common way is to upload the data to Amazon S3 and use the built-in features of Amazon EMR to load the data onto your cluster. You can use the Distributed Cache feature of Hadoop to transfer files from a distributed file system to the local file system. For more details, see documentation. Alternatively, if you are migrating data from on premises to the cloud, you can use one of the Cloud Data Migration services from AWS.

Hadoop system logs as well as user logs will be placed in the Amazon S3 bucket which you specify when creating a cluster. Persistent application UIs are run off-cluster, Spark History Server, Tez UI and YARN timeline servers logs are available for 30 days after an application terminates.

No. At this time Amazon EMR does not compress logs as it moves them to Amazon S3.

Q: Do you compress logs?

Yes. You can use AWS Direct Connect to establish a private dedicated network connection to AWS. If you have large amounts of data, you can use AWS Import/Export. For more details refer to our documentation.

Billing

Open all

No. As each cluster and input data is different, we cannot estimate your job duration.

Amazon EMR pricing is simple and predictable: you pay a per-second rate for every second you use, with a one-minute minimum. You can estimate your bill using the AWS Pricing Calculator. Usage for other Amazon Web Services including Amazon EC2 is billed separately from Amazon EMR.

Amazon EMR billing commences when the cluster is ready to execute steps. Amazon EMR billing ends when you request to shut down the cluster. For more details on when Amazon EC2 begins and ends billing, please refer to the Amazon EC2 Billing FAQ.

You can track your usage in the Billing & Cost Management Console.

On the AWS Management Console, every cluster has a Normalized Instance Hours column that displays the approximate number of compute hours the cluster has used, rounded up to the nearest hour.

Normalized Instance Hours are hours of compute time based on the standard of 1 hour of m1.small usage = 1 hour normalized compute time. You can view our documentation to see a list of different sizes within an instance family, and the corresponding normalization factor per hour.

For example, if you run a 10-node r3.8xlarge cluster for an hour, the total number of Normalized Instance Hours displayed on the console will be 640 (10 (number of nodes) x 64 (normalization factor) x 1 (number of hours that the cluster ran) = 640).

This is an approximate number and should not be used for billing purposes. Please refer to the Billing & Cost Management Console for billable Amazon EMR usage.

Yes. Amazon EMR seamlessly supports On-Demand, Spot, and Reserved Instances. Click here to learn more about Amazon EC2 Reserved Instances. Click here to learn more about Amazon EC2 Spot Instances. Click here to learn more about Amazon EC2 Capacity Reservations.

Except as otherwise noted, our prices are exclusive of applicable taxes and duties, including VAT and applicable sales tax. For customers with a Japanese billing address, use of AWS services is subject to Japanese Consumption Tax. Learn more

Security and Data Access Control

Open all

Amazon EMR starts your instances in two Amazon EC2 security groups, one for the master and another for the other cluster nodes. The master security group has a port open for communication with the service. It also has the SSH port open to allow you to SSH into the instances, using the key specified at startup. The other nodes start in a separate security group, which only allows interaction with the master instance. By default both security groups are set up to not allow access from external sources including Amazon EC2 instances belonging to other customers. Since these are security groups within your account, you can reconfigure them using the standard EC2 tools or dashboard. Click here to learn more about EC2 security groups. Additionally, you can configure Amazon EMR block public access in each region that you use to prevent cluster creation if a rule allows public access on any port that you don't add to a list of exceptions.

Amazon S3 provides authentication mechanisms to ensure that stored data is secured against unauthorized access. Unless the customer who is uploading the data specifies otherwise, only that customer can access the data. Amazon EMR customers can also choose to send data to Amazon S3 using the HTTPS protocol for secure transmission. In addition, Amazon EMR always uses HTTPS to send data between Amazon S3 and Amazon EC2. For added security, customers may encrypt the input data before they upload it to Amazon S3 (using any common data encryption tool); they then need to add a decryption step to the beginning of their cluster when Amazon EMR fetches the data from Amazon S3.

Yes. AWS CloudTrail is a web service that records AWS API calls for your account and delivers log files to you. The AWS API call history produced by CloudTrail enables security analysis, resource change tracking, and compliance auditing. Learn more about CloudTrail at the AWS CloudTrail detail page, and turn it on via CloudTrail's AWS Management Console.

By default, Amazon EMR application processes use EC2 instance profiles when they call other AWS services. For multi-tenant clusters, Amazon EMR offers three options to manage user access to Amazon S3 data.

  1. Integration with AWS Lake Formation allows you to define and manage fine-grained authorization policies in AWS Lake Formation to access databases, tables, and columns in AWS Glue Data Catalog. You can enforce the authorization policies on jobs submitted through Amazon EMR Notebooks and Apache Zeppelin for interactive EMR Spark workloads, and send auditing events to AWS CloudTrail. By enabling this integration, you also enable federated Single Sign-On to EMR Notebooks or Apache Zeppelin from enterprise identity systems compatible with Security Assertion Markup Language (SAML) 2.0.

  2. Native integration with Apache Ranger allows you to set up a new or an existing Apache Ranger server to define and manage fine-grained authorization policies for users to access databases, tables, and columns of Amazon S3 data via Hive Metastore. Apache Ranger is an open-source tool to enable, monitor, and manage comprehensive data security across the Hadoop platform.

    This native integration allows you to define three types of authorization policies on the Apache Ranger Policy Admin server. You can set table, column, and row level authorization for Hive, table and column level authorization for Spark, and prefix and object level authorization for Amazon S3. Amazon EMR automatically installs and configures the corresponding Apache Ranger plugins on the cluster. These Ranger plugins sync up with the Policy Admin server for authorization polices, enforce data access control, and send auditing events to Amazon CloudWatch Logs.

  3. Amazon EMR User Role Mapper allows you to leverage AWS IAM permissions to manage accesses to AWS resources. You can create mappings between users (or groups) and custom IAM roles. A user or group can only access the data permitted by the custom IAM role. This feature is currently available through AWS Labs.

Regions and Availability Zones

Open all

Amazon EMR launches all nodes for a given cluster in the same Amazon EC2 Availability Zone. Running a cluster in the same zone improves performance of the jobs flows. By default, Amazon EMR chooses the Availability Zone with the most available resources in which to run your cluster. However, you can specify another Availability Zone if required. You also have the option to optimize your allocation for lowest-priced on demand instances, optimal spot capacity, or use On-Demand Capacity Reservations.

For a list of the supported Amazon EMR AWS regions, please visit the AWS Region Table for all AWS global infrastructure.

EMR supports launching clusters in the Los Angeles AWS Local Zone. You can use EMR in the US West (Oregon) region to launch clusters into subnets associated with the Los Angeles AWS Local Zone.

When creating a cluster, typically you should select the Region where your data is located.

Yes, you can. If you transfer data from one region to the other you will be charged bandwidth charges. For bandwidth pricing information, please visit the pricing section on the EC2 detail page.

The AWS GovCloud (US) region is designed for US government agencies and customers. It adheres to US ITAR requirements. In GovCloud, EMR does not support spot instances or the enable-debugging feature. The EMR Management Console is not yet available in GovCloud.

Deployment options

Open all

Amazon EMR on Amazon EC2

Open all

A cluster is a collection of Amazon Elastic Compute Cloud (Amazon EC2) instances. Each instance in the cluster is called a node and has a role within the cluster, referred to as the node type. Amazon EMR also installs different software components on each node type, giving each node a role in a distributed application like Apache Hadoop. Every cluster has a unique identifier that starts with "j-".

An Amazon EMR cluster has three types of nodes:

  1. Master node: A node that manages the cluster by running software components to coordinate the distribution of data and tasks among other nodes for processing. The master node tracks the status of tasks and monitors the health of the cluster. Every cluster has a master node, and it's possible to create a single-node cluster with only the master node.

  2. Core node: A node with software components that run tasks and store data in the Hadoop Distributed File System (HDFS) on your cluster. Multi-node clusters have at least one core node.

  3. Task node: A node with software components that only runs tasks and does not store data in HDFS. Task nodes are optional.

A cluster step is a user-defined unit of processing, mapping roughly to one algorithm that manipulates the data. A step is a Hadoop MapReduce application implemented as a Java jar or a streaming program written in Java, Ruby, Perl, Python, PHP, R, or C++. For example, to count the frequency with which words appear in a document, and output them sorted by the count, the first step would be a MapReduce application which counts the occurrences of each word, and the second step would be a MapReduce application which sorts the output from the first step based on the counts.

STARTING – The cluster starts by configuring EC2 instances.
BOOTSTRAPPING – Bootstrap actions are being executed on the cluster.
RUNNING – A step for the cluster is currently being run.
WAITING – The cluster is currently active, but has no steps to run.
TERMINATING - The cluster is in the process of shutting down.
TERMINATED - The cluster was shut down without error.
TERMINATED_WITH_ERRORS - The cluster was shut down with errors.

PENDING – The step is waiting to be run.

RUNNING – The step is currently running.

COMPLETED – The step completed successfully.

CANCELLED – The step was cancelled before running because an earlier step failed or cluster was terminated before it could run.

FAILED – The step failed while running.

You can launch a cluster through the AWS Management Console by filling out a simple cluster request form. In the request form, you specify the name of your cluster, the location in Amazon S3 of your input data, your processing application, your desired data output location, and the number and type of Amazon EC2 instances you’d like to use. Optionally, you can specify a location to store your cluster log files and SSH Key to login to your cluster while it is running. Alternatively, you can launch a cluster using the RunJobFlow API or using the ‘create’ command in the Command Line Tools. For launching a cluster with EMR Studio, refer to the EMR Studio section above..

At any time, you can terminate a cluster via the AWS Management Console by selecting a cluster and clicking the “Terminate” button. Alternatively, you can use the TerminateJobFlows API. If you terminate a running cluster, any results that have not been persisted to Amazon S3 will be lost and all Amazon EC2 instances will be shut down.

You can start as many clusters as you like. When you get started, you are limited to 20 instances across all your clusters. If you need more instances, complete the Amazon EC2 instance request form. Once your Amazon EC2 limit is raised, the new limit will be automatically applied to your Amazon EMR clusters.

You can upload your input data and a data processing application into Amazon S3. Amazon EMR then launches a number of Amazon EC2 instances that you specified. The service begins the cluster execution while pulling the input data from Amazon S3 using S3 URI scheme into the launched Amazon EC2 instances. Once the cluster is finished, Amazon EMR transfers the output data to Amazon S3, where you can then retrieve it or use as input in another cluster.

Amazon EMR uses the Hadoop data processing engine to conduct computations implemented in the MapReduce programming model. The customer implements their algorithm in terms of map() and reduce() functions. The service starts a customer-specified number of Amazon EC2 instances, comprised of one master and multiple other nodes. Amazon EMR runs Hadoop software on these instances. The master node divides input data into blocks, and distributes the processing of the blocks to the other nodes. Each node then runs the map function on the data it has been allocated, generating intermediate data. The intermediate data is then sorted and partitioned and sent to processes which apply the reducer function to it locally on the nodes. Finally, the output from the reducer tasks is collected in files. A single “cluster” may involve a sequence of such MapReduce steps.

See the EMR pricing page for details on latest available instance types and pricing per region.

The time to run your cluster will depend on several factors including the type of your cluster, the amount of input data, and the number and type of Amazon EC2 instances you choose for your cluster.

Yes. You can launch an EMR cluster (version 5.23 or later) with three master nodes and support high availability of applications like YARN Resource Manager, HDFS Name Node, Spark, Hive, and Ganglia. Amazon EMR automatically fails over to a standby master node if the primary master node fails or if critical processes, like Resource Manager or Name Node, crash. Since the master node is not a potential single point of failure, you can run your long-lived EMR clusters without interruption. In the event of a failover, Amazon EMR automatically replaces the failed master node with a new master node with the same configuration and boot-strap actions. 

Yes. Amazon EMR is fault tolerant for node failures and continues job execution if a node goes down. Amazon EMR will also provision a new node when a core node fails. However, Amazon EMR will not replace nodes if all nodes in the cluster are lost.

Yes. You can SSH onto your cluster nodes and execute Hadoop commands directly from there. If you need to SSH into a specific node, you have to first SSH to the master node, and then SSH into the desired node.

Bootstrap Actions is a feature in Amazon EMR that provides users a way to run custom set-up prior to the execution of their cluster. Bootstrap Actions can be used to install software or configure instances before running your cluster. You can read more about bootstrap actions in EMR's Developer Guide.

You can write a Bootstrap Action script in any language already installed on the cluster instance including Bash, Perl, Python, Ruby, C++, or Java. There are several pre-defined Bootstrap Actions available. Once the script is written, you need to upload it to Amazon S3 and reference its location when you start a cluster. Please refer to the Developer Guide for details on how to use Bootstrap Actions.

The EMR default Hadoop configuration is appropriate for most workloads. However, based on your cluster’s specific memory and processing requirements, it may be appropriate to tune these settings. For example, if your cluster tasks are memory-intensive, you may choose to use fewer tasks per core and reduce your job tracker heap size. For this situation, a pre-defined Bootstrap Action is available to configure your cluster on startup. See the Configure Memory Intensive Bootstrap Action in the Developer’s Guide for configuration details and usage instructions. An additional predefined bootstrap action is available that allows you to customize your cluster settings to any value of your choice. See the Configure Hadoop Bootstrap Action in the Developer’s Guide for usage instructions.

Yes. Nodes can be of two types: (1) core nodes, which both host persistent data using Hadoop Distributed File System (HDFS) and run Hadoop tasks and (2) task nodes, which only run Hadoop tasks. While a cluster is running you may increase the number of core nodes and you may either increase or decrease the number of task nodes. This can be done through the API, Java SDK, or though the command line client. Please refer to the Resizing Running clusters section in the Developer’s Guide for details on how to modify the size of your running cluster. You can also use EMR Managed Scaling.

As core nodes host persistent data in HDFS and cannot be removed, core nodes should be reserved for the capacity that is required until your cluster completes. As task nodes can be added or removed and do not contain HDFS, they are ideal for capacity that is only needed on a temporary basis. You can launch task instance fleets on Spot Instances to increase capacity while minimizing costs.

There are several scenarios where you may want to modify the number of nodes in a running cluster. If your cluster is running slower than expected, or timing requirements change, you can increase the number of core nodes to increase cluster performance. If different phases of your cluster have different capacity needs, you can start with a small number of core nodes and increase or decrease the number of task nodes to meet your cluster’s varying capacity requirements. You can also used EMR Managed Scaling.

Yes. You may include a predefined step in your workflow that automatically resizes a cluster between steps that are known to have different capacity needs. As all steps are guaranteed to run sequentially, this allows you to set the number of nodes that will execute a given cluster step.

To create a new cluster that is visible to all IAM users within the EMR CLI: Add the --visible-to-all-users flag when you create the cluster. For example: elastic-mapreduce --create --visible-to-all-users. Within the Management Console, simply select “Visible to all IAM Users” on the Advanced Options pane of the Create cluster Wizard.

To make an existing cluster visible to all IAM users you must use the EMR CLI. Use --set-visible-to-all-users and specify the cluster identifier. For example: elastic-mapreduce --set-visible-to-all-users true --jobflow j-xxxxxxx. This can only be done by the creator of the cluster.

To learn more, see the Configuring User Permissions section of the EMR Developer Guide.

You can add tags to an active Amazon EMR cluster. An Amazon EMR cluster consists of Amazon EC2 instances, and a tag added to an Amazon EMR cluster will be propagated to each active Amazon EC2 instance in that cluster. You cannot add, edit, or remove tags from terminated clusters or terminated Amazon EC2 instances which were part of an active cluster.

No, Amazon EMR does not support resource-based permissions by tag. However, it is important to note that propagated tags to Amazon EC2 instances behave as normal Amazon EC2 tags. Therefore, an IAM Policy for Amazon EC2 will act on tags propagated from Amazon EMR if they match conditions in that policy.

You can add up to ten tags on an Amazon EMR cluster.

Yes, Amazon EMR propagates the tags added to a cluster to that cluster's underlying EC2 instances. If you add a tag to an Amazon EMR cluster, it will also appear on the related Amazon EC2 instances. Likewise, if you remove a tag from an Amazon EMR cluster, it will also be removed from its associated Amazon EC2 instances. However, if you are using IAM policies for Amazon EC2 and plan to use Amazon EMR's tagging functionality, you should make sure that permission to use the Amazon EC2 tagging APIs CreateTags and DeleteTags is granted.

Select the tags you would like to use in your AWS billing report here. Then, to see the cost of your combined resources, you can organize your billing information based on resources that have the same tag key values.

An Amazon EC2 instance associated with an Amazon EMR cluster will have two system tags:

  • aws:elasticmapreduce:instance-group-role=CORE

    • Key = instance-group role ; Value = [CORE or TASK];

  • aws:elasticmapreduce:job-flow-id=j-12345678

    • Key = job-flow-id ; Value = [JobFlowID]

Yes, you can add or remove tags directly on Amazon EC2 instances that are part of an Amazon EMR cluster. However, we do not recommend doing this, because Amazon EMR’s tagging system will not sync the changes you make to an associated Amazon EC2 instance directly. We recommend that tags for Amazon EMR clusters be added and removed from the Amazon EMR console, CLI, or API to ensure that the cluster and its associated Amazon EC2 instances have the correct tags.

EMR Serverless

Open all

Amazon EMR Serverless is a new deployment option in Amazon EMR that allows you to run big data frameworks such as Apache Spark and Apache Hive without configuring, managing, and scaling clusters.

Data engineers, analysts, and scientists can use EMR Serverless to build applications using open-source frameworks such as Apache Spark and Apache Hive. They can use these frameworks to transform data, run interactive SQL queries, and machine learning workloads.

You can use EMR Studio, AWS CLI, or APIs to submit jobs, track job status, and build your data pipelines to run on EMR Serverless. To get started with EMR Studio, sign into the AWS Management Console, navigate to Amazon EMR under the Analytics category, and select Amazon EMR Serverless. Follow the directions in the AWS Management Consolenavigate to Amazon EMR under the Analytics category, and select Amazon EMR Serverless. Follow the directions in the Getting started guide  to create your EMR Serverless application and submit jobs. You can refer to the Interacting with your application on the AWS CLI page to launch your applications and submit jobs using CLI. You can also find EMR Serverless examples and sample code in our GitHub repository.

EMR Serverless currently supports Apache Spark and Apache Hive engines. If you want support for additional frameworks such as Apache Presto or Apache Flink, please send a request to emr-feedback@amazon.com.

EMR Serverless is available in the following AWS Regions: Asia Pacific (Mumbai), Asia Pacific (Seoul), Asia Pacific (Singapore), Asia Pacific (Sydney), Asia Pacific (Tokyo), Canada (Central), Europe (Frankfurt), Europe (Ireland), Europe (London), Europe (Paris), Europe (Stockholm), South America (São Paulo), US East (N. Virginia), US East (Ohio), US West (N. California), and US West (Oregon).

Amazon EMR provides the option to run applications on EC2 based clusters, EKS clusters, Outposts, or Serverless. EMR on EC2 clusters are suitable for customers who need maximum control and flexibility over running their application. With EMR on EC2 clusters, customers can choose the EC2 instance type to meet application-specific performance needs, customize the Linux AMI, customize the EC2 instance configuration, customize and extend open-source frameworks, and install additional custom software on cluster instances. Amazon EMR on EKS is suitable for customers who want to standardize on EKS to manage clusters across applications or use different versions of an open-source framework on the same cluster. Amazon EMR on AWS Outposts is for customers who want to run EMR closer to their data center, within an Outpost. EMR Serverless is suitable for customers who want to avoid managing and operating clusters and prefer to run applications using open-source frameworks.

Feature


EMR Serverless


Amazon EMR on EC2 

Amazon EMR on EKS 


Support for latest open-source versions


Y


Y

Y

Resilience to Availability Zone failures


Y


N

Y

Automatically scale resources up and down as needed


Y


Y

Y

Encryption for data at rest


Y


Y

Y


Open-source frameworks supported


Spark and Hive


multiple

Spark


Support for fine-grained authorization using AWS Lake Formation


N


Y

N

Support for Apache Hudi and Apache Iceberg

Y

Y

Y

Integration with Apache Ranger for table- and column-level permission control


N


Y

N

Customize operating system images


Y


Y

Y

Customize open-source framework installed

Y

Y

Y

Customize and load additional libraries and dependencies

Y

Y

Y

Run workloads from SageMaker Studio as part of machine learning (ML) workflow

N


Y

N

Connect to self-hosted Jupyter Notebooks

N

Y

Y

Build and orchestrate pipelines using Apache Airflow and Amazon Managed Workflows for Apache Airflow (MWAA)


Y


Y

Y

Build and orchestrate pipelines using AWS Step Functions

Y


Y

Y

EMR Serverless supports EMR release labels 6.6 and above. With EMR Serverless, you get the same performance-optimized EMR runtime available in other EMR deployment options, which is 100% API-compatible with standard open-source frameworks.

BilledResourcesUtilization solely factors in the duration for which pre-initialized capacity was utilized for the job, and does not take into account any idle time of such capacity.

If a worker's runtime duration is less than 60 seconds, BilledResourceUtilization would count it as 60 seconds, whereas TotalResourceUtilization would round it up to the nearest second. Additionally, BilledResourceUtilization excludes 20GB of free storage from the calculation.

With Amazon EMR Serverless, you can create one or more EMR Serverless applications that use open-source analytics frameworks. To create an application, you must specify the following attributes: 1) the Amazon EMR release version for the open-source framework version you want to use and 2) the specific analytics engines that you want your application to use, such as Apache Spark 3.1 or Apache Hive 3.0. After you create an application, you can start running your data processing jobs or interactive requests to your application.

An EMR Serverless application internally uses workers to execute your workloads. When a job is submitted, EMR Serverless computes the resources needed for the job and schedules workers. EMR Serverless breaks down your workloads into tasks, provisions and sets up workers with the open-source framework, and decommissions them when the job completes. EMR Serverless automatically scales workers up or down depending on the workload and parallelism required at every stage of the job, thereby removing the need for you to estimate the number of workers required to run your workloads. The default size of these workers is based on your application type and Amazon EMR release version. You can override these sizes when scheduling a job run.

With EMR Serverless, you can specify the minimum and maximum number of concurrent workers and the vCPU and memory configuration for workers. You can also set the maximum capacity limits on the application’s resources to control costs.

Consider creating multiple applications when doing any of the following:

  1. Using different open-source frameworks

  2. Using different versions of open-source frameworks for different use cases

  3. Performing A/B testing when upgrading from one version to another

  4. Maintaining separate logical environments for test and production scenarios

  5. Providing separate logical environments for different teams with independent cost controls and usage tracking

  6. Separating different lines-of-business applications

Yes, you can modify application properties such as initial capacity, maximum capacity limits, and network configuration using EMR Studio or the update-application API/CLI call.

An EMR Serverless application without pre-initialized workers takes up to 120 seconds to determine the required resources and provision them. EMR Serverless provides an optional feature that keeps workers initialized and ready to respond in seconds, effectively creating an on-call pool of workers for an application. This feature is called pre-initialized capacity and can be configured for each application by setting the initial-capacity parameter of an application.

Pre-initialized capacity allows jobs to start immediately, making it ideal for implementing time-sensitive jobs. You can specify the number of workers that you want to pre-initialize when you start an EMR Serverless application. Subsequently, when users submit jobs, the pre-initialized workers can be used to immediately start the jobs. If the job requires more workers than what you have chosen to pre-initialize, EMR Serverless automatically adds more workers (up to the maximum concurrent limit that you specify). After the job finishes, EMR Serverless automatically reverts back to maintaining the pre-initialized workers that you specified. Workers automatically shut down if they have been idle for 15 minutes. You can change the default idle timeout for your application using the updateApplication API or EMR Studio.

You can submit and manage EMR Serverless jobs using EMR Studio, SDK/CLI, or our Apache Airflow connectors.

For PySpark, you can package your Python dependencies using virtualenv and pass the archive file using the --archives option, which enables your workers to use the dependencies during the job run. For Scala or Java, you can package your dependencies as jars, upload them to Amazon S3, and pass them using the --jars or --packages options with your EMR Serverless job run.

EMR Serverless supports Java-based UDFs. You can package them as jars, upload them to S3, and use them in your Spark or HiveQL scripts.

Refer to the Supported Worker Configuration for details.

Yes, you can cancel a running EMR Serverless job from EMR Studio or by calling the cancelJobRun API/CLI.

You can add extra storage to the workers in EMR Serverless by selecting the appropriate storage option during job submission. EMR Serverless offers two ephemeral storage options:

  • Standard storage: This option comes with 20 GB of ephemeral storage per worker by default. You can customize this during job submission and increase the storage capacity from 20 GB up to 200 GB per worker.

  • Shuffle-optimized Disk storage: This option provides up to 2 TB of ephemeral storage per worker, optimized for shuffle-intensive workloads.

You can find EMR Serverless code samples in our GitHub repository.

EMR Serverless offers two options for workers: on-demand workers and pre-initialized workers.

On-demand workers are launched only when needed for a job and are released automatically when the job is complete. This helps you save costs by paying only for the resources that you use and avoid any additional costs for idle capacity. On-demand workers scales your application up or down based on your workload, so you don’t have to worry about over- or under-provisioning resources.

Pre-initialized workers are an optional feature where you can keep workers ready to respond in seconds. This effectively creates a warm pool of workers for an application. This allows jobs to start instantly, making it ideal for iterative applications and time-sensitive jobs.

Yes, it is possible to configure EMR Serverless applications across multiple AZs. The process to set up multiple AZs depends on the type of workers that you use.

When using On-Demand workers only, EMR Serverless distributes jobs across multiple AZs by default, but each job runs only in one AZ. You can choose which AZs to use by associating subnets with the AZs. If an AZ fails, EMR Serverless automatically runs your job in another healthy AZ.

When using Pre-Initialized workers, EMR Serverless selects a healthy AZ from the subnets that you specify. Jobs are submitted in that AZ until you stop the application. If an AZ becomes impaired, you can restart the application to switch to another healthy AZ.

EMR Serverless can only access certain AWS resources in the same Region when configured without VPC connectivity. See considerations. To access AWS resources in a different Region or non-AWS resources, you will need to setup VPC access and a NAT gateway to route to public endpoints for the AWS resources.

Amazon EMR Serverless application- and job-level metrics are published every 30 seconds to Amazon CloudWatch.

From EMR Studio, you can select a running or completed EMR Serverless job and then click on the Spark UI or Tez UI button to launch them.

Yes, you can configure Amazon EMR Serverless applications to access resources in your own VPC. See the Configuring VPC access section in the documentation to learn more. 

Each EMR Serverless application is isolated from other applications and runs on a secure Amazon VPC.

Amazon EMR Serverless is introducing a new service quota called Max concurrent vCPUs per account. This vCPU-based quota allows you to set the maximum number of aggregate vCPUs your applications are able to scale up to within a Region. The existing application-level, worker-based quotas (Maximum active workers) will will reach end of support on February 1, 2023.

You can view, manage, and request quota increase in the AWS Service Quotas Management console. For more information, see Requesting a Quota Increase in the Service Quotas User Guide.

EMR Serverless provides two cost controls - 1/ The maximum concurrent vCPUs per account quota is applied across all EMR Serverless applications in a Region in your account. 2/ The maximumCapacity parameter limits the vCPU of a specific EMR Serverless application. You should use the vCPU-based quota to limit the maximum concurrent vCPUs used by all application in a Region, and the maximumCapacity property to limit the resources used by a specific application. For e.g. if you have 5 applications and each application can scale up to 1000 vCPUs, set maximumCapacity property to 1000 vCPUs for each application and configure the account-level vCPU-based quota to 5 * 1000 = 5000 vCPUs.

If you exceed your account-level vCPU quota, EMR Serverless will stop provisioning new capacity. If you try creating a new application after exceeding the quota, the application creation will fail with an error message “Application failed to create as you have exceeded the maximum concurrent vCPUs per account service quota. You can view and manage your service quota using AWS Service Quotas console.” If you submit a new job after exceeding the quota, the job will fail with an error message: “Job failed as you have exceeded the maximum concurrent vCPUs per account service quota. You can view and manage your service quota using AWS Service Quotas console.” Please refer to the documentation for more details.

There are three ways in which Amazon EMR Serverless can help you save costs. First, there is no operational overhead of managing, securing, and scaling clusters. Second, EMR Serverless automatically scales workers up at each stage of processing your job and scales them down when they’re not required. You’re charged for aggregate vCPU, memory, and storage resources used from the time a worker starts running until it stops, rounded up to the nearest second with a 1-minute minimum. For example, your job may require 10 workers for the first 10 minutes of processing the job and 50 workers for the next 5 minutes. With fine-grained automatic scaling, you incur costs for only 10 workers for 10 minutes and 50 workers for 5 minutes. As a result, you don’t have to pay for underutilized resources. Third, EMR Serverless includes the Amazon EMR performance-optimized runtime for Apache Spark and Apache Hive, and Presto. The Amazon EMR runtime is API-compatible and over twice as fast as standard open-source analytics engines, so your jobs run faster and incur fewer compute costs.

It depends on your current EMR on EC2 cluster utilization. If you are running EMR clusters using EC2 On-Demand Instances, EMR Serverless will offer a lower total cost of ownership (TCO) if your current cluster utilization is less than 70%. If you are using the EC2 Savings Plans, EMR Serverless will offer a lower TCO if your current cluster utilization is less than 50%. And if you use EC2 Spot Instances, Amazon EMR on EC2 and Amazon EMR on EKS will continue to be more cost effective.

Yes, if you do not stop workers after a job is complete, you will incur charges on pre-initialized workers.

For PySpark, you can package your Python dependencies using virtualenv and pass the archive file using the --archives option, which enables your workers to use the dependencies during the job run. For Scala or Java, you can package your dependencies as jars, upload them to Amazon S3, and pass them using the --jars or --packages options with your EMR Serverless job run.

For PySpark, you can package your Python dependencies using virtualenv and pass the archive file using the --archives option, which enables your workers to use the dependencies during the job run. For Scala or Java, you can package your dependencies as jars, upload them to Amazon S3, and pass them using the --jars or --packages options with your EMR Serverless job run.