Skip to content

You are viewing documentation for Immuta version 2023.4.

For the latest version, view our documentation for Immuta SaaS or the latest self-hosted version.

Quickstart Installation Guide for Immuta on AWS EMR

Audience: System Administrators

Content Summary: This simple deployment guide familiarizes users with Immuta on EMR. This guide is only meant to be deploy clusters for non-production purposes, such as demos or proof-of-concept. For more robust deployments, please see the main installation guide for Immuta on EMR.

Deprecation notice

Support for this integration has been deprecated.

Installation Prerequisites

AWS Resources

  • AWS CLI (v1.16.x or greater) installed in a bash environment.

    • The CLI should be configured to use a role that is able to fully manage EMR, IAM, and S3 resources. This can be a user role in a local environment or an instance role on an EC2 instance.
  • Resource IDs for your chosen AWS VPC subnet and EMR-managed security groups.

    • Be sure that your master and worker security groups are configured for bi-directional communication with your Immuta instance.

Immuta Resources

  • An instance of Immuta that is reachable from your chosen AWS VPC.
  • A username and password for the Immuta archives site. If you need these, reach out to your Immuta support professional.

Run the Immuta EMR Quickstart Script

First, download the quickstart script here.

Next, run the script. Note that you will be prompted for input variables. If a variable is not required, you can press enter to use the displayed default value.

See below for an example of the script being run and prompting for variables. Note that any input in the example is simply for demonstration purposes; you will need to provide your own values.

$ ./immuta-emr-quickstart.sh

* Enter Cluster Name [immuta-quickstart]:

* Enter EMR Version [Default: 5.23.0]:

* Enter Immuta Version [Default: 2023.4.12_20240507]:

* Enter Immuta Instance URL [REQUIRED]: https://immuta.mycompany.com

* Enter AWS Region [us-east-1]:

* Enter Instance Count [3]:

* Enter Instance Type [m5.xlarge]:

* Enter AWS Key Name for SSH [REQUIRED]: my-aws-key

* Enter AWS Subnet ID [REQUIRED]: subnet-xxxxxx

* Enter EMR Service Managed Security Group ID [REQUIRED]: sg-xxxxxxxxxxxx

* Enter EMR Master Node Managed Security Group ID [REQUIRED]: sg-yyyyyyyyyyy

* Enter EMR Worker Node Managed Security Group ID [REQUIRED]: sg-zzzzzzzzzzz

* Enter Immuta Archive Username [REQUIRED]: abjgksdthghjksgslkjaghsdfsj

* Enter Immuta Archive Password [REQUIRED]: gjw4a8906y423432r93hf3f03rhfqfq470ty3

* Enter Bootstrap Bucket Name. If the bucket does not exist, it will be created with default permissions [immuta-emr-bootstrap-<account id>-us-east-1]:

* Enter Data Bucket Name. If the bucket does not exist, it will be created with default permissions [immuta-emr-data-<account id>-us-east-1]:

* Enter Kerberos Admin Password [Default: <generated>]:

* Enter HDFS System Token [Default: <generated>]:

< Cluster creation begins>
...

Input Variables

The immuta-emr-quickstart.sh script will prompt the user for input variables to configure the AWS resources required for the cluster. These variables are represented by the environment variables listed below. Exporting these environment variables prior to running the script will skip the prompts.

  • CLUSTER_NAME

    • Optional. The name of the EMR cluster to be created.
    • Default: immuta-quickstart.
  • EMR_VERSION

    • Optional. The EMR version of the cluster. Current supported versions are 5.17.0 - 5.23.0.
    • Default: 5.23.0.
  • IMMUTA_VERSION

    • Optional. The full Immuta version to be installed on the cluster.
    • Default: 2023.4.12_20240507.
  • IMMUTA_INSTANCE_URL

    • Required. The URL of the Immuta instance that will drive policies on the cluster.
  • AWS_REGION

    • Optional. The AWS Region that the cluster will run in.
    • Default: us-east-1.
  • INSTANCE_COUNT

    • Optional. The number of instances (master + worker) in the cluster.
    • Default: 3.
  • INSTANCE_TYPE

    • Optional. The type of instance for cluster nodes.
    • Default: m5.xlarge.
  • AWS_KEY_NAME

    • Required. The name of the SSH keypair in AWS that will be used to connect to the cluster.
  • AWS_SUBNET_ID

    • Required. The ID string of the subnet that the cluster will run in.
  • SERVICE_SECURITY_GROUP

    • Required. The ID string of the security group for the cluster's EMR services.
  • MASTER_SECURITY_GROUP

    • Required. The ID string of the security group for the cluster's master node.
  • WORKER_SECURITY_GROUP

    • Required. The ID string of the security group for the cluster's worker nodes.
  • ARCHIVE_USERNAME

    • Required. The username for Immuta Archives.
  • ARCHIVE_PASSWORD

    • Required. The password for Immuta Archives.
  • BOOTSTRAP_BUCKET

    • Optional. The S3 bucket where bootstrap artifacts will be stored. If the specified bucket does not exist, a new one will be created with default private ACLs.
    • Default: immuta-emr-bootstrap-$AWS_ACCOUNT_ID-$AWS_REGION.
  • DATA_BUCKET

    • Optional. The S3 bucket where partitioned data is stored. If the specified bucket does not exist, a new one will be created with default private ACLs.
    • Default: immuta-emr-data-$AWS_ACCOUNT_ID-$AWS_REGION.
  • KADMIN_PASSWORD

    • Optional. The Kerberos admin password that will be used to create Kerberos principals on the cluster's dedicated internal KDC.
    • Default: random.
  • HDFS_SYSTEM_TOKEN

    • Optional. The HDFS System Token that the cluster will use to securely communicate with the Immuta instance. You should generate this value in the Immuta Configuration UI before creating your cluster.

    • Default: random.

Post-installation

Copy Kerberos Resources to Immuta Instance

You will need to copy the immuta.keytab and krb5.conf files from the cluster and upload them to your Immuta instance using the Immuta Configuration UI.

scp -i my-aws-key.pem hadoop@ip-x-x-x-x.ec2.internal:/etc/krb5.conf .
scp -i my-aws-key.pem hadoop@ip-x-x-x-x.ec2.internal:~/.keytabs/immuta.keytab .

Associate Quickstart Principals with Immuta Users

The quickstart bootstrap automatically seeds the cluster with three user principals for you to use while familiarizing yourself with the Immuta platform and data policies: owner, consumer1, and consumer2. The default Kerberos password for these users is immuta-quickstart.

You can associate these users with your Immuta users by following this guide. Note that only the owner principal will have access to the data in your chosen S3 data bucket, so this is the principal that you should use to create your data sources in Immuta.