POV Data Setup Guide
We’ve provided some data that will allow you to complete the walkthroughs provided. Of course, you would use your own data with Immuta in Production, but since we are going to walk through very specific use cases, it’s easier to work off the same sheet of music, data-wise.
While this page is long, you will only need to worry about your specific data warehouses/compute.
1 - Download the Data
Databricks Workspaces
A Databricks workspace (your Databricks URL) can be configured to use traditional notebooks or SQL Analytics. Select one of these options from the menu in the top left corner of the Databricks console.
Select one of the tabs below to download the script to generate fake data for your specific warehouse.
Databricks Notebooks (Data Science and Engineering or Machine Learning Notebooks)
Download this Databricks Notebook.
Databricks SQL (SQL Workspace)
Download this SQL script.
Snowflake
Download this SQL script.
Redshift
Download this SQL script.
2 - Load the Data
This will get the data downloaded in the first step into your data warehouse.
Assumptions: Part 2 assumes you have a user with permission to create databases/schemas/tables in your warehouse/compute (and potentially write files to cloud storage).
Databricks Notebooks
Databricks and SQL Analytics Imports
If you’ve already done the import using SQL Analytics and SQL Analytics shares the same workspace with Databricks Notebooks, you will not have to do it again in Databricks because they share a metastore.
- Before importing and running the notebook, ensure you are either logged in as Databricks admin or you are running it on a cluster that is NOT Immuta-enabled.
- Import the Notebook downloaded from step 1 into Databricks.
- Go to your workspace and click the down arrow next to your username.
- Select import.
- Import the file from step 1.
-
Run all cells in the Notebook, which will create both tables.
- For simplicity, the data is being stored in DBFS; however, we do not recommend this in real deployments, and you should instead store your real data in your cloud-provided object store (S3, ADLS, Google Storage).
Databricks SQL
Databricks and SQL Analytics Imports
Note, if you’ve already done the import using Databricks and Databricks shares the same workspace with SQL Analytics, you will not have to do it again in Databricks because they share a metastore.
- Before importing and running the script, ensure you are logged in as a user who can create databases.
- Select SQL from the upper left menu in Databricks.
- Click Create → Query.
- Copy the contents of the SQL script you downloaded from step 1 and paste that script into the SQL area.
-
Run the script.
- For simplicity, the data is being stored in DBFS, however, we do not recommend this in real deployments and you should instead store your real data in your cloud-provided object store (S3, ADLS, Google Storage).
Snowflake
- Open up a worksheet in Snowflake using a user that has CREATE DATABASE and CREATE SCHEMA permission. Alternatively, you can save the data in a pre-existing database or schema by editing the provided SQL script.
- To the right of your schema selection in the worksheet, click the ... menu to find the Load Script option.
- Load the script downloaded from step 1.
- Optional: Edit the database and schema if desired at the top of the script.
- Check the All Queries button next to the Run button.
- Ensure you have a warehouse selected, and then click the Run button to execute the script. (There should be 11 commands it plans to run.)
Both tables should be created and populated.
Redshift
Redshift RA3 Instance Type
You must use a Redshift RA3 instance type because Immuta requires cross-database views, which are only supported in Redshift RA3 instance types.
Unfortunately there is not a standard query editor for Redshift, so creating the POV tables in Redshift is going to be a bit less automated.
- Connect to your Redshift instance using your query editor of choice.
- Create a new database called
immuta_pov
using the commandCREATE DATABASE immuta_pov
; optionally, you can connect to a pre-existing database and just load these tables in there. - After creating the database, you will need to disconnect from Redshift and connect to the new database you created (if you did so).
- Upload the script you downloaded from step 1 above. If your query editor does not support uploading a SQL script, you can simply open that file in a text editor to copy the commands and paste them in the editor.
- Run the script.
- Both tables should be created and populated.
Synapse
Synapse Analytics Dedicated SQL Pools
Immuta supports Synapse Analytics dedicated SQL pools only.
Creating the data in Synapse is potentially a four step process. First you will need to upload the data to a storage account, then create a Synapse Workspace (if you don’t have one in mind to use), then create a dedicated SQL pool, and then point Synapse to that stored data.
1 - Upload the Data to an Azure Storage Account
- Log in to the Azure Portal.
- Select or create a storage account.
- If selecting an existing storage account and you already have a Synapse Workspace you plan to use, make sure the storage container(s) are attached to that Synapse workspace.
- The selected or created storage account MUST have Data Lake Storage Gen2 Hierarchical namespace enabled. Note this has to be enabled at creation time and cannot be changed after creation.
- The setting Enable hierarchical namespace is found under advanced settings when creating the storage account.
- Click Containers.
- Select or create a new container to store the data.
- Upload both files from step 1 to the container by clicking the upload button.
2 - Create a Synapse Workspace (you can skip this step if you already have one you plan to use)
- Go to Azure Synapse Analytics (still logged in to Azure Portal).
- Create a Synapse workspace.
- Select a resource group.
- Provide a workspace name.
- Select a region.
- For account name use the storage account used from the above steps, remembering it MUST have Data Lake Storage Gen2 Hierarchical namespace enabled.
- Select the container you created in the above steps.
- Make sure Assign myself the Storage Blob Data Contributor role on the Data Lake Storage Gen2 account to interactively query it in the workspace is checked.
- Go to the security section.
- Enter your administrator username/password (save these credentials).
- Review and create.
- Once the Synapse Workspace is created, there should be a Workspace web URL (this is to the Synapse Studio) available on the overview page, go there.
3 - Create a Dedicated Pool
With Synapse, a dedicated pool is essentially a database. So we want to create a database for this POV data.
- On the Azure portal home page click Azure Synapse Analytics.
- Click the Synapse workspace above and then click + New dedicated SQL pool (Immuta only works on Synapse dedicated pools).
- Next, enter
immuta_pov
as the name of your dedicated SQL pool. - Choose an appropriate performance level. For testing purposes (other than performance) it is recommended to use the lowest performance level to avoid high costs if the instance is left running.
- Once that information is chosen, click Review + Create and then Create.
4 - Point Synapse to the Stored Data
- From Synapse Studio (this is the Workspace web URL you were given when the Synapse Workspace was completed) click
the Data menu on the left.
- Click on the Workspace tab.
- Expand databases and you should see the dedicated pool you created above. Sometimes, even if the dedicated pool has been deployed, it takes time to see it in Synapse Studio. Wait some time and refresh the browser.
- Once the dedicated pool is there, hover over it, and click the Actions button.
- Select New SQL script.
- Select Empty script.
- Paste the contents of the script you downloaded in Part 1 into the script window.
- Run it.
- From that same Synapse Studios window, click the Integrate menu on the left.
- Click the + button to Add a new resource.
- Select Copy Data tool.
- Leave it as Built-in copy task with Run once now, and then click Next >.
- For Source type select Azure Data Lake Storage Gen 2.
- For connection, you should see your workspace; select it.
- For File or folder, click the browse button and select the container where you placed the data.
- Dive into that container (double click the folder if in a folder) and select the immuta_fake_hr_data_tsv.txt file.
- Uncheck recursive and click Next >.
- For File format, select Text format.
- For Column delimiter, leave the default, Tab (\t).
- For Row delimiter, leave the default, Line feed (\n).
- Leave First row is header checked.
- Click Next >
- For Target type, select Azure Synapse dedicated SQL pool.
- For Connection, select the dedicated pool you created:
immuta_pov
. - Under Azure Data Lake Storage Gen2 file, click Use existing table.
- Select the
pov_data.immuta_fake_hr_data
table. - Click Next >.
- Leave all the defaults on the column mapping page, and then click Next >.
- On the settings page, name the task hr_data_import.
- Open the Advanced section.
- Uncheck Allow PolyBase.
- Uncheck the Edit button under Degree of copy parallelism.
- Click Next >.
- Review the Summary page and click Next >.
- This should run the task and load the data; you can click Finish when it completes.
If you’d like, you can test that it worked by opening a new SQL Script from the data menu and running:
SELECT * FROM pov_data.immuta_fake_hr_data
.
Repeat these steps for the immuta_fake_credit_card_transactions_tsv.txt file, loading it into the pov_data.immuta_fake_credit_card_transactions table.
Starburst (Trino)
Since Starburst (Trino) can connect to and query many different systems, it would be impossible for us to list instructions for every single one. To load these tables into Starburst (Trino), you should
- Upload the data from Part 1 to whatever backs your Starburst (Trino) instance. If that is cloud object storage, this would mean loading the files downloaded from Part 1. If it’s a database, you may want to leverage some of the SQL scripts listed for the other databases in Part 1.
- Follow the appropriate guide from here.
- Create a database:
immuta_pov
for the schema/tables. - Create a schema: pov_data for the tables.
- Load both tables into that schema:
- immuta_fake_hr_data
- immuta_fake_credit_card_transactions
3 - Configure the Immuta Integration(s)
Assumptions: Part 3 assumes your user has the following permissions in Immuta (note you should have these by default if you were the initial user on the Immuta installation):
- APPLICATION_ADMIN: in order to configure the integration
- USER_ADMIN: in order to create a non-admin user
User Accounts (admin, non-admin)
When the integration is configured, since it allows you to query the data directly from the data warehouse/compute, there needs to be a mapping between Immuta users and data warehouse/compute users. Typically in production this is accomplished through a shared identity manager like LDAP or Okta.
However, for the purposes of this POV, you may prefer to do a simple mapping instead, which we will describe. Taking this a step further, you really need two different “levels” of users to really see the power of Immuta. For example, you want an “admin” user that has more permissions to create policies (and avoid policies) and a regular “non-admin” user to see the impacts of policies on queries - think of this as your regular downstream analyst.
It’s best to follow these rules of thumb:
-
When running through the below native configurations, it’s best to use a system account from your data warehouse/compute to configure them (when we ask for credentials in the configuration steps, we are not talking about the actual Immuta login here), although you can use your admin account.
-
You need some kind of Immuta admin account for registering data, building policies, etc, this should be the user that initially stood up the Immuta instance in many cases, but could be a different user as long as you give them all required permissions. This user should map to your data warehouse/compute admin user. We get more into segmentation of duties later in the walkthroughs.
-
You need a non-admin user. This may be more difficult if you have an external identity / SSO system where you can’t login as a different user to your data warehouse/compute. But if possible you should create a second user for yourself with no special permissions in Immuta and be able to map that user to a user you can login as on your data warehouse/compute of choice.
Mapping the users
Understanding the rules of thumb above, you will need an admin and non-admin user in Immuta, and those users need to map to users in your data warehouse/computes of choice. Typically, if the users in both places are identified with email addresses, this all “just works” - they are already mapped. However, if they do not match, you can manually configure the mapping. For example, if you want to map steve@immuta.com to the plain old “steve” username in Synapse, you can do that by following the steps below. Again, this is not necessary if your users in Immuta match your users (spelling) in your data warehouse/compute (typically by email address).
- Log in to Immuta.
- Click the People icon and select Admin in the left sidebar.
- Click on the user of interest.
- Next to the username on the left, click the three dot menu.
- Here you will see the following options: Change Databricks username, change Snowflake username, etc.
- Select which data warehouse/compute username you want to map.
- Enter the data warehouse/compute username that maps to that Immuta user.
- Click Save.
Configuring the Integrations per data warehouse/compute
For Immuta to enforce controls, you must enable what is called our integrations. This is done slightly differently per each database/warehouse/compute, and how it works is explained in more detail in our Query Your Data Guide. For now, let’s just get the integrations of interest configured.
Databricks
- Log in to Immuta.
- Click the App Settings icon in the left sidebar (the wrench).
- Under the Configuration menu on the left, click System API Key under HDFS.
- Click Generate Key.
- Click Save in the bottom left of the Configuration screen, and Confirm when prompted.
- Under the Configuration menu on the left, click Native Integrations.
- Click the + Add Native Integration button.
- Select Databricks Integration.
- Enter the Databricks hostname.
- For Immuta IAM, there should only be one option, bim. Select it. This is the built-in Immuta identity manager. It’s likely you would hook up a different identity manager in production, like Okta or LDAP, but this is sufficient for POV testing.
- Access Model: this one is a pretty big decision, read the descriptions for each and eventually you will need to decide which mode to use. But for the purposes of this POV guide, we assumed the default: Protected until made available by policy.
- Select the appropriate Storage Access Type.
- Enter the required authentication information based on which Storage Access Type you select.
- No Additional Hadoop Configuration is required.
- Click Add Native Integration.
- This will pop up a message stating that Your Databricks integration will not work properly until your cluster policies are configured. Clicking this button will allow you to select the cluster policies that are deployed to your Databricks instance. We encourage you to read that table closely including the detailed notes linked in it to decide which cluster policies to use.
- Once you select the cluster policies you want deployed, click Download Policies. This will allow you to either:
- Automatically Push Cluster Policies to the Databricks cluster if you provide your Databricks admin token (Immuta will not store it), or
- Manually Push Cluster Policies yourself without providing your Databricks admin token, you decide.
- Please also Download the Benchmarking Suite; you will use that later in the Databricks Performance Test walkthrough.
- If you choose to Manually Push Cluster Policies you will have to also Download Init Script.
- Click Download Policies or Apply Policies depending which option you selected.
- Once adding the integration is successful, Click Save in the bottom left of the Configuration screen (also Confirm when warned). This may take a little while to run.
If you took the manual approach, you must deploy those cluster policies manually in Databricks. You should configure Immuta-enabled cluster(s) using the deployed cluster policies.
Congratulations, you have successfully configured the Immuta integration with Databricks. To leverage it, you will need to use a cluster configured with one of the cluster policies created through the above steps.
Databricks SQL
- Log in to Immuta.
- Click the App Settings icon in the left sidebar (the wrench).
- Under the Configuration menu on the left, click Native Integrations.
- Click the + Add Native Integration button.
- Select Databricks SQL analytics.
- Enter the Databricks SQL analytics host.
- Enter the Databricks SQL analytics port.
- Enter the HTTP Path of the SQL Endpoint that will be used to execute DDL to create views. This is not to be confused with an HTTP Path to a regular Databricks cluster!
- Enter the Immuta Database: immuta_pov_secure. This is the database name where all the secure views Immuta
creates will be stored. (You’ll learn more about this in the Query Your Data Guide.) That is why we named it
immuta_pov_secure
(since the original data is in theimmuta_pov database
), but, remember, it could contain data from multiple different databases if desired, so in production you likely want to name this database something more generic. - Enter any additional required Connection String Options.
- Enter your Personal Access Token Immuta needs to connect to Databricks to create the integration database, configure the necessary procedures and functions, and maintain state between Databricks and Immuta. The Personal Access Token provided here should not have a set expiration and should be tied to an account with the privileges necessary to perform the operations listed above (e.g., an Admin User).
- Make sure the SQL Endpoint is running (if you don't, you may get a timeout waiting for it to start when you test the connection), and then click Test Databricks SQL Connection.
- Once the connection is successful, Click Save in the bottom left of the Configuration screen. This may take a little while to run.
Congratulations, you have successfully configured the Immuta integration with Databricks SQL. Be aware, you can configure multiple Databricks SQL workspaces to Immuta.
Snowflake
- Log in to Immuta.
- Click the App Settings icon in the left sidebar (the wrench).
- Under the Configuration menu on the left, click Native Integrations.
- Click the + Add Native Integration button.
- Select Snowflake.
- Enter the Snowflake host.
- Enter the Snowflake port.
- Enter the default warehouse. This is the warehouse Immuta uses to compute views, so it does not need to be very big; XS is fine.
- Enter the Immuta Database: IMMUTA_POV_SECURE.
- This is the database name where all the secure schemas and views Immuta creates will be stored (you’ll learn more
about this in the Query Your Data Guide). That is why we named it
IMMUTA_POV_SECURE (since the original data is in the
IMMUTA_POV
database), but, remember, it could contain data from multiple different databases if desired, so in production you likely want to name this database something more generic. - For Additional Connection String Options, you may need to specify something here related to proxies depending on how your Snowflake is set up.
- You now need to decide if you want to do an automated installation or not; Immuta can automatically install the necessary procedures, functions, and system accounts into your Snowflake account if you provide privileged credentials (described in next step). These credentials will not be stored or saved by Immuta. However, if you do not feel comfortable with providing these credentials, you can manually run the provided bootstrap script.
-
Select Automatic or Manual depending on your decision above.
-
Automatic:
- Enter the username (when performing an automated installation, the credentials provided must have the ability to both CREATE databases and CREATE, GRANT, REVOKE, and DELETE roles.)
- Enter the password.
- You can use a key pair if required.
- For role, considering this user must be able to both CREATE databases and CREATE, GRANT, REVOKE, and DELETE roles, make sure you enter the appropriate role.
- Click Test Snowflake Connection.
-
Manual:
- Download the bootstrap script
- Enter a NEW user, this is the account that will be created, and then the bootstrap script will populate it with the appropriate permissions.
- Please feel free to inspect the bootstrap script for more details
- Enter a password for that NEW user
- You can use a key pair if required
- Click “Test Snowflake Connection”
- Once the connection is successful, click Save in the bottom left of the Configuration screen. This may take a little while to run.
- Run the bootstrap script in Snowflake.
-
Congratulations, you have successfully configured the Immuta integration with Snowflake. Be aware, you can configure multiple Snowflake instances to Immuta.
Redshift
Redshift RA3 Instance Type
You must use a Redshift RA3 instance type because Immuta requires cross-database views, which are only supported in Redshift RA3 instance types.
- Log in to Immuta.
- Click the App Settings icon in the left sidebar (the wrench).
- Under the Configuration menu on the left, click Native Integrations.
- Click the + Add Native Integration button and select Redshift.
- Enter the Redshift host.
- Enter the Redshift port.
- Enter the Immuta Database: immuta_pov_secure. This is the database name where all the secure schemas and views
Immuta creates will be stored (you’ll learn more about this in the
Query Your Data Guide). That is why we named it immuta_pov_secure
(since the original data is in the
immuta_pov
database), but, remember, it could contain data from multiple different databases if desired, so in production you likely want to name this database something more generic. - You now need to decide if you want to do an automated install or not, Immuta can automatically install the necessary procedures, functions, and system accounts into your Redshift account if you provide privileged credentials. These credentials will not be stored or saved by Immuta. However, if you do not feel comfortable with providing these credentials, you can manually run the provided bootstrap script. Please ensure you enter the username and password that were set in the bootstrap script.
-
Select Automatic or Manual depending on your decision above.
-
Automatic:
- Enter the initial database. This should be a database that already exists; it doesn’t really matter which. Immuta simply needs this because you must include a database when connecting to Redshift.
- Enter the username (this must be a user that can create databases, users, and modify grants).
- Enter the password.
-
Manual:
- Download the bootstrap script.
- Enter a NEW user, this is the account that will be created, and then the bootstrap script will populate it with the appropriate permissions.
- Please feel free to inspect the bootstrap script for more details.
- Enter a password for that NEW user.
- Click Test Redshift Connection.
- Once the connection is successful, click Save in the bottom left of the Configuration screen.
- This may take a little while to run.
- Run the bootstrap scripts in Redshift.
-
Congratulations, you have successfully configured the Immuta integration with Redshift. Be aware, you can configure multiple Redshift instances to Immuta.
Synapse
- Log in to Immuta.
- Click the App Settings icon in the left sidebar (the wrench).
- Under the Configuration menu on the left, click Native Integrations.
- Click the + Add Native Integration button and select Azure Synapse Analytics.
- Enter the Synapse Analytics host (this should come from the SQL dedicated pool).
- Enter the Synapse Analytics port.
- Enter the Immuta Database. This should be a database that already exists; this is where Immuta will create
the schemas that contain the secure views that will be generated. In our case, that should be
immuta_pov
. - Enter the Immuta Schema: pov_data_secure.
- This is the schema name where all the secure views Immuta creates will be stored (you’ll learn more about this in the Query Your Data Guide). That is why we named it pov_data_secure (since the original data is in the pov_data schema), but, remember, it could contain data from multiple different schemas if desired, so in production you likely want to name this schema something more generic.
- Add any additional connection string options.
- Since Synapse does not support array/json primitives, Immuta must store user attribute information using a delimiter. If you expect any of these in user profiles, please update them accordingly. (It’s likely you don’t.)
You now need to decide if you want to do an automated installation or not. Immuta can automatically install the necessary procedures, functions, and system accounts into your Azure Synapse Analytics account if you provide privileged credentials. These credentials will not be stored or saved by Immuta. However, if you do not feel comfortable providing these credentials, you can manually run the provided bootstrap script (initial database) and bootstrap script.
-
Select Automatic or Manual depending on your decision above.
-
Automatic:
- Enter the username. (We recommend using the system account you created associated to the workspace.)
- Enter the password.
-
Manual:
- Download both bootstrap scripts.
- Enter a NEW user; this is the account that will be created, and then the bootstrap script will populate it with the appropriate permissions.
- Please feel free to inspect the bootstrap scripts for more details.
- Enter a password for that NEW user.
- Click Test Azure Synapse Analytics Connection
- Once the connection is successful, click Save in the bottom left of the Configuration screen. This may take a little while to run.
- Run the bootstrap scripts in Synapse Analytics.
-
Congratulations, you have successfully configured the Immuta integration with Azure Synapse Analytics. Be aware, you can configure multiple Azure Synapse Analytics instances to Immuta.
Starburst (Trino)
- Log in to Immuta.
- Click the App Settings icon in the left sidebar (the wrench).
- Under the Configuration menu on the left, click Native Integrations.
- Click the + Add Native Integration button and select Trino.
- To connect a Trino cluster to Immuta, you must install the Immuta Plugin and register an Immuta Catalog. For Starburst clusters, the plugin comes installed, so you just need to register an Immuta Catalog. The catalog configuration needed to connect to this instance of Immuta is displayed in this section of the App Settings page. Copy that information to configure the plugin.
- Should you want Starburst (Trino) queries to be audited you must configure an Immuta Audit Event Listener. The event listener configuration needed to configure the audit listener is displayed below. The catalog name should match the name of the catalog associated with the Immuta connector configuration displayed above. Copy that information to configure the Audit Event Listener.
- Click Save in the bottom left of the Configuration screen. This may take a little while to run.
- Go to Starburst (Trino) and configure the plugin and audit listener.
Congratulations, you have successfully configured the Immuta integration with Starburst (Trino). Be aware, you can configure multiple Starburst (Trino) instances to Immuta.
4 - Register the Data with Immuta
Assumptions: Part 4 assumes your user has the following permissions in Immuta (note you should have these by default if you were the initial user on the Immuta installation):
- CREATE_DATA_SOURCE: in order to register the data with Immuta
- GOVERNANCE: in order to create a custom tag to tag the data tables with
These steps are captured in our first walkthrough under the Scalability & Evolvability theme: Schema monitoring and automatic sensitive data discovery. Please do that walkthrough to register the data to complete your data setup. Make sure you come back here to complete Part 5 below after doing this!
5 - Create a Subscription Policy to Open the Data to Everyone
Assumptions: Part 5 assumes your user has the following permissions in Immuta (note you should have these by default if you were the initial user on the Immuta install):
- GOVERNANCE: in order to build policy against any table in Immuta OR
- “Data Owner” of the registered tables from Part 4 without GOVERNANCE permission. (You likely are the Data Owner and have GOVERNANCE permission.)
Only do this part if you created the non-admin user in Part 3 (it is highly recommended you do that). If you did, you must give them access to the data as well. Immuta has what are called subscription policies, these are what controls access to tables, you may think of these as table GRANTs.
To get things going, let’s simply open those tables you created to anyone:
- Click the Policy icon in the left sidebar of the Immuta console.
- Click the Subscription Policies tab at the top.
- Click + Add Subscription Policy.
- Name the policy Open Up POV Data.
- For How should this policy grant access? select Allow Anyone.
- For Where should this policy be applied?, select On Data Sources.
- Select tagged for the circumstance (make sure you pick “tagged” and not “with columns tagged”).
- Type in Immuta POV for the tag. (Remember, this was the tag you created in the Schema Monitoring and Automatic Sensitive Data Discovery walkthrough under Part 4 above). Note that if you are a Data Owner of the tables without GOVERNANCE permission the policy will be automatically limited to the tables you own.
- Click Create Policy → Activate Policy
That will allow anyone access to those tables you created. We’ll come back to subscription policies later to learn more.
Next Steps
Return to the POV Guide to move on to your next topic.