Dataproc cooperative multi-tenancy

Data analysts run their BI workloads on Dataproc to generate dashboards, reports and insights. Diverse sets of Data analysts from various teams analyzing data to generate reports, dashboards and insights drive the need for multi-tenancy for Dataproc workloads. Today, workloads from all the users on the cluster runs as a single service account thereby every workload has the same data access. Dataproc Cooperative Multi-tenancy enables multiple users with distinct data accesses to run workloads on the same cluster. A Dataproc cluster usually runs the workloads as the cluster service account. Creating a Dataproc cluster with Dataproc Cooperative Multi-tenancy enables you to isolate user identities when running jobs that access Cloud Storage resources. The mapping of the  Cloud IAM user(s) to a service account is specified at cluster creation time and many service accounts can be configured for a given cluster. This means that interactions with Cloud Storage will be authenticated as a service account that is mapped to the user who submits the job, instead of the cluster service account.ConsiderationsDataproc Cooperative Multi-Tenancy has the following considerations:Setup the mapping of the Cloud IAM user to the service account by enabling the dataproc:dataproc.cooperative.multi-tenancy.user.mapping property. When a user submits a job to the cluster, the VM service account will impersonate the service account mapped to this user and interact with Cloud Storage as that service account, through the GCS connector.Requires GCS connector version to be at least 2.1.4.Does not support clusters with Kerberos enabled.Intended for jobs submitted through the Dataproc Jobs API only.ObjectivesWe intend to demonstrate the following objects as part of this blog.Create a Dataproc cluster with Dataproc Cooperative Multi-tenancy enabled.Submit jobs to the cluster with different user identities and observe different access rules applied  when interacting with Cloud Storage.Verify that interactions with Cloud Storage are authenticated with different service accounts, through StackDriver loggings.Before You BeginCreate a ProjectIn the Cloud Console, on the project selector page, select or create a Cloud project.Make sure that billing is enabled for your Google Cloud project. Learn how to confirm billing is enabled for your project.Enable the Dataproc API.Enable the StackDriver API.Install and initialize the Cloud SDK.Simulate a Second UserUsually, you have another user as a second user, however you can also simulate a second user by using a separate service account. Since you are going to submit jobs to the cluster by different users, you can activate a service account in your gcloud settings to simulate a second user.First, get your current activated account in gcloud. In most cases this would be your personal accountFIRST_USER=$(gcloud auth list –filter=status:ACTIVE –format=”value(account)”)Create a service account Grant the service account proper permissions to submit jobs to a Dataproc clusterCreate a key for the service account and use the key to activate it in gcloud. You can delete the key file after the service account is activated. Now if you run the following command:gcloud auth list –filter=status:ACTIVE –format=”value(account)”You will see this service account is the active account. In order to proceed with the examples below, switch back to your original active accountgcloud config set account ${FIRST_USER}Configure the Service AccountsCreate 3 additional service accounts, one as the Dataproc VM service account, and the other 2 as the service accounts mapped to users (user service accounts). Note: we recommend using a per-cluster VM service account and only allow it to impersonate user service accounts you intend to use on the specific cluster.Grant the iam.serviceAccountTokenCreator role to the VM service account on the two user service accounts, so it can impersonate them.AndGrant the dataproc.worker role to the VM service account so it can perform necessary jobs on the cluster VMs.Create Cloud Storage Resource and Configure Service AccountsCreate a bucketWrite a simple file to the bucket.echo “This is a simple file” | gsutil cp – gs://${BUCKET}/fileGrant only the first  user service account, USER_SA_ALLOW, admin access to the bucket.gsutil iam ch serviceAccount:${USER_SA_ALLOW}:admin gs://${BUCKET}Create a Cluster and Configure Service AccountsIn this example, we will map the user “FIRST_USER” (personal user) to the service account with GCS admin permissions, and the user “SECOND_USER” (simulated with as a service account) to the service account without GCS access.Note that cooperative multi-tenancy is only available in GCS connector from version 2.1.4 onwards. It is pre-installed on Dataproc image version 1.5.11 and up, but you can use the connectors initialization action to install a specific version of GCS connector on older Dataproc images.The VM service account needs to call the generateAccessToken API to fetch access token for the job service account, so make sure your cluster has the right scopes. In the example below I’ll just use the cloud-platform scope.Note: The user service accounts might need to have access to the config bucket associated with the cluster in order to run jobs, so make sure you grant the user service accounts access.2. On Dataproc clusters with 1.5+ images, by default, Spark and MapReduce history files are sent to the temp-bucket associated with the cluster, so you might want to grant the user service accounts access to this bucket.Run Example JobsRun a Spark job as “FIRST_USER”, and since the mapped service account has access to the GCS file gs://${BUCKET}/file, the job will succeed.And the job will succeed with output like:Now run the same job as “SECOND_USER”, and since the mapped service account has no access to the GCS file gs://${BUCKET}/file, the job will fail, and the driver output will show it is because of permission issues.And the job driver shows it is because the service account used does not have storage.get.access to the GCS file.Similarly for a Hive job (creating an external table in GCS, inserting records, then reading the records), when running the following as user “FIRST_USER”,It will succeed because the mapped  service account has access to the bucket <BUCKET>: However, when querying the table employeeas a different user “SECOND_USER”, the job will use the second user service account which has no access to the bucket, and the job will fail.Verify Service Accounts Authentication With Cloud Storage Through StackDriver LoggingFirst, check the usage of the first service account which has access to the bucket.Make sure the gcloud active account is your personal accountgcloud config set account ${FIRST_USER}Find logs about access to the bucket using the service account with GCS permissionsgcloud logging read “resource.type=”gcs_bucket” AND resource.labels.bucket_name=”${BUCKET}” AND protoPayload.authenticationInfo.principalEmail=”${USER_SA_ALLOW}””And we can see the results are that permissions are always granted:Checking the service account which has no access to the bucketAnd we see access permissions were never granted:And we can verify the VM service account was never directly used to access the bucket (the following gcloud command returns 0 log entries)gcloud logging read “resource.type=”gcs_bucket” AND resource.labels.bucket_name=”${BUCKET}” AND protoPayload.authenticationInfo.principalEmail=”${VM_SA}””CleanupDelete the clustergcloud dataproc clusters delete ${CLUSTER_NAME} –region ${REGION} –quietDelete the bucketgsutil rm -r gs://${BUCKET}Deactivate the service account used to simulate a second usergcloud auth revoke ${SECOND_USER}Delete the service accountsNoteThe cooperative multi-tenancy feature does not yet work on clusters with Kerberos enabled.Jobs submitted by users without service accounts mapped to them will fall back to use the VM service account when accessing GCS resources. However, you can set the `core:fs.gs.auth.impersonation.service.account`property to change the fallback service account. The VM service account will have to be able to call `generateAccessToken` to fetch access tokens for this fallback service account as well.This blog successfully demonstrates how you can use Dataproc Cooperative Multi-Tenancy to share Dataproc clusters across multiple users.Related ArticlePresto optional component now available on DataprocThe Presto query engine optional component is now available for Dataproc, Google Cloud’s fully managed Spark and Hadoop cluster service.Read Article
Quelle: Google Cloud Platform

Published by