Protecting Healthcare data with DLP: A guide for getting started

Protecting patient data is of paramount importance to any healthcare provider. This is not only because of the many laws and regulations in different countries around the world requiring this data to be safeguarded, but it is a foundational requirement for trust between a provider and their patient. This does create some tension between patients and their providers.  To give patients the best care possible, a significant amount of Protected Health Information (PHI) is shared with providers who in turn share it with other providers, insurers, labs, etc. While sharing data can lead to better quality of care, it also introduces more risk to patient privacy.With many healthcare providers and insurers leveraging cloud technologies, the concerns around protecting PHI evolve. The various data types encountered in the healthcare ecosystem are diverse and complex, which means there are many different systems and formats to protect. This is where Cloud Data Loss Prevention (DLP) comes into play. Cloud DLP can help identify PHI/PII and help to obfuscate it, allowing for use of that data while adding additional layers of protection for the privacy of the patient.  This additional layer of protection compliments traditional security measures like access control, encryption-at-rest and encryption-in-transit by adding a layer of protection that can change or mask the data itself. This helps attain a deeper level of “least privilege” access, or data minimization. At Google Cloud we help many healthcare providers build and deploy globally scaled data infrastructure. These providers use Cloud Data Loss Prevention to discover PHI and protect it.In this series of articles, our goal is to discuss the various types of data formats (i.e. structured data, csv, etc) and systems in Google Cloud that may handle PHI and how the Google DLP API can be used across all of them to protect sensitive data.What is DLP?Cloud DLP helps customers inspect and mask this sensitive data with techniques like redaction, bucketing, date-shifting, and tokenization, which help strike the balance between risk and utility. This is especially crucial when dealing with unstructured or free-text workloads, in which it can be challenging to know what data to redact. Google Cloud DLP provides many system- and storage-agnostic capabilities that enable it to be used in virtually any workload, migration, or real-time application. In 2021, Forrester Research has named Google Cloud a Leader in The Forrester Wave™: Unstructured Data Security Platforms, Q2 2021 report, and rated Google Cloud highest in the current offering category among the providers evaluated. Additionally Cloud received the highest possible score in the Obfuscation criteria, a technique that can help protect sensitive data, like personally identifiable information (PII). While there are many great articles describing Google Cloud DLP and its components, for the purposes of this article we will focus on the key features of Cloud DLP that Healthcare providers leverage including:Data discovery and classificationData masking and obfuscationMeasuring re-identification risk Types of Data Healthcare Providers ManageWhile all Healthcare providers have data needs that are unique to their organization, we generally see Healthcare data falling into two buckets:Text-based dataThis data is typically seen as CSVs, flat files and transactional database entriesHL7/FHIR/DICOM dataThis data is typically received from EMR and other systems that follow interoperability standards in HealthcareIn the case of HL7/FHIR data the data is typically formed as a JSON or XML objectDICOM, being the standard for medical imaging, stores image files that often have text embedded in themThese common Healthcare data structures leverage various data sources and sinks to ingest, store and analyze data. The following is the list of the services we typically see leveraged for Healthcare data:Google Cloud StorageGoogle BigQueryGoogle Cloud SQLGoogle Cloud Pub/SubGoogle Cloud DataflowOne service in Google Cloud deserves a special callout here due to its impact on managing and leveraging Healthcare data and that is the Google Healthcare API. Check out overview videos and documentation here for more complete information on the Google Healthcare API. For the purposes of these articles we will focus on a few of its key capabilities including:Ingestion of HL7/FHIR dataIngestion of DICOM dataLeveraging DLP through the inbuilt features of the Healthcare APIGetting Started with DLP for Healthcare DataThere are a few key steps required to begin leveraging the DLP API which we will walk through. We will begin with a simple use case that scans a Google Cloud Storage bucket for the information we define, and replaces the information with the name of the information type. This is a good method to use to periodically scan a common data store for PHI, for example, in development environments.Inspecting DataThe first step is to create a template to instruct the DLP API on what data you need to find. To do so you will build an inspect template (example shown below). Inspect templates have many built-in infotypes (over 150) that allow users to discover common data elements that require redaction, like names, social security numbers, etc. It also allows for custom data types to be built in case you need to extend past the built-in detectors.Knowing where sensitive data exists is a critical step to protecting it. Using Cloud DLP to help discover, inspect, and classify data can help you understand how to best protect and secure your data. This inspection can be integrated into workflows to proactively detect and prevent data loss, or it can be used for ongoing inspections, which can generate security notifications/alerts when data is found in areas that it’s not expected. For the purposes of this article, we will show how to build an inspect template similar to the one that has been integrated into our Cloud Healthcare API for the de-identification of FHIR data, with a couple additional infotypes we often see used by our customers. There are many other built-in infotypes that are useful for Healthcare, like ICD10 codes.  We also added in a custom infotype to show how any regex can be used to match data based on unique needs.De-Identifying DataThe next step, de-identification, tells the DLP API what to do once it finds information based on the inspect template that you built. The De-Identify template can do many things such as:Tokenization with secure one-way hashingTokenization with two-way Deterministic Encryption or Format Preserving EncryptionDate shiftingData maskingBucketing or generalizationCombinations and variations of the aboveFor the purposes of this article we created a De-Identify template (shown below) that redacts the data matching the inspection configuration with the name of the infotype detected (i.e. [DATE_OF_BIRTH]). Many more options for what the De-Identify template can do can be found here.Scheduled Inspection JobsNow that you know what you are scanning for and the actions you want to take when sensitive data is discovered, you can schedule a scan. This process is well-documented here but these are a few key callouts before we proceed: When configuring a scan you must know the GCS bucket (or BigQuery dataset) that you want to scanWhen configuring your scan, consider reducing the sampling rate (percentage of data scanned) and only scanning data changed since the last scan to reduce costsWhile configuring the scan you will be given several options for notifications and outputs of DLP scans called Actions (seen below)These actions have various capabilities for analysis and notifications. Detailed descriptions of the options are listed here:Publish to BigQuery – This setting allows you to publish all results of DLP scans to a BigQuery dataset for future analysisPublish to Pub/Sub: This option will create messages in a selected Pub/Sub topic about the outcome of DLP scans. This is a great option if you want to have other applications, like a SIEM, consume the results.Publish to Security Command Center: Results can be published into Security Command Center for review by security teams.Publish to Data Catalog: If you leverage Data Catalog in your environment to manage and understand your data in Google Cloud, you can add your scan resultdata to your catalog.Notify by email: This action sends an email to project owners and editors when the job completes.Publish to Cloud Monitoring: Send inspection results to Cloud Monitoring in Google Cloud’s Operations suite.See ResultsIn the console, navigate to Data Loss Prevention and select the Inspection tab. You should see what is indicated in the following image:If you select any job ID you will see the job details including findings, bytes scanned, errors, and a result listing that shows which infotypes were discovered.As noted above, you can send your results to many locations via DLP Actions. Thanks to that flexibility, there are many options for bespoke solutions on current BI tools to analyze scan results. If you don’t have that tooling available, a great solution could be to send your results to BigQuery and use a Data Studio dashboard to analyze your results.Next StepsIn this article we started down the path of leveraging DLP to protect PHI in Healthcare environments. Now that we have the basic building blocks set up, we want to start using them.In the rest of this blog post series we will discuss:DLP use cases for the different data stores commonly used in HealthcareDLP in the Healthcare APIAlternate De-Identification methodsViewing and managing scanning resultsGoogle Cloud DLP is built for the modern technology landscape. By utilizing the steps above, you can create a secure foundation for protecting patient data. The Google Cloud team is here to help. To learn more about getting started on DLP or general best practices to manage risk, reach out to your Technical Account Manager or contact a Google Cloud account team.Related ArticleImproving security, compliance, and governance with cloud-based DLP data discoveryData discovery, a key component of DLP technology, has never been more important. Here’s why.Read Article
Quelle: Google Cloud Platform

New Cloud Functions min instances reduces serverless cold starts

Cloud Functions, Google Cloud’s Function as a Service (FaaS) offering, is a lightweight compute platform for creating single-purpose, standalone functions that respond to events, without needing an administrator to manage a server or runtime environment. Over the past year we have shipped many new important capabilities on Cloud Functions: new runtimes (Java, .NET, Ruby, PHP), new regions (now up to 22), an enhanced user and developer experience, fine-grained security, and cost and scaling controls. But as we continue to expand the capabilities of Cloud Functions, the number-one friction point of FaaS is the “startup tax,” a.k.a. cold starts: if your function has been scaled down to zero, it can take a few seconds for it to initialize and start serving requests. Today, we’re excited to announce minimum (“min”) instances for Cloud Functions. By specifying a minimum number of instances of your application to keep online during periods of low demand, this new feature can dramatically improve performance for your serverless applications and workflows, minimizing your cold starts.Min instances in actionLet’s take a deeper look at min instances with a popular, real-world use case: recording, transforming and serving a podcast. When you record a podcast, you need to get the audio in the right format (mp3, wav), and then make the podcast accessible so that users can easily access, download and listen to it. It’s also important to make your podcast accessible to the widest audience possible including those with trouble hearing and those who would prefer to read the transcript of the podcast. In this post, we show a demo application that takes a recorded podcast, transcribes the audio, stores the text in Cloud Storage, and then emails an end user with a link to the transcribed file, both with and without min instances. Approach 1: Building the application with Cloud Functions and Cloud WorkflowsIn this approach, we use Cloud Functions and Google Cloud Workflows to chain together three individual cloud functions. The first function (transcribe) transcribes the podcast, the second function (store-transcription) consumes the result of the first function in the workflow and stores it in Cloud Storage, and the third function (send-email) is triggered by Cloud Storage when the transcribed result is stored and sends an email to the user to inform them that the workflow is complete.Fig 1. Transcribe Podcast Serverless WorkflowCloud Workflows executes the functions in the right order and can be extended to add additional steps in the workflow in the future. While the architecture in this approach is simple, extensible and easy to understand, the cold start problem remains, impacting end-to-end latency. Approach 2: Building the application with Cloud Functions, Cloud Workflows and min instancesIn this approach, we follow all the same steps as in Approach 1, with a slightly modified configuration that enables a set of min instances for each of the functions in the given workflow.Fig 2. Transcribe Podcast Serverless Workflow (Min Instances)This approach presents the best of both worlds. It has the simplicity and elegance of wiring up the application architecture using Cloud Workflows and Cloud Functions. Further, each of the functions in this architecture leverages a set of min instances to mitigate the cold-start problem and time to transcribe the podcast.Comparison of cold start performanceNow consider executing the Podcast transcription workflow using Approach 1, where no min instances are set on the functions that make up the app. Here is an instance of this run with a snapshot of the log entries. The start and end timestamps are highlighted to show the execution of the run. You can see here that the total runtime in Approach 1 took 17 s. Approach 1: Execution Time (without Min Instances)Now consider executing the podcast transformation workflow using Approach 2, where min instances are set on the functions. Here is an instance of this run with a snapshot of the log entries. The start and end timestamps are highlighted to show the execution of the run, for a total of 6 s. Approach 2: Execution Time (with Min Instances)That’s an 11 second difference between the two approaches. The example set of functions are hardcoded with a 2 to 3 second sleep during function initialization, and when combined with average platform cold-start times, you can clearly see the cost of not using min instances.You can reproduce the above experiment in your own environment using the tutorial here. Check out min instances on Cloud FunctionsWe are super excited to ship min instances on Cloud Functions, which will allow you to run more latency-sensitive applications such as podcast transcription workflows in the serverless model. You can also learn more about Cloud Functions and Cloud Workflows in the following Quickstarts: Cloud Functions, Cloud Workflows.Related ArticleRegistration is open for Google Cloud Next: October 12–14Register now for Google Cloud Next on October 12–14, 2021Read Article
Quelle: Google Cloud Platform

Google Cloud VMware Engine, PowerCLI and BigQuery Analytics

Google Cloud Billing allows Billing Account Administrators to configure the export of Google Cloud billing data to a BigQuery dataset for analysis and intercompany billback scenarios. Developers may choose to extract Google Cloud VMware Engine (GCVE) configuration and utilization data and apply internal cost and pricing data to create custom reports to support GCVE resource billback scenarios.  Using VMware PowerCLI, a collection of modules for Windows PowerShell, data is extracted and loaded into Big Query. Once the data is loaded into Big Query, analysts may choose to create billing reports and dashboards using Looker or Google Sheets. Exporting data from Google Cloud billing data into a BigQuery dataset is relatively straight-forward. However, exporting data from GCVE into BigQuery requires PowerShell scripting. The following blog details steps to extract data from GCVE and load it into BigQuery for reporting and analysis. Initial Setup VMware PowerCLI InstallationInstalling and configuring VMware PowerCLI is a relatively quick process, with the primary dependency being network connectivity between the host used to develop and run the PowerShell script and the GCVE Private Cloud.[Option A] Provision a Google Compute Engine instance, for example Windows Server 2019, for use as a development server. [Option B] Alternatively, use a Windows laptop with Powershell 3.0 or higher installedLaunch the Powershell ISE as Administrator. Install and configure the VMware PowerCLI, any required dependencies and perform a connection test.Reference: https://www.powershellgallery.com/packages/VMware.PowerCLI/12.3.0.17860403 VMware PowerCLI DevelopmentNext, develop a script to extract and load data into BigQuery. Note that this requires a developer to have permissions to create a BigQuery dataset, create tables and insert data. An example process including code samples follows. 1. Import the PowerCLI Module and connect to the GVCE cluster.2. [Optional] If desired, a vCenter simulator docker container, nimmis/vcsim : vCenter and ESi API based simulator, may be useful for development purposes. For information on setting up a vCenter simulator, see the following link: https://www.altaro.com/vmware/powercli-scripting-vcsim/ 3. Create a dataset to hold data tables in BigQuery. This dataset may hold multiple tables. 4. Create a list of vCenters you would like to collect data from.5. Create a file name variable.6. For a simple VM inventory, extract data using the Get-VM cmdlet. You may also choose to extract data using other cmdlets, for example, Get-VMHost and Get-DataStore. Review vSphere developer documentation for more information on available cmdlets along with specific examples. 7. View/Validate the json data as required.8. Create a table in BigQuery. Note that this only needs to be done once. In the example below the .json file was first loaded into a Cloud Storage bucket. The table was then created from the file in the bucket.9. Load the file into Big Query.10. Disconnect from a server. 11. Consider scheduling the script you developed using Windows Task Scheduler, cron or another scheduling tool so that it runs on the required schedule.12. Using the BigQuery UI or Google Data Studio, create views and queries referencing the staging tables to extract and transform data for reporting and analysis purposes. It’s a good idea to create any supporting tables in BigQuery to support cost analysis such as a date dimension table, pricing schedule table and other relevant lookup tables to support allocations and departmental bill back scenarios. Connect to Big Query using Looker to create reports and dashboards. Example: Connect to BigQuery from Google Sheets and Import data.Using Custom TagsCustom tags allow a GCVE administrator to associate a VM with a specific service or application and are useful for bill back and cost allocation. For example, vm’s that have a custom tag populated with a service name (ex. x-callcenter) can be grouped together to calculate direct costs required to deliver a service.  Jump boxes or shared vm’s may be tagged accordingly and grouped to support shared service and indirect cost allocations.  Custom tags combined with key metrics such as provisioned, utilized and available capacity enable GCVE administrators to optimize infrastructure and support budgeting and accounting requirements. Serverless Billing Exports scheduled with Cloud SchedulerIn addition to running powershell code as a scheduled task, you may choose to host your script in a container and enable script execution using a web service. One possible solution could look something like this: Create a Docker File running ubuntu 18:04. Install Python3, Powershell 7  and VmWare Power CLIExample requirements.txtExample Docker File: 2. For your main.py script, use subprocess to run your powershell script.  Push your container to Container Registry , deploy your container and schedule ELT using Cloud Scheduler. Related ArticleGoogle Cloud VMware Engine explained: Integrated networking and connectivityLearn about the networking features in Google Cloud VMware Engine to let you easily and deploy workloads across on-prem and cloud environ…Read Article
Quelle: Google Cloud Platform

AWS DataSync verbessert Aufgabenfilterung und Warteschlangen

Wenn Sie eine AWS-DataSync-Aufgabe erstellen, um Ihre Daten zu und von AWS Storage zu übertragen, können Sie jetzt Einschlussfilter sowie Ausschlussfilter angeben, wodurch Sie noch mehr Kontrolle darüber haben, wie Ihre Daten übertragen werden. Mit dieser Verbesserung können Sie jetzt Aufgaben planen, die sowohl Ausschluss- als auch Einschlussfilter verwenden, um nur eine Teilmenge von Dateien an Ihrem Quellspeicherort zu übertragen. Darüber hinaus können Sie jetzt mehrere Ausführungen einer Aufgabe in die Warteschlange stellen, wenn sich die Filtereinstellungen zwischen den Ausführungen unterscheiden.
Quelle: aws.amazon.com

IPv6-Endpunkte sind jetzt für den Amazon-EC2-Instance-Metadaten-Service, Amazon Time Sync Service und Amazon-VPC-DNS-Server verfügbar

Auf den Amazon-EC2-Instance-Metadaten-Service, den Amazon Time Sync Service und den Amazon-VPC-DNS-Server kann jetzt über IPv6-Endpunkte von Instances zugegriffen werden, die auf dem Nitro System aufgebaut sind. Diese lokalen Instance-Services verfügen über IPv6-Adressen, auf die von Ihren Amazon-EC2-Instances zugegriffen werden kann. Diese IPv6-Endpunkte verwenden Unique Local Addresses (ULA); IPv6 für lokale Instance-Services ist zum Ausführen von Software und Containern in einer reinen IPv6-Single-Stack-Konfiguration nützlich. Wenn Sie mit der Umstellung auf IPv6 in einer Dual-Stack-Umgebung beginnen, sind außerdem die Endpunkte für den Instance-Metadaten-Service, Amazon Time Sync Service und Amazon-VPC-DNS sowohl über IPv4 als auch über IPv6 verfügbar.
Quelle: aws.amazon.com

Amazon Elasticsearch Service unterstützt jetzt drei Availability Zone-Bereitstellungen in der Region AWS GovCloud (USA-Ost)

Mit Amazon Elasticsearch Service (Amazon ES) können Sie Ihre Instances nun auf drei Availability Zones (AZs) bereitstellen und erreichen damit eine bessere Verfügbarkeit Ihrer Domains. Wenn Sie Replicas für Ihre Elasticsearch-Indizes aktivieren, verteilt Amazon Elasticsearch Service die primären und Replica-Shards zur Maximierung der Verfügbarkeit auf die Knoten in verschiedenen AZs.
Quelle: aws.amazon.com

Amazon MSK fügt Metriken für eine bessere Sichtbarkeit der Kapazität hinzu

Amazon Managed Streaming for Apache Kafka (Amazon MSK) bietet jetzt einen besseren Einblick in die Nutzung von Amazon-MSK-Ressourcen durch 19 neue Metriken, die in Amazon CloudWatch veröffentlicht werden. Diese Metriken bieten Kunden zusätzliche Einblicke in die Ressourcenauslastung über CPU, Speicher und das Netzwerk, sodass Kunden die Leistung und Betriebszeit ihrer Apache Kafka-Anwendungen maximieren können, die mit Amazon MSK interagieren.
Quelle: aws.amazon.com