Efficient PyTorch training with Vertex AI

Vertex AI provides flexible and scalable hardware and secured infrastructure to train PyTorch based deep learning models with pre-built containers and custom containers. For model training with large amounts of data, using the distributed training paradigm and reading data from Cloud Storage is the best practice. However, training with data on the cloud such as remote storage on Cloud Storage, introduces a new set of challenges. For example, when a dataset consists of many small individual files, randomly accessing them can introduce network overhead. Another challenge is data throughput, the speed at which data is fed to the hardware accelerators (GPU) to keep them fully utilized.In this post, we walk through methods to improve training performance step-by-step, starting first without distributed training followed by distributed training paradigms using data on cloud. Finally we can boost the training by 6x faster with data on Cloud Storage approaching the same speed as data on a local disk. We will show how Vertex AI Training service with Vertex AI Experiments and Vertex AI TensorBoard can be used to keep track of experiments and results.You can find the accompanying code for this blog post on the GitHub Repo.PyTorch distributed trainingPyTorch natively supports distributed training strategies. DataParallel (DP) is a simple strategy often used for single-machine multi-GPU training, but the single process it relies on could be the bottleneck of performance. This approach loads an entire mini-batch on the main thread and then scatters the sub mini-batches across the GPUs. The model parameters are only updated on the main GPU and then broadcasted to other GPUs at the beginning of the next iteration.DistributedDataParallel (DDP) fits multi-node multi-GPU scenarios where the model is replicated on each device which is controlled by an individual process. Each process loads its own mini-batch and passes them to its GPU. Each process also has its own optimizer with no parameter broadcast reducing the communication overhead. Finally, an all-reduce operation is performed across GPUs unlike DP. This multi-process benefits the training performance.FullyShardedDataParallel (FSDP) is another data parallel paradigm similar to DDP, which enables fitting more data and larger models by sharding the optimizer states, gradients and parameters into multiple FSDP units, unlike DDP where model parameters are replicated on each GPU.Different distributed training strategies can ideally fit different training scenarios. However, sometimes it is not easy to pick the best one for specific environment configurations. For example, effectiveness of data loading pipeline to GPUs, batch size and network bandwidth in a multi-node setup can affect performance of a distributed training strategy.In post, we will use PyTorch ResNet-50 as the example model and train it on ImageNet validation data (50K images) to measure the training performance for different training strategies.DemonstrationEnvironment configurationsFor the test environment, we create custom jobs on Vertex AI Training with following setup:Here are training hyperparameters setup for all of the following experiments:For each of the following experiments, we train the model for 10 epochs and use the averaged epoch time as the training performance. Please note that we focused on improving the training time and not on the model performance itself.Read data from Cloud Storage with gcsfuse and WebDatasetWe use gcsfuse to access data on Cloud Storage from Vertex AI Training jobs. Vertex AI training jobs have Cloud Storage buckets already mounted via gcsfuse and there is no additional work required to use gcsfuse. With gcsfuse training jobs on Vertex AI can access data on Cloud Storage as simply as files in the local file system. This also provides high throughput for large file sequential reads.code_block[StructValue([(u’code’, u”open(‘/gcs/test-bucket/path/to/object’, ‘r’)”), (u’language’, u’lang-py’), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3e0f8fad3d90>)])]Data loading pipeline could be a bottleneck of distributed training when it reads individual data files from the cloud. WebDataset is a PyTorch dataset implementation designed to improve streaming data access especially in remote storage settings. The idea behind WebDataset is similar to TFRecord, it collects multiple raw data files and compiles them into one POSIX tar file. But unlike TFRecord, it doesn’t do any format conversion and doesn’t assign object semantics to data and the data format is the same in the tar file as it is on disk. Refer to this blog post for key pipeline performance enhancements we can achieve with WebDataset.WebDataset shards a large number of individual images into a small number of tar files. During training, each single network request will be able to fetch multiple images and cache them locally for the next couple of batches. Thus the sequential I/O allows much lower overhead of network communication. In the below demonstration, we will see the difference between training using data on Cloud Storage with and without WebDataset using gcsfuse.NOTE: WebDataset has been incorporated into the official TorchData library as torchdata.datapipes.iter.WebDataset. But the TorchData lib is currently in the Beta stage and doesn’t have a stable version. So we stick to the original WebDataset as the dependency.Without distributed trainingWe train the ResNet-50 on one single GPU first to get a baseline performance:From the result we can see that, when training on one single GPU, using data on Cloud Storage takes about 2x the time of using a local disk. Keep this in mind, we will use multiple methods to improve the performance step by step.DataParallel (DP)The DataParallel strategy is the simplest method introduced by PyTorch to enable single-machine multiple-GPU training with the smallest code change. Actually as small as one line code change:code_block[StructValue([(u’code’, u’model = torch.nn.DataParallel(model)’), (u’language’, u’lang-py’), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3e0f8c5c2f90>)])]We train the ResNet-50 on single node with 4 GPUs using the DP strategy:After applying DP on 4 GPUs, we can see that:Training with data on the local disk gets 3x faster (from 489s to 157s).Training with data on Cloud Storage gets faster a little bit (from 804s to 738s).It’s apparent that the distributed training with data on Cloud Storage becomes an input bound training, waiting for data to be read due to network bottleneck.DistributedDataParallel (DDP)DistributedDataParallel is more sophisticated and powerful than DataParallel. It’s recommended to use DDP over DP, despite the added complexity, because DP is single-process multi-thread which suffers from Python GIL contention and DDP can fit more scenarios like multi-node and model-parallel. Here we experimented with DDP on a single node with 4 GPUs where each GPU is handled by an individual process.We use the nccl backend to initialize the process group for DDP and construct the model:code_block[StructValue([(u’code’, u”dist.init_process_group(rn backend=’nccl’, init_method=’env://’,rn world_size=4, rank=rank)rnrntorch.nn.parallel.DistributedDataParallel(model)”), (u’language’, u’lang-py’), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3e0fa9dcad90>)])]We train the ResNet-50 on 4 GPUs using the DDP strategy and WebDataset:After enabling DDP on 4 GPUs, we can see that:Training with data on the local disk gets further faster than DP (from 157s to 134s).Training with data on Cloud Storage gets much better (from 738s to 432s), but it is 3x times slower than using a local disk.Training with data on Cloud Storage gets a lot faster (from 432s to 133s) when using source files in WebDataset format, which is very close or as good as to the speed of training with data on the local disk.The input bound problem is kind of relieved when using DDP, which is expected because there’s no Python GIL contention any more for reading data. And despite the addition of data preprocessing work, sharding data with WebDataset benefits the performance by removing the overhead of network communication. Finally,  DDP and WebDataset improve training performance by 6x (from 804s to 133s) in comparison to without distributed training and individual smaller files.FullyShardedDataParallel (FSDP)FullyShardedDataParallel wraps model layers into FSDP units. It gathers full parameters before the forward and backward operations and runs reduce-scatter to synchronize gradients. It achieves lower peak memory usage than DDP with some configurations.code_block[StructValue([(u’code’, u’# policy to recursively wrap layers with FSDPrnfsdp_auto_wrap_policy = functools.partial(rn size_based_auto_wrap_policy, rn min_num_params=100)rnrn# construct the model to shard model parameters rn# across data parallel workersrnmodel = torch.distributed.fsdp.FullyShardedDataParallel(rn model, rn auto_wrap_policy=fsdp_auto_wrap_policy)’), (u’language’, u’lang-py’), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3e0faaa1b2d0>)])]We train the ResNet-50 on 4 GPUs using the FSDP strategy and WebDataset:We can see that using FSDP achieves a similar training performance as DDP in this configuration on a single node with 4 GPUs.Comparing performance across these different training strategies, with and without WebDataset format, we see an overall 6x performance improvement with data on Cloud Storage using WebDataset and choosing DistributedDataParallel or FullyShardedDataParallel distributed training strategies. The training performance with data on Cloud Storage is similar to when trained with data on a local disk.Tracking with Vertex AI TensorBoard and ExperimentsAs you have seen so far, we carried out performance improvement trials step-by-step and it was necessary to run the experiments with several configurations and track the development and outcome. Vertex AI Experiments enable seamless experimentation along with tracking. You can track parameters, visualize and compare the performance metrics of your model and pipeline experiments.You would use Vertex AI Python SDK to create an experiment, and log both parameters, metrics, and artifacts associated with experiment runs. The SDK provides a handy initialization method to create a TensorBoard instance using Vertex AI TensorBoard for logging model time series metrics. For example, we tracked training loss, validation accuracy and training run times for each epoch.Below is the snippet to start an experiment, log model parameters, run the training job and track metrics at the end of the training session:code_block[StructValue([(u’code’, u”# Create Tensorboard instance and initialize Vertex AI clientrnTENSORBOARD_RESOURCE_NAME = aiplatform.Tensorboard.create()rnaiplatform.init(project=PROJECT_ID,rn location=REGION,rn experiment=EXPERIMENT_NAME,rn experiment_tensorboard=TENSORBOARD_RESOURCE_NAME,rn staging_bucket=BUCKET_URI)rnrn# start experiment runrnaiplatform.start_run(EXPERIMENT_RUN_NAME)rnrn# log parameters to the experimentrnaiplatform.log_params(exp_params)rnrn# create jobrnjob = aiplatform.CustomJob(rn display_name=DISPLAY_NAME, rn worker_pool_specs=WORKER_SPEC,rn staging_bucket=BUCKET_URI,rn base_output_dir=BASE_OUTPUT_DIRrn)rnrn#run jobrnjob.run(rn service_account=SERVICE_ACCOUNT,rn tensorboard=TENSORBOARD_RESOURCE_NAMErn)rnrn# log metrics to the experimentrnmetrics_df = pd.read_json(metrics_path, typ=’series’)rnaiplatform.log_metrics(metrics_df[metrics_cols].to_dict())rnrn# stop the runrnaiplatform.end_run()”), (u’language’, u’lang-py’), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3e0faabb7890>)])]The SDK supports a handy get_experiment_df method to return experiment run information as a Pandas dataframe. Using this dataframe, we can now effectively compare performance between different experiment configurations:Since the experiment is backed with TensorBoard using Vertex AI TensorBoard, you can access TensorBoard from the console and do a deeper analysis. For the experiment, we modified training code to add TensorBoard scalars with metrics that we were interested in.ConclusionIn this post, we demonstrated how PyTorch training could be input bound when data is read from Google Cloud Storage and showed approaches to improve performance by comparing distributed training strategies and introducing WebDataset format.Use WebDataset to shard individual files which can improve sequential I/O performance by reducing network bottlenecks. When training on multiple GPUs, choose DistributedDataParallel or FullyShardedDataParallel distributed training strategies for better performance. For large-scale datasets you cannot download to the local disk. Use gcsfuse to simplify implementation of data access to Cloud Storage from Vertex AI and use WebDataset to shard individual files reducing network overhead. Vertex AI improves productivity when carrying out experiments while offering flexibility, security and control. Vertex AI Training custom jobs make it easy to run experiments with several training configurations, GPU shapes and machine specs. Combined with Vertex AI Experiments and Vertex AI TensorBoard, you can track parameters, visualize and compare the performance metrics of your model and pipeline experiments.You can find the accompanying code for this blog post on this GitHub Repo.
Quelle: Google Cloud Platform

Using Vertex AI to build an industry leading Peer Group Benchmarking solution

The modern world of financial markets is fraught with volatility and uncertainty. Market participants and members are rethinking the way they approach problems and rapidly changing the way they do business. Access to models, usage patterns, and data has become key to keeping up with ever evolving markets. One of the biggest challenges firms face in futures and options trading is determining how they benchmark against their competitors. Market participants are continually looking for ways to improve performance, identifying what happened, why it happened, and any associated risks. Leveraging the latest technologies in automation and artificial intelligence, many organizations are using Vertex AI to build a solution around peer group benchmarking and explainability. IntroductionUsing the speed and efficiency of Vertex AI, we have developed a solution that will allow market participants to identify similar trading group patterns and assess performance relative to their competition. Machine learning (ML) models for dimensionality reduction, clustering, and explainability are trained to detect patterns and transform data into valuable insights. This blog post goes over these models in detail, as well as the ML operations (MLOps) pipeline used to train and deploy these models at scale.A series of successive models are used that feed predictive results as training data into the next model (e.g. dimensionality reduction -> clustering -> explainability). This requires a robust automated system for training and maintaining models and data, and provides an ideal use case for the MLOps capabilities of Vertex AI. The SolutionDataA market analytics dataset was used which contains market participant trading metrics aggregated and averaged across a 3 month period. This dataset contains a high number of dimensions. Specific features include buying and selling counts, trade and order quantities, types, first and last fill times, aggressive vs. passive trading indicators, and a number of other features related to trading behavior.ModelingDimensionality ReductionClustering in high dimensional space presents a challenge, particularly for distance-based clustering algorithms. As the number of dimensions grows, the distance between all points in the dataset converge and become more similar. This distance concentration problem makes it difficult to perform typical cluster analysis on highly dimensional data. For the task of dimensionality reduction, an Artificial Neural Network (ANN) Autoencoder was used to learn a supervised similarity metric for each market participant in the dataset. This autoencoder takes in each market participant and their associated features. It pushes the information through a hidden layer that is constrained in size, forcing the network to learn how to condense information down into a small encoded representation.The constrained layer is a vector (z) in latent space, where each element in the vector is a learned reduction of the original market participant features (X); thus, allowing dimensionality reduction by simply applying X * z. This results in a new distribution of customer data q(X’ | X) where the distribution is constrained in size to the shape of z. By minimizing the reconstruction error between the initial input X and the autoencoder’s reconstructed output X’ we can balance the overall size of the similarity space (the number of latent dimensions) and the amount of information lost.The resulting output of the autoencoder is a 2-dimensional learned representation of the highly dimensional data.ClusteringExperiments were conducted to determine the optimal clustering algorithm, number of clusters, and hyperparameters. A number of models were compared, including density-based spatial clustering of applications with noise (DBSCAN), agglomerative clustering, gaussian mixture model (GMM), and k-means. Using silhouette score as an evaluation criterion, it was ultimately determined that k-means performed best for clustering on the dimensionally reduced data. The k-means algorithm is an iterative refinement technique that aims to separate data points into n groups of equal variance. Each of these groups are defined by a cluster centroid, which is the mean of the data points in the cluster. Cluster centroids are initially randomly generated, and iteratively reassigned until the within-cluster sum-of-squares is minimized. Below: within-cluster sum-of-squares criteria.ExplainabilityExplainable AI (XAI) aims to provide insights into why a model predicts in a certain way. For this use case, XAI models are used to explain why a market participant was placed into a particular peer group. This is achieved through feature importance e.g. for each market participant, the top contributing factors towards a peer group cluster assignment. Deriving explainability from clustering models is somewhat difficult. Clustering is an unsupervised learning problem, which means there are no labels or “ground truth” for the model to analyze. Distance-based clustering algorithms instead rely on creating labels for the data points based on their relative positioning to each other. These labels are assigned as part of the prediction by the k-means algorithm – each point in the dataset is given a peer group assignment that associates it with a particular cluster. XAI models can be trained on top of k-means by fitting a classifier to these peer group cluster assignments. Using the cluster assignments as labels turns the problem into supervised learning, whereby the end goal is to determine feature importance for the classifier. Shapley values are used for feature importance, which explain the marginal contributions of each feature to the final classification prediction.Shapley values are ranked to provide market participants with a powerful tool to analyze what features are contributing the most to their peer group assignments.MLOpsMLOps is an ML engineering culture and practice that aims to unify ML system development (Dev) and ML system operation (Ops). Using Vertex AI, a fully functioning MLOps pipeline has been constructed that trains and explains peer group benchmarking models. This pipeline is complete with automation and monitoring at all steps of ML system construction, including integration, testing, releasing, deployment and infrastructure management. It also includes a comprehensive approach for continuous integration / continuous delivery (CI/CD). Vertex AI’s end-to-end platform was used to meet these MLOps needs, including:Distributed training jobs to construct ML models at scale using Vertex AI PipelinesHyperparameter tuning jobs to quickly tune complex models using Vertex AI VizierModel versioning using Vertex AI Model RegistryBatch prediction jobs using Vertex AI PredictionTracking metadata related to training jobs using Vertex ML MetadataTracking model experimentation using Vertex AI ExperimentsStoring and versioning training data from prediction jobs using Vertex AI Feature StoreData validation and monitoring using Tensorflow Data Validation (TFDV)The MLOps pipeline is broken down into 5 core areas:CI/CD & OrchestrationData Ingestion & PreprocessingDimensionality ReductionClusteringExplainabilityThe CI/CD and orchestration layer was implemented using Vertex AI Pipelines, Cloud Source Repository (CSR), Artifact Registry, and Cloud Build. When changes are made to the code base, automatic Cloud Build Triggers are executed that run unit tests, build containers, push the containers to Artifact Registry, and compile and run the Vertex AI pipeline. The pipeline is a sequence of connected components that run successive training and prediction jobs; the outputs from one model are stored in Vertex AI Feature Store and used as inputs into the next model. The end result of this pipeline is a series of trained models for dimensionality reduction, clustering, and explainability, all stored in Vertex AI Model Registry. Peer groups and explainable results are written to Feature Store and BigQuery respectively.Working with AI Services in Google Cloud’s Professional Services Organization (PSO)AI Services leads the transformation of enterprise customers and industries with cloud solutions. We are seeing widespread application of AI across Financial Services and Capital Markets. Vertex AI provides a unified platform for training and deploying models and helps enterprises more effectively make data driven decisions. You can learn more about our work at: Google CloudVertex AIGoogle Cloud consulting servicesCustom AI-as-a-ServiceThis post was edited with help from Mike Bernico, Eugenia Inzaugarat, Ashwin Mishra, and the rest of the delivery team. I would also like to thank core team members Rochak Lamba, Anna Labedz, and Ravinder Lota.Related ArticleAccelerate the deployment of ML in production with Vertex AIGoogle Cloud expands Vertex AI to help customers accelerate deployment of ML models into production.Read Article
Quelle: Google Cloud Platform

BigQuery Omni: solving cross-cloud challenges by bringing analytics to your data

Research shows that over 90% of large organizations already deploy multicloud architectures, and their data is distributed across several public cloud providers. Additionally, data is also increasingly split across various storage systems such as warehouses, operational and relational databases, object stores, etc. With the proliferation of new applications, data is serving many more use cases such as data sciences, business intelligence, analytics, streaming and the list goes on. With these data trends, customers are increasingly gravitating towards an open multicloud data lake. However, multicloud data lakes present several challenges such as data silos, data duplication, fragmented governance, complexity of tools, and increased costs.With Google’s data cloud technologies, customers can leverage the unique combination of distributed cloud services. They can create an agile cross-cloud semantic business layer with Looker and manage data lakes and data warehouses across cloud environments at scale with BigQuery and capabilities like BigLake and BigQuery Omni. BigLake is a storage engine that unifies data warehouses and lake houses by standardizing across different storage formats including BigQuery managed table and open file formats such as Parquet and Apache Iceberg on object storage. BigQuery Omni provides the compute engine that runs locally to the storage on AWS or Azure, which customers can use to query data in AWS or Azure seamlessly. This provides several key benefits such as:A single pane of glass to query your multicloud data lakes (across Google Cloud Platform, Amazon Web Services, and Microsoft Azure)Cross-cloud analytics by combining data across different platforms with little to no egress costsUnified governance and secure management of your data wherever it residesIn this blog, we will share cross-cloud analytics use cases customers are solving with Google’s Data Cloud and the benefits they are realizing.Unified marketing analytics for 360-degree insightsOrganizations want to perform marketing analytics – ads optimization, inventory management, churn prediction, buyer propensity trends and many more such analytics. To do this before BigQuery Omni, customers had to use data from several different sources such as Google Analytics, public datasets and other proprietary information stored across cloud environments. This requires moving large amounts of data, managing duplicate copies and incremental costs to perform any cross-cloud analytics and derive actionable insights. With BigQuery Omni, organizations are able to greatly simplify this workflow. Using the familiar BigQuery interface, users can access data residing in AWS or Azure, discover and select just the relevant data that needs to be combined for further analysis. This subset of data can be moved to Google Cloud using Omni’s new Cross-Cloud Transfer capabilities. Customers can combine this data with other Google Cloud datasets and these consolidated tables can be made available to key business stakeholders through advanced analytics tools such as Looker and Looker Studio. Customers are also able to tie in this data now with world class AI models via Vertex AI.As an illustrative example, consider a retailer who has sales & inventory, user and search data spread across multiple data silos. Using BigQuery Omni they can seamlessly bring these datasets together and power several marketing analytics scenarios like customer segmentation, campaign management and demand forecasting etc.”Interested in performing cross-cloud analytics, we tested BigQuery Omni and really liked the SQL support to easily get data from AWS S3. We have seen great potential and value in BigQuery Omni for adopting a multi-cloud data strategy.” — Florian Valeye, Staff Data Engineer,Back Market, a leading online marketplace for renewed technology based out of FranceData platform with consistent and unified cross-cloud governanceAnother pattern is customers looking to analyze operational, transactional and business data across data silos in different clouds through a unified data platform. These data silos are a result of various factors such as merger and acquisitions, standardization of analytical tools, leveraging best of breed solutions in different clouds and diversification of data footprint across clouds. In addition to a single pane of glass for data access across silos, customers deeply desire consistent and uniform governance of their data across clouds. With BigLake and BigQuery Omni abstracting the storage and compute layers respectively, organizations can access and query their data in Google Cloud irrespective of where it resides. They can also set fine-grained row level and column access policies in BigQuery and consistently govern it across clouds. These building blocks enable data engineering teams to build a unified and governed data platform for their data users without having to deal with the complexity of building and managing complex data pipelines. Furthermore, with BigQuery Omni’s integration with Dataplex and Data Catalog, you can discover, search your data across clouds and enrich your data by adding relevant business context with business glossary and rich text.”Several SADA customers use GCP to build and manage their data analytics platform. During many explorations and proofs of concepts, our customers have seen the great potential and value in BigQuery Omni. Enabling seamless cross-cloud data analytics has allowed them to realize the value of their data quicker while lowering the barrier to entry for BigQuery adoption in a low-risk fashion.” — Brian Suk, Associate Chief Technology Officer,SADA, one of the strategic partners of Google Cloud.Simplified data sharing between data providers and their customersA third emerging pattern in cross cloud analytics is data sharing. Several services have the business need to share information such as inventory data, subscriber data to their customers or users who in turn analyze or aggregate the data with their proprietary data and oftentimes share the results back with the service provider. In several cases, the two parties are on different cloud environments, requiring them to move data back and forth. Consider an example from a company like ActionIQ that operates in the customer data platform (CDP) space. CDPs were designed to help activate customer data, and a critical first step of that was unifying and managing that customer data. To enable this, many CDP vendors built their solution choosing one of the available cloud infrastructure technologies and copied data from the client’s systems.“Copying data from client applications and infrastructure has always been a requirement to deploy a CDP, but it doesn’t have to be anymore” — Justin DeBrabant, Senior Vice President of Product, ActionIQ.While a small percentage of customers are fine with moving data across cloud environments, the majority are hesitant to onboard new services and would rather prefer providing governed access to their data sets. “A new architectural pattern is emerging, allowing organizations to keep their data at one location and make it accessible, with the proper guardrails, to applications used by the rest of the organization’s stack”addsJustin at ActionIQ.With BigQuery Omni, services in Google Cloud Platform can more easily access and share data with their customers and users in other cloud environments with limited data movement. One of UK’s largest statistics providers has explored Omni for their data sharing needs.”We tested BigQuery Omni and really like the ability to get data from AWS directly into BQ. We’re excited about managing data sharing with different organizations without onboarding new clouds” – Simon Sandford-Taylor, Chief Information and Digital Officer, UK’s Office for National StatisticsWith BigQuery Omni, customers are able to:Access and query data across clouds through a single user interfaceReduce the need for data engineering before analyzing dataLower operational overhead and risks by deploying an application that runs across multiple clouds which leverages the same, consistent security controlsAccelerate access to insights by significantly reducing the time for data processing and analysis Create consistent and predictable budgeting across multiple cloud footprints Enable long term agility and maximize the benefits every cloud investmentOver the last year, we’ve seen great momentum in customer adoption and added significant innovations to BigQuery Omni including improved performance and scalability for querying your data in AWS S3 or Azure Blob Storage, Iceberg support for Omni, Larger query result set size up to 10GB and Cross-cloud transfer that helps customers easily, securely, and cost effectively move just enough data across cloud environments for advanced analytics. BigQuery Omni has launched several features to support unified governance of your data across multiple clouds – you can get fine-grained access to your multi-cloud data with row level and column level security. Building on this, we are excited to announce that BigQuery Omni now supports data masking. We’ve also made it easy for customers to try and see the benefits of BigQuery Omni through the limited time free trial available until March 30, 2023. BigQuery Omni running on other public clouds outside of Google Cloud is available in AWS US East1 (N.Virginia) and Azure US East2 (US East) regions. We are also excited to share that we will be bringing BigQuery Omni to more regions in the future, starting with Asia Pacific (AWS Korea) coming soon.Getting StartedGet started with a free trial to learn about Omni. Check out the documentation to learn more about BigQuery Omni. You can also leverage the self paced labs to learn how to set up BigQuery Omni easily.
Quelle: Google Cloud Platform

Learn how Microsoft datacenter operations prepare for energy issues

The war in Ukraine and the resultant shortage of natural gas has forced the European Union (EU) and European countries to proactively prepare for the possibility of more volatile energy supplies—both this winter and beyond. Microsoft is working with customers, governments, and other stakeholders throughout the region to bring clarity, continuity, and compliance in the face of possible energy-saving strategies at the local and national level. In solidarity with Europe, where even essential services are likely to be asked to find energy savings, we have validated plans and contingencies in place to responsibly reduce energy use in our operations across Europe, and we will do so in a way that minimizes risk to customer workloads running in the Microsoft Cloud.

We want to share some of the contingencies and mitigations that our teams have put in place to responsibly operate our cloud services.

Supporting grid stability by responsibly managing our energy consumption

The power that is consumed by Microsoft from the utilities is primarily used to power our network and servers, cooling systems, and other datacenter operations. We have contingency plans to contribute to energy grid stability, while working to ensure minimal disruption to our customers and their workloads, including:

The scale and distribution of the Microsoft datacenters gives us the ability to reposition non-regional platform as a service (PaaS) services, internal infrastructure, and many of our internal non-customer research and development (R&D) workloads to other nearby regions, while still meeting our data residency and EU Data Boundary commitments.
Actively working with local governments and large organizations to closely monitor and respond to power consumption to ensure grid stability and minimal disruption to our customers’ critical workloads. We are working with local utility providers to ensure our systems are ready for a range of circumstances.
Our datacenter regions are planned and built to withstand grid emergencies. When needed, we quickly transition to backup power sources to reduce impact on the grid without impacting customer workloads.

Resilient infrastructure investment

Microsoft is responsible for providing our customers with a resilient foundation in the Microsoft Cloud—in how it is designed, operated, and monitored to ensure availability. We make considerable investments in the platform itself—physical things like our datacenters, as well as software things like our deployment and maintenance processes.

We strive to provide our cloud-using customers with “five-nines” of service availability, meaning that the datacenter is operational 99.999 percent of the time. However, knowing that service interruptions and failures happen for a myriad of reasons, we build systems designed with failure in mind.

We have Azure Availability Zones (AZs) in every country in which we operate datacenter regions. AZ’s are comprised of a minimum of three zone locations, each with independent power, cooling and networking, allowing customers to spread their infrastructure and applications across discrete and dispersed datacenters for added resiliency and availability.

Battery backup and backup generators are an additional resiliency capability we implement and are utilized during power grid outages and other service interruptions so we can meet service levels and operational reliability. We have contracted access to additional fuel supplies to maintain generator operations, and we also hold critical spares to maintain generator health. We are ready to use backup generators across Europe, when necessary, to keep our services running in case of a serious grid emergency. 

Across our global infrastructure, it’s not unusual for us to work with a heightened operational awareness, due to external factors. For instance, severe winter weather events in Texas in 2021 caused substantial pressure on the Texas energy grid. Microsoft was able to remove its San Antonio datacenter from using grid power. Although Microsoft’s onsite substations were designed with redundancy, we were able to quickly transition to our tertiary redundant systems—generators. These systems kept the datacenters running, with zero impact to our cloud customers, while the utility grid could ensure residential homes stayed warm. During this event, we maintained 100 percent uptime for our customers, while removing our workloads from the grid.

Resiliency recommendations for cloud architectures

This is a challenging time for organizations monitoring the growing energy concerns in Europe. We are providing important infrastructure for the communities where we operate, and our customers are counting on us to provide reliable cloud services to run their critical workloads. We recognize the importance of continuity of service for our customers, including those providing essential services: health care providers, police and emergency responders, financial institutions, manufacturers of critical supplies, grocery stores and health agencies. Organizations wondering what more they can do to improve the reliability of their applications, or wondering how they can reduce their own energy consumption, can consider the following:

Customers who have availed themselves of high availability tools, including geo-redundancy, should be unaffected by impacts to a single datacenter region. For software as a service (SaaS) services like Microsoft 365, Microsoft Dynamics 365, and Microsoft Power Platform, the business continuity and resiliency are managed by Microsoft. For Microsoft Azure, customers should always consider designing their Azure workloads with high availability in mind.

We always encourage customers to have a Business Continuity and Disaster Recovery (BCDR) plan in place as part of the Microsoft Well-Architected Framework, which you can read more about. Customers who want to proactively migrate their Azure resources from one region to another can do so at any time. Find out how.
On-premises customers can reduce their own energy consumption by moving their applications, workloads, and databases to the cloud. The Microsoft Cloud can be up to 93 percent more energy efficient than traditional enterprise datacenters, depending on the specific comparison being made. Discover more here. Start your sustainability journey today.
Energy use in our datacenters is driven by customer use. Customers can play a part in reducing energy consumption by following green software development guidelines, including shutting down unused server instances, and sustainable application design. Further information available here.

We continue to improve the energy efficiency of our datacenters, in our ongoing commitment to make our global infrastructure more sustainable and efficient. As countries and energy providers consider options to reduce their consumption of electricity in the event of an energy capacity shortage, we are working with grid operators on this evolving situation. With the scale, expertise, and partnerships that we operate, we are confident that our risk mitigation activities will offset any potential disruption to our customers running their critical workloads in the cloud.
Quelle: Azure

NoSQL Workbench für Amazon DynamoDB unterstützt jetzt die Erstellung von Datenmodellen direkt aus Musterdatenmodellvorlagen

NoSQL Workbench für Amazon DynamoDB unterstützt jetzt die Erstellung von Datenmodellen direkt aus Musterdatenmodellvorlagen, um dich dabei zu unterstützen, Datenschemata für deine Workloads zu generieren. Mit dieser Funktion kannst du dich mit den Best Practices der NoSQL-Datenmodellierung vertraut machen, wenn du deine Anwendungen auf DynamoDB aufbaust.
Quelle: aws.amazon.com

AWS IoT Device Defender Audit identifiziert jetzt mögliche Fehlkonfigurationen in IoT-Richtlinien

Heute hat AWS IoT Device Defender einen neuen Audit-Check eingeführt, um bestimmte mögliche Fehlkonfigurationen in AWS-IoT-Richtlinien zu finden. Falsche Sicherheitskonfigurationen, wie z. B. zu großzügige Richtlinien, können eine Hauptursache für Sicherheitsvorfälle sein. Mit diesem neuen Audit-Check in AWS IoT Device Defender kannst du jetzt leichter Schwachstellen identifizieren, Probleme beheben und die notwendigen Korrekturmaßnahmen ergreifen. 
Quelle: aws.amazon.com

Mit Wildcare-Konfiguration für AWS CloudFormation Hooks mehrere Ressourcentypen als Ziel setzen

AWS CloudFormation Hooks startet heute den Wildcard-Ressourcentyp für Hook-Konfigurationen, wodurch Kunden mehrere Ressourcentypen aufeinander abstimmen können. Kunden können mit Wildcards Ressourcenziele für das Erstellen flexibler Hooks definieren. Solche Hooks können für Ressourcentypen aktiviert werden, die nicht explizit bekannt waren, als der Hook erstellt wurde. Kunden können beispielsweise die Wildcard AWS::ECR::* verwenden, um einen Hook zu definieren, der alle Ressourcentypen in Amazon ECR auslöst. 
Quelle: aws.amazon.com

Papierkorb für Amazon Machine Images ist jetzt in der Region Asien-Pazifik (Hyderabad) verfügbar

Amazon-EC2-Kunden können jetzt den Papierkorb für Amazon Machine Images (AMIs) in der Region Asien-Pazifik (Hyderabad) verwenden, um versehentlich gelöschte Dateien wiederherzustellen und so ihre Anforderungen an die Geschäftskontinuität zu erfüllen. Mit dem Papierkorb können Sie einen Aufbewahrungszeitraum festlegen und ein deregistriertes AMI bei Bedarf wiederherstellen, bevor der Aufbewahrungszeitraum abläuft. Ein wiederhergestelltes AMI würde seine Attribute wie Tags, Berechtigungen und Verschlüsselungsstatus, die es vor der Löschung hatte, beibehalten und kann sofort für Startvorgänge verwendet werden. AMIs, die nicht aus dem Papierkorb wiederhergestellt werden, werden nach Ablauf des Aufbewahrungszeitraums endgültig gelöscht.
Quelle: aws.amazon.com