NVIDIA Tesla T4 GPUs now available in beta

In November, we announced that Google Cloud Platform (GCP) was the first and only major cloud vendor to offer NVIDIA’s newest data center GPU, the Tesla T4, via a private alpha. Today, these T4 GPU instances are now available publicly in beta in Brazil, India, Netherlands, Singapore, Tokyo, and the United States. For Brazil, India, Japan, and Singapore, these are the first GPUs we have offered in those GCP regions.The T4 GPU is well suited for many machine learning, visualization and other GPU accelerated workloads. Each T4 comes with 16GB of GPU memory, offers the widest precision support (FP32, FP16, INT8 and INT4), includes NVIDIA Tensor Core and RTX real-time visualization technology and performs up to 260 TOPS1 of compute performance. Customers can create custom VM shapes that best meet their needs with up to four T4 GPUs, 96 vCPUs, 624GB of host memory and optionally up to 3TB of in-server local SSD. Our T4 GPU prices are as low as $0.29 per hour per GPU on Preemptible VM instances. On-demand instances start at $0.95 per hour per GPU, with up to a 30% discount with sustained use discounts. Committed use discounts are also available as well for the greatest savings for on-demand T4 GPU usage—talk with sales to learn more.Broadest GPU availabilityWe’ve distributed our T4 GPUs across the globe in eight regions, allowing you to provide low latency solutions to your customers no matter where they are. The T4 joins our NVIDIA K80, P4, P100 and V100 GPU offerings, providing customers with a wide selection of hardware-accelerated compute options. T4 GPUs are now available in the following regions: us-central1, us-west1, us-east1, asia-northeast1, asia-south1, asia-southeast1, europe-west4,  and southamerica-east1.Machine learning inferenceThe T4 is the best GPU in our product portfolio for running inference workloads. Its high performance characteristics for FP16, INT8 and INT4 allow you to run high scale inference with flexible accuracy/performance tradeoffs that are not available on any other GPU. The T4’s 16GB of memory supports large ML models or running inference on multiple smaller models simultaneously. ML inference performance on Google Compute Engine’s T4s has been measured at up to 4267 images/sec2 with latency as low as 1.1ms3. Running production workloads on T4 GPUs on Compute Engine is a great solution thanks to the T4’s price, performance, global availability across eight regions and high-speed Google network. To help you get started with ML inference on the T4 GPU, we also have a technical tutorial demonstrating how to deploy a multi-zone, auto-scaling ML inference service on top of Compute Engine VMs and T4 GPUs.Machine learning trainingThe V100 GPU has become the primary GPU for ML training workloads in the cloud thanks to its high performance, Tensor Core technology and 16GB of GPU memory to support larger ML models. The T4 supports all of this at a lower price point,  making it a great choice for scale-out distributed training or when a V100 GPU’s power is overkill. Our customers tell us they like the near-linear scaling of many training workloads on our T4 GPUs as they speed up their training results with large numbers of T4 GPUs.ML cost savings options only on Compute EngineOur T4 GPUs complement our V100 GPU offering nicely. You can scale up with large VMs up to eight V100 GPUs, scale down with lower cost T4 GPUs or scale out with either T4 or V100 GPUs based on your workload characteristics. With Google Cloud as the only major cloud provider to offer T4 GPUs, our broad product portfolio lets you save money or do more with the same resources.* Prices listed are current Compute Engine on-demand pricing for certain regions.Prices may vary by region and lower prices are available through SUDs and Preemptible GPUsStrong visualization with RTXThe NVIDIA T4 with its Turing architecture is the first data center GPU to include dedicated ray-tracing processors. Called RT Cores, they accelerate the computation of how light travels in 3D environments. Turing accelerates real-time ray tracing over the previous-generation NVIDIA Pascal architecture and can render final frames for film effects faster than CPUs, providing hardware-accelerated ray tracing capabilities via NVIDIA’s OptiX ray-tracing API. In addition, we are glad to also offer virtual workstations running on T4 GPUs that give creative and technical professionals the power of the next generation of computer graphics with the flexibility to work from anywhere and on any device.Getting startedWe make it easy to get started with T4 GPUs for ML, compute and visualization. Check out our GPU product page to learn more about the T4 and our other GPU offerings. For those looking to get up and running quickly with GPUs and Compute Engine, our Deep Learning VMimage comes with NVIDIA drivers and various ML libraries pre-installed. Not a Google Cloud customer? Sign up today and take advantage of our $300 free tier.1. 260 TOPs INT4 performance, 130 TOPs INT8, 65 TFLOPS FP16, 8.1 TFLOPS FP322. INT8 precision, resnet50, batch size 1283. INT8 precision, resnet50, batch size 1
Quelle: Google Cloud Platform

Running TensorFlow inference workloads at scale with TensorRT 5 and NVIDIA T4 GPUs

Today, we announced that Google Compute Engine now offers machine types with NVIDIA T4 GPUs, to accelerate a variety of cloud workloads, including high-performance computing, deep learning training and inference, broader machine learning (ML) workloads, data analytics, and graphics rendering.In addition to its GPU hardware, NVIDIA also offers tools to help developers make the best use of their infrastructure. NVIDIA TensorRT is a cross-platform library for developing high-performance deep learning inference—the stage in the machine learning process where a trained model is used, typically in a runtime, live environment, to recognize, process, and classify results. The library includes a deep learning inference data type (quantization) optimizer, model conversion process, and runtime that delivers low latency and high throughput. TensorRT-based applications perform up to 40 times faster1 than CPU-only platforms during inference. With TensorRT, you can optimize neural network models trained in most major frameworks, calibrate for lower precision with high accuracy, and finally, deploy to a variety of environments. These might include hyperscale data centers, embedded systems, or automotive product platforms.In this blog post, we’ll show you how to run deep learning inference on large-scale workloads with NVIDIA TensorRT 5 running on Compute Engine VMs configured with our Cloud Deep Learning VM image and NVIDIA T4 GPUs.OverviewThis tutorial shows you how to set up a multi-zone cluster for running an inference workload on an autoscaling group that scales to meet changing GPU utilization demands, and covers the following steps:Preparing a model using a pre-trained graph (ResNet)Benchmarking the inference speed for a model with different optimization modesConverting a custom model to TensorRT formatSetting up a multi-zone cluster that is:Built on Deep Learning VMs preinstalled with TensorFlow, TensorFlow serving, and TensorRT 5.Configured to auto-scale based on GPU utilization.Configured for load-balancing.Firewall enabled.  Running an inference workload in the multi-zone cluster.Here’s a high-level architectural perspective for this setup:Preparing and optimizing the model with TensorRTIn this section, we will create a VM instance to run the model, and then download a model from the TensorFlow official models catalog.Create a new Deep Learning Virtual Machine instanceCreate the VM instance:If command is successful you should see a message that looks like this:Notes:You can create this instance in any available zone that supports T4 GPUs.A single GPU is enough to compare the different TensorRT optimization modes.Download a ResNet model pre-trained graphThis tutorial uses the ResNet model, which trained on the ImageNet dataset that is in TensorFlow. To download the ResNet model to your VM instance, run the following command:Verify model was downloaded correctly:Save the location of your ResNet model in the $WORKDIR variable:Benchmarking the modelLeveraging fast linear algebra libraries and hand tuned kernels, TensorRT can speed up inference workloads, but the most significant speed-up comes from the quantization process. Model quantization is the process by which you reduce the precision of weights for a model. For example, if the initial weight of a model is FP32, you have the option to reduce the precision to FP16, INT8, or even INT4, with the goal of improving runtime performance. It’s important to pick the right balance between speed (precision of weights) and accuracy of a model. Luckily, TensorFlow includes functionality that does exactly this, measuring accuracy vs. speed, or other metrics such as throughput, latency, node conversion rates, and total training time.Note: This test is limited to image recognition models at the moment, however it should not be too hard to implement a custom test based on this code.Set up the ResNet modelTo set up the model, run the following command:This test requires a frozen graph from the ResNet model (the same one that we downloaded before), as well as arguments for the different quantization modes that we want to test.The following command prepares the test for the execution:Run the testThis command will take some time to finish.Notes:$WORKDIR is the directory in which you downloaded the ResNet model.The –native arguments are the different available quantization modes you can test.Review the resultsWhen the test completes, you will see a comparison of the inference results for each optimization mode.To see the full results, run the following command:V100 (Old)V100T4P4From the above results, you can see that FP32 and FP16 performance numbers are identical under predictions. This means that if you are content working with TensorRT, you can definitely start using FP16 right away. INT8, on the other hand, shows slightly worse accuracy and requires understanding the accuracy-versus-performance tradeoffs for your models.In addition, you can observe that when you run the model with TensorRT 5:Using FP32 optimization improves throughput by 40% (440 vs 314). At the same time it decreases latency by ~30%, making it 0.28 ms instead of 0.40 ms.Using FP16 optimization rather than native TF graph increases the speed by 214%! (from 314 to 988 fps). At the same time latency decreased by 0.12 ms (almost 3x decrease!).Using INT8, the last result displayed above, we observed a speedup of 385% (from 314 to 1524) with the latency decreasing to 0.08 ms.Notes:The above results do not include latency for image pre-processing nor HTTP requests latency. In production systems the inference’ speed may not be a bottleneck at all, and you will need to account for all the factors mentioned in order to measure your end to end inference’ speed.Now, let’s pick a model, in this case, INT8.Converting a custom model to TensorRTDownload and extract ResNet modelTo convert a custom model to a TensorRT graph you will need a saved model. To download a saved INT8 ResNet model, run the following command:Convert the model to a TensorRT graph with TFToolsNow we can convert this model to its corresponding TensorRT graph with a simple tool:You now have an INT8 model in your $WORKDIR/resnet_v2_int8_NCHW/00001 directory.To ensure that everything is set up properly, try running an inference test.Upload the model to Cloud StorageYou’ll need to run this step so that the model can be served from the multi-zone cluster that we will set up in the next section. To upload the model, complete the following steps:1. Archive the model.2. Upload the archive.If needed, you can obtain an INT8 precision variant of the frozen graph from Cloud Storage at this URL:Setting up a multi-zone clusterCreate the clusterNow that we have a model in Cloud Storage, let’s create a cluster.Create an instance templateAn instance template is a useful way to create new instances. Here’s how:Notes:This instance template includes a startup script that is specified by the metadata parameter.The startup script runs during instance creation on every instance that uses this template, and performs the following steps:Installs NVIDIA drivers, NVIDIA drivers are installed on each new instance. Without NVIDIA drivers, inference will not work.Installs a monitoring agent that monitors GPU usage on the instanceDownloads the modelStarts the inference serviceThe startup script runs tf_serve.py, which contains the inference logic. For this example, I have created a very small Python file based on the TFServe package.To view the startup script, see start_agent_and_inf_server.sh.Create a managed instance groupYou’ll need to set up a managed instance group, to allow you to run multiple instances in specific zones. The instances are created based on the instance template generated in the previous step.Notes:INSTANCE_TEMPLATE_NAME is the name of the instance that you created in the previous step.You can create this instance in any available zone that supports T4 GPUs. Ensure that you have available GPU quotas in the zone.Creating the instance takes some time. You can watch the progress with the following command:Once the managed instance group is created, you should see output that resembles the following:Confirm metrics in Stackdriver1. Access Stackdriver’s Metrics Explorer here2. Search for gpu_utilization. StackDriver > Resources > Metrics Explorer3. If data is coming in, you should see something like this:Enable auto-scalingNow, you’ll need to enable auto-scaling for your managed instance group.Notes:The custom.googleapis.com/gpu_utilization is the full path to our metric.We are using level 85, this means that whenever GPU utilization reaches 85, the platform will create a new instance in our group.Test auto-scalingTo test auto-scaling, perform the following steps:1. SSH to the instances. See Connecting to Instances for more details.2. Use the gpu-burn tool to load your GPU to 100% utilization for 600 seconds:Notes:During the make process, you may receive some warnings, ignore them.You can monitor the gpu usage information, with a refresh interval of 5 seconds:3. You can observe the autoscaling in Stackdriver, one instance at a time.4. Go to the Instance Groups page in the Google Cloud Console.5. Click on the deeplearning-instance-group managed instance group.6. Click on the Monitoring tab.At this point your auto-scaling logic should be trying to spin as many instances as possible to reduce the load. And that is exactly what is happening:At this point you can safely stop any loaded instances (due to the burn-in tool) and watch the cluster scale down.Set up a load balancerLet’s revisit what we have so far:A trained model, optimized with TensorRT 5 (using INT8 quantization)A managed instance group. These instances have auto-scaling enable based on the GPU utilizationNow you can create a load balancer in front of the instances.Create health checksHealth checks are used to determine if a particular host on our backend can serve the traffic.Create inferences forwarderConfigure named-ports of the instance group so that LB can forward inference requests, sent via port 80, to the inference service that is served via port 8888.Create a backend serviceCreate a backend service that has an instance group and health check.First, create the health check:Then, add the instance group to the new backend service:Set up the forwarding URLThe load balancer needs to know which URL can be forwarded to the backend services.Create the load balancerAdd an external IP address to the load balancer:Find the allocated IP address:Set up the forwarding rule that tells GCP to forward all requests from the public IP to the load balancer:After creating the global forwarding rules, it can take several minutes for your configuration to propagate.Enable the firewallYou need to enable a firewall on your project, or else it will be impossible to connect to your VM instances from the external internet. To enable a firewall for your instances, run the following command:Running inferenceYou can use the following Python script to convert images to a format that can be uploaded to the server.Finally, run the inference request:That’s it!Toward TensorFlow inference blissRunning ML inference workloads with TensorFlow has come a long way. Together, the combination of NVIDIA T4 GPUs and its TensorRT framework make running inference workloads a relatively trivial task—and with T4 GPUs available on Google Cloud, you can spin them up and down on demand. If you have feedback on this post, please reach out to us here.Acknowledgements: Viacheslav Kovalevskyi, Software Engineer, Gonzalo Gasca Meza, Developer Programs Engineer, Yaboo Oyabu, Machine Learning Specialist and Karthik Ramasamy, Software Engineer contributed to this post.1. Inference benchmarks show ResNet training times to be 27x faster, and GNMT times to be 36x faster
Quelle: Google Cloud Platform

Deloitte’s HealthInteractive Platform on IBM Cloud helps state Medicaid programs innovate

For decades, state Medicaid programs have relied on large, established Medicaid management information systems (MMIS) to support core functions. Historically, these solutions have been inflexible and expensive to maintain with long, and sometimes failed, implementations.
To address the costs and complexity in maintaining and updating these platforms, the Centers for Medicare and Medicaid Services (CMS) established guidelines to shift programs towards a new information technology (IT) approach focused on modular, interoperable solutions.
In response to the new CMS guidelines, we teamed with IBM to create a standardized Medicaid enterprise system (MES) called the HealthInteractive Platform. It’s a more flexible solution to manage state healthcare programs.
How the new HealthInteractive Platform works
The HealthInteractive Platform is a Medicaid management environment that aligns with federal guidelines running on the IBM Cloud.
At its core, the solution is a systems integration platform that uses IBM service bus components and provides a data management platform for data exchange and access. For operational automation, the HealthInteractive Platform uses both IBM Business Process Manager and IBM Operational Decision Manager. For security, the platform relies on IBM Cloud Identity and Access Management (ICAM). Some agencies also require a data management solution. In those instances, we incorporate IBM InfoSphere Master Data Management (MDM). We offer the hybrid cloud solution as a software-as-a-service (SaaS) single-tenant model so a state’s data is segregated.
Expanding on the fast deployment capabilities delivered through the IBM Cloud, we developed a rapid deployment framework, which enables the HealthInteractive Platform to be up and running in 24 hours. Critical workloads and data are migrated between on-premises environments and VMware Cloud Foundation on IBM Cloud.
We chose to team in developing the solution because IBM Cloud delivers the end-to-end cohesive products that align with state agency needs. We also have a deep, two-decade relationship with IBM. We know how to get the job done together.
Benefits of the new Medicaid platform
The HealthInteractive Platform helps address the challenges of heritage Medicaid systems that state agencies have faced for years. Some benefits include:

A modular IT approach to help programs reduce cost and complexity
Deployment time reduction from months to hours, providing state agencies with the agility needed to meet evolving federal regulations and client expectations
Eliminating layers of patching and upgrades that make a system difficult and expensive to maintain and adapt

Consider everything that goes into a Medicaid system: member information, medical and pharmacy claims, health care provider data, payment details, and so on. These interrelated functions are part of one large system, but also can stand alone. Modularity helps break functions apart so that each module can take advantage of new technologies, update to comply with evolving regulations and adapt to customer needs. This helps manage costs and increase flexibility.
The HealthInteractive Platform increases connectivity between modules, allowing users to quickly and easily access up-to-date information from various functions. With near-real-time processing, users can access operational reporting and short-term analytics in hours or days versus waiting for monthly or even longer-term batch reporting.
Read the case study for more details.
The post Deloitte’s HealthInteractive Platform on IBM Cloud helps state Medicaid programs innovate appeared first on Cloud computing news.
Quelle: Thoughts on Cloud