Unify Kubernetes and GCP resources for simpler and faster deployments

Adopting containers and Kubernetes means adopting new ways of doing things, not least of which is how you configure and maintain your resources. As a declarative system, Kubernetes allows you to express your intent for a given resource, and then creates and updates those resources using continuous reconciliation. Compared with imperative configuration approaches, Kubernetes-style declarative config helps ensure that your organization follows GitOps best practices like storing configuration in a version control system, and defining it in a YAML file.   However, applications that run on Kubernetes often use resources that live outside of Kubernetes, for example, Cloud SQL or Cloud Storage, and those resources typically don’t use the same approach to configuration. This can cause friction between teams, and force developers into frequent “context switching”. Further, configuring and operating those applications is a multi-step process: configuring the external resources, then the Kubernetes resources, and finally making the former available to the latter. To help, today, we’re announcing the general availability of Config Connector, which lets you manage Google Cloud Platform (GCP) resources as Kubernetes resources, giving you a single place to configure your entire application.Config Connector is a Kubernetes operator that makes all GCP resources behave as if they were Kubernetes resources, so you don’t have to learn and use multiple conventions and tools to manage your infrastructure. For cloud-native developers, Config Connector simplifies operations and resource management by providing a uniform and consistent way to manage all of cloud infrastructure through Kubernetes.Automating infrastructure consistencyWith its declarative approach, Kubernetes is continually reconciling the resources it manages. Resources managed by Kuberentes are continuously monitored, and “self-heal” to continuously meet the user’s desired state. However, monitoring and reconciliation of non-Kubernetes resources (a SQL server instance for example), happens as part of a separate workflow. In the most extreme cases, changes to your desired configuration, for example, changes to the number of your Cloud Spanner nodes, are not propagated to your monitoring and alerting infrastructure, causing false alarms and creating additional work for your teams. By bringing these resources under the purview of Kuberentes with Config Connector, you get resource reconciliation across your infrastructure, automating the work of achieving eventual consistency in your infrastructure. Instead of spinning up that SQL server instance separately and monitoring it for changes as a second workflow, you ask Config Connector to create a SQL server instance and an SQL database on that instance. Config Connector creates these resources, and now that they’re part of your declarative approach, the SQL server instance is effectively self-healing, just like the rest of your Kubernetes deployment. Using Kubernetes’ resource model relieves you from having to explicitly order resources in your deployment scripts. Just like for pods, deployments, or other native Kubernetes resources, you no longer have to explicitly wait for the SQL instance to be completed before starting to provision an SQL database on that instance, as illustrated in the YAML manifests below.Additionally, by defining GCP resources as Kubernetes objects, you now get to leverage familiar Kubernetes features with these resources, such as Kubernetes Labels and Selectors. For example, here  we used cost-center as a label on the resources. You can now filter by this label using kubectl get. Furthermore, you can apply your organization’s governance policy using admission controllers, such as Anthos Policy Controller. For example, you can enforce that the cost-center label should exist on all resources in the cluster and only have an allowed range of values:Faster development with simplified operationsFor Etsy, Kubernetes was instrumental in helping them to move to the cloud, but the complexity of their applications meant they were managing resources in multiple places, slowing down their deployments.“At Etsy, we run complex Kubernetes applications that combine custom code and cloud resources across many environments. Config Connector will allow Etsy to move from having two distinct, disconnected CI/CD pipelines to a single pipeline for both application code and the infrastructure it requires. Config Connector will simplify our delivery and enable end-to-end testing of cloud infrastructure changes, which we expect will result in faster deployment and lower friction usage of cloud infrastructure” – Gregg Donovan, Senior Staff Software Engineer, Etsy. Getting started with Config ConnectorToday, Config Connector can be used to manage more than 60+ GCP services, including Bigtable, BigQuery, IAM Policies, Service Account and Service Account Keys, Pub/Sub, Redis, Spanner, Cloud SQL, Cloud Storage, Compute Engine, Networking and Cloud Load Balancer. Config Connector can be installed standalone on any Kubernetes cluster, and is also integrated into Anthos Config Management, for managing hybrid and multi-cloud environments. Get started with Config Connector today to simplify configuration management across GKE and GCP.
Quelle: Google Cloud Platform

10 top tips: Unleash your BigQuery superpowers

Lots of us are already tech heroes by day. If you know SQL, for example, you’re a hero.You have the power to transform data into insights. You can save the day when someone in need comes to you to reveal the magic numbers they can then use in their business proposals. You can also amaze your colleagues with patterns you found while roaming around your data lakes.With BigQuery, Google Cloud’s enterprise data warehouse, you quickly become a superhero: You can run queries faster than anyone else. You’re not afraid of running full table scans. You’ve made your datasets highly available, and you no longer live in fear of maintenance windows. Indexes? We don’t need indexes where we’re going, or vacuums either.If you’re a BigQuery user, you’re already a superhero. But superheroes don’t always know all their superpowers, or how to use them. Here are the top 10 BigQuery superpowers to discover.1. The power of dataLet’s say your favorite person has been trapped by an evil force, which will only release them if you answer this simple riddle: Who were the top superheroes on Wikipedia the first week of February 2018?Oh no! Where will you get a log of all the Wikipedia page views? How can you tell which pages are superheroes? How long will it take to collect all of this data, and comb through it all? Well, I can answer that question (see the source data here). Once data is loaded, it will only take 10 seconds to get the query results. This is how:There it is—all the superheroes on the English Wikipedia page, and the number of page views for whatever time period you choose. And these are the top 10, for the first week of February 2018:You’ve saved your friend! But first, the evil spirit needs more detail. Well, this query will do:You can have the power of data too: check out the Wikipedia pageviews, and my latest Wikidata experiments (plus all of BigQuery’s public datasets) and copy paste these queries, modify them, and save your friends.2. The power of teleportationYou want to see the tables with the Wikipedia pageviews and Wikidata? Let’s jump to the BigQuery web UI. Did you know that you can autocomplete your queries while typing them? Just press tab while writing your queries. Or you can run a sub-query by selecting it and pressing CMD-E. And teleportation? Jump straight to your tables with CMD and click on them. For example, that Wikipedia 2018 page views table we queried previously has more than 2TB of data, and the Wikidata one has facts for more than 46 million entities. And we just joined them to get the results we wanted.Also, while looking at the schema, you can click on the fields, and that will auto-populate your query. Ta-da!3. The power of miniaturizationDid I just say that the page views table has more than 2TB of data? That’s a lot! Remember that in BigQuery you have 1TB of free queries every month, so going through 2TB in one query means you will be out of the free quota pretty quickly. So how much data did I just consume? Let me run that first query again, without hitting the cache.The result? 4.6 sec elapsed, 9.8 GB processed.How is that possible? I just joined a 2TB table with a 750GB one. Even with partitioning, one week of Wikipedia page views is 2TB, divided by 52 weeks…that’s 38.5GB. So even with daily partitioning, I’m somehow querying less data.Well, turns out I have the data in the tables clustered by the language of the Wikipedia and title, so I can make sure to always use those filters when going through the Wikipedia logs.And that’s how you miniaturize your queries!4. The power of X-ray visionLet’s say you want to get more data out of Wikidata for each superhero. Well, this query will do:Why did this query take more time to process? Well, with our X-ray vision powers, we can see what BigQuery did in the background. Let’s look at the query history and the execution details tab.Those are all the steps BigQuery had to go through to run our query. Now, if this is a little hard to read, we have some alternatives. For example, the legacy BigQuery web UI has more compact results:You can see that the slowest operations were computing while reading the 56-million-row table twice.I’ll focus on that to improve performance. If I change the two, shown in these lines:Now my query runs in half the time! The slowest part has moved elsewhere, as shown here:Which is this JOIN now:It even shows us that it’s looking for all the superheroes between “3-D Man” and “Zor-El”… yes, it’s going through the whole alphabet. Get an even deeper view of the BigQuery query plan visualizer.5. The power of materializationIt’s really cool to have these tables in BigQuery. But how did I load them? I periodically bring new raw files into Cloud Storage, and then I read them raw into BigQuery. In the case of the Wikipedia pageviews, I do all the CSV parsing inside BigQuery, as there are many edge cases, and I need to solve some case by case.Then I materialize these tables periodically into my partitioned and clustered tables. In the case of Wikidata, they have some complicated JSON—so I read each JSON row raw into BigQuery. I could parse it with SQL, but that’s not enough. And that brings us to our next super power.6. Navigating the multiverseSo we live in this SQL universe, a place where you can go beyond SQL alone. It’s an incredible place to manipulate and understand data, but each universe has its limitations and its rules. What if we could jump to a different universe, with different rules and powers, and manage to connect both universes, somehow? What if we could jump into the…JavaScript universe? We can, with UDFs—user-defined functions. They can easily extend BigQuery’s standard SQL. For example, I can download a random JavaScript library and use it from within BigQuery, like for performing natural language processing and lots more. Using UDFs means I can take each row of Wikidata JSON from above and parse it inside BigQuery, using whatever JavaScript logic I want to use, and then materialize this into BigQuery.7. Time travelLet’s take one particular table. It’s a beautiful table, with a couple thousand rows. But not everyone is happy—turns out someone wants to delete half of its rows, randomly. How would our super-enemy pull this off?Oh no. Half of the rows of our peaceful universe are gone. Randomly. How is that even fair? How will we ever recover from this?5 days laterWe learned how to move forward without these rows, but we still miss them. If only there was a way to travel back in time and bring them back.Yes we can.Instead of:we can write:Warning: CREATE OR REPLACE TABLE deletes the table history, so write the results elsewhere. 8. The power of super-speedHow fast is BigQuery? It’s this fast.The quick summary: BigQuery can run HyperLogLog++, Google’s internal implementation of the HyperLogLog algorithm, for cardinality estimation. It lets BigQuery count uniques a lot faster than other databases can do, and has some other cool features that make BigQuery perform incredibly well.  9. InvulnerabilityOur most annoying enemy? It’s a black hole of data, that thing that happens when we try to divide by zero. However it’s possible to avoid that using BigQuery expressions like the SAFE. prefix.SAFE. prefixSyntax:DescriptionIf you begin a function with the SAFE. prefix, it will return NULL instead of an error.Operators such as + and = do not support the SAFE. prefix. To prevent errors from a division operation, use SAFE_DIVIDE. Some operators, such as IN, ARRAY, and UNNEST, resemble functions, but do not support the SAFE.prefix. The CAST and EXTRACT functions also do not support the SAFE. prefix. To prevent errors from casting, use SAFE_CAST. Find out more in the BigQuery docs.10. The power of self-controlAll superheroes struggle when they first discover their super-powers. Having super strength is cool, but you can break a lot of things if you’re not careful. Having super-speed is fun—but only if you also learn how to brake. You can query 5PB of data in three minutes, sure—but then remember that querying 1PB is one thousand times more expensive than querying 1TB. And you only have 1TB free every month. If you have not entered a credit card, don’t worry—you will have your free terabyte every month, no need to have a credit card. But if you want to go further, now you need to be aware of your budget and set up cost controls.Check out this doc on creating custom cost controls, and find out how BigQuery Reservations work to easily use our flat-rate pricing model. Remember, with great powers comes great responsibility. Turn on your cost controls.And there are a lot more. How about the power to predict the future? And there’s a whole world of ML to explore, not to mention all the GIS capabilities you can find in BigQuery. Check out Lak Lakshmanan talk about more of the awesome resources we have. And that brings me to our bonus super power:11. The power of communityNo superhero should stand alone. Join our Reddit community, where we share tips and news. Come to Stack Overflow for answers, and to help new superheroes learning the ropes. We can all learn from each other. And follow me and my friends on Twitter. If you’re ready to test your powers, try to solve our weekly BigQuery Data Challenge. It’s fun, free of charge, and you might win $500 in cloud credits!
Quelle: Google Cloud Platform

Emaar: Improving customer engagement across industries with APIs

Editor’s note: Today’s post comes from Binoo Joseph, CIO, and Venkadesh Sivalingam, integration architect at Emaar. Based in Dubai, Emaar is a real-estate development company operating across a number of verticals, including properties, shopping malls, hospitality, and entertainment. Learn how Emaar develops new customer experiences using APIs.Emaar is known worldwide for our luxurious properties and communities. Our most well-known property is likely the Burj Khalifa, which is the tallest structure in the world at 829 meters. But we’re also involved in many other businesses, including hotels, leisure clubs, shopping malls, events, and entertainment.One thing that ties our businesses together is our commitment to innovation and delivering excellent customer experiences. Traditionally that’s meant creating a variety of front-end applications powered by middleware for our customers to interact with. This might be a web portal where property owners pay service fees, a booking site for restaurant or hotel reservations, or a mobile app guiding visitors around Dubai Mall. Every business had their own workflows for creating and managing the APIs needed to run these customer experiences, but we wanted to enable greater access to information and services for  customers such as tenants, property owners, customers, and partners.We realized in order to innovate faster, we needed to pool resources and share microservices across teams. That meant having a single way of developing, deploying, and securing APIs across all verticals. With that goal, we looked for an API management solution that could bring us security and speed in a flexible hybrid architecture.Apigee checked all the right boxes. Combined with the great support that we got from the local Apigee team, we knew Apigee was the ideal solution for us.Creating APIs in a standard environmentApigee is helping us transform how we work with APIs. We now have a consistent structure for designing and securing APIs, and that consistency enables us to be much more secure in the way we connect with data and deploy them. Our developers focus on microservices, and Apigee gives us full visibility into work across development groups. Teams can share and reuse APIs rather than rebuilding them for every new customer experience, so we can push them out faster for our customers.In just a few months after going live, we exposed around 250 APIs through Apigee. Some APIs connect enterprise and internal systems, but many connect with external customers.Bringing key information to customersOne of the biggest ways we currently use APIs is to connect with customers. Our flagship property app lets property owners manage their properties from anywhere. They can view service requests, pay service fees, and stay up to date with any information from Emaar. With Apigee, we can continue to add new app functionality to keep owners connected and offer more self-service options.We’re also starting to connect our shopping mall tenants with our new APIs. Our flagship shopping center, Dubai Mall, is the world’s largest retail destination with more than 1,300 shopping, dining, and entertainment options, including a cinema, ice rink, and aquarium. We connect with mall visitors through the Dubai Mall app. But to provide customers with the most up-to-date information, we need to communicate with our tenants. Tenants keep us updated about events, traffic, and sales, and we can use that information to push shoppers towards deals.The developer portal plays a role in helping tenants start taking advantage of our APIs quickly, as it’s easy to use and centralizes API documentation. In the short time since we started working with mall tenants, we’ve already onboarded 60 retailers. The speed of integration is also about five to six times faster than it was before we employed the developer portal, making us confident many more retailers will come onboard in the coming months.Looking to future experiences with APIsWith Apigee, we not only have the tools that we need to quickly deploy and secure APIs, but eventually to monetize them. One potential way to do so would be to provide our APIs to aggregator apps and services, which are popular in many of the verticals that we operate in—hotel bookings, restaurant reservations, and tickets to events and attractions. It’s good to know that when we’re ready to monetize, the technology is there for us.Apigee is also opening the doors for us to explore new lines of business. We look forward to seeing how we can use APIs to give our customers engaging and beneficial experiences in the future.
Quelle: Google Cloud Platform

4 ways Anthos delivers ROI to customers, according to new Forrester Consulting Study

In our conversations with technology leaders about Anthos, we quickly get into the strategic questions about long term transformation and selection of the right technology architectures. At the heart of those discussions is an exploration of the potential economic value to their organization. So we commissioned Forrester Consulting to interview a few early adopters of Anthos in order to conduct a comprehensive study on the impact that Anthos, Google Cloud’s application modernization platform had on their IT organization. Today we’re excited to share Forrester’s New Technology Projection: The Total Economic Impact™ of Anthos study, which describes how Anthos can help make operators, developers, and security team more productive and satisfied. Forrester conducted interviews with several early Anthos customers to evaluate the benefits, costs, and risks of investing in Anthos across an organization. Based on their interviews, Forrester identified major financial benefits across four different areas: operational efficiency, developer productivity, security productivity and increased customer advocacy and retention. In fact, Forrester projects that customers adopting Anthos can achieve a range of up to 4.8x Return on Investment (ROI) within three years.Save money, increase customer satisfactionLike you, Anthos customers work with multiple clouds (including on-premises environments) and report that managing a hybrid or multi-cloud platform is incredibly complex. Anthos enables them to modernize in place, build and deploy apps fast, and scale effectively without compromising their security or increasing complexity. Let’s take a deeper look at the ways that Forrester found that Anthos can help you achieve your goals and unlock your business potential. 1. Streamline operational efficiencyAnthos gives operators a single platform to manage applications across environments, saving time on management and speeding up modernization. Anthos Service Mesh simplifies delivery and lifecycle management of microservices and helps ensure the health of the overall application. The composite organization is projected to reduce the time spent on platform management by 40% to 55%, across both on-prem and cloud environments. When you are ready to migrate existing applications to the cloud, Migrate for Anthos makes that process simple and fast. The composite organization is projected to have 58% to 75% faster app migration and modernization process when using Anthos. After you containerize your existing applications you can take advantage of Anthos Google Kubernetes Engine (GKE), both on-prem and in the cloud, and consistently manage your Kubernetes deployments. 2. Accelerated development velocity and increased developer productivityAnthos gives time back to developers by providing consistency across on-prem and various cloud environments. Instead of managing configurations and deployments, developers can focus on what they do best: writing, testing, and pushing code. Developers also enjoy a better experience while using environment-agnostic Anthos capabilities like our Cloud Run serverless solution, Anthos Service Mesh, Anthos Config Management, and Anthos GKE. Ultimately, developers at enterprises in this study projected reduced non-coding time by 23% to 38%. Saving developers time also directly contributes to organizational agility. One Google Cloud financial services customer using Anthos expects to move their updates from a quarterly to a weekly roll-out—a 13x improvement on time-to-market. When developers are freed from the burden of infrastructure management, they can accelerate progress in your organization. 3. Consistent, unified security policy creation and enforcement across environmentsRegardless of where you are in your application modernization journey, keeping your cloud platform reliable and secure is absolutely critical. Just minutes of downtime can mean millions of dollars in lost sales, and one security breach can be a billion dollar mistake. One major concern surrounding moving to the cloud is the difficulty of securing applications across a range of different environments that may not sit solely in your data centers. Anthos Config Management allows you to automate and standardize security policies and best practices across all of your environments. Anthos GKE combines the power of containerization with the ease of management from a single UI and API surfaceConsistent and unified policy creation and enforcement, through Anthos is projected to save security operators 60% to 96% of their time on deployment-related tasks. 4. Improve customer advocacy and retention for lift in top-line revenue Perhaps most importantly, Forrester’s analysis found that Anthos can enhance the customer-facing application performance, and availability, leading to more satisfied customers and a significant financial sales lift. Containerization, microservices, and serverless all provide tools that improve cloud agility and governance. Application downtime is incredibly frustrating for your customers and can result in lost revenue. Anthos is projected to reduce application downtime by 20% to 60% in the composite organization, which also contributed to a better customer experience and increased overall sales. Additionally, productive developers, and efficient operators, can push new features and updates more frequently, enhancing the customer experience across all your applications. Download the Forrester Total Economic Impact study today to hear directly from enterprise engineering leaders and dive deep into the economic impact Anthos can deliver your organization. We would love to partner with you to explore the potential Anthos can unlock in your teams. Please reach out to our sales team to start a conversation about your digital transformation with Google Cloud.
Quelle: Google Cloud Platform

What’s happening in BigQuery: New federated queries, easier ML, and more metadata

BigQuery, Google Cloud’s petabyte-scale data warehouse, lets you ingest and analyze data quickly and with high availability, so you can find new insights, trends, and predictions to efficiently run your business. Our engineering team is continually making improvements to BigQuery so you can get even more out of it. Recently added BigQuery features include new federated data sources, BigQuery ML transforms, integer partitions, and more.Read on to learn more about these new capabilities and get quick demos and tutorial links so you can try them yourself.Query Orc and Parquet files directly with BigQueryParquet and ORC are popular columnar open source formats for large-scale data analytics. As you make your move to the cloud, you can use BigQuery to analyze data stored in these formats. Choosing between keeping these files in Cloud Storage vs. loading your data into BigQuery can be a difficult decision. To make it easier, we launched federated query support for Apache ORC and Parquet files in Cloud Storage from BigQuery’s Standard SQL interface. Check out a demo:This new feature joins other federated querying capabilities from within BigQuery, including storage systems such as Cloud Bigtable, Google Sheets, and Cloud SQL, as well as AVRO, CSV, and JSON file formats in Cloud Storage—all part of BigQuery’s commitment to building an open and accessible data warehouse. Read more details on this launch.  Your turn: Load and query millions of movie recommendations Love movies? Here’s a way to try query federation: analyze over 20 million movie ratings and compare analytic performance between Cloud SQL, BigQuery federated queries, and BigQuery native storage. Launch the code workbook and follow along with the video. Don’t have a Google Cloud account? Sign up for free.New data transformations with BigQuery MLThe success of machine learning (ML) models depends heavily on the quality of the dataset used in training. Preprocessing your training data during feature engineering can get complicated when you also have to do those same transformations on your production data at prediction time. We announced some new features in BigQuery ML that can help preprocess and transform data with simple SQL functions. In addition, because BigQuery automatically applies these transformations at the time of predictions, the productionization of ML models is greatly simplified.Binning data values with ML.BUCKETIZEOne decision you will face when building your models is whether to throw away records where there isn’t enough data for a given dimension. For example, if you’re evaluating taxi cab fares in NYC, do you throw away rides to latitude and longitudes that only appear once in the data? One common technique is to bucketize continuous values (like lat/long) into discrete bins. This will help group and not discard infrequent dropoff locations in the long tail:BigQuery provides out-of-the-box support for several common machine learning operations that do not require a separate analysis pass through the data. For example, here’s an example of bucketizing the inputs, knowing the latitude and longitude boundaries of New York:BigQuery ML now supports running the same preprocessing steps at serving time if you wrap all of your transformations in the new TRANSFORM clause. This saves you from having to remember and implement the same data transformations you did during training on your raw prediction data. Check out the blog post to see the complete example for NYC Taxi cab fare prediction and read more in the BigQuery ML preprocessing documentation. Use flat-rate pricing with BigQuery ReservationsBigQuery Reservations is now in beta in U.S. and E.U. regions. BigQuery Reservations allows you to seamlessly purchase BigQuery slots to take advantage of flat-rate pricing and manage BigQuery spending with complete predictability.BigQuery Reservations allows you to:Purchase dedicated BigQuery slots by procuring commitments in a matter of secondsProgrammatically and dynamically distribute committed BigQuery slots to reservations for workload management purposesUse assignments to assign Google Cloud projects, folders, or your entire organization to reservationsQuickly analyze metadata with INFORMATION_SCHEMAAs a data engineer or analyst, you are often handed a dataset or table name with little or no context of what’s inside or how it is structured. Knowing what tables and columns are available across your datasets is a critical part of exploring for insights. Previously, you could select a preview of the data or click each table name in the BigQuery UI to inspect the schema. Now, with INFORMATION_SCHEMA, you can do these same tasks at scale with SQL. How can you quickly tell how many tables are in a dataset? What about the total number and names of columns and whether they are partitioned or clustered? BigQuery natively stores all this metadata about your datasets, tables, and columns in a queryable format that you can quickly access with INFORMATION_SCHEMA:As you can see, with INFORMATION_SCHEMA.COLUMNS there is a list of every column and its data type for all tables in the baseball dataset. Let’s expand the previous query to aggregate some useful metrics like:Count of tablesNames of tables Total number of columnsCount of partitioned columnsCount of clustered columnsTry the above query with different public datasets like github_repos or new_york or even your own Google Cloud project and dataset. Your turn: Analyze BigQuery public dataset and table metadata quickly with INFORMATION_SCHEMAPractice analyzing dataset metadata with this 10-minute demo video and code workbook. And check out the documentation for INFORMATION_SCHEMA for reference. Partition your tables by an integer rangeBigQuery natively supports partitioning your tables, which makes managing big datasets easier and faster to query. You can decide which of your timestamp, date, or range of integers of your data to segment your table by. Creating an integer range partitioned tableLet’s assume we have a large dataset of transactions and a customer_id field that we want to partition by. After the table is set up, we will be able to filter for a specific customer without having to scan the entire table, which means faster queries and more control over costs.We’ll use the BigQuery CLI for this example. Here we create an integer range partitioned table named mypartitionedtable in mydataset in your default project.The partitioning is based on a start of 0, end of 100, and interval of 10. That means we have customer IDs from 0 to 100 and want to segment them in groups of 10 (so IDs 0-9 will be in one partition, and so on.). Note that new data written to an integer range partitioned table will automatically be partitioned. This includes writing to the table via load jobs, queries, and streaming. As your dataset changes over time, BigQuery will automatically re-partition your data too.Using table decorators to work with specific partitions You can use table decorators to quickly access specific segments of your data. For example, if you wanted all the customers in the first partition (0-9 ID range) add the $0 suffix to the table name:This is particularly useful when you need to load in additional data—you can just specify the partitions using decorators. For additional information about integer range partitions, see Integer range partitioned tables.In case you missed itFor more on all things BigQuery, check out these recent posts, videos and how-tos:New features for BigQuery Audit Logs are now generally available Persistent SQL UDFs are now generally available Check out the new book BigQuery: The Definitive Guide, by Jordan Tigani and Valliappa LakshmananTo keep up on what’s new with BigQuery, subscribe to our release notes. You can try BigQuery with no charge in our sandbox. And let us know how we can help.
Quelle: Google Cloud Platform

Exploring an Apache Kafka to Pub/Sub migration: Major considerations

The fastest way to migrate a business application into Google Cloud is to use the lift and shift strategy—and part of that transition is migrating any OSS, or third-party services that the application uses. But sometimes it can be more efficient and beneficial to leverage Google Cloud services instead. One of the services that customers often think about migrating is Apache Kafka, a popular message distribution solution that performs asynchronous message exchange between different components of an application. While following the lift and shift strategy, the native solution is to migrate to a proprietary managed Kafka cluster or to leverage a managed partner service of Confluent Cloud. But in many cases, our Pub/Sub messaging and event distribution service can successfully replace Apache Kafka, with lower maintenance and operational costs, and better integration with other Google Cloud services.Kafka is designed to be a distributed commit log. In other words, it includes the functionality of both a message system and storage system, providing features beyond that of a simple message broker. These features include log compaction, partitioned ordering, exactly-once delivery, the ability to browse committed messages, long message retention times and others often complicate the migration decision. The migration task is easier when Kafka is simply used as a message broker or event distribution system. But it is also possible to migrate from Kafka to Pub/Sub when the former is used for data streaming. In this post, we compare some key differences between Kafka and Pub/Sub to help you evaluate the effort of the migration. Then, in an upcoming post, we’ll show you how to implement some Kafka functionality with the Pub/Sub service as well as to accomplish the migration itself.Pub/Sub Key AdvantagesDespite the fact that Apache Kafka offers more features, many applications that run in Google Cloud can benefit from using Pub/Sub as their messaging service. Some of Pub/Sub’s benefits include:Zero maintenance costs – Apache Kafka is highly customizable and flexible, but that can translate to expensive, often manual maintenance. In contrast, running Pub/Sub does not require any manpower. Lower operational costs – Running Kafka OSS in Google Cloud incurs operational costs, since you have to provision and maintain the Kafka clusters.  In addition, infrastructure costs might be higher in some circumstances since they are based on allocated resources rather than used resources. In contrast, Pub/Sub pricing is based on pay-per-use and the service requires almost no administrationNative integration with other Google Cloud services, e.g. Cloud Functions, Storage or Stackdriver – To use Kafka with these services, you need to install and configure additional software (connectors) for each integration. A push mechanism – In addition to the conventional message pulling mechanism, Pub/Sub retrieves messages posted to a topic via push delivery.Implicit scaling – Pub/Sub automatically scales in response to a change in load. In contrast, Kafka’s topic partitioning requires additional management, including making decisions about resource consumption vs. performance.Integrated logging and monitoring – Pub/Sub is natively integrated with Stackdriver, with no external configurations or tooling required. Kafka provides monitoring using the JMX plugin. When you deploy Kafka on Google Cloud, you’ll need to do additional development to integrate Kafka logs into Stackdriver logging and monitoring, maintain multiple sources of logs and alerts.Key differences affecting migration decisions It’s not easy to know upfront how complex it will be to migrate from Kafka to Pub/Sub. Here’s a decision tree that suggests solutions to potential migration blockers.If you use exactly-once message delivery Kafka’s exactly-once message delivery guarantee comes with a price: a degradation in performance. You can use it in production environments if you’re not expecting high message throughput and you don’t need to scale under load. A more effective way to achieve exactly once processing at high scale might be to make your message processing idempotent or use Dataflow to deduplicate the messages. These approaches can be used with Kafka too. If you consume messages that were published longer than seven days agoThere are few business reasons to postpone message processing. One of the most common is processing of messages that for some reason were not processed at a time they were posted by a publisher, for example, due to commit failure.  One of the use cases is the dead letter queue pattern where messages that cannot be processed by current applications are stored until it is modified to accommodate them. In Kafka you implement a dead letter queue using Kafka Connect or Kafka Streams. Pub/Sub now has a native dead letter queue too. This functionality is in alpha. Follow the Pub/Sub release notes to see when it will be generally available. Alternatively, you can implement dead letter queue logic using a combination of Google Cloud services. This post shows you how, using Dataflow and a Google Cloud database.When you use Kafka to store messages over long time periods, the migration guidelines are to store the posted messages in a database such as Cloud Bigtable or the BigQuery data warehouse.If you use log compaction,random message access or message deletionBeing able to overwrite or delete messages is functionality that you usually find in a storage service rather than in a message distribution service. Kafka’s log compaction ensures that Kafka will always retain at least the last known value for each message key within the log of data for a single topic partition. There is no equivalent feature in Pub/Sub and compaction requires explicit reprocessing of messages or incremental aggregation of state. Pub/Sub does provide the ability to discard messages automatically after as little 10 minutes. You can consider using seek functionality to random message access. Also the seek to a timestamp allows to discard the acknowledged messages manually after a retention period between 10 minutes and 7 days.If you use keyed message orderingOne of Kafka’s flagship features is its partition message ordering, sometimes referred to as keyed message ordering. Compared to Kafka, Pub/Sub offers only best-effort ordered message delivery. The feature is often cited as a functional blocker for migrating to another message distribution solution. However, the problem is not that clear-cut. Let’s briefly review message ordering in Kafka. Kafka promises to order messages within a single partition of a topic. This means that when a producer sends messages to a topic in some order, the broker writes the messages to the topic’s partition in that order, and all consumers read them in that order too.A broker distributes messages among partitions randomly. Because topics usually have many partitions, it is hard to maintain the ordering of the messages. To solve that problem, Kafka offers keyed messages—a mechanism that allows a single producer to assign unique keys to published messages. All messages that come with a specific key go to the same partition. A consumer can process the messages with the same key chronologically by reading them from that partition.Kafka’s ordering provides partial message ordering within a topic. Total topic ordering can be achieved with Kafka by configuring only one partition in the topic. However, this configuration takes out parallelism and usually is not used in production. Pub/Sub documentation reviews different use cases for message ordering and proposes solutions using additional Cloud services. You can also use third-party solutions if you don’t want to use these Google Cloud services. In addition, Pub/Sub has an “ordering key” feature (in limited alpha) that guarantees that messages successfully published by a single publisher for the same ordering key will be sent to subscribers in that order. Follow the Pub/Sub release notes to see when it will be generally available.What’s next?If you are considering a migration from Apache Kafka to Pub/Sub, we hope that this post helps to evaluate the change and offers comparison of unique features of both tools. In our next post, we’ll review implementation complexity of the migration and how to resolve it using the mentioned unique Pub/Sub features.ResourcesKafka reference documentationSpotify on replacing Kafka with Pub/SubImplement exactly-once delivery using Google Cloud DataflowError handling strategy using Cloud Pub/Sub and Dead Letter queuePub/Sub product pageMessage ordering documentationPub/Sub FAQ
Quelle: Google Cloud Platform

New GA Dataproc features extend data science and ML capabilities

The life of a data scientist can be challenging. If you’re in this role, your job may involve anything from understanding the day-to-day business behind the data to keeping up with the latest machine learning academic research. With all that a data scientist must do to be effective, you shouldn’t have to worry about migrating data environments or dealing with processing limitations associated with working with raw data. Google Cloud’s Dataproc lets you run cloud-native Apache Spark and Hadoop clusters easily. This is especially helpful as data growth relocates data scientists and machine learning researchers from personal servers and laptops into distributed cluster environments like Apache Spark, which offers Python and R interfaces for data of any size. You can run open source data processing on Google Cloud, making Dataproc one of the fastest ways to extend your existing data analysis to cloud-sized datasets.  We’re announcing the general availability of several new Dataproc features that will let you apply the open source tools, algorithms, and programming languages that you use today to large datasets. This can be done without having to manage clusters and computers. These new GA features make it possible for data scientists and analysts to build production systems based on personalized development environments. You can keep data as the focal point of your work, instead of getting bogged down with peripheral IT infrastructure challenges. Here’s more detail on each of these features. Streamlined environments with autoscaling and notebooks With Dataproc autoscaling and notebooks, data scientists can work in familiar notebook environments that remove the need to change underlying resources or contend with other analysts for cluster processing. You can do this by combining Dataproc’s component gateway for notebooks withautoscaling GA, With Dataproc autoscaling, a data scientist can work on their own isolated and personalized small cluster while running descriptive statistics, building features, developing custom packages, and testing various models. Once you’re ready to run your analysis on the full dataset, you can do the full analysis within the same cluster and notebook environment, as long as autoscaling is enabled. The cluster will simply grow to the size needed to process the full dataset and then scale itself back down when the processing is completed. You don’t need to waste time trying to move over to a larger server environment or figure out how to migrate your work. Remember that when working with large datasets in a Jupyter notebook for Spark, you may often want to stop the Spark context that is created by default and instead use a configuration with larger memory limits, as shown in the example below. #In Jupyter you have to stop the current context firstsc.stop()#in this example, the driver program that runs is given access to all memory on the masterconf = (SparkConf().set(“spark.driver.maxResultSize”, 0))#restart the Spark context with your new configurationsc = SparkContext(conf=conf)Autoscaling and notebooks makes a great development environment for right-sizing cluster resources and working in a collaborative environment. Once you are ready to move from development to an automated process for production jobs, the Dataproc Jobs API makes this an easy transition. Logging and monitoring for SparkR job typesThe Dataproc Jobs API makes it possible to submit a job to an existing Cloud Dataproc cluster with jobs.submit call over HTTP, using the gcloud command-line tool or in the Google Cloud Platform Console itself. With the GA release of SparkR job type, you can have SparkR jobs logged and monitored, which makes it easy to build automated tooling around R code. The Jobs API also allows for separation between permissions of who has access to submit jobs on a cluster and who has permissions to reach the cluster itself. The Jobs API makes it possible for data scientists and analysts to schedule production jobs without setting up gateway nodes or networking configurations.As shown in the below image, you can combine the Dataproc Jobs API with Cloud Scheduler’s HTTP target for automating tasks such as re-running jobs for operational pipelines or re-training ML models at predetermined intervals. Just be sure that the scope and service account used in the Cloud Scheduler job have the correct access permissions to all the Dataproc resources required.For more sophisticated tasks that involve running jobs with multiple steps or creating a cluster alongside the jobs, workflow templates offer another HTTP target that can be used from schedulers. For both workflow templates and Dataproc jobs, Cloud Scheduler is a good choice as a scheduling tool based on time. Cloud Functionsis a good option if you prefer to run Dataproc jobs in response to new files in Cloud Storage or events in Pub/Sub. Cloud Composer is yet another scheduling option if your job needs to orchestrate data pipelines outside of Dataproc.   For more on moving R to the cloud and how customers are using the Dataproc Jobs API for SparkR, check out this blog post on SparkR and this Google Cloud Next ‘19 talk on Data Science at Scale with R on Google Cloud.  Accelerator support for GPUsOften Spark and other Hadoop frameworks are preprocessing steps for creating datasets that are appropriate for deep learning models that use GPUs. With this in mind, Dataproc now offers support for attaching GPUs to clusters. This is another feature that unifies the processing environment for data scientists and saves analysts the time and hassle of re-configuring underlying cluster resources. In a single workflow template, you can automate a series of jobs that mixes and matches Spark ML and GPU-based deep learning algorithms. For datasets that need to scale beyond the memory of a single GPU, RAPIDS on Dataproc is a framework that uses both GPUs and Dataproc’s ability to launch and control a cluster of VMs with API calls. To get started with RAPIDS on Dataproc, check out the RAPIDS initialization action and the associated example notebooks based on NYC taxi data. Scheduled cluster deletionOften times, you may have spent your day building a model or tweaking that SQL query to pull back the cohort of information you need from a massive dataset. You click “run” to kick off a long-running job and head home for the weekend. When Monday comes, the results are ready for review. While using the autoscaler to utilize more compute resources is one way to help get answers faster, there will inevitably be long-running queries and jobs that go unattended. To make sure you do not overpay for these unattended jobs, use cluster scheduled deletion to automatically delete a cluster after a specified idle period when submitting a job using the Dataproc Jobs API. That way, you can leave for the weekend and not have to worry about continuing to check in on when you can delete the cluster and stop paying for Dataproc clusters to be running. Cluster scheduled deletion offers two additional options for time-based deletion that will make sure you stop paying for Dataproc clusters if you forget to delete the cluster when you leave for the day or if you inadvertently leave your cluster running. Learn more about Dataproc.
Quelle: Google Cloud Platform

Cheaper Cloud AI deployments with NVIDIA T4 GPU price cut

Google Cloud offers a wide range of GPUs to accelerate everything from AI deployment to 3D visualization. These use cases are now even more affordable with the price reduction of the NVIDIA T4 GPU. As of early January, we’ve reduced T4 prices by more than 60%, making it the lowest cost GPU instance on Google CloudPrices above are for us-central1 and vary by region. A full GPU pricing table is here.Locations and configurationsGoogle Cloud was the first major cloud provider to launch the T4 GPU and offer it globally (in eight regions). This worldwide footprint, combined with the performance of the T4 Tensor Cores, opens up more possibilities to our customers. Since our global rollout, T4 performance has improved. The T4 and V100 GPUs now boast networking speeds of up to 100 Gbps, in beta, with additional regions coming online in the future. These GPU instances are also flexible to suit different workloads. The T4 GPUs can be attached to our n1 machine types that support custom VM shapes. This means you can create a VM tailored specifically to meet your needs, whether it’s a low cost option like one vCPU, one GB memory, and one T4 GPU, or as high performance as 96 vCPUs, 624 GB memory, and four T4 GPUs—and most anything in between. This is helpful for machine learning (ML), since you may want to adjust your vCPU count based on your pre-processing needs. For visualization, you can create VM shapes for lower end solutions all the way up to powerful, cloud-based professional workstations.Machine LearningWith mixed precision support and 16 GB of memory, the T4 is also a great option for ML workloads. For example, Compute Engine preemptible VMs work well for batch ML inference workloads, offering lower cost compute in exchange for variable capacity availability. We previously shared sample T4 GPU performance numbers for ML inference of up to 4,267 images-per-second (ResNet 50, batch size 128, precision INT8). That means you can perform roughly 15 million image predictions in an hour for a $0.11 add-on cost for a single T4 GPU with your n1 VM. Google Cloud offers several options to access these GPUs. One of the simplest ways to get started is through Deep Learning VM Images for AI Platform and Compute Engine, and Deep Learning Containers for Google Kubernetes Engine (GKE). These are configured for software compatibility and performance, and come pre-packaged with your favorite ML frameworks, including PyTorch and TensorFlow Enterprise. We’re committed to making GPU acceleration more accessible, whatever your budget and performance requirements may be. With the reduced cost of NVIDIA T4 instances, we now have a broad selection of accelerators for a multitude of workloads, performance levels, and price points. Check out the full pricing table and regional availability and try the NVIDIA T4 GPU for your workload today.
Quelle: Google Cloud Platform

Want to use AutoML Tables from a Jupyter Notebook? Here’s how

While there’s no doubt that machine learning (ML) can be a great tool for businesses of all shapes and sizes, actually building ML models can seem daunting at first. Cloud AutoML—Google Cloud’s suite of products—provides tools and functionality to help you build ML models that are tailored to your specific needs, without needing deep ML expertise.AutoML solutions provide a user interface that walks you through each step of model building, including importing data, training your model on the data, evaluating model performance, and predicting values with the model. But, what if you want to use AutoML products outside of the user interface? If you’re working with structured data, one way to do it is by using the AutoML Tables SDK, which lets you trigger—or even automate—each step of the process through code. There is a wide variety of ways that the SDK can help embed AutoML capabilities into applications. In this post, we’ll use an example to show how you can use the SDK from end-to-end within your Jupyter Notebook. Jupyter Notebooks are one of the most popular development tools for data scientists. They enable you to create interactive, shareable notebooks with code snippets and markdown for explanations. Without leaving Google Cloud’s hosted notebook environment, AI Platform Notebooks, you can leverage the power of AutoML technology.There are several benefits of using AutoML technology from a notebook. Each step and setting can be codified so that it runs the same every time by everyone. Also, it’s common, even with AutoML, to need to manipulate the source data before training the model with it. By using a notebook, you can use common tools like pandas and numpy to preprocess the data in the same workflow. Finally, you have the option of creating a model with another framework, and ensemble that together with the AutoML model, for potentially better results. Let’s get started!Understanding the dataThe business problem we’ll investigate in this blog is how to identify fraudulent credit card transactions. The technical challenge we’ll face is how to deal with imbalanced datasets: only 0.17% of the transactions in the dataset we’re using are marked as fraud. More details on this problem are available in the research paper Calibrating Probability with Undersampling for Unbalanced Classification.To get started, you’ll need a Google Cloud Platform project with billing enabled. To create a project, follow the instructions here. For a smooth experience, check that the necessary storage and ML APIs are enabled. Then, follow this link to access BigQuery public datasets in the Google Cloud console.In the Resources tree in the bottom-left corner, navigate through the list of datasets until you find ml-datasets, and then select the ulb-fraud-detection table within it.Click the Preview tab to preview sample records from the dataset. Each record has the following columns:Time is the number of seconds between the first transaction in the dataset and the time of the selected transaction.V1-V28 are columns that have been transformed via a dimensionality reduction technique called PCA that has anonymized the data.Amount is the transaction amount.Set up your Notebook EnvironmentNow that we’ve looked at the data, let’s set up our development environment. The notebook we’ll use can be found in AI Hub. Select the “Open in GCP” button, then choose to either deploy the notebook in a new or existing notebook server.Configure the AutoML Tables SDKNext, let’s highlight key sections of the notebook. Some details, such as setting the project ID, are omitted for brevity, but we highly recommend running the notebook end-to-end when you have an opportunity.We’ve recently released a new and improved AutoML Tables client library. You will first need to install the library and initialize the Tables client.By the way, we recently announced that AutoML Tables can now be used in Kaggle kernels. You can learn more in this tutorial notebook, but the setup is similar to what you see here.Import the Data The first step is to create a BigQuery dataset, which is essentially a container for the data. Next, import the data from the BigQuery fraud detection dataset. You can also import from a CSV file in Google Cloud Storage or directly from a pandas dataframe.Train the ModelFirst, we have to specify which column we would like to predict, or our target column, with set_target_column(). The target column for our example will be “Class”—either 1 or 0, if the transaction is fraudulent or not.Then, we’ll specify which columns to exclude from the model. We’ll only exclude the target column, but you could also exclude IDs or other information you don’t want to include in the model.There are a few other things you might want to do that aren’t necessary needed in this example:Set weights on individual columnsCreate your own custom test/train/validation split and specify the column to use for the splitSpecify which timestamp column to use for time-series problemsOverride the data types and nullable status that AutoML Tables inferred during data importThe one slightly unusual thing that we did in this example is override the default optimization objective. Since this is a very imbalanced dataset, it’s recommended that you optimize for AU-PRC, or the area under the Precision/Recall curve, rather than the default AU-ROC.Evaluate the ModelAfter training has been completed, you can review various performance statistics on the model, such as the accuracy, precision, recall, and so on. The metrics are returned in a nested data structure, and here we are pulling out the AU-PRC and AU-ROC from that data structure.Deploy and Predict with the ModelTo enable online predictions, the model must first be deployed. (You can perform batch predictions without deploying the model).We’ll create a hypothetical transaction record with similar characteristics and predict on it. After invoking the predict() API with this record, we receive a data structure with each class and its score. The code below finds the class with the maximum score.Conclusion Now that we’ve seen how you can use AutoML Tables straight from your notebook to produce an accurate model of a complex problem, all with a minimal amount of code, what’s next?To find out more, the AutoML Tables documentation is a great place to start. When you’re ready to use AutoML in a notebook, the SDK guide has detailed descriptions of each operation and parameter. You might also find our samples on Github helpful.After you feel comfortable with AutoML Tables, you might want to look at other AutoML products. You can apply what you’ve learned to solve problems in Natural Language, Translation, Video Intelligence, and Video domains.Find me on Twitter at @kweinmeister, and good luck with your next AutoML experiment!
Quelle: Google Cloud Platform

New Anthos training: a masterclass in hybrid cloud architecture and management

You’re moving faster than ever to build new applications, innovate, and bring value to your customers. Anthos, Google Cloud’s open application modernization platform, can help you modernize your existing applications, making them more portable, maintainable, scalable and secure. And now, our newest learning specialization, Architecting Hybrid Cloud Infrastructure with Anthos, is live, showing how you can use its technologies to transform your IT environments.Designed for infrastructure operators, architects, and DevOps professionals, Architecting Hybrid Cloud Infrastructure with Anthos teaches you how to modernize, observe, secure, and manage your applications using Istio-powered service mesh and Kubernetes, whether you’re on-premises, on Google Cloud, or distributed across both. With a mix of lectures and hands-on labs, you’ll learn about compute, networking, service mesh, config management, and their underlying control-planes, so you can begin to understand the full scope of the platform’s capabilities. The training also unpacks the complexities of modern environments, and equips you with the foundational knowledge needed to address challenges such as migrating and scaling among environments hosted in multiple regions and by multiple providers.This specialization builds on the Architecting with Google Kubernetes Engine (GKE) learning specialization, and assumes that students have extensive hands-on experience with Kubernetes. Architecting Hybrid Cloud Infrastructure with Anthos is delivered as three courses, which are available on demand and in a classroom setting:Hybrid Cloud Infrastructure Foundations with Anthos – This course lays the groundwork for assembling hybrid infrastructure by presenting the Anthos platform architecture including Anthos GKE and Anthos Service Mesh.Hybrid Cloud Service Mesh with Anthos – Gain the practical skills you need to deploy a service mesh to overcome challenges in multi-service application management, operation, and delivery.Hybrid Cloud Multi-Cluster with Anthos – The final course will help you to understand configuration and get hands-on practice to manage a multi-cluster Anthos GKE deployment, including on-premises and in-cloud clusters.Interested in hearing more? Register today for our webinar, Architecting Hybrid Cloud Infrastructure with Anthos, on Jan 31 at 9:00 am PST to get hands-on Anthos experience and receive a special discount on additional Anthos training.
Quelle: Google Cloud Platform