Google Cloud Platform News, Entwicklungen, Updates, HowTos - Seite 7 von 303

At Google Cloud, we understand how important it is for startups to get holistic technical support so that you can build and scale your businesses to the next level. The Google Cloud Technical Guides for Startups Series helps you do this and more with its Start, Build and Grow multi-series.In the Start and Build Series, we explored how to get started on Google Cloud, as well as how to build and optimize existing applications.It’s time to take the next step and learn how to scale them.Boost your startup game with our Grow SeriesWe are excited to announce the launch of our 3rd installment – the Grow Series!This series focuses on growing and scaling your deployments, and is the final piece of the technical enablement multi-series. Not only will we traverse through some exciting and innovative Google Cloud solutions, but we will also throw the spotlight on some industry specific use cases. Scale your deployments Learn to scale with solutions such as Looker for powerful insights, Cloud Spanner- a highly scalable relational database, AlloyDB – our high performing postgreSQL database and Anthos for your hybrid connectivity needs. Explore industry specific architecturesDeep dive into various industry examples as we explore startup architectures from healthcare to retail verticals and more. Optimize for sustainabilityLearn about implementing digital and operational sustainability as we discuss how to build your startup on Google Cloud- a platform with net zero carbon emissions.Get started with the first episodeIntroducing Google Cloud Technical Guides for Startups – Grow SeriesCheck out the first episode of the Grow Series for an overview of the topics covered and find out what else is in store. Hop on to the Google Cloud channelWe are excited to have you with us on the final chapter of this journey. Check out our website and join us on the Google Cloud Tech Channel to find the Start, Build and Grow Series. If you want to learn more about how Google Cloud can help your startup, visit our pagehere to get more information about our program, andsign up for our communications to get a look at our community activities, digital events, special offers, and more.
Quelle: Google Cloud Platform

11. Februar 2023

da Agency

Reducing the storage impact of Point-in-Time Recovery

Point-In-Time Recovery (PITR) is a critical capability for enterprise applications. It allows database administrators to recover from accidental data deletion by restoring their production databases to a time before the incident. Cloud SQL for PostgreSQL launched support for PITR in July 2020, allowing you to recover from disasters like data corruption or accidental deletion by restoring your Cloud SQL instance to a previous time. We’re excited to announce an additional enhancement to PITR for Cloud SQL for PostgreSQL that makes enabling PITR an even easier decision: for instances with Point-in-Time Recovery newly-enabled, the write-ahead logs being stored for PITR operations (which are the transaction logs that are used to go back in time) will no longer consume disk storage space. Instead, when you enable PITR for new instances, Cloud SQL will store transaction logs collected during the retention window in Google Cloud Storage, and retrieve them when you perform a restore. Because transaction logs can grow rapidly when your database experiences a burst of activity, this move will help reduce the impact these bursts have on your provisioned disk storage. These logs will be stored for up to seven days in the same Google Cloud region as your instance at no additional cost to you. PITR is enabled by default when you create a new Cloud SQL for PostgreSQL instance from the Google Cloud console, and transaction logs will no longer be stored on the instance for instances that have PITR newly enabled. If you have already enabled PITR on your PostgreSQL instances, this enhancement will be rolled out to your instances at a later point. If you want to take advantage of this change sooner, you can first disable and then re-enable PITR on your instance (which will reset your ability to perform a point-in-time restore to the time at which PITR was re-enabled). On instances with this feature enabled, you’ll notice that consumed storage on your instance will reduce relative to the volume of write-ahead logs (WAL) generated by your instance. The actual amount of storage your logs consume will vary by instance and by database activity – during busy times for your database, log size may shrink or grow. However, these logs will now only be stored on your instance long enough to successfully replicate to any replicas of the instance and to ensure that they are safely written to Cloud Storage; afterwards, they will be removed from your instance.We’re excited to continue to enhance Cloud SQL for PostgreSQL to ensure that disaster recovery is easy to enable, cost effective, and seamless to use. Learn more about this change in our documentation.Related ArticleBuilding a resilient architecture with Cloud SQLHow to build a resilient database architecture with Cloud SQL, a managed Google Cloud service for MySQL, PostgreSQL and SQL-Server database.Read Article
Quelle: Google Cloud Platform

10. Februar 2023

da Agency

Extending reality: Immersive Stream for XR is now Generally Available

Last year at Google I/O, we announced the preview of Immersive Stream for XR, which leverages Google Cloud GPUs to host, render, and stream high-quality photorealistic experiences to millions of mobile devices around the world. Today, we are excited to announce that the service is now generally available for Google Cloud customers. With Immersive Stream for XR, users don’t need powerful hardware or a special application to be immersed in a 3D or AR world; instead, they can click a link or scan a QR code and immediately be transported to extended reality. Immersive Stream for XR is being used to power the “immersive view” feature in Google Maps, while automotive and retail brands are enhancing at-home shopping experiences for consumers, from virtually configuring a new vehicle to visualizing new appliances in the home.What’s new with GAWith this latest product milestone, Immersive Stream for XR now supports content developed in Unreal Engine 5.0. We have also added the ability to render content in landscape mode to support tablet and desktop devices. With landscape mode and the ability to render to larger screens, there is more real estate for creating sophisticated UIs and interactions, for more full-featured immersive applications. Finally, you can now embed Immersive Stream for XR content on your own website using an HTML iframe, allowing users to access your immersive applications without leaving your domain.How customers are using Immersive Stream for XRA common type of experience our customers want to create is a ‘space’ where users can walk around and interact with objects. For example, home improvement retailers can let their shoppers place appliances options or furniture in renderings of their actual living spaces; travel and hospitality companies can provide virtual tours of a hotel room or event space; and museums can offer virtual experiences where users can walk around and interact with virtual exhibits. To help customers create these experiences faster, we collaborated with Google Partner Innovation (PI) to create a spaces template, the first of a series of templates developed with close customer involvement within the PI Early Access Program. The spaces template standardizes the common interactions across these scenarios, such as user movement and object interaction.aside_block[StructValue([(u’title’, u’Industries embrace XR’), (u’body’, <wagtail.wagtailcore.rich_text.RichText object at 0x3e70c16cc8d0>), (u’btn_text’, u’Read more.’), (u’href’, u’https://cloud.google.com/blog/transform/augment-reality-virtual-reality-smartphone-secrets-immersive-stream’), (u’image’, None)])]Aosom, a home and garden ecommerce retailer, recently used this template to launch an experience that allows users to place furniture in either a virtual living room or in their own space using AR. Users have the ability to customize the item’s color and options, then add products to their shopping cart once satisfied. “Home & Garden shoppers are always looking for offerings that are unique and compatible with their own living space,” said Chunhua Wang, Chief Executive Officer, Aosom. “Google Cloud’s Immersive Stream for XR has enabled Aosom to deliver a visually vivid and immersive shopping experience to our customers.”Immersive Stream for XR especially benefits automakers, who can now enable prospective buyers to browse and customize new vehicles in photo realistic detail and visualize them in their own driveway. Most recently, Kia Germany leveraged the technology to promote the Kia Sportage, one of their top selling vehicles. The virtual experience was accessible via a QR code on the Kia website.“At Kia Germany we are excited to use Google Immersive Stream for XR to reach new consumers and provide them the perfect experience to discover our Sportage,” said Jean-Philippe Pottier, Manager of Digital Platforms at Kia Germany. “Our users love that they can change colors, engines, and interact with the model in 3D and augmented reality.”Last, with the addition of Unreal Engine 5.0 and support for bigger and more realistic worlds, users have the ability to explore far away historical landmarks without leaving their home. For example, Virtual Worlds uses photogrammetry techniques to capture historical sites, polish them with a team of designers, and then create interactive experiences on top. Because of the visual detail involved, these experiences have historically required expensive workstations with GPUs to perform the rendering, limiting their availability to physical exhibits. Using Unreal 5.0’s new Nanite and Luman capabilities, the team created an educational tour of the Great Sphinx of Giza, and made it accessible by anyone using Immersive Stream for XR, available here. Elliot Mizroch, CEO of Virtual Worlds, explains, “We’ve captured incredible sites from Machu Picchu to the Pyramids of Giza and we want everyone to be able to explore these monuments and learn about our heritage. Immersive Stream for XR finally gives us this opportunity.”Next stepsWe’re excited to see all of the innovative use cases you build using Google Cloud’s Immersive Stream for XR. Learn more by reading our documentation, or get started by downloading the Immersive Stream for XR template project. To get started with Unreal Engine 5.0 and landscape mode, you can download our updated Immersive Stream for XR template project, load it into Unreal Engine 5.0.3, and start creating your content. If you’d like to embed your experience on your own website, you can contact us to allowlist your domain.
Quelle: Google Cloud Platform

9. Februar 2023

da Agency

Transform SQL into SQLX for Dataform

IntroductionDeveloping in SQL poses significant problems when compared to other languages and frameworks. It’s not easy to reuse statements across different scripts, there’s no way to write tests to ensure data consistency, and dependency management requires external software solutions. Developers will typically write thousands of lines of SQL to ensure data processing occurs in the correct order. Additionally, documentation and metadata are afterthoughts because they need to be managed in an external catalog.Google Cloud offers Dataform and SQLX to solve these challenges. Dataform is a service for data analysts to test, develop, and deploy complex SQL workflows for data transformation in BigQuery. Dataform lets you manage data transformation in the Extraction, Loading, and Transformation (ELT) process for data integration. After extracting raw data from source systems and loading into BigQuery, Dataform helps you transform it into a well-defined, tested, and documented suite of data tables.SQLX is an open source extension of SQL and the primary tool used in Dataform. As it is an extension, every SQL file is also a valid SQLX file. SQLX brings additional features to SQL to make development faster, more reliable, and scalable. It includes functions including dependencies management, automated data quality testing, and data documentationTeams should quickly transform their SQL into SQLX to gain the full suite of benefits that Dataform provides. This blog contains a high-level, introductory guide demonstrating this process.The steps in this guide use the Dataform on Google Cloud console. You can follow along or implement these steps with your own SQL scripts!Getting StartedHere is an example SQL script we will transform into SQLX. This script takes a source table containing reddit data. The script cleans, deduplicates, and inserts the data into a new table with a partition.code_block[StructValue([(u’code’, u’CREATE OR REPLACE TABLE reddit_stream.comments_partitionedrnPARTITION BYrn comment_daternASrnrnWITH t1 as (rnSELECTrn comment_id,rn subreddit,rn author,rn comment_text,rn CAST(total_words AS INT64) total_words,rn CAST(reading_ease_score AS FLOAT64) reading_ease_score,rn reading_ease,rn reading_grade_level,rn CAST(sentiment_score AS FLOAT64) sentiment_score,rn CAST(censored AS INT64) censored,rn CAST(positive AS INT64) positive,rn CAST(neutral AS INT64) neutral,rn CAST(negative AS INT64) negative,rn CAST(subjectivity_score AS FLOAT64) subjectivity_score,rn CAST(subjective AS INT64) subjective,rn url,rn DATE(comment_date) comment_date,rn CAST(comment_hour AS INT64) comment_hour,rn CAST(comment_year AS INT64) comment_year,rn CAST(comment_day AS INT64) comment_dayrnFROM reddit_stream.comments_streamrn)rnSELECT k.*rnFROM (rn SELECT ARRAY_AGG(row LIMIT 1)[OFFSET(0)] krn FROM t1 rowrn GROUP BY comment_idrn)’), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3e4d73053f90>)])]1. Create a new SQLX file and add your SQL In this guide we’ll title our file as comments_partitioned.sqlx.As you can see below, our dependency graph does not provide much information.2. Refactor SQL to remove DDL and use only SELECTIn SQLX, you only write SELECT statements. You specify what you want the output of the script to be in the config block, like a view or a table as well as other types available. Dataform takes care of adding CREATE OR REPLACE or INSERT boilerplate statements.3. Add a config object containing metadataThe config object will contain the output type, description, schema (dataset), tags, columns and their descriptions, and the BigQuery-related configuration. Check out the example below.code_block[StructValue([(u’code’, u’config {rn type: “table”,rn description: “cleaned comments data and partitioned by date for faster performance”,rn schema: “demo_optimized_staging”,rn tags: [“reddit”],rn columns: {rn comment_id: “unique id for each comment”,rn subreddit: “which reddit community the comment occurred”,rn author: “which reddit user commented”,rn comment_text: “the body of text for the comment”,rn total_words: “total number of words in the comment”,rn reading_ease_score: “a float value for comment readability score”,rn reading_ease: “a plain-text english categorization of readability”,rn reading_grade_level: “a plain-text english categorization of readability by school grade level”,rn sentiment_score: “float value for sentiment of comment between -1 and 1″,rn censored: “whether the comment needed to censoring by some process upstream”,rn positive: “one-hot encoding 1 or 0 for positive”,rn neutral: “one-hot encoding 1 or 0 for neutral”,rn negative: “one-hot encoding 1 or 0 for negative”,rn subjectivity_score: “float value for comment subjectivity score”,rn subjective: “one-hot encoding 1 or 0 for subjective”,rn url: “link to the comment on reddit”,rn comment_date: “date timestamp for when the comment occurred”,rn comment_hour: “integer for hour of comment post time”,rn comment_year: “integer for year of comment post time”,rn comment_month: “integer for month of comment post time”,rn comment_day: “integer for day of comment post time”rn },rn bigquery: {rn partitionBy: “comment_date”,rn labels: {rn cost_center: “123456”rn }rn }rn}’), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3e4d58b5b190>)])]4. Create declarations for any source tablesIn our SQL script, we directly write reddit_stream.comments_stream. In SQLX, we’ll want to utilize a declaration to create relationships between source data and tables created by Dataform. Add a new comments_stream.sqlx file to your project for this declaration:code_block[StructValue([(u’code’, u’config {rn type: “declaration”,rn database: “my-project”,rn schema: “reddit_stream”,rn name: “comments_stream”,rn description: “A BigQuery table acting as a data sink for comments streaming in real-time.”rn}’), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3e4d7271f710>)])]We’ll utilize this declaration in the next step.5. Add references to declarations, tables, and views.This will help build the dependency graph. In our SQL script, there is a single reference to the declaration. Simply replace reddit_stream.comments_stream with ${ref(“comments_stream”)}. Managing dependencies with the ref function has numerous advantages.The dependency tree complexity is abstracted away. Developers simply need to use the ref function and list dependencies.It enables us to write smaller, more reusable and more modular queries instead of thousand-line-long queries. That makes pipelines easier to debug.You get alerted in real time about issues like missing or circular dependencies6. Add assertions for data validationYou can define data quality tests, called assertions, directly from the config block of your SQLX file. Use assertions to check for uniqueness, null values or any custom row condition. The dependency tree adds assertions for visibility.Here are assertions for our example:code_block[StructValue([(u’code’, u’assertions: {rn uniqueKey: [“comment_id”],rn nonNull: [“comment_text”],rn rowConditions: [rn “total_words > 0″rn ]rn }’), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3e4d7271fd10>)])]These assertions will pass if comment_id is a unique key, if comment_text is non-null, and if all rows have total_words greater than zero.7. Utilize JavaScript for repeatable SQL and parameterizationOur example has a deduplication SQL block. This is a perfect opportunity to create a JavaScript function to reference this functionality in other SQLX files. For this scenario, we’ll create the includes folder and add a common.js file with the following contents:code_block[StructValue([(u’code’, u’function dedupe(table, group_by_cols) {rn return `rnSELECT k.*rnFROM (rn SELECT ARRAY_AGG(row LIMIT 1)[OFFSET(0)] krn FROM ${table} rowrn GROUP BY ${group_by_cols}rn)rn `rn}rnrnmodule.exports = { dedupe };’), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3e4d7271f610>)])]Now, we can replace that code block with this function call in our SQLX file as such: ${common.dedupe(“t1″, “comment_id”)}In certain scenarios, you may want to use constants in your SQLX files. Let’s add a constants.js file to our includes folder and create a cost center dictionary.code_block[StructValue([(u’code’, u’const COST_CENTERS = {rn dev: “000000”,rn stage: “123123”,rn prod: “123456”rn}rnrnmodule.exports = { COST_CENTERS }’), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3e4d7271f910>)])]We can use this to label our output BigQuery table with a cost center. Here’s an example of using the constant in a SQLX config block:code_block[StructValue([(u’code’, u’bigquery: {rn partitionBy: “comment_date”,rn labels: {rn cost_center: constants.COST_CENTERS.devrn }rn }’), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3e4d7271f990>)])]8. Validate the final SQLX file and compiled dependency graphAfter completing the above steps, let’s have a look at the final SQLX files:comments_stream.sqlxcode_block[StructValue([(u’code’, u’config {rn type: “declaration”,rn database: “my-project”,rn schema: “reddit_stream”,rn name: “comments_stream”,rn description: “A BigQuery table acting as a data sink for comments streaming in real-time.”rn}’), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3e4d7271f750>)])]comments_partitioned.sqlxcode_block[StructValue([(u’code’, u’config {rn type: “table”,rn description: “cleaned comments data and partitioned by date for faster performance”,rn schema: “demo_optimized_staging”,rn tags: [“reddit”],rn columns: {rn comment_id: “unique id for each comment”,rn subreddit: “which reddit community the comment occurred”,rn author: “which reddit user commented”,rn comment_text: “the body of text for the comment”,rn total_words: “total number of words in the comment”,rn reading_ease_score: “a float value for comment readability score”,rn reading_ease: “a plain-text english categorization of readability”,rn reading_grade_level: “a plain-text english categorization of readability by school grade level”,rn sentiment_score: “float value for sentiment of comment between -1 and 1″,rn censored: “whether the comment needed to censoring by some process upstream”,rn positive: “one-hot encoding 1 or 0 for positive”,rn neutral: “one-hot encoding 1 or 0 for neutral”,rn negative: “one-hot encoding 1 or 0 for negative”,rn subjectivity_score: “float value for comment subjectivity score”,rn subjective: “one-hot encoding 1 or 0 for subjective”,rn url: “link to the comment on reddit”,rn comment_date: “date timestamp for when the comment occurred”,rn comment_hour: “integer for hour of comment post time”,rn comment_year: “integer for year of comment post time”,rn comment_month: “integer for month of comment post time”,rn comment_day: “integer for day of comment post time”rn },rn bigquery: {rn partitionBy: “comment_date”,rn labels: {rn cost_center: constants.COST_CENTERS.devrn }rn },rn assertions: {rn uniqueKey: [“comment_id”],rn nonNull: [“comment_text”],rn rowConditions: [rn “total_words > 0″rn ]rn }rn}rnrnWITH t1 as (rnSELECTrn comment_id,rn subreddit,rn author,rn comment_text,rn CAST(total_words AS INT64) total_words,rn CAST(reading_ease_score AS FLOAT64) reading_ease_score,rn reading_ease,rn reading_grade_level,rn CAST(sentiment_score AS FLOAT64) sentiment_score,rn CAST(censored AS INT64) censored,rn CAST(positive AS INT64) positive,rn CAST(neutral AS INT64) neutral,rn CAST(negative AS INT64) negative,rn CAST(subjectivity_score AS FLOAT64) subjectivity_score,rn CAST(subjective AS INT64) subjective,rn url,rn DATE(comment_date) comment_date,rn CAST(comment_hour AS INT64) comment_hour,rn CAST(comment_year AS INT64) comment_year,rn CAST(comment_month AS INT64) comment_month,rn CAST(comment_day AS INT64) comment_dayrnFROM ${ref(‘comments_stream’)}rnWHERE CAST(total_words AS INT64) > 0)rnrnrn${common.dedupe(“t1″, “comment_id”)}’), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3e4d72d0e410>)])]Let’s validate the dependency graph and ensure the order of operations looks correct.Now it’s easy to visualize where the source data is coming from, what output type comments_partitioned is, and what data quality tests will occur!Next StepsThis guide outlines the first steps of transitioning legacy SQL solutions to SQLX and Dataform for improved metadata management, comprehensive data quality testing, and efficient development. Adopting Dataform streamlines the management of your cloud data warehouse processes allowing you to focus more on analytics and less on infrastructure management. For more information, check out Google Cloud’s Overview of Dataform. Explore our official Dataform guides and Dataform sample script library for even more hands-on experiences.Related ArticleDataform is joining Google Cloud: Deploy data transformations with SQL in BigQueryWith our acquisition of Dataform, you can now leverage software development best practices to define, document, test and deploy data tran…Read Article
Quelle: Google Cloud Platform

7. Februar 2023

da Agency

How to migrate Cloud Storage data from multi-region to regional

There are many considerations to take into account when choosing the location type of your Cloud Storage bucket. However, as business needs change, you may find that regional storage offers lower cost and/or better performance than multi-region or dual-region storage. By design, once your data is already stored in a bucket, the location type of that bucket cannot be changed. The path forward is clear: you must create new, regional buckets and move your existing data into them.Migrating from multi-region to regional storageThe tool for this job is the Storage Transfer Service (STS), which uses parameterization to bulk migrate files. The basic steps are as follows:Create new buckets in the region you desire.Use STS to transfer the objects from the original multi-region buckets to the new regional ones.Test the objects (e.g., using Cloud Storage Insights) in the new buckets and if the test is passed, delete the old buckets.While there is no charge for use of the STS itself, performing a migration will incur Cloud Storage charges associated with the move — including storage charges for the data in the source and destination until you delete the source bucket; for the Class A and B operations involved in listing, reading, and writing the objects; for egress charges for moving the data across the network; and retrieval and/or early delete fees associated with migrating Nearline, Coldline and Archive objects. Please see the STS pricing documentation for more information.Though we have focused on a multi-region to regional Cloud Storage migration, in the steps that follow, the considerations and process for any other type of location change will be much the same — for example, you might want to migrate from multi-region to dual-region, which could be a good middle ground between the options, or even migrate a regional bucket from one location to a different regional location.Planning the migrationThe first determination will be which buckets to migrate. There could be a number of reasons why you would choose not to migrate certain buckets, for example, the data inside might be stale and/or not needed anymore, or it might serve a workload that is a better fit for multi-region, for example, an image hosting service for an international user base. If you’re transferring massive amounts of data, it is also important to consider the time it will take to complete the transfer. To prevent any one customer from overloading the service, the STS has queries per second and bandwidth limitations at a project level. If you’re planning a massive migration (say over 100PB or 1 billion objects) you should notify your Google Cloud sales team or create a support ticket to ensure that the required capacity is available in the region where you’re doing the transfer. Your sales team can also help you calculate the time the transfer will take, which is a complex process that involves many factors. To determine if you need to be worried about how long the transfer could take, consider the following data points: A bucket with 11PB of data and 70 million objects should take around 24 hours to transfer. A bucket with 11PB of data and 84 billion objects could take 3 years to transfer if jobs are not executed concurrently. In general, if the number of objects you need to transfer is over a billion, the transfer could take prohibitively long, so you will need to work with Google Cloud technicians to reduce the transfer time by parallelizing the transfer. Note that these metrics are for cloud to cloud transfers, not HTTP transfers. There may also be metadata that you want to transfer from your old buckets to your new buckets. Some metadata, like user-created custom fields, are automatically transferred by the STS, whereas other fields, like storage classes or CMEK, must be manually enabled via the STS API. The API or gcloud CLI must also be used if you wish to transfer all versions of your objects, as opposed to just the latest one. If you are using Cloud Storage Autoclass in the destination bucket (it must be enabled at bucket creation time), all of your objects will start out as a Standard storage class after the transfer. Refer to the Transfer between Cloud Storage buckets documentation for guidance on handling all complexities you may have to account for.Your final decision point will be whether you want to keep the exact same names for your buckets, or whether you can work with new bucket names (e.g., no application changes with the same bucket name). As you will see in the next section, the migration plan will require an additional step if you need to keep the original names.Steps for migrationThe diagram below shows how the migration process will unfold for a single bucket.You may decide that in order to avoid having to recode the names of the buckets in every downstream application, you want your regional buckets to have the exact same names as your original multi-region buckets did. Since bucket names are as immutable as their location types, and the names need to be globally unique, this requires transferring your data twice: once to temporary intermediate buckets, then to the new target buckets that were created after the source buckets had been deleted. While this will obviously take additional time, it should be noted that the second transfer to the new target buckets will take approximately a tenth of the time of the first transfer because you are doing a simple copy within a region.Be sure to account for the fact that there will be downtime for your services while you are switching them to the new buckets. Also keep in mind that when you delete the original multi-region buckets, you should create the regional buckets with the same name immediately afterwards. Once you’ve deleted them, theoretically anyone can claim their names. If you are aiming to transfer multiple buckets, you can run multiple jobs simultaneously to decrease the overall migration time. STS supports around 200 concurrent jobs per project. Additionally, if you have very large buckets, either by size or number of objects, it is possible that the job may take several days to fully transfer the data in the bucket, as each job will copy one object at a time. In these cases, you can run multiple jobs per bucket and configure each job to filter objects by prefix. If configured correctly, this can significantly reduce the overall migration time for very large buckets. This library can help with managing your STS jobs, and testing the objects that have been transferred.What’s next?With great flexibility of storage options comes great responsibility. To determine whether a migration is necessary, you will need to do a careful examination of your data, and the workloads that use it. You will also need to consider what data and metadata should be transferred to the buckets of the new location type. Luckily, once you’ve made the decisions, Cloud Storage and the STS make it easy to migrate your data. Once your data is transferred, there are other ways to optimize your usage of Cloud Storage, such as leveraging customizable monitoring dashboards. If you’re not using the STS, perhaps for smaller transfers or analytical workloads where you’re downloading and uploading data to a VM, consider using the gcloud storage CLI.Related ArticleBest practices for Cloud Storage cost optimizationFind tips here on optimizing your Cloud Storage buckets or objects based on your company’s performance, retention, and access pattern req…Read Article
Quelle: Google Cloud Platform

4. Februar 2023

da Agency

Submit your entry now for our new Talent Transformation Google Cloud Customer Award

Has your business made talent investments that directly link to your digital transformation strategy? Google Cloud wants to honor your organization for its dedication to developing your team’s Google Cloud skills through our *new* Talent Transformation Google Cloud Customer Award. Submit your application before March 31, 2023 to be recognized as a global leader in cloud talent transformation. Celebrating Google Cloud customer’s outstanding contributionsGoogle Cloud Customer Awards recognize organizations who are leading business transformation with Google Cloud products and solutions. We want to hear how you are growing one of the most important elements of your organization — your people! Tell us your story for a chance to win and enjoy benefits like:A Google Cloud Customer Award designation for your websiteCollaboration with Google Cloud leaders, engineers and product managers at a variety of roundtables, discussions and eventsGoogle Cloud press release and announcement support to help strengthen your brand as a visionary leader in technologyPromotion through the Google Cloud results blog and social media to share your success story with our extensive customer and partner networkInclusion in the annual Google Cloud Customer ebookA place amongst the global leaders who are recognized at Google Cloud events and celebrationsHow to get started and tips for a submission that will stand outTell your compelling and unique story about cloud talent transformation! This can include mentorship, skills training, Google Cloud certification preparation support or anything you’ve built to invest in your people’s Google Cloud skills. To help your accomplishments shine, use the distinct voice and personality of your organization. You’ll want to begin by gathering:Business and deployment metricsSolution overview diagrams, workflows, architectural diagrams or imagesExisting public case studies, webinars or other contentThese awards recognize customers who demonstrate unique transformation and innovation, business/operational excellence, industry-wide problem solving, and implementing long-term, lasting benefits. You can add depth to your submission by asking stakeholders to share their perspectives — for example, your CEO or customer testimonies are great ways to do this.Metrics and impact are also important. Share how your company is now faster, smarter, more collaborative and flexible due to the Google Cloud skills development opportunities that you provided.Judging the entries and results announcementsA diverse panel of senior technical judges from around the world carefully assess hundreds of entries that are ranked using a scoring framework. We ensure high quality assessment through a three-round process, using specified benchmarks at least twice per entry. The Google Cloud Customer Awards team and the judges are the only people who see submissions, and winners are under embargo until official announcements take place. All participants will be notified of results via email at least two months prior to announcements, with results notification scheduled for May 31, 2023.Results will be formally announced and celebrated at a special event later this year, where winners take their place amongst other outstanding leaders in innovative thinking and business transformation.For inspiration and to learn more about the transformative organizations that have won Customer Awards with their visionary achievements, take a look at last year’s industry winners. We encourage entry by any customers – new to established, small to large, across all types of products and solutions. Submitting your entryIn order to enter, you must be a Google Cloud customer, with success you can demonstrate within the last 12 months. Google Cloud partners and Googlers can also submit on behalf of customers.Award categories include Industry Customer Awards across a range of verticals, and our Technology for Good Awards, which include the Talent Transformation Award. You may apply for one Industry Customer Award, plus any or all of the Technology for Good Awards.Start by using this templateto gather all of the relevant information as a team. Designate one person to complete the application and submit via the Customer Awards online questionnaire. The submission window is now open through March 31, 2023.We are so excited to hear about the wonderful things you are doing to empower your teams to build upon their Google Cloud knowledge and skills — making you a leader in your industry. Happy submitting — get started here!
Quelle: Google Cloud Platform

4. Februar 2023

da Agency

How to use advance feature engineering to preprocess data in BigQuery ML

Preprocessing and transforming raw data into features is a critical but time consuming step in the ML process. This is especially true when a data scientist or data engineer has to move data across different platforms to do MLOps. In this blogpost, we describe how we streamline this process by adding two feature engineering capabilities in BigQuery ML.Our previous blog outlines the data to AI journey with BigQuery ML, highlighting two powerful features that simplify MLOps – data preprocessing functions for feature engineering and the ability to export BigQuery ML TRANSFORM statement as part of the model artifact. In this blog post, we share how to use these features for creating a seamless experience from BigQuery ML to Vertex AI.Data Preprocessing FunctionsPreprocessing and transforming raw data into features is a critical but time consuming step when operationalizing ML. We recently announced the public preview of advanced feature engineering functions in BigQuery ML. These functions help you impute, normalize or encode data. When this is done inside the database, BigQuery, the entire process becomes easier, faster, and more secure to preprocess data.Here is a list of the new functions we are introducing in this release. The full list of preprocessing functions can be found here.ML.MAX_ABS_SCALERScale a numerical column to the range [-1, 1] without centering by dividing by the maximum absolute value.ML.ROBUST_SCALERScale a numerical column by centering with the median (optional) and dividing by the quantile range of choice ([25, 75] by default).ML.NORMALIZERTurn an input numerical array into a unit norm array for any p-norm: 0, 1, >1, +inf. The default is 2 resulting in a normalized array where the sum of squares is 1.ML.IMPUTERReplace missing values in a numerical or categorical input with the mean, median or mode (most frequent).ML.ONE_HOT_ENCODEROne-hot encode a categorical input. Also, it optionally does dummy encoding by dropping the most frequent value. It is also possible to limit the size of the encoding by specifying k for the k most frequent categories and/or a lower threshold for the frequency of categories. ML.LABEL_ENCODEREncode a categorical input to integer values [0, n categories] where 0 represents NULL and excluded categories. You can exclude categories by specifying k for k most frequent categories and/or a lower threshold for the frequency of categories.Model Export with TRANSFORM StatementYou can now export BigQuery ML models that include a feature TRANSFORM statement. The ability to include TRANSFORM statements makes models more portable when exporting them for online prediction. This capability also works when BigQuery ML models are registered with Vertex AI Model Registry and deployed to Vertex AI Prediction endpoints. More details about exporting models can be found in BigQuery ML Exporting models.These new features are available through the Google Cloud Console, BigQuery API, and client libraries. Step-by-step guide to use the two featuresIn this tutorial, we will use the bread recipe competition dataset to predict judges rating using linear regression and boosted tree models.Objective: To demonstrate how to preprocess data using the new functions, register the model with Vertex AI Model Registry, and deploy the model for online prediction with Vertex AI Prediction endpoints. Dataset: Each row represents a bread recipe with columns for each ingredient (flour, salt, water, yeast) and procedure (mixing time, mixing speed, cooking temperature, resting time). There are also columns that include judges ratings of the final product from each recipe.Overview of the tutorial: Steps 1 and 2 show how to use the TRANSFORM statement. Steps 3 and 4 demonstrate how to manually export and register the models. Steps 5 through 7 show how to deploy a model to Vertex AI Prediction endpoint.For the best learning experience, follow this blog post alongside the tutorial notebook.Step 1: Transform BigQuery columns into ML features with SQLBefore training an ML model, exploring the data within columns is essential to identifying the data type, distribution, scale, missing patterns, and extreme values. BigQuery ML enables this exploratory analysis with SQL. With the new preprocessing functions it is now even easier to transform BigQuery columns into ML features with SQL while iterating to find the optimal transformation. For example, when using the ML.MAX_ABS_SCALER function for an input column, each value is divided by the maximum absolute value (10 in the example): code_block[StructValue([(u’code’, u’SELECTrn input_column,rn ML.MAX_ABS_SCALER (input_column) OVER() AS scale_columnrnFROMrn UNNEST([0, -1, 2, -3, 4, -5, 6, -7, 8, -9, 10]) as input_columnrnORDER BY input_column’), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3ea90d564c90>)])]Once the input columns for an ML model are identified and the feature transformations are chosen, it is enticing to apply the transformation and save the output as a view. But this has an impact on our predictions later on because these same transformations will need to be applied before requesting predictions. Step 2 shows how to prevent this separation of processing and model training.Step 2: Iterate through multiple models with inline TRANSFORM functionsBuilding on the preprocessing explorations in Step 1, the chosen transformations are applied inline with model training using the TRANSFORM statement. This interlocks the model iteration with the preprocessing explorations while making any candidate ready for serving with BigQuery or beyond. This means you can immediately try multiple model types without any delayed impact of feature transformations on predictions. In this step, two models, linear regression and boosted tree, are trained side-by-side with identical TRANSFORM statements:Training with linear regression – Model acode_block[StructValue([(u’code’, u”CREATE OR REPLACE MODEL `statmike-mlops-349915.feature_engineering.03_feature_engineering_2a`rnTRANSFORM (rn JUDGE_A,rnrn ML.MIN_MAX_SCALER(flourAmt) OVER() as scale_flourAmt, rn ML.ROBUST_SCALER(saltAmt) OVER() as scale_saltAmt,rn ML.MAX_ABS_SCALER(yeastAmt) OVER() as scale_yeastAmt,rn ML.STANDARD_SCALER(water1Amt) OVER() as scale_water1Amt,rn ML.STANDARD_SCALER(water2Amt) OVER() as scale_water2Amt,rnrn ML.STANDARD_SCALER(waterTemp) OVER() as scale_waterTemp,rn ML.ROBUST_SCALER(bakeTemp) OVER() as scale_bakeTemp,rn ML.MIN_MAX_SCALER(ambTemp) OVER() as scale_ambTemp,rn ML.MAX_ABS_SCALER(ambHumidity) OVER() as scale_ambHumidity,rnrn ML.ROBUST_SCALER(mix1Time) OVER() as scale_mix1Time,rn ML.ROBUST_SCALER(mix2Time) OVER() as scale_mix2Time,rn ML.ROBUST_SCALER(mix1Speed) OVER() as scale_mix1Speed,rn ML.ROBUST_SCALER(mix2Speed) OVER() as scale_mix2Speed,rn ML.STANDARD_SCALER(proveTime) OVER() as scale_proveTime,rn ML.MAX_ABS_SCALER(restTime) OVER() as scale_restTime,rn ML.MAX_ABS_SCALER(bakeTime) OVER() as scale_bakeTimern)rnOPTIONS (rn model_type = ‘LINEAR_REG’,rn input_label_cols = [‘JUDGE_A’],rn enable_global_explain = TRUE,rn data_split_method = ‘AUTO_SPLIT’,rn MODEL_REGISTRY = ‘VERTEX_AI’,rn VERTEX_AI_MODEL_ID = ‘bqml_03_feature_engineering_2a’,rn VERTEX_AI_MODEL_VERSION_ALIASES = [‘run-20230112234821′]rn ) ASrnSELECT * EXCEPT(Recipe, JUDGE_B)rnFROM `statmike-mlops-349915.feature_engineering.bread`”), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3ea90e7bcc90>)])]Training with boosted tree – Model bcode_block[StructValue([(u’code’, u”CREATE OR REPLACE MODEL `statmike-mlops-349915.feature_engineering.03_feature_engineering_2b`rnTRANSFORM (rn JUDGE_A,rnrn ML.MIN_MAX_SCALER(flourAmt) OVER() as scale_flourAmt, rn ML.ROBUST_SCALER(saltAmt) OVER() as scale_saltAmt,rn ML.MAX_ABS_SCALER(yeastAmt) OVER() as scale_yeastAmt,rn ML.STANDARD_SCALER(water1Amt) OVER() as scale_water1Amt,rn ML.STANDARD_SCALER(water2Amt) OVER() as scale_water2Amt,rnrn ML.STANDARD_SCALER(waterTemp) OVER() as scale_waterTemp,rn ML.ROBUST_SCALER(bakeTemp) OVER() as scale_bakeTemp,rn ML.MIN_MAX_SCALER(ambTemp) OVER() as scale_ambTemp,rn ML.MAX_ABS_SCALER(ambHumidity) OVER() as scale_ambHumidity,rnrn ML.ROBUST_SCALER(mix1Time) OVER() as scale_mix1Time,rn ML.ROBUST_SCALER(mix2Time) OVER() as scale_mix2Time,rn ML.ROBUST_SCALER(mix1Speed) OVER() as scale_mix1Speed,rn ML.ROBUST_SCALER(mix2Speed) OVER() as scale_mix2Speed,rn ML.STANDARD_SCALER(proveTime) OVER() as scale_proveTime,rn ML.MAX_ABS_SCALER(restTime) OVER() as scale_restTime,rn ML.MAX_ABS_SCALER(bakeTime) OVER() as scale_bakeTimern)rnOPTIONS (rn model_type = ‘BOOSTED_TREE_REGRESSOR’,rn booster_type = ‘GBTREE’,rn num_parallel_tree = 1,rn max_iterations = 30,rn early_stop = TRUE,rn min_rel_progress = 0.01,rn tree_method = ‘HIST’,rn subsample = 0.85, rn input_label_cols = [‘JUDGE_A’],rn enable_global_explain = TRUE,rn data_split_method = ‘AUTO_SPLIT’,rn l1_reg = 10,rn l2_reg = 10,rn MODEL_REGISTRY = ‘VERTEX_AI’,rn VERTEX_AI_MODEL_ID = ‘bqml_03_feature_engineering_2b’,rn VERTEX_AI_MODEL_VERSION_ALIASES = [‘run-20230112234926′]rn ) ASrnSELECT * EXCEPT(Recipe, JUDGE_B)rnFROM `statmike-mlops-349915.feature_engineering.bread`”), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3ea8f93a6a90>)])]Identical input columns that have the same preprocessing means you can easily compare the accuracy of the models. Using the BigQuery ML function ML.EVALUATE makes this comparison as simple as a single SQL query that stacks these outcomes with the UNION ALL set operator:code_block[StructValue([(u’code’, u”SELECT ‘Manual Feature Engineering – 2A’ as Approach, mean_squared_error, r2_scorernFROM ML.EVALUATE(MODEL `statmike-mlops-349915.feature_engineering.03_feature_engineering_2a`)rnUNION ALLrnSELECT ‘Manual Feature Engineering – 2B’ as Approach, mean_squared_error, r2_scorernFROM ML.EVALUATE(MODEL `statmike-mlops-349915.feature_engineering.03_feature_engineering_2b`)”), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3ea90c7bd310>)])]The results of the evaluation comparison show that using the boosted tree model results in a much better model than linear regression with drastically lower mean squared error and higher r2. Both models are ready to serve predictions, but the clear choice is the boosted tree regressor. Once we decide which model to use, you can predict directly within BigQuery ML using the ML.PREDICT function. In the rest of the tutorial, we show how to export the model outside of BigQuery ML and predict using Google Cloud Vertex AI.Using BigQuery Models for Inference Outside of BigQueryOnce your model is trained, if you want to do online inference for low latency responses in your application for online prediction, you have to deploy the model outside of BigQuery. The following steps demonstrate how to deploy the models to Vertex AI Prediction endpoints.This can be accomplished in one of two ways:Manually export the model from BigQuery ML and set up a Vertex AI Prediction Endpoint. To do this, you need to do steps 3 and 4 first.Register the model and deploy from Vertex AI Model Registry automatically. The capability is not available yet but will be available in a forthcoming release. Once it’s available steps 3 and 4 can be skipped.Step 3. Manually export models from BigQueryBigQuery ML supports an EXPORT MODEL statement to deploy models outside of BigQuery. A manual export includes two models – a preprocessing model that reflects the TRANSFORM statement and a prediction model. Both models are exported with a single export statement in BigQuery ML.code_block[StructValue([(u’code’, u”EXPORT MODEL `statmike-mlops-349915.feature_engineering.03_feature_engineering_2b`rn OPTIONS (URI = ‘gs://statmike-mlops-349915-us-central1-bqml-exports/03/2b/model’)”), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3ea8faf965d0>)])]The preprocessing model that captures the TRANSFORM statement is exported as a TensorFlow SavedModel file. In this example it is exported to a GCS bucket located at ‘gs://statmike-mlops-349915-us-central1-bqml-exports/03/2b/model/transform’. The prediction models are saved in portable formats that match the frameworks in which they were trained by BigQuery ML. The linear regression model is exported as a TensorFlow SavedModel and the boosted tree regressor is exported as Booster file (XGBoost). In this example, the boost tree model is exported to a GCS bucket located at ‘gs://statmike-mlops-349915-us-central1-bqml-exports/03/2b/model’ These export files are in a standard open format of the native model types making them completely portable to be deployed anywhere – they can be deployed to Vertex AI (Steps 4-7 below), on your own infrastructure, or even in edge applications.Steps 4 through 7 show how to register and deploy a model to Vertex AI Prediction endpoint. These steps need to be repeated separately for the preprocessing models and the prediction models.Step 4. Register models to Vertex AI Model RegistryTo deploy the models in Vertex AI Prediction, they first need to be registered with the Vertex AI Model Registry To do this two inputs are needed – the links to the model files and a URI to a pre-built container. Go to Step 4 in the tutorial to see how exactly it’s done.The registration can be done with the Vertex AI console or programmatically with one of the clients. In the example below, the Python client for Vertex AI is used to register the models like this:code_block[StructValue([(u’code’, u’vertex_model = aiplatform.Model.upload(rn display_name = ‘gcs_03_feature_engineering_2b’,rn serving_container_image_uri = ‘us-docker.pkg.dev/vertex-ai/prediction/xgboost-cpu.1-1:latest’,rn artifact_uri = “gs://statmike-mlops-349915-us-central1-bqml-exports/03/2b/model”rn)’), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3ea8faf96650>)])]Step 5. Create Vertex AI Prediction endpointsVertex AI includes a service forhosting models for online predictions. To host a model on a Vertex AI Prediction endpoint you first create an endpoint. This can also be done directly from the Vertex AI Model Registry console or programmatically with one of the clients. In the example below, the Python client for Vertex AI is used to create the endpoint like this:code_block[StructValue([(u’code’, u’vertex_endpoint = aiplatform.Endpoint.create (rn display_name = u201803_feature_engineering_manual_2bu2019rn)’), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3ea8faf96b50>)])]Step 6. Deploy models to endpointsDeploying a model from the Vertex AI Model Registry (Step 4) to a Vertex AI Prediction endpoint (Step 5) is done in a single deployment action where the model definition is supplied to the endpoint along with the type of machine to utilize. Vertex AI Prediction endpoints can automatically scale up or down to handle prediction traffic needs by providing the number of replicas to utilize (default is 1 for min and max). In the example below, the Python client for Vertex AI is being used with the deploy method for the endpoint (Step 5) using the models (Step 4):code_block[StructValue([(u’code’, u”vertex_endpoint.deploy(rn model = vertex_model,rn deployed_model_display_name = vertex_model.display_name,rn traffic_percentage = 100,rn machine_type = ‘n1-standard-2′,rn min_replica_count = 1,rn max_replica_count = 1rn)”), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3ea90c7bd450>)])]Step 7. Request predictions from endpoints Once the model is deployed to a Vertex AI Prediction endpoint (Step 6) it can serve predictions. Rows of data, called instances, are passed to the endpoint and results are returned that include the processed information: preprocessing result or prediction. Getting prediction results from Vertex AI Prediction endpoints can be done with any of the Vertex AI API interfaces (REST, gRPC, gcloud, Python, Java, Node.js). Here, the request is demonstrated directly with the predict method of the endpoint (Step 6) using the Python client for Vertex AI as follows:code_block[StructValue([(u’code’, u”results = vertex_endpoint.predict(instances = [rn{‘flourAmt': 511.21695405324624,rn ‘saltAmt': 9,rn ‘yeastAmt': 11,rn ‘mix1Time': 6,rn ‘mix1Speed': 4,rn ‘mix2Time': 5,rn ‘mix2Speed': 4,rn ‘water1Amt': 338.3989183746999,rn ‘water2Amt': 105.43955159464981,rn ‘waterTemp': 48,rn ‘proveTime': 92.27755071811586,rn ‘restTime': 43,rn ‘bakeTime': 29,rn ‘bakeTemp': 462.14028505497805,rn ‘ambTemp': 38.20572852497746,rn ‘ambHumidity': 63.77836403396154}])”), (u’language’, u”), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3ea90c7bd690>)])]The result of an endpoint with a preprocessing model will be identical to applying the TRANSFORM statement from BigQuery ML. The results can then be pipelined to an endpoint with the prediction model to serve predictions that match the results of the ML.PREDICT function in BigQuery ML. The results of both methods, Vertex AI Prediction endpoints and BigQuery ML with ML.PREDICT are shown side-by-side in the tutorial to show that the results of the model are replicated. Now the model can be used for online serving with extremely low latency. This even includes using private endpoints for even lower latency and secure connections with VPC Network Peering.Conclusion With the new preprocessing functions, you can simplify data exploration and feature preprocessing. Further, by embedding preprocessing within model training using the TRANSFORM statement, the serving process is simplified by using prepped models without needing additional steps. In other words, predictions are done right inside BigQuery or alternatively the models can be exported to any location outside of BigQuery such as Vertex AI Prediction for online serving. The tutorial demonstrated how BigQuery ML works with Vertex AI Model Registry and Prediction to create a seamless end-to-end ML experience. In the future you can expect to see more capabilities that bring BigQuery, BigQuery ML and Vertex AI together.Click here to access the tutorial or check out the documentation to learn more about BigQuery MLThanks to Ian Zhao, Abhinav Khushraj, Yan Sun, Amir Hormati, Mingge Deng and Firat Tekiner from the BigQuery ML team
Quelle: Google Cloud Platform

3. Februar 2023

da Agency

Advancing cancer research with public imaging datasets from the National Cancer Institute Imaging Data Commons

Medical imaging offers remarkable opportunities in research for advancing our understanding of cancer, discovering new non-invasive methods for its detection, and improving overall patient care. Advancements in artificial intelligence (AI), in particular, have been key in unlocking our ability to use this imaging data as part of cancer research. Development of AI-powered research approaches, however, requires access to large quantities of high quality imaging data. Sample images from NCI Imaging Data Commons. Left: Magnetic Resonance Imaging (MRI) of the prostate (credit: http://doi.org/10.7937/K9/TCIA.2018.MR1CKGND), along with the annotations of the prostate gland and substructures. Right: highly-multiplexed fluorescence tissue imaging of melanoma (credit: https://humantumoratlas.org/hta7/).The US National Cancer Institute (NCI) has long prioritized collection, curation, and dissemination of comprehensive, publicly available cancer imaging datasets. Initiatives like The Cancer Genome Atlas (TCGA) and Human Tumor Atlas Network (HTAN) (to name a few) work to make robust, standardized datasets easily accessible to anyone interested in contributing their expertise: students learning the basics of AI, engineers developing commercial AI products, researchers developing innovative proposals for image analysis, and of course the funders evaluating those proposals.Even so, there continue to be challenges that complicate sharing and analysis of imaging data:Data is spread across a variety of repositories, which means replicating data to bring it together or within reach of tooling (such as cloud-based resources).Images are often stored in vendor-specific or specialized research formats which complicates analysis workflows and increases maintenance costs.Lack of a common data model or tooling make capabilities such as search, visualization, and analysis of data difficult and repository- or dataset-specific. Achieving reproducibility of the analysis workflows, a critical function in research, is challenging and often lacking in practice.Introducing Imaging Data CommonsTo address these issues, as part of the Cancer Research Data Commons (CRDC) initiative that establishes the national cancer research ecosystem, NCI launched the Imaging Data Commons (IDC), a cloud-based repository of publicly available cancer imaging data with several key advantages:Colocation: Image files are curated into Google Cloud Storage buckets, side-by-side with on-demand computational resources and cloud-based tools, making it easier and faster for you to access and analyze.Format: Images, annotations and analysis results are harmonized into the standard DICOM (Data Imaging and Communications and Medicine) format to improve interoperability with tools and support uniform processing pipelines.Tooling: IDC maintains tools that – without having to download anything – allow you to explore and search the data, and visualize images and annotations. You can easily access IDC data from the cloud-based tools available in Google Cloud, such as Vertex AI, Colab, or deploy your own tools in highly configurable virtual environments.Reproducibility: Sharing reproducible analysis workflows is streamlined through maintaining persistent versioned data that you can use to precisely define cohorts used to train or validate algorithms, which in turn can be deployed in virtual environments that can provide consistent software and hardware configuration.IDC ingests and harmonizes de-identified data from a growing list of repositories and initiatives, spanning a broad range of image types and scales, cancer types, and manufacturers. A significant portion of these images are accompanied by annotations and clinical data. For a quick summary of what is available in IDC, check the IDC Portal or this Looker Studio dashboard! Exploring the IDC dataIDC PortalA great place to start exploring the data is the IDC Portal. From this in-browser portal, you can use some of the key metadata attributes to navigate the images and visualize them.Navigating the IDC portal to view dataset imagesAs an example, here are the steps you can follow to find slide microscopy images for patients with lung cancer:From the IDC Portal, proceed to “Explore images”.In the top right portion of the exploration screen, use the summary pie chart to select Chest primary site (you could alternatively select Lung, noting that annotation of cancer location can use different terms).In the same pie chart summary section, navigate to Modality and select Slide Microscopy.In the right-hand panel, scroll to the Collections section, which will now list all collections containing relevant images. Select one or more collections using the checkboxes. Navigate to the Selected Cases section just below, where you will find a list of patients within the selected collections that meet the search criteria. Next, select a given patient using the checkbox. Navigating to the Selected Studies section just below will now show the list of studies – think of these as specific imaging exams available for this patient. Click the “eye” icon on the far right which will open the viewer allowing you to see the images themselves.BigQuery Public DatasetWhen it’s time to search and select the subsets (or cohorts) of the data that you need to support your analysis more precisely, you’ll head to the public dataset in BigQuery. This dataset contains the comprehensive set of metadata available for the IDC images (beyond the subset contained in the IDC portal), which you can use to precisely define your target data subset with a custom, standard SQL query.You can run these queries from the in-browser BigQuery Console by creating a BigQuery sandbox. The BigQuery sandbox enables you to query data within the limits of the Google Cloud free tier without needing a credit card. If you decide to enable billing and go above the free tier threshold, you are subject to regular BigQuery pricing. However, we expect most researchers’ needs will fit within this tier. To get started with an exploratory query, you can select studies corresponding to the same criteria you just used in your exploration of the IDC Portal:code_block[StructValue([(u’code’, u’SELECTrn DISTINCT(StudyInstanceUID)rnFROMrn `bigquery-public-data.idc_current.dicom_all`rnWHERErn tcia_tumorLocation = “Chest”rn AND Modality = “SM”‘), (u’language’, u’lang-sql’), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3e11af929610>)])]Alright now you’re ready to write a query that creates precisely defined cohorts. This time we’ll shift from exploring digital pathology images to subsetting Computed Tomography (CT) scans that meet certain criteria.The following query selects all files, identified by their unique storage path in the gcs_url column, and corresponding to CT series that have SliceThickness between 0 and 1 mm. It also builds a URL in series_viewer_url that you can follow to visualize the series in the IDC Portal viewer. For the sake of this example, the results are limited to only one series.code_block[StructValue([(u’code’, u’SELECTrn collection_id,rn PatientID,rn SeriesDescription,rn SliceThickness,rn gcs_url,rn CONCAT(“https://viewer.imaging.datacommons.cancer.gov/viewer/”,StudyInstanceUID, “?seriesInstanceUID=”, SeriesInstanceUID) AS series_viewer_urlrnFROMrn `bigquery-public-data.idc_current.dicom_all`rnWHERErn SeriesInstanceUID IN (rn SELECTrn SeriesInstanceUIDrn FROMrn `bigquery-public-data.idc_current.dicom_all`rn WHERErn Modality = “CT”rn AND SAFE_CAST(SliceThickness AS FLOAT64) > 0rn AND SAFE_CAST(SliceThickness AS FLOAT64) < 1rn LIMITrn 1)’), (u’language’, u’lang-sql’), (u’caption’, <wagtail.wagtailcore.rich_text.RichText object at 0x3e11af3287d0>)])]As you start to write more complex queries, it will be important to familiarize yourself with the DICOM format, and how it is connected with the IDC dataset. This getting started tutorial is a great place to start learning more.What can you do with the results of these queries? For example:You can build the URL to open the IDC Portal viewer and examine individual studies, as we demonstrated in the second query above.You can learn more about the patients and studies that meet this search criteria by exploring what annotations or clinical data available accompanying these images. The getting started tutorial provides several example queries along these lines.You can link DICOM metadata describing imaging collections with related clinical information, which is linked when available. This notebook can help in navigating clinical data available for IDC collections.Finally, you can download all images contained in the resulting studies. Thanks to the support of Google Cloud Public Dataset Program, you are able to download IDC image files from Cloud Storage without cost.Integrating with other Cloud toolsThere are several Cloud tools we want to mention that can help in your explorations of the IDC data:Colab: Colab is a hosted Jupyter notebook solution that allows you to write and share notebooks that combine text and code, download images from IDC, and execute the code in the cloud with a free virtual machine. You can expand beyond the free tier to use custom VMs or GPUs, while still controlling costs with fixed monthly pricing plans. Notebooks can easily be shared with colleagues (such as readers of your academic manuscript). Check out these example Colab notebooks to help you get started.Vertex AI: Vertex AI is a platform to handle all the steps of the ML workflow. Again, it includes managed Jupyter notebooks, but with more control over the environment and hardware you use. As part of Google Cloud, it also comes with enterprise-grade security, which may be important to your use case, especially if you are joining in your own proprietary data. Its Experiments functionality allows you to automatically track architectures, hyperparameters, and training environments, to help you discover the optimal ML model faster. Looker Studio: Looker Studio is a platform for developing and sharing custom interactive dashboards. You can create dashboards that are focused on a specific subset of metadata accompanying the images and cater to the users that prefer interactive interface over the SQL queries. As an example, this dashboard provides a summary of IDC data, and this dashboard focuses on the preclinical datasets within the IDC.Cloud Healthcare API: IDC relies on Cloud Healthcare API to extract and manage DICOM metadata with BigQuery, and to maintain DICOM stores that make IDC data available via the standard DICOMweb interface. IDC users can utilize these tools to store and provide access to the artifacts resulting from the analysis of IDC images. As an example, DICOM store can be populated with the results of image segmentation, which could be visualized using a user-deployed Firebase-hosted instance of OHIF Viewer (deployment instructions are available here).Next StepsThe IDC dataset is a powerful tool for accelerating data-driven research and scientific discovery in cancer prevention, treatment, and diagnosis. We encourage researchers, engineers, and students alike to get started by following the onboarding steps we laid out in this post: familiar yourselves with the data by heading to the IDC portal, tailor your cohorts using the BigQuery public dataset, and then download the images to analyze with your on-prem tools, or with Google Cloud services or Colab. Getting started with the IDC notebook series should help you get familiar with the resource.For questions, you can reach the IDC team at support@canceridc.dev, or join the IDC community and post your questions. Also, see the IDC user guide for more details, including official documentation.Related ArticleBoost medical discoveries with AlphaFold on Vertex AILearn 3 ways to run AlphaFold on Google Cloud using no-cost solutions and guides.Read ArticleRelated ArticleMost popular public datasets to enrich your BigQuery analysesCheck out free public datasets from Google Cloud, available to help you get started easily with big data analytics in BigQuery and Cloud …Read Article
Quelle: Google Cloud Platform

3. Februar 2023

da Agency

Demystifying BigQuery BI Engine

BigQuery BI Engine is a fast, in-memory analysis system for BigQuery currently processing over 2 billion queries per month and growing. BigQuery has its roots in Google’s Dremel system and is a data warehouse built with scalability as a goal. On the other hand BI Engine was envisioned with data analysts in mind and focuses on providing value on Gigabyte to sub-Terabyte datasets, with minimal tuning, for real time analytics and BI purposes.Using BI Engine is simple – create a memory reservation on the project that runs BigQuery queries, and it will cache data and use the optimizations. This post is a deep dive into how BI Engine helps deliver blazing fast performance for your BigQuery queries and what users can do to leverage its full potential. BI Engine optimizationsThe two main pillars of BI Engine are in-memory caching of data and vectorized processing. Other optimizations include CMETA metadata pruning, single-node processing, and join optimizations for smaller tables.Vectorized engineBI Engine utilizes the “Superluminal” vectorized evaluation engine which is also used for YouTube’s analytic data platform query engine – Procella. In BigQuery’s row-based evaluation, the engine will process all columns within a row for every row. The engine is potentially alternating between column types and memory locations before going to the next row. In contrast, a vectorized engine like Superluminal will process a block of values of the same type from a single column for as long as possible and only switch to the next column when necessary. This way, hardware can run multiple operations at once using SIMD, reducing both latency and infrastructure costs. BI Engine dynamically chooses block size to fit into caches and available memory.For the example query, “SELECT AVG(word_count), MAX(word_count), MAX(corpus_date) FROM samples.shakespeare”, will have the following vectorized plan. Note how the evaluation processes “word_count” separately from “corpus_date”.In-memory cacheBigQuery is a disaggregated storage and compute engine. Usually the data in BigQuery is stored on Google’s distributed file system – Colossus, most often in blocks in Capacitor format and the compute is represented by Borg tasks. This enables BigQuery’s scaling properties. To get the most out of vectorized processing, BI Engine needs to feed the raw data at CPU speeds, which is achievable only if the data is already in memory. BI Engine runs Borg tasks as well, but workers are more memory-heavy to be able to cache the data as it is being read from Colossus.A single BigQuery query can be either sent to a single BI Engine worker, or sharded and sent to multiple BI Engine workers. Each worker receives a piece of a query to execute with a set of columns and rows necessary to answer it. If the data is not cached in the workers memory from the previous query, the worker loads the data from Colossus into local RAM. Subsequent requests for the same or subset of columns and rows are served from memory only. Note that workers will unload the contents if data hasn’t been used for over 24 hours. As multiple queries arrive, sometimes they might require more CPU time than available on a worker, if there is still reservation available, a new worker will be assigned to same blocks and subsequent requests for the same blocks will be load-balanced between the workers.BI Engine can also process super-fresh data that was streamed to the BigQuery table. Therefore, there are two formats supported by BI Engine workers currently – Capacitor and streaming. In-memory capacitor blocksGenerally, data in a capacitor block is heavily pre-processed and compressed during generation. There are a number of different ways the data from the capacitor block can be cached, some are more memory efficient, while others are more CPU efficient. BI Engine worker intelligently chooses between those preferring latency and CPU-efficient formats where possible. Thus actual reservation memory usage might not be the same as logical or physical storage usage due to the different caching formats.In-memory streaming dataStreaming data is stored in memory as blocks of native array-columns and is lazily unloaded when blocks get extracted into Capacitor by underlying storage processes. Note that for streaming, BI workers need to either go to streaming storage every time to potentially obtain new blocks or serve slightly stale data. BI Engine prefers serving slightly stale data and loading the new streaming blocks in the background instead.BI Engine worker does this opportunistically during the queries, if the worker detects streaming data and the cache is newer than 1 minute, a background refresh is launched in parallel with the query. In practice, this means that with enough requests the data is no more stale than the previous request time. For example if a request arrives every second, then the streaming data will be around a second stale.First requests loading data are slowDue to the read time optimizations, loading data from previously unseen columns can take longer than BigQuery does. Subsequent reads will benefit from these optimizations.For example, the query above here is backend time for a sample run of the same query with BI Engine off, first run and subsequent run.Multiple block processing and dynamic single worker executionBI Engine workers are optimized for BI workloads where the output size will be small compared to the input size and the output will be mostly aggregated. In regular BigQuery execution, a single worker tries to minimize data loading due to network bandwidth limitations. Instead, BigQuery relies on massive parallelism to complete queries quickly. On the other hand, BI Engine prefers to process more data in parallel on a single machine. If the data has been cached, there is no network bandwidth limitation and BI Engine further reduces network utilization by reducing the number of intermediate “shuffle” layers between query stages.With small enough inputs and a simple query, the entire query will be executed on a single worker and the query plan will have a single stage for the whole processing. We constantly work on making more tables and query shapes eligible for a single stage processing, as this is a very promising way to improve the latency of typical BI queries.For the example query, which is very simple and the table is very small, here is a sample run of the same query with BI Engine distributed execution vs single node (default).How to get most out of BI EngineWhile we all want a switch that we can toggle and everything becomes fast, there are still some best practices to think about when using BI Engine.Output data sizeBI optimizations assume human eyes on the other side and that the size of output data is small enough to be comprehensible by a human. This limited output size is achieved by selective filters and aggregations. As a corollary, instead of SELECT * (even with a LIMIT), a better approach will be to provide the fields one is interested in with an appropriate filter and aggregation.To show this on an example – query “SELECT * FROM samples.shakespeare” processes about 6MB and takes over a second with both BigQuery and BI Engine. If we add MAX to every field – “SELECT MAX(word), MAX(word_count), MAX(corpus), MAX(corpus_date) FROM samples.shakespeare”, both engines will read all of the data, perform some simple comparisons and finish 5 times faster on BigQuery and 50 times faster on BI Engine.Help BigQuery with organizing your dataBI Engine uses query filters to narrow down the set of blocks to read. Therefore, partitioning and clustering your data will reduce the amount of data to read, latency and slot usage. With a caveat, that “over partitioning” or having too many partitions might interfere with BI Engine multi-block processing. For optimal BigQuery and BI Engine performance, partitions larger than one gigabyte are preferred.Query depthBI Engine currently accelerates stages of the query that read data from the table, which are typically the leaves of the query execution tree. What this means in practice is that almost every query will use some BigQuery slots.That’s why one gets the most speedup from BI Engine when a lot of time is spent on leaf stages. To mitigate this, BI Engine tries to push as many computations as possible to the first stage. Ideally, execute them on a single worker, where the tree is just one node.For example Query1 of TPCH 10G benchmark, is relatively simple. It is 3 stages deep with efficient filters and aggregations that processes 30 million rows, but outputs just 1.Running this query in BI Engine we see that the full query took 215 ms with “S00: Input” stage being the one accelerated by BI Engine taking 26 ms.Running the same query in BigQuery gets us 583ms, with “S00: Input” taking 229 ms.What we see here is that the “S00: Input” stage run time went down 8x, but the overall query did not get 8x faster, as the other two stages were not accelerated and their run time remained roughly the same. With breakdown between stages illustrated by the following figure.In a perfect world, where BI Engine processes its part in 0 milliseconds, the query will still take 189ms to complete. So the maximum speed gain for this query is about 2-3x. If we, for example, make this query heavier on the first stage, by running TPCH 100G instead, we see that BI Engine finishes the query 6x faster than BigQuery, while the first stage is 30 times faster!vs 1 second on BigQueryOver time, our goal is to expand the eligible query and data shapes and collapse as many operations as feasible into a single BI Engine stage to realize maximum gains.JoinsAs previously noted, BI Engine accelerates “leaf” stages of the query. However, there is one very common pattern used in BI tools that BI Engine optimizes. It’s when one large “fact” table is joined with one or more smaller “dimension” tables. Then BI Engine can perform multiple joins, all in one leaf stage, using so-called “broadcast” join execution strategy.During the broadcast join, the fact table is sharded to be executed in parallel on multiple nodes, while the dimension tables are read on each node in their entirety.For example, let’s run Query 3 from the TPC-DS 1G benchmark. The fact table is store_sales and the dimension tables are date_dim and item. In BigQuery the dimension tables will be loaded into shuffle first, then the “S03: Join+” stage will, for every parallel part of store_sales, read all necessary columns of two dimension tables, in their entirety, to join.Note that filters on date_dim and item are very efficient, and the 2.9M row fact table is joined only with about 6000 rows. BI Engine plan will look a bit different, as BI Engine will cache the dimension tables directly, but the same principle applies. For BI Engine, let’s assume that two nodes will process the query due to the store_sales table being too big for a single node processing. We can see on the image below that both nodes will have similar operations – reading the data, filtering, building the lookup table and then performing the join. While only a subset of data for the store_sales table is being processed on each, all operations on dimension tables are repeated.Note that”build lookup table” operation is very CPU intensive compared to filtering”join” operation performance also suffers if the lookup tables are large, as it interferes with CPU cache localitydimension tables need to be replicated to each “block” of fact tableThe takeaway is when join is performed by BI Engine, the fact table is sometimes split into different nodes. All other tables will be copied multiple times on every node to perform the join. Keeping dimension tables small or selective filters will help to make sure join performance is optimal.ConclusionsSummarizing everything above, there are some things one can do to make full use of BI Engine and make their queries fasterLess is more when it comes to data returned – make sure to filter and aggregate as much data as possible early in the query. Push down filters and computations into BI Engine.Queries with a small number of stages get the best acceleration. Preprocessing the data to minimize query complexity will help with optimal performance. For example, using materialized views can be a good option.Joins are sometimes expensive, but BI Engine may be very efficient in optimizing typical star schema queries.It’s beneficial to partition and/or cluster the tables to limit the amount of data to be read.Special thanks to Benjamin Liles, Software Engineer for BI Engine, Deepak Dayama, Product Manager for BI Engine, for contributing to this post.
Quelle: Google Cloud Platform

2. Februar 2023

da Agency

What Data Pipeline Architecture should I use?

Data is essential to any application and is used in the design of an efficient pipeline for delivery and management of information throughout an organization. Generally, define a data pipeline when you need to process data during its life cycle. The pipeline can start where data is generated and stored in any format. The pipeline can end with data being analyzed, used as business information, stored in a data warehouse, or processed in a machine learning model.Data is extracted, processed, and transformed in multiple steps depending on the downstream system requirements. Any processing and transformational steps are defined in a data pipeline. Depending on the requirements, the pipelines can be as simple as one step or as complex as multiple transformational and processing steps.How to choose a design pattern?When selecting a data pipeline design pattern, there are different design elements that must be considered. These design elements include the following:Select data source formats.Select which stacks to use.Select data transformation tools.Choose between Extract Transform Load (ETL), Extract Load Transform (ELT), or Extract Transform Load Transform (ETLT).Determine how changed data is managed.Determine how changes are captured.Data sources can have a variety of data types. Knowing the technology stack and tool sets that we use is also a key element of the pipeline build process. Enterprise environments come with the challenges that require using multiple and complicated techniques to capture the changed data and to merge with the target data.I mentioned that most of the time the downstream systems define the requirements for a pipeline and how these processes can be interconnected. The processing steps and sequences of the data flow are the major factors affecting pipeline design. Each step might include one or more data inputs, and the outputs might include one or more stages. The processing between input and output might include simple or complex transformational steps. I highly recommend keeping the design simple and modular to ensure that you clearly understand the steps and transformation taking place. Also, keeping your pipeline design simple and modular makes it easier for a team of developers to implement development and deployment cycles. It also makes debugging and troubleshooting the pipeline easier when issues occur.The major components of a pipeline Include: Source dataProcessing Target storageSource data can be the transaction application, the files collected from users, and data extracted from an external API. Processing of the source data can be as simple as one step copying or as complex as multiple transformations and joining with other data sources. The target data warehousing system might require the processed data that is the result of the transformation (such as a data type change or data extraction), and lookup and updates from other systems. A simple data pipeline might be created by copying data from source to target without any changes. A complex data pipeline might include multiple transformation steps, lookup, updates, KPI calculations, and data storage into several targets for different reasons.Source data can be presented in multiple formats. Each will need a proper architecture and tools to process and transform. There can be multiple data types required in a typical data pipeline that might be in any of the following formats:Batch Data: A file with tabular information (CSV, JSON, AVRO, PARQUET and …) where the data is collected according to a defined threshold or frequency with conventional batch processing or micro-batch processing. Modern applications tend to generate continuous data. For this reason, micro-batch processing is a preferred design to collect the data from sources.Transactions Data: Application data such as RDBMS (relational data), NoSQL, Big Data.Stream Data: Real-time applications that use Kafka, Google Pub/Sub, Azure Stream Analytics, or Amazon Stream Data. Streaming data applications can communicate in real time and exchange messages to meet the requirements. In Enterprise architecture design, the real time and stream processing is a very important component of design.Flat file – PDFs or other non-tabular formats that contain data for processing. For example, medical or legal documents that can be used to extract information.Target data is defined based on the requirements and the downstream processing needs. It’s common to build target data to satisfy the need for multiple systems. In the Data Lake concept, the data is processed and stored in a way that Analytics systems can get insight while the AI/ML process can use the data to build predictive models.Architectures and examplesMultiple architecture designs are covered that show how the source data is extracted and transformed to the target. The goal is to clever the general approaches, and it’s important to remember that each use case can be very different and unique to the customer and need special consideration.The data pipeline architecture can be broken down into Logical and Platform levels. The logical design describes how the data is processed and transformed from the source into the target. The platform design focuses on implementation and tooling that each environment needs, and this depends on the provider and tooling available in the platform. GCP, Azure, or Amazon have different toolsets for the transformation while the goal of the logical design remains the same (data transform) no matter which provider is used. Here is a logical design of a Data Warehousing pipeline:Here is the logical design for a Data Lake pipeline:Depending on the downstream requirements, the generic architecture designs can be implemented with more details to address several use cases.The Platform implementations can vary depending on the toolset selection and development skills. What follows are a few examples of GCP implementations for the common data pipeline architectures.A Batch ETL Pipeline in GCP – The Source might be files that need to be ingested into the analytics Business Intelligence (BI) engine. The Cloud Storage is the data transfer medium inside GCP and then Dataflow is used to load the data into the target BigQuery storage. The simplicity of this approach makes this pattern reusable and effective in simple transformational processes. On the other hand, if we need to build a complex pipeline, then this approach isn’t going to be efficient and effective.A Data Analytics Pipeline is a complex process that has both batch and stream data ingestion pipelines. The processing is complex and multiple tools and services are used to transform the data into warehousing and an AL/ML access point for further processing. Enterprise solutions for data analytics are complex and require multiple steps to process the data. The complexity of the design can add to the project timeline and cost but in order to achieve the business objectives, carefully review and build each component.Machine learning data pipeline in GCP is a comprehensive design that allows customers to utilize all GCP native services to build and process a machine learning process. For more information, see Creating a machine learning pipeline.GCP platform diagrams are created by Google Cloud Developer Architecture.How to choose a data pipeline architecture?There are multiple approaches to designing and implementing data pipelines. The key is to choose the design that meets your requirements. There are new technologies emerging that are providing more robust and faster implementations for data pipelines. Google Big Lake is a new service that introduces a new approach on data ingestion. BigLake is a storage engine that unifies data warehouses by enabling BigQuery and open source frameworks such as Spark to access data with fine-grained access control. BigLake provides accelerated query performance across multi-cloud storage and open formats such as Apache Iceberg.The other major factor in deciding the proper data pipeline architecture is the cost. Building a cost-effective solution is a major factor in deciding the design. Usually, streaming and real-time data processing pipelines are more expensive to build and run compared to using batch models. There are times that the budget runs the decision on which design to choose and how to build the platform. Knowing the details on each component and being able to do cost analysis of the solution ahead of time is important in choosing the right architecture design for your solution. GCP provides a cost calculator that can be used in these cases.Do you really need real-time analytics or will a near real-time system be sufficient? This can resolve the design decision for the streaming pipeline. Are you building cloud native solutions or migrating an existing one from on-premises? All of these questions are important in designing a proper architecture for our data pipeline.Don’t ignore the data volume when designing a data pipeline. The scalability of the design and services used in the platform is another very important factor to consider when designing and implementing a solution. Big Data is growing more and building capacity for processing. Storing the data is a key element to data pipeline architecture. In reality, there are many variables that can help with proper platform design. The data volume and velocity or data flow rates can be very important factors.If you are planning to build a data pipeline for a data science project, then you might consider all data sources that the ML Model requires for future engineering. The data cleansing process is mostly a big part of the data engineering team which must have adequate and sufficient transformational toolsets. Data science projects are dealing with large data sets, which will require planning for storage. Depending on how the ML Model is utilized, either real-time or batch processing must serve the users.What Next?Big Data and the growth of the data in general are posing new challenges for data architects and always challenging the requirements for data architecture. A constant increase of data variety, data formats, and data sources is a challenge as well. Businesses are realizing the value of the data and are automating more processes and demanding real-time access to the analytics and decision making information. This is becoming a challenge to take into consideration all variables for a scalable performance system. The data pipeline must be strong, flexible, and reliable. The data quality must be trusted by all users. Data privacy is one of the most important factors in any design consideration. I’ll cover these concepts in my next article.I highly recommend following Google Cloud quickstart and tutorials as the next steps to learn more about the GCP and experience hands-on practice.Interactive Tutorial: BigQuery tour for data analystsInteractive Tutorial: Train an AutoML tabular modelInteractive Tutorial: Analyze Billing data with BigQueryStay tuned. Thank you for reading! Have a question or want to chat? Find me on Twitter or LinkedIn.
Quelle: Google Cloud Platform