Google Cloud Data Heroes Series: Meet Lynn, a cloud architect equipping bioinformatic researchers with genomic-scale data pipelines on GCP

Google Cloud Data Heroes is a series where we share stories of the everyday heroes who use our data analytics tools to do amazing things. Like any good superhero tale, we explore our Google Cloud Data Heroes’ origin stories, how they moved from data chaos to a data-driven environment, what projects and challenges they are overcoming now, and how they give back to the community.Lynn Langit rides her bike in the middle of a snowy Minnesota winterFor our first issue, we couldn’t be more excited to introduce Google Cloud Data Heroine Lynn Langit. Lynn is a seasoned business woman in Minnesota beginning her eleventh year as the Founder of her own consulting business, Lynn Langit Consulting LLC. Lynn wears many data professional hats including Cloud Architect, Developer, and Educator. If that wasn’t already a handful, she also loves riding her bike at any given season of the year (pictured on the right), which you might imagine gets a bit challenging when you have to invest in bike studded snow tires!Tell us how you got to be a data practitioner. What was that experience like and how did this journey bring you to GCP?I worked on the business side of tech for many years. While I enjoyed my work, I found I was intrigued by the nuanced questions practitioners could ask – and the sophisticated decisions they could make – once they unlocked value from their data. This initial intrigue developed into a strong curiosity and I ultimately made the switch from business worker to data practitioner over 15 years ago. This was a huge change in career considering I got my bachelor’s degree in Linguistics and German. And so I started small. I taught myself most everything both at the beginning and even now through online resources, courses, and materials. I began with database and data warehousing, specifically building and tuning many enterprise databases. It wasn’t until Hadoop/NoSQL became available that I pivoted to Big Data…Back then, I supplemented my self-paced learning with Microsoft technologies, even earning all Microsoft certifications in just one year. When I noticed the industry shifting from on premise to cloud, I shifted my learning from programming to cloud, too. I have been working in the public cloud for over ten years already!“I started with AWS, but recently I have been doing most everything in GCP. I particularly love implementing data pipelining, data ops, and machine learning.”How did you supplement your self teachings with Google Cloud data upskilling opportunities like product deep dives and documentation, courses, skills, and certificates?One of the first Google Cloud data analytics products I fell in love with was BigQuery. BigQuery was my gateway product into a much larger open, intelligent, and unified data platform full of products that combined data analytics, databases, AI/ML, and business intelligence.I’ve used BigQuery forever. It’s been amazing since it’s initial release and it keeps getting better and better. Then I discovered Dataproc and BigTable. Dataproc is my go-to for Apache Spark projects and I’ve used BigTable for several projects as well. I am also a heavy user of TensorFlow and also AutoMLI’ve achieved Skills Badges in BigQuery, Data Analysis, and more. I’ve also achieved Google’s Professional Data Engineer Certification, and have been a Google Developer Expert since 2012. Most recently, I was named one of few Data Analysis Innovator Champions within the Google Cloud Innovators Program, which I’m particularly excited about because I’ve heard it’s a coveted spot for data practitioners and necessitates a Googler nomination to move from the Innovator membership to Champion title!You’re undoubtedly a data analytics thought leader in the community. When did you know you moved from data student to data master and what data project are you most excited about? I knew I had graduated, if you will, to the data architect realm once I was able to confidently do data work that matters, even if that work was outside of my usual domains: adTech and finTech.. For example, my work over the past few years has been around human health outcomes, including combatting the COVID-19 pandemic. I do this by supporting scientists and bioinformatic researchers with genomic-scale data pipelines. Did I know anything about genomics before I started? Not at all! I self-studied bioinformatics and recorded my learnings on GitHub.  Along the way  I  adopted my learnings into an open source GCP course on GitHub aimed at researchers who are new to working with GCP. What’s cool about the course is that I begin from the true basics of how to set up a GCP account. Then I gradually work up to mapping out genomic-scale data workflows, pipelines, analyses, batch jobs, and more using BigQuery and a host of other Google Cloud data products. Now, I’ve received feedback that this repository has made a positive impact on researchers’ ability to process and synthesize enormous amounts of data quickly. Plus, it achieves the greater goal of broadening accessibility to a public cloud like GCP. In what ways do you think you uniquely bring value back to the data community? Why is it important to you to give back to the data community? I stay busy always sharing my learnings back to the community. I record Cloud and Big data technical screencasts (demos) on Youtube, I’ve authored 25 data and cloud courses on LinkedIn Learning, and I occasionally write Medium articles on cloud technology and random thoughts I have about everyday life. I’m also the cofounder of Teaching Kids Programming, with a mission to help equip middle and high school teachers with a great programming curriculum on Java.If I had to rationalize why giving back to the data community was important to me, I’d say this: I just turned 60 and I am learning cutting edge technology constantly – my latest foray is into cloud quantum computing Technology benefits us when we combine life experience with curiosity, so I feel an immense duty to keep learning and share my progress and success along the way!Begin your own hero’s journeyReady to embark on your Google Cloud data adventure? Begin your own hero’s journey with GCP’s recommended learning path where you can achieve badges and certifications along the way. Join the Cloud Innovators program today to stay up to date on more data practitioner tips, tricks, and events.Connect with Google’s data community at our upcoming virtual event “Latest Google Cloud data analytics innovations”. Register and save your spot now to get your data questions answered live by GCP’s top data leaders and watch demos from our latest products and features including BigQuery, Dataproc, Dataplex, Dataflow, and more. Lynn will take the main stage as an emcee for this event – you won’t want to miss!Finally, if you think you have a good Data Hero story worth sharing, please let us know! We’d love to feature you in our series as well.Related ArticleGoogle data experts share top data practitioner skills needed in 2022Top data analytics skills to learn in 2022 as a data practitioner, Google Cloud experts weigh in.Read Article
Quelle: Google Cloud Platform

Developing high-quality ML solutions

When a deployed ML model produces poor predictions, it can be due to a wide range of problems. It can be the result of bugs that are typical in any program—but it can also be the result of ML-specific problems. Perhaps data skews and anomalies are causing model performance to degrade over time. Or the data format is inconsistent between the model’s native interface and the serving API. If  models aren’t monitored, they can fail silently. When a model is embedded into an application, issues like this can create poor user experiences. If the model is part of an internal process, the issues can negatively impact business decision-making. Software engineering has many processes, tools, and practices to ensure software quality, all of which help make sure that the software is working in production as intended. These tools include software testing, verification and validation, and logging and monitoring. In ML systems, the tasks of building, deploying, and operating the systems present additional challenges that require additional processes and practices. Not only are ML systems particularly data-dependent because they inform decision-making from data automatically, but they’re also dual training-serving systems. This duality can result in training-serving skew. ML systems are also prone to staleness in automated decision-making systems.These additional challenges mean that you need different kinds of testing and monitoring for ML models and systems than you do for other software systems—during development, during deployment, and in production. Based on our work with customers, we’ve created a comprehensive collection of guidelines for each process in the MLOps lifecycle. The guidelines cover how to assess, ensure, and control the quality of your ML solutions. We’ve published this complete set of guidelines on the Google Cloud site. To give you an idea of what you can learn, here’s a summary of what the guidelines cover:Model development: These guidelines are about building an effective ML model for the task at hand by applying relevant data preprocessing, model evaluation, and model testing and debugging techniques. Training pipeline deployment: These guidelines discuss ways to implement a CI/CD routine that automates the unit tests for model functions and the integration tests of the training pipeline components. The guidelines also help you apply an appropriate progressive delivery strategy for deploying the training pipeline.Continuous training: These guidelines provide recommendations for extending your automated training workflows with steps that validate the new input data for training, and that validate the new output model that’s produced after training. The guidelines also suggest ways to track the metadata and the artifacts that are generated during the training process.Model deployment: These guidelines address how to implement a CI/CD routine that automates the process of validating compatibility of the model and its dependencies with the target deployment infrastructure. These recommendations also cover how to test the deployed model service and how to apply progressive delivery and online experimentation strategies to decide on a model’s effectiveness.Model serving: These guidelines concern ways to monitor the deployment model throughout its prediction serving lifetime to check for performance degradation and dataset drift. They also provide suggestions for monitoring the efficiency of model service.Model governance: These guidelines concern setting model quality standards. They also cover techniques for implementing procedures and workflows to review and approve models for production deployment, as well as managing the deployed model in production.To read the full list of our recommendations, read the document Guidelines for developing high-quality ML solutions.Acknowledgements: Thanks to Jarek Kazmierczak, Renato Leite, Lak Lakshmanan, and Etsuji Nakai for their valuable contributions to the guide.
Quelle: Google Cloud Platform

Build a data mesh on Google Cloud with Dataplex, now generally available

Democratizing data insights and accelerating data-driven decision making is a top priority for most enterprises seeking to build a data cloud. This often requires building a self-serve data platform that can span data silos and enable at-scale usage and application of data to drive meaningful business insights.  Organizations today need the ability to distribute ownership of data across teams that have the most business context, while ensuring that the overall data lifecycle management and governance is consistently applied across their distributed data landscape.Today we are excited to announce the general availability of Dataplex, an intelligent data fabric that enables you to centrally manage, monitor, and govern data across data lakes, data warehouses, and data marts, and make this data securely accessible to a variety of analytics and data science tools.With Dataplex, enterprises can easily delegate ownership, usage, and sharing of data, to data owners who have the right business context, while still having a single pane of glass to consistently monitor and govern data across various data domains in their organization. With built-in data intelligence, Dataplex automates the data discovery, data lifecycle management, and data quality, enabling data productivity and accelerating analytics agility.  Here is what some of our customers have to say, “We have PBs of data stored in GCS and BigQuery in GCP, accessed by 1000s of internal users daily” said Saral Jain, Director of Engineering, Snap Inc. “Dataplex enables us to deliver a business domain specific, self-service data platform across distributed data, with de-centralized data ownership but centralized governance and visibility. It significantly reduces the manual toil involved in data management, and automatically makes this data queryable via both BigQuery and open source applications. We are very excited to adopt Dataplex as a central component for building a unified data mesh across our analytics data.”“As the central data team at Deutsche Bank, we are building a data mesh to standardize data discovery, access control and data quality across the distributed domains,” said Balaji Maragalla, Director Big Data Platform at Deutsche Bank. “To help us on this journey, we are excited to use Dataplex to enable centralized governance for our distributed data. Dataplex formalizes our data mesh vision and gives us the right set of controls for cross-domain data organization, data security, and data quality.”“As one of the largest entertainment companies in Japan, we generate TBs of data everyday and use it to make business critical decisions”,  said Iwao-san, Director of Data Analytics at DeNA. “While we manage each product independently as a separate domain, we want to centralize governance of data across our products. Dataplex enables us to effectively manage and standardize data quality, data security, and data privacy for data across these domains. We are looking forward to building trust in our data with Google Cloud’s Dataplex.”One of the key use cases that Dataplex enables is a data mesh architecture. Let’s take a closer look at how you can use Dataplex as the data fabric that enables a data mesh. What is a Data Mesh?With enterprise data becoming more diverse and distributed, and the number of tools and users that need access to this data growing, organizations are moving away from monolithic data architectures that are domain agnostic. While monolithic, centrally managed architectures create data bottlenecks and impact analytics agility, a completely decentralized architecture where business domains maintain their own purpose-built data lakes also has its pitfalls and results in data duplication and silos, making governance of this data impossible. Per Gartner, Through 2025, 80% of organizations seeking to scale digital business will fail because they do not take a modern approach to data and analytics governance.The data mesh architecture, first proposed in this paper by Zamak Deghani, describes a modern data stack that moves away from a monolithic data lake or data warehouse architecture to a distributed domain-specific architecture that enables autonomy of data ownership, provides agility with decentralized domain aware data management while providing the ability to centrally govern and monitor data across domains. To learn more, refer to this Build a Modern Distributed Data Mesh Whitepaper.  How to make Data Mesh real with Google Cloud Dataplex provides a data management platform to easily build independent data domains within a data mesh that spans your organization while still maintaining central controls for governing and monitoring the data across domains. “Dataplex is embodying the principles of Data Mesh as we have envisioned in Adeo. Having a first party, cloud-native, product to architect a Data Mesh in GCP is crucial for effective data sharing and data quality amongst teams. Dataplex streamlines productivity, allowing teams to build data domains and orchestrate data curation across the enterprise. I only wish we had Dataplex three years ago.” —Alexandre Cote, Product Leader with ADEOImagine you have the following domains in your organization,With Dataplex you can logically organize your data and related artifacts such as code, notebooks, and logs, into a Dataplex Lake which represents a data domain. You can model all the data in a particular domain as a set of Dataplex Assets within a lake without physically moving data or storing it into a single storage system. Assets can refer to Cloud Storage buckets and BigQuery datasets, stored in multiple Google Cloud projects, and manage both analytics and operational data, structured and unstructured data that logically belongs to a single domain. Dataplex Zones enable you to group assets and add structure that capture key aspects of your data – its readiness, the workloads it is associated with, or the data products it is serving.  The lakes and data zones in Dataplex enable you to unify distributed data and organize it based on the business context. This forms the foundation for managing metadata, setting up governance policies, monitoring data quality and so on, giving you the ability to manage your distributed data at scale.  Now let’s take a look at one of the domains in a little more detail.Automatically discover metadata across data sources: Dataplex provides metadata management and cataloging that enables all members of the domain to easily search, browse and discover the tables and filesets as well as augment them with business and domain-specific semantics. Once data is added as assets, Dataplex automatically extracts associated metadata and keeps it up-to-date as data evolves. This metadata is made available for search, discovery, and enrichment via integration with Data Catalog.Enable interoperability of tools: The metadata curated by Dataplex is automatically made available as runtime metadata to power federated open source analytics via Apache SparkSQL, HiveQL, Presto, and so on. Compatible metadata is also automatically published as external tables in BigQuery to enable federated analytics via BigQuery. Govern data at scale: Dataplex enables data administrators and stewards to consistently and scalably manage their IAM data policies to control data access across distributed data. It provides the ability to centrally govern data across domains while enabling autonomous and delegated ownership of data. It provides the ability to manage reader/writer permissions on the domains and the underlying physical storage resources. Dataplex integrates with Stackdriver to provide observability including audit logs, data metrics and logs.Enable access to high quality data: Dataplex provides built-in data quality rules that can automatically surface issues in your data. You can run these rules as data quality tasks across your data in BigQuery and GCS. One-click data exploration: Dataplex enables data engineers, data scientists and data analysts with a built-in, self-serve, serverless data exploration experience to interactively explore data and metadata, iteratively develop scripts, and deploy and monitor data management workloads. It provides content management across SQL scripts and Jupyter notebooks that makes it easy to create domain-specific code artifacts and share or schedule them from that same interface. Data management: You can also leverage the built-in data management tasks that address common tasks such as tiering, archiving or refining data. It integrates with Google Cloud’s native data tools such as Dataproc Serverless, Dataflow, Data Fusion, and BigQuery to provide an integrated data management platform. With the collective of data, metadata, policies, code, interactive and production analytics infrastructure, and data monitoring, Dataplex delivers on the core value proposition of a data mesh: data as the product.“Consistent data management and governance of distributed data remains a top priority for most of our clients today. Dataplex enables a business-centric data mesh architecture and significantly lowers the administrative overhead associated with managing, monitoring, and governing distributed data. We are excited to collaborate with the Dataplex team to enable enterprise clients to be more data-driven and accelerate their digital transformation journeys.”—Navin Warerkar, Managing Director, Deloitte Consulting LLP, and US Google Cloud Data & Analytics GTM LeaderNext stepsGet started with Dataplex today by using this quickstart guide, this data mesh tutorial or contact the Google Cloud sales team.Related ArticleIntroducing Dataplex—an intelligent data fabric for analytics at scaleDataplex unifies distributed data to help automate data management and power analytics at scale.Read Article
Quelle: Google Cloud Platform

Amazon CloudWatch Container Insights fügt Unterstützung für Amazon EKS Fargate unter Verwendung von AWS Distro for OpenTelemetry hinzu

Ab heute unterstützt Amazon CloudWatch Container Insights die Erfassung von Metriken für Anwendungen, die Sie auf Amazon Elastic Kubernetes Service (EKS) mit AWS Fargate ausführen, unter Verwendung von AWS Distro for OpenTelemetry (ADOT). ADOT ist eine sichere, AWS-unterstützte Verteilung des OpenTelemetry-Projekts. Kunden erfassen jetzt ganz einfach EKS-Fargate-Metriken, beispielsweise zu CPU, Arbeitsspeicher, Festplatte oder Netzwerk, und analysieren diese gemeinsam mit anderen Containermetriken in Amazon CloudWatch. Das ermöglicht es Kunden, die Leistung und Ressourcennutzung ihrer Anwendungen direkt in der Container-Insights-Konsole in CloudWatch abzurufen.
Quelle: aws.amazon.com

AWS führt s2n-quic ein, eine neue Open-Source-Implementierung des QUIC-Protokolls

Wir freuen uns Ihnen anzukündigen, dass s2n-quic in unseren Open-Source-Bibliotheken für AWS-Verschlüsselungen verfügbar ist. Dabei handelt es sich um eine Open-Source-Implementierung des QUIC-Protokolls für Rust. Außerdem benennen wir s2n, die Open-Source-C-Implementierung des TLS-Protokolls von AWS, in s2n-tls um. s2n-quic verfügt über eine schnelle und kleine API, bei der Einfachheit an erster Stelle steht. Es ist in Rust geschrieben und profitiert somit von einigen Vorteilen wie Leistung, Thread und Sicherheit des Arbeitsspeichers. Für den TLS-1.3-Handshake ist s2n-quic von s2n-tls oder Rustls abhängig, einer Open-Source-Implementierung von TLS für Rust.
Quelle: aws.amazon.com