Januar 2019 - Seite 18 von 89 - Cloud Computing Köln

We’re pleased to announce that The Site Reliability Workbook is available in HTML now! Site Reliability Engineering (SRE), as it has come to be generally defined at Google, is what happens when you ask a software engineer to solve an operational problem. SRE is an essential part of engineering at Google. It’s a mindset, and a set of practices, metrics, and prescriptive ways to ensure systems reliability. The new workbook is designed to give you actionable tips on getting started with SRE and maturing your SRE practice. We’ve included links to specific chapters of the workbook that align with our tips throughout this post.We’re often asked what implementing SRE means in practice, since our customers face challenges quantifying their success when setting up their own SRE practices. In this post, we’re sharing a couple of checklists to be used by members of an organization responsible for any high-reliability services. These will be useful when you’re trying to move your team toward an SRE model. Implementing this model at your organization can benefit both your services and teams due to higher service reliability, lower operational cost, and higher-value work for the humans.But how can you tell how far you have progressed along this journey? While there is no simple or canonical answer, you can see below a non-exhaustive list to check your progress, organized as checklists by ascending order of maturity of a team. Within every checklist, the items are roughly in chronological order, but we do recognize that any given team’s actual needs and priorities may vary.If you’re part of a mature SRE team, these checklists can be useful as a form of industry benchmark, and we’d love to encourage others to publish theirs as well. Of course, SRE isn’t an exact science, and challenges arise along the way. You may not get to 100% completion of the items here, but we’ve learned at Google that SRE is an ongoing journey. SRE: Just getting startedThe following three practices are key principles of SRE, but can largely be adopted by any team responsible for production systems, regardless of its name, before and in parallel to staffing an SRE team.Some service-level objectives (SLOs) have been defined (jointly with developers and business owners, if you aren’t part of one of these groups) and are met most months.There’s a culture of authoring blameless postmortems.There’s a process to manage production incidents. It may be company-wide.Beginner SRE teamsMost, if not all, SRE teams at Google have established the following practices and characteristics. We generally view these as fundamental to an effective SRE team, unless there are good reasons why they aren’t feasible for a specific team’s circumstances.A staffing and hiring plan is in place and funding has been approved.Once staffed, the team may be on-call for some services while taking at least part of the operational load (toil).There is documentation for the release process, service setup, teardown (and failover, if applicable).A canary process for releases has been evaluated as a function of the SLO.A rollback mechanism is in place where it’s applicable (though it’s understood that this is a nontrivial exercise when mobile applications are involved, for example).An operational playbook/runbook should exist, even if not complete.Theoretical (role-playing) disaster recovery testing takes place, at least annually.SRE plans and executes project work, which may not be immediately visible by their developer counterparts, such as operational load reduction efforts that may not need developer buy-in.The following practices are also common for SRE teams starting out. If they don’t exist, that can be a sign of poor team health and sustainability issues:Enough on-call load to exercise incident response procedures on a regular (i.e., weekly) basis.An SRE team charter that’s been reviewed by the appropriate leadership beyond SRE (i.e., CTO).Periodic meetings between SRE and developer leadership to discuss issues and goals and share information.Project planning and execution is done jointly by developers and SRE. SRE work and positive impact is visible to developer leadership.Intermediate SRE teamsThese characteristics are common in mature teams and generally indicate that the team is taking a proactive approach to efficient management of its services.There are periodic reviews of SRE project work and impact with business leaders.There are periodic reviews of SLIs and SLOs with business leaders.There’s a low volume of toil overall; <=50% can be measured beyond “just” low on-call load. The team establishes an approach regarding configuration changes that takes reliability into account. SREs have established a plan to scale impact beyond adding scope or services to their on-call load.There’s a rollback mechanism in case of canary failures. It may be automated.There is periodic testing of incident management, using a combination of role-playing with some automation in place.There’s an escalation policy tied to SLO violations; this might be a release process freeze/unfreeze, or something else. Check out our previous post on the possible consequences of SLO violations.There are periodic reviews of postmortems and action items that are shared between developers and SRE.Disaster recovery is periodically tested against non-production environments.Teams measure demand vs. capacity and use active forecasting to determine when demand might exceed capacity.The SRE team may produce long-term plans (i.e., a yearly roadmap) jointly with devs.Advanced SRE teamsThese practices are common in more senior teams, or sometimes can be achieved when an organization or set of SRE teams share a broader charter.At least some individuals on the team can claim major positive impact on some aspect of the business beyond firefighting or ops.Project work can be and is often executed horizontally, positively impacting many services at once as opposed to linearly or worse per service.Most service alerts are based on SLO burn rate.Automated disaster recovery testing is in place and positive impact can be measured.Another set of SRE “features” which may be desirable but unlikely to be implemented by most companies are:SREs are not on-call 24×7. SRE teams are geographically distributed in two locations, such as U.S. and Europe. It’s worth pointing out that neither half is treated as secondary.SRE and developer organizations share common goals and may have separate reporting chains up to SVP level or higher. This arrangement helps to avoid conflicts of interest.What should I do next?Once you’ve looked through these checklists, your next step is to think about whether they match your company’s needs.For those without an SRE team where most of the beginner list is unfilled, we’d highly recommend reading the associated SRE Workbook chapters in the order they have been presented. If you happen to be a Google Cloud Platform (GCP) customer and would like to request CRE involvement, contact your account manager to apply for this program. But to be clear, SRE is a methodology that will work on a huge variety of infrastructures, and using Google Cloud is not a prerequisite for pursuing this set of engineering practices.We’d also recommend attending existing conferences and organizing summits with other companies in order to share best practices on how to solve some of the blockers, such as recruiting.We have also seen teams struggling to fill out the advanced list because of churn. The rate of systems and personnel changes may be a deterrent to get there. In order to avoid teams reverting to the beginner stage and other problems, our SRE leadership reviews key metrics per team every six months. The scope is more narrow than the checklists above because several of the items have now become standard.As you may have guessed by now, answering the central question in this article involves addressing and attempting to assess a given team’s impact, health, and most importantly, how the actual work is done. After all, as we wrote in our first book on SRE: “If we are engineering processes and solutions that are not automatable, we continue having to staff humans to maintain the system. If we have to staff humans to do the work, we are feeding the machines with the blood, sweat, and tears of human beings.”So yes, you might have an SRE team already. Is it effective? Is it scalable? Are people happy? Wherever you are in your SRE journey, you can likely continue to evolve, grow and hone your team’s work and your company’s services. Learn more here about getting started building an SRE team.Thanks to Adrian Hilton, Alec Warner, David Ferguson, Eric Harvieux, Matt Brown, Myk Taylor, Stephen Thorne, Todd Underwood and Vivek Rau among others for their contributions to this post.
Quelle: Google Cloud Platform

26. Januar 2019

da Agency

Otto Group CLASH: an open-source tool to run bash scripts directly on GCP

Editor’s note: Founded in Germany in 1949, today the Otto Group is a globally active retail and services group with around 51,800 employees and generated revenues of 13.7 billion euros. Today, business intelligence experts Dr. Mahmoud Reza Rahbar Azad and Mike Czech describe an open-source tool they built to run bash-based data processing scripts directly in Google Cloud. Read on to learn why they built it, how they built it, and how you can use it in your own environment.We here at Otto Group Business Intelligence build machine learning and data-driven products for online retailers such as otto.de or aboutyou.de, to enhance our customers’ user experience. A part of that is a big data lake that we recently migrated to Google Cloud Platform (GCP). As data engineers, we sometimes need to perform data processing jobs. Since these jobs can take a long time or require a lot of compute power, we didn’t want to perform these tasks on a local machine or via a web frontend: we wanted a tool that uses the full power of GCP.A few months back, we were at a point where we understood our requirements but couldn’t find a good tool to fulfill them. So we built it ourselves: During a recent internal hacking day, we wrote CLoud bASH, or CLASH, which takes a bash script as an input and simply runs it inside a cloud environment.Running scalable data processing scripts in the cloudBefore we dive into the nitty gritty details, let me give you a little bit of a background about what we do and why we built CLASH.As mentioned above, we needed a tool that takes a bash script as an input and simply runs it inside a cloud environment. The user should have the option to either wait for the result or to be notified asynchronously when the job is finished. If the user waits for the result, log messages from the script should be forwarded to the user console and the user can cancel the job execution. This feature comes in very handy during fast development iteration cycles.The following image illustrates what we roughly had in mind.How we built itWe quickly came up with two implementations built on GCP. The first one was based on Google Kubernetes Engine (GKE), the other on Google Compute Engine. We expected the GKE variant to be a simple ‘one size fits all’ solution, whereas Compute Engine to be more customizable, allowing us, for instance, to attach a GPU to the compute unit, for additional performance.Since Kubernetes already brings a lot of scheduling primitives to the table, it was very easy to get a prototype up and running quickly. The following image shows the CLASH architecture running on GKE:The user calls the CLASH CLI to submit the scripts.sh job. Internally, CLASH utilises the gcloud CLI to spin up a Kubernetes cluster and afterwards uses kubectl to deploy the contents of the script as a ConfigMap as well as a Kubernetes job. The container logs of the job are automatically saved to Stackdriver as well as forwarded to the user’s terminal via kubectl logs. For example, here is a simple “hello world” example in the terminal:While this architecture fulfilled our requirements, it had some drawbacks. Not every user has a Kubernetes cluster lying around, so we had to spin up a cluster every time we wanted to run a job, which actually can take quite a while. Secondly, if a job only needs an individual node, we end up with a single-node node-pool. But what if a second job has different resource requirements? We would try to reuse the same cluster, but had to create again a second single-node node-pool. So while Kubernetes’ orchestration features are very nice, we switched gears and chose to go with the more straightforward Compute Engine approach. Here is the CLASH architecture on Compute Engine.From the user perspective, the functionality of this approach is the same as before, but this time, instead of spinning up a Kubernetes node, it spawns a GCE instance. We take care that the VM has the Docker engine installed so that the bash script can run again inside a docker container—more on this later. Then, since doing SSH into a machine is considered undesired overhead, we also decided not to integrate CLASH with a PKI. Instead, we use Cloud Pub/Sub to get notified about the result of a job and Stackdriver for the job logs. After the job finishes we initiate an automatic VM shutdown.We also reused the clash init function that we developed for the GKE-based deployment. The init command creates a configuration file where you can tune a lot of aspects when executing a CLASH job. A basic configuration file looks like this:The most prominent configuration is the machine_type which lets you specify how much resources the Compute Engine instance should provide, as well as basic region and networking configurations. Because CLASH needs Docker as well as the gcloud CLI present on the target machine, the fields disk_image and container_image are pre-populated accordingly. The actual script can then be deployed via cloud-init without any SSH connection to the machine. Another feature we built early on in CLASH is templating support for the configuration file using Jinja2. With this feature you can reuse the same configuration and overwrite single fields via an environment variable as shown in the example configuration with MACHINE_TYPE and PROJECT_ID.Using this design led to good results. Altogether the time to provision the infrastructure is between three and five minutes, which is manageable. For repetitive jobs we added the option to reuse an instance by specifying an instance-id. We noticed that the implementation of a job scheduling feature in the way of Kubernetes cron jobs was quite a hassle, so we dropped it for now—especially given the fact that GCP offers great services like Cloud Scheduler and Cloud Tasks.Using CLASH in the wildNow let’s dive into some use cases.One of the early use cases for CLASH was to run data synchronization jobs for BigQuery, running a script to shovel data via the bq command-line tool from one source to another. Nowadays this use case is covered by BigQuery’s scheduling feature, but that wasn’t available to us at the time. Even though the bq command is quite simple, it can also take quite a long time to complete, making it a poor fit for using Cloud Functions.Another use case is importing compressed data from a Google Cloud Storage bucket. We have set up a data importing pipeline where a new archive in a bucket triggers a cloud function which then triggers CLASH in detached mode to call the actual importing script. The script then unpacks the archive, performs some consistency checks, potentially does some data filtering and cleaning, and finally archives the result back into the target bucket.Finally, yet another use case for CLASH is specific to data scientists, namely model training. When we push new code to a model repository, we want to be able to perform regression tests for different model versions, so we have to train the model against a dataset. For obvious reasons we don’t want to do this in our CI environment, so we use CLASH to spin up a high-mem instance, perform the model training and save the model in a bucket where we can pick it up later for further investigation. We built this workflow with Google Cloud Composer, integrated CLASH via a ComputeEngineJobOperator into Airflow, and then used it in our Airflow pipelines.An example pipeline with the corresponding DAG definition looks like this:As mentioned, we use Cloud Pub/Sub to notify CLASH once a job has finished. Hence it is possible to subscribe a cloud function to the model training topic and on a successful event, trigger another CLASH job that does the regression test automatically. This is something we are currently thinking about. This also shows the potential of building workflows by combining CLASH with existing Google Cloud services.Wrap upWe wanted to share CLASH as open source because it’s a really useful and adaptive tool. We hope you can find other use cases for it in its current state. We do plan to improve CLASH and smooth out some rough edges in the future. As with any open-source software, contributions and discussions are always welcome, so please go ahead and give CLASH a try. You can find the CLASH source code here: https://github.com/ottogroup/clash.
Quelle: Google Cloud Platform