How’s your cloud confidence?

I learned a lot at IBM InterConnect this year, but most importantly, I learned how and having super powers are surprisingly similar concepts.
I had a great time speaking with clients at the Cloud Confidence Center on the concourse. The central feature of the space was our cloud story string board, which was a real draw and a fantastic conversation starter to boot.
Cloud adoption leaders asked attendees to talk about their roles, their experiences with cloud so far, what they would like to do with cloud technology, what areas of cloud they would like to learn more about, and what super power they either have or would like to have.
I noticed interesting patterns regarding the super powers the people wanted. Agility came to the fore for those who were interested in how cloud could help them speed up internal processes. Invisibility was key for those who saw how cloud was helping their IT systems stay up and running. Those interested in finding out more about Watson were overall looking for super intelligence as their super power.
Using the board, we were able to take conversations to the next level. The area featured each component of IBM Cloud Technical Engagement, so as well cloud adoption leaders, we had people available from the Bluemix Garage, Cloud Professional Services, Solution Architecture and our support teams ready to help clients expand their stories, gain deeper knowledge and find paths forward.
We tracked hundreds of stories and I noticed a few trends from the people that I spoke to. To start, more people have begun their journey onto the cloud, so fewer are looking for help with the first steps. Many are now looking for help with fully adopting the cloud in their organizations.
While there was still a lot of focus on moving existing workloads to the cloud, there were also many people who were looking to create their first “born-on-the-cloud” applications as well as use the cloud to improve business processes and extend their on-premises infrastructure with hybrid

cloud.
The most popular points for further learning centered around Watson, which is perhaps unsurprising, as organizations are now starting to
realize the power of cognitive within their applications. They’ve been seeing the ease with which Watson APIs can be implemented into Bluemix applications.
Blockchain and Internet of Things (IoT) were big topics of conversation, along with DevOps, process transformation and Bluemix Infrastructure. I had many conversations about containers and microservices, too, with customers keen to understand how they can take advantage of technologies such as Docker, Kubernetes and OpenWhisk within their organization.
The best thing about the board was that it was a real talking point and a focus for visitors who were themselves taking a minute to look at the patterns that were emerging. I think it may have also been the most photographed exhibit, too.
Missed an InterConnect keynote or want to watch again? Catch up on IBMGO.
The post How’s your cloud confidence? appeared first on Cloud computing news.
Quelle: Thoughts on Cloud

Reliable releases and rollbacks – CRE life lessons

By Adrian Hilton, Customer Reliability Engineer

Editor’s note: One of the most common causes of service outages is releasing a new version of the service binaries; no matter how good your testing and QA might be, some bugs only surface when the affected code is running in production. Over the years, Google Site Reliability Engineering has seen many outages caused by releases, and now assumes that every new release may contain one or more bugs.

As software engineers, we all like to add new features to our services; but every release comes with the risk of something breaking. Even assuming that we are appropriately diligent in adding unit and functional tests to cover our changes, and undertaking load testing to determine if there are any material effects on system performance, live traffic has a way of surprising us. These are rarely pleasant surprises.

The release of a new binary is a common source of outages. From the point of view of the engineers responsible for the system’s reliability, that translates to three basic tasks:

Detecting when a new release is actually broken;
Moving users safely from a bad release to a “hopefully” fixed release; and
Preventing too many clients from suffering through a bad release in the first place (“canarying”).

For the purpose of this analysis, we’ll assume that you are running many instances of your service on machines or VMs behind a load balancer such as nginx, and that upgrading your service to use a new binary will involve stopping and starting each service instance.

We’ll also assume that you monitor your system with something like Stackdriver, measuring internal traffic and error rates. If you don’t have this kind of monitoring in place, then it’s difficult to meaningfully discuss reliability; per the Hierarchy of Reliability described in the SRE Book, monitoring is the most fundamental requirement for a reliable system).

Detection
The best case for a bad release is that when a service instance is restarted with the bad release, a major fraction of improperly handled requests generate errors such as HTTP 502, or much higher response latencies than normal. In this case, your overall service error rate rises quickly as the rollout progresses through your service instances, and you realize that your release has a problem.

A more subtle case is when the new binary returns errors on a relatively small fraction of queries – say, a user setting change request, or only for users whose name contains an apostrophe for good or bad reasons. With this failure mode, the problem may only become manifest in your overall monitoring once the majority of your service instances are upgraded. For this reason, it can be useful to have error and latency summaries for your service instance broken down by binary release version.

Rollbacks
Before you plan to roll out a new binary or image to your service, you should ask yourself, “What will I do if I discover a catastrophic / debilitating / annoying bug in this release?” Not because it might happen, but because sooner or later it is going to happen and it is better to have a well-thought out plan in place instead of trying to make one up when your service is on fire.

The temptation for many bugs, particularly if they are not show-stoppers, is to build a quick patch and then “roll forward,” i.e., make a new release that consists of the original release plus the minimal code change necessary to fix the bug (a “cherry-pick” of the fix). We don’t generally recommend this though, especially if the bug in question is user-visible or causing significant problems internally (e.g., doubling the resource cost of queries).

What’s wrong with rolling forward? Put yourself in the shoes of the software developer: your manager is bouncing up and down next to your desk, blood pressure visibly climbing, demanding to know when your fix is going to be released because she has your company’s product director bending her ear about all the negative user feedback he’s getting. You’re coding the fix as fast as humanly possible, because for every minute it’s down another thousand users will see errors in the service. Under this kind of pressure, coding, testing or deployment mistakes are almost inevitable.

We have seen this at Google any number of times, where a hastily deployed roll-forward fix either fails to fix the original problem, or indeed makes things worse. Even if it fixes the problem it may then uncover other latent bugs in the system; you’re taking yourself further from a known-good state, into the wilds of a release that hasn’t been subject to the regular strenuous QA testing.

At Google, our philosophy is that “rollbacks are normal.” When an error is found or reasonably suspected in a new release, the releasing team rolls back first and investigates the problem second. A request for a rollback is not interpreted as an attack on the releasing team, or even the person who wrote the code containing the bug; rather, it is understood as The Right Thing To Do to make the system as reliable as possible for the user. No-one will ask “why did you roll back this change?” as long as the rollback changelist describes the problem that was seen.

Thus, for rollbacks to work, the implicit assumption is that they are:

easy to perform; and
trusted to be low-risk.

How do we make the latter true?

Testing rollbacks
If you haven’t rolled back in a few weeks, you should do a rollback “just because”; aim to find any traps with incompatible versions, broken automation/testing etc. If the rollback works, just roll forward again once you’ve checked out all your logs and monitoring. If it breaks, roll forward to remove the breakage and then focus all your efforts on diagnosing the cause of the rollback breakage. It is better by far to detect this when your new release is working well, rather than being forced off a release that is on fire and having to fight to get back to your known-good original release.

Incompatible changes
Inevitably, there are going to be times when a rollback is not straightforward. One example is when the new release requires a schema change to an in-app database (such as a new column). The danger is that you release the new binary, upgrade the database schema, and then find a problem with the binary that necessitates rollback. This leaves you with a binary that doesn’t expect the new schema, and hasn’t been tested with it.

The approach we recommend here is a feature-free release; starting from version v of your binary, build a new version v+1 which is identical to v except that it can safely handle the new database schema. The new features that make use of the new schema are in version v+2. Your rollout plan is now:

Release binary v+1
Upgrade database schema
Release binary v+2

Now, if there are any problems with either of the new binaries then you can roll back to a previous version without having to also roll back the schema.

This is a special case of a more general problem. When you build the dependency graph of your service and identify all its direct dependencies, you need to plan for the situation where any one of your dependencies is suddenly rolled back by its owners. If your launch is waiting for a dependency service S to move from release r to r+1, you have to be sure that S is going to “stick” at r+1. One approach here is to make an ecosystem assumption that any service could be rolled back by one version, in which case your service would wait for S to reach version r+2 before your service moved to a version depending on a feature in r+1.

Summary
We’ve learned that there’s no good rollout unless you have a corresponding rollback ready to do, but how can we know when to rollback without having our entire service burned to the ground by a bad release?

In part 2 we’ll look at the strategy of “canarying” to detect real production problems without risking the bulk of your production traffic on a new release.
Quelle: Google Cloud Platform

The journey of a new OpenStack service in RDO

When new contributors join RDO, they ask for recommendations about
how to add new services and help RDO users to adopt it. This post is
not a official policy document nor a detailed description about how to carry
out some activities, but provides some high level recommendations to newcomers
based on what I have learned and observed in the last year working in RDO.

Note that you are not required to follow all these steps and even you can
have your own ideas about it. If you want to discuss it, let us know your thoughts, we are always open to improvements.

1. Adding the package to RDO

The first step is to add the package(s) to RDO repositories as shown
in RDO documentation.
This tipically includes the main service package, client library and maybe
a package with a plugin for horizon.

In some cases new packages require some general purpose libraries. If they
are not in CentOS base channels, RDO imports them from Fedora packages
into a dependencies repository. If you need a new dependency which already
exists in Fedora, just let us know and we’ll import it into the repo. If it
doesn’t exist, you’ll have to add the new package into Fedora following
the existing process.

2. Create a puppet module

Although there are multiple deployment tools for OpenStack based on several
frameworks, puppet is widely used by different tools or even directly
by operators so we recommend to create a puppet module to deploy your new service
following the Puppet OpenStack Guide.
Once the puppet module is ready, remember to follow the RDO new package
process
to get it packaged in the repos.

3. Make sure the new service is tested in RDO-CI

As explained in a previous post
we run several jobs in RDO CI to validate the content of our repos. Most
of the times the first way to get it tested is by adding the new service
to one of the puppet-openstack-integration scenarios which is also
recommended to get the puppet module tested in upstream gates. An example
of how to add a new service into p-o-i is in this review.

4. Adding deployment support in Packstack

If you want to make it easier for RDO users to evaluate a new service, adding
it to Packstack is a good idea.
Packstack is a puppet-based deployment tool used by RDO users to deploy small proof
of concept (PoC) environments to evaluate new services or configurations
before deploying it in their production clouds. If you are interested you can
take a look to these two reviews
which added support for Panko and Magnum in Ocata cycle.

5. Add it to TripleO

TripleO is a powerful
OpenStack management tool able to provision and manage cloud environments
with production-ready features, as high availability, extended security,
etc… Adding support for new services in TripleO will help the users to
adopt it for their cloud deployments. The TripleO composable roles tutorial
can guide you about how to do it.

6. Build containers for new services

Kolla is the upstream
project providing container images and deployment tools to operate OpenStack
clouds using container technologies. Kolla supports building images for
CentOS distro using binary method which uses packages from RDO. Operators using
containers will have it easier it if you add containers for new services.

Other recomendations

Follow OpenStack governance policies

RDO methodology and tooling is conceived according to OpenStack upstream
release model, so following policies about release management
and requirements
is a big help to maintain packages in RDO. It’s specially important to create
branches and version tags as defined by the releases team.

Advertise your work to the RDO community

Making potential users aware of availability of new services or other
improvements is a good practice. RDO provides several ways to do this as
sending mails to our mailing lists,
writing a post in the blog, adding
references in our documentation, creating screencast demos, etc… You
can also join the RDO weekly meeting
to let us know about your work.

Join RDO Test Days

RDO organizes test days at several
milestones during each OpenStack release cycle. Although we do Continuous
Integration testing in RDO, it’s good to test that it can be deployed
following the instructions in the documentation. You can propose new
services or configurations in the test matrix and add a link to the
documented instructions about how to do it.

Upstream documentation

RDO relies on upstream OpenStack Installation Guide for
deployment instructions. Keeping it up to date is recommended.
Quelle: RDO