How Bitmovin is Doing Multi-Stage Canary Deployments with Kubernetes in the Cloud and On-Prem

Editor’s Note: Today’s post is by Daniel Hoelbling-Inzko, Infrastructure Architect at Bitmovin, a company that provides services that transcode digital video and audio to streaming formats, sharing insights about their use of Kubernetes.Running a large scale video encoding infrastructure on multiple public clouds is tough. At Bitmovin, we have been doing it successfully for the last few years, but from an engineering perspective, it’s neither been enjoyable nor particularly fun. So obviously, one of the main things that really sold us on using Kubernetes, was it’s common abstraction from the different supported cloud providers and the well thought out programming interface it provides. More importantly, the Kubernetes project did not settle for the lowest common denominator approach. Instead, they added the necessary abstract concepts that are required and useful to run containerized workloads in a cloud and then did all the hard work to map these concepts to the different cloud providers and their offerings.The great stability, speed and operational reliability we saw in our early tests in mid-2016 made the migration to Kubernetes a no-brainer.And, it didn’t hurt that the vision for scale the Kubernetes project has been pursuing is closely aligned with our own goals as a company. Aiming for >1,000 node clusters might be a lofty goal, but for a fast growing video company like ours, having your infrastructure aim to support future growth is essential. Also, after initial brainstorming for our new infrastructure, we immediately knew that we would be running a huge number of containers and having a system, with the expressed goal of working at global scale, was the perfect fit for us. Now with the recent Kubernetes 1.6 release and its support for 5,000 node clusters, we feel even more validated in our choice of a container orchestration system.During the testing and migration phase of getting our infrastructure running on Kubernetes, we got quite familiar with the Kubernetes API and the whole ecosystem around it. So when we were looking at expanding our cloud video encoding offering for customers to use in their own datacenters or cloud environments, we quickly decided to leverage Kubernetes as our ubiquitous cloud operating system to base the solution on.Just a few months later this effort has become our newest service offering: Bitmovin Managed On-Premise encoding. Since all Kubernetes clusters share the same API, adapting our cloud encoding service to also run on Kubernetes enabled us to deploy into our customer’s datacenter, regardless of the hardware infrastructure running underneath. With great tools from the community, like kube-up and turnkey solutions, like Google Container Engine, anyone can easily provision a new Kubernetes cluster, either within their own infrastructure or in their own cloud accounts. To give us the maximum flexibility for customers that deploy to bare metal and might not have any custom cloud integrations for Kubernetes yet, we decided to base our solution solely on facilities that are available in any Kubernetes install and don’t require any integration into the surrounding infrastructure (it will even run inside Minikube!). We don’t rely on Services of type LoadBalancer, primarily because enterprise IT is usually reluctant to open up ports to the open internet – and not every bare metal Kubernetes install supports externally provisioned load balancers out of the box. To avoid these issues, we deploy a BitmovinAgent that runs inside the Cluster and polls our API for new encoding jobs without requiring any network setup. This agent then uses the locally available Kubernetes credentials to start up new deployments that run the encoders on the available hardware through the Kubernetes API.Even without having a full cloud integration available, the consistent scheduling, health checking and monitoring we get from using the Kubernetes API really enabled us to focus on making the encoder work inside a container rather than spending precious engineering resources on integrating a bunch of different hypervisors, machine provisioners and monitoring systems. Multi-Stage Canary DeploymentsOur first encounters with the Kubernetes API were not for the On-Premise encoding product. Building our containerized encoding workflow on Kubernetes was rather a decision we made after seeing how incredibly easy and powerful the Kubernetes platform proved during development and rollout of our Bitmovin API infrastructure. We migrated to Kubernetes around four months ago and it has enabled us to provide rapid development iterations to our service while meeting our requirements of downtime-free deployments and a stable development to production pipeline. To achieve this we came up with an architecture that runs almost a thousand containers and meets the following requirements we had laid out on day one:Zero downtime deployments for our customersContinuous deployment to production on each git mainline pushHigh stability of deployed services for customersObviously and are at odds with each other, if each merged feature gets deployed to production right away – how can we ensure these releases are bug-free and don’t have adverse side effects for our customers?To overcome this oxymoron, we came up with a four-stage canary pipeline for each microservice where we simultaneously deploy to production and keep changes away from customers until the new build has proven to work reliably and correctly in the production environment.Once a new build is pushed, we deploy it to an internal stage that’s only accessible for our internal tests and the integration test suite. Once the internal test suite passes, QA reports no issues, and we don’t detect any abnormal behavior, we push the new build to our free stage. This means that 5% of our free users would get randomly assigned to this new build. After some time in this stage the build gets promoted to the next stage that gets 5% of our paid users routed to it. Only once the build has successfully passed all 3 of these hurdles, does it get deployed to the production tier, where it will receive all traffic from our remaining users as well as our enterprise customers, which are not part of the paid bucket and never see their traffic routed to a canary track.This setup makes us a pretty big Kubernetes installation by default, since all of our canary tiers are available at a minimum replication of 2. Since we are currently deploying around 30 microservices (and growing) to our clusters, it adds up to a minimum of 10 pods per service (8 application pods + minimum 2 HAProxy pods that do the canary routing). Although, in reality our preferred standard configuration is usually running 2 internal, 4 free, 4 others and 10 production pods alongside 4 HAProxy pods – totalling around 700 pods in total. This also means that we are running at least 150 services that provide a static ClusterIP to their underlying microservice canary tier.A typical deployment looks like this:Services (ClusterIP)Deployments#-serviceaccount-service-haproxy4account-service-internalaccount-service-internal-v1.18.02account-service-canaryaccount-service-canary-v1.17.04account-service-paidaccount-service-paid-v1.15.04account-service-productionaccount-service-production-v1.15.010An example service definition the production track will have the following label selectors:apiVersion: v1kind: Servicemetadata:  name: account-service-production  labels:    app: account-service-production    tier: service    lb: privatespec:  ports:  – port: 8080    name: http    targetPort: 8080    protocol: TCP  selector:    app: account-service    tier: service    track: productionIn front of the Kubernetes services, load balancing the different canary versions of the service, lives a small cluster of HAProxy pods that get their haproxy.conf from the Kubernetes ConfigMaps that looks something like this:frontend http-in  bind *:80  log 127.0.0.1 local2 debug  acl traffic_internal    hdr(X-Traffic-Group) -m str -i INTERNAL  acl traffic_free        hdr(X-Traffic-Group) -m str -i FREE  acl traffic_enterprise  hdr(X-Traffic-Group) -m str -i ENTERPRISE  use_backend internal   if traffic_internal  use_backend canary     if traffic_free  use_backend enterprise if traffic_enterprise  default_backend paidbackend internal  balance roundrobin  server internal-lb        user-resource-service-internal:8080   resolvers dns check inter 2000backend canary  balance roundrobin  server canary-lb          user-resource-service-canary:8080     resolvers dns check inter 2000 weight 5  server production-lb      user-resource-service-production:8080 resolvers dns check inter 2000 weight 95backend paid  balance roundrobin  server canary-paid-lb     user-resource-service-paid:8080       resolvers dns check inter 2000 weight 5  server production-lb      user-resource-service-production:8080 resolvers dns check inter 2000 weight 95backend enterprise  balance roundrobin  server production-lb      user-resource-service-production:8080 resolvers dns check inter 2000 weight 100Each HAProxy will inspect a header that gets assigned by our API-Gateway called X-Traffic-Group that determines which bucket of customers this request belongs to. Based on that, a decision is made to hit either a canary deployment or the production deployment.Obviously, at this scale, kubectl (while still our main day-to-day tool to work on the cluster) doesn’t really give us a good overview of whether everything is actually running as it’s supposed to and what is maybe over or under replicated.Since we do blue/green deployments, we sometimes forget to shut down the old version after the new one comes up, so some services might be running over replicated and finding these issues in a soup of 25 deployments listed in kubectl is not trivial, to say the least.So, having a container orchestrator like Kubernetes, that’s very API driven, was really a godsend for us, as it allowed us to write tools that take care of that.We built tools that either run directly off kubectl (eg bash-scripts) or interact directly with the API and understand our special architecture to give us a quick overview of the system. These tools were mostly built in Go using the client-go library.One of these tools is worth highlighting, as it’s basically our only way to really see service health at a glance. It goes through all our Kubernetes services that have the tier: service selector and checks if the accompanying HAProxy deployment is available and all pods are running with 4 replicas. It also checks if the 4 services behind the HAProxys (internal, free, others and production) have at least 2 endpoints running. If any of these conditions are not met, we immediately get a notification in Slack and by email.Managing this many pods with our previous orchestrator proved very unreliable and the overlay network frequently caused issues. Not so with Kubernetes – even doubling our current workload for test purposes worked flawlessly and in general, the cluster has been working like clockwork ever since we installed it.Another advantage of switching over to Kubernetes was the availability of the kubernetes resource specifications, in addition to the API (which we used to write some internal tools for deployment). This enabled us to have a Git repo with all our Kubernetes specifications, where each track is generated off a common template and only contains placeholders for variable things like the canary track and the names.All changes to the cluster have to go through tools that modify these resource specifications and get checked into git automatically so, whenever we see issues, we can debug what changes the infrastructure went through over time!To summarize this post – by migrating our infrastructure to Kubernetes, Bitmovin is able to have:Zero downtime deployments, allowing our customers to encode 24/7 without interruptionFast development to production cycles, enabling us to ship new features fasterMultiple levels of quality assurance and high confidence in production deploymentsUbiquitous abstractions across cloud architectures and on-premise deploymentsStable and reliable health-checking and scheduling of servicesCustom tooling around our infrastructure to check and validate the systemHistory of deployments (resource specifications in git + custom tooling)We want to thank the Kubernetes community for the incredible job they have done with the project. The velocity at which the project moves is just breathtaking! Maintaining such a high level of quality and robustness in such a diverse environment is really astonishing. –Daniel Hoelbling-Inzko, Infrastructure Architect, BitmovinPost questions (or answer questions) on Stack OverflowJoin the community portal for advocates on K8sPortGet involved with the Kubernetes project on GitHubFollow us on Twitter @Kubernetesio for latest updatesConnect with the community on SlackDownload Kubernetes
Quelle: kubernetes

Configuring Private DNS Zones and Upstream Nameservers in Kubernetes

Editor’s note: this post is part of a series of in-depth articles on what’s new in Kubernetes .6Many users have existing domain name zones that they would like to integrate into their Kubernetes DNS namespace. For example, hybrid-cloud users may want to resolve their internal “.corp” domain addresses within the cluster. Other users may have a zone populated by a non-Kubernetes service discovery system (like Consul). We’re pleased to announce that, in Kubernetes 1.6, kube-dns adds support for configurable private DNS zones (often called “stub domains”) and external upstream DNS nameservers. In this blog post, we describe how to configure and use this feature.Default lookup flowKubernetes currently supports two DNS policies specified on a per-pod basis using the dnsPolicy flag: “Default” and “ClusterFirst”. If dnsPolicy is not explicitly specified, then “ClusterFirst” is used:If dnsPolicy is set to “Default”, then the name resolution configuration is inherited from the node the pods run on. Note: this feature cannot be used in conjunction with dnsPolicy: “Default”.If dnsPolicy is set to “ClusterFirst”, then DNS queries will be sent to the kube-dns service. Queries for domains rooted in the configured cluster domain suffix (any address ending in “.cluster.local” in the example above) will be answered by the kube-dns service. All other queries (for example, www.kubernetes.io) will be forwarded to the upstream nameserver inherited from the node.Before this feature, it was common to introduce stub domains by replacing the upstream DNS with a custom resolver. However, this caused the custom resolver itself to become a critical path for DNS resolution, where issues with scalability and availability may cause the cluster to lose DNS functionality. This feature allows the user to introduce custom resolution without taking over the entire resolution path.Customizing the DNS FlowBeginning in Kubernetes 1.6, cluster administrators can specify custom stub domains and upstream nameservers by providing a ConfigMap for kube-dns. For example, the configuration below inserts a single stub domain and two upstream nameservers. As specified, DNS requests with the “.acme.local” suffix will be forwarded to a DNS listening at 1..3.4. Additionally, Google Public DNS will serve upstream queries. See ConfigMap Configuration Notes at the end of this section for a few notes about the data format.apiVersion: v1kind: ConfigMapmetadata: name: kube-dns namespace: kube-systemdata: stubDomains: | {“acme.local”: [“1.2.3.4”]} upstreamNameservers: | [“8.8.8.8”, “8.8.4.4”]The diagram below shows the flow of DNS queries specified in the configuration above. With the dnsPolicy set to “ClusterFirst” a DNS query is first sent to the DNS caching layer in kube-dns. From here, the suffix of the request is examined and then forwarded to the appropriate DNS.  In this case, names with the cluster suffix (e.g.; “.cluster.local”) are sent to kube-dns. Names with the stub domain suffix (e.g.; “.acme.local”) will be sent to the configured custom resolver. Finally, requests that do not match any of those suffixes will be forwarded to the upstream DNS.Below is a table of example domain names and the destination of the queries for those domain names:Domain nameServer answering the querykubernetes.default.svc.cluster.localkube-dnsfoo.acme.localcustom DNS (1.2.3.4)widget.comupstream DNS (one of 8.8.8.8, 8.8.4.4)ConfigMap Configuration NotesstubDomains (optional)Format: a JSON map using a DNS suffix key (e.g.; “acme.local”) and a value consisting of a JSON array of DNS IPs.Note: The target nameserver may itself be a Kubernetes service. For instance, you can run your own copy of dnsmasq to export custom DNS names into the ClusterDNS namespace.upstreamNameservers (optional)Format: a JSON array of DNS IPs.Note: If specified, then the values specified replace the nameservers taken by default from the node’s /etc/resolv.confLimits: a maximum of three upstream nameservers can be specifiedExample 1: Adding a Consul DNS Stub DomainIn this example, the user has Consul DNS service discovery system they wish to integrate with kube-dns. The consul domain server is located at 10.150.0.1, and all consul names have the suffix “.consul.local”.  To configure Kubernetes, the cluster administrator simply creates a ConfigMap object as shown below.  Note: in this example, the cluster administrator did not wish to override the node’s upstream nameservers, so they didn’t need to specify the optional upstreamNameservers field.Example 2: Replacing the Upstream NameserversIn this example the cluster administrator wants to explicitly force all non-cluster DNS lookups to go through their own nameserver at 172.16.0.1.  Again, this is easy to accomplish; they just need to create a ConfigMap with the upstreamNameservers field specifying the desired nameserver.apiVersion: v1kind: ConfigMapmetadata: name: kube-dns namespace: kube-systemdata: upstreamNameservers: | [“172.16.0.1”]Get involvedIf you’d like to contribute or simply help provide feedback and drive the roadmap, join our community. Specifically for network related conversations participate though one of these channels:Chat with us on the Kubernetes Slack network channel Join our Special Interest Group, SIG-Network, which meets on Tuesdays at 14:00 PTThanks for your support and contributions. Read more in-depth posts on what’s new in Kubernetes 1.6 here.–Bowei Du, Software Engineer and Matthew DeLio, Product Manager, GooglePost questions (or answer questions) on Stack Overflow Join the community portal for advocates on K8sPortGet involved with the Kubernetes project on GitHub Follow us on Twitter @Kubernetesio for latest updatesConnect with the community on SlackDownload Kubernetes
Quelle: kubernetes

We installed an OpenStack cluster with close to 1000 nodes on Kubernetes. Here’s what we found out.

The post We installed an OpenStack cluster with close to 1000 nodes on Kubernetes. Here&;s what we found out. appeared first on Mirantis | Pure Play Open Cloud.
Late last year, we did a number of tests that looked at deploying close to 1000 OpenStack nodes on a pre-installed Kubernetes cluster as a way of finding out what problems you might run into, and fixing them, if at all possible. In all we found several, and though in general, we were able to fix them, we thought it would still be good to go over the types of things you need to look for.
Overall we deployed an OpenStack cluster that contained more than 900 nodes using Fuel-CCP on a Kubernetes that had been deployed using Kargo. The Kargo tool is part of the Kubernetes Incubator project and uses the Large Kubernetes Cluster reference architecture as a baseline.
As we worked, we documented issues we found, and contributed fixes to both the deployment tool and reference design document where appropriate.  Here&8217;s what we found.
The setup
We started with just over 175 bare metal machines, allocating 3 of them to be used for Kubernetes control plane services placement (API servers, ETCD, Kubernetes scheduler, etc.), others had 5 virtual machines on each node, where every VM was used as a Kubernetes minion node.
Each bare metal node had the following specifications:

HP ProLiant DL380 Gen9
CPU &; 2x Intel(R) Xeon(R) CPU E5-2680 v3 @ .50GHz
RAM &8211; 264G
Storage &8211; 3.0T on RAID on HP Smart Array P840 Controller, HDD &8211; 12 x HP EH0600JDYTL
Network &8211; 2x Intel Corporation Ethernet 10G 2P X710

The running OpenStack cluster (as far as Kubernetes is concerned) consists of:

OpenStack control plane services running on close to 150 pods over 6 nodes
Close to 4500 pods spread across all of the remaining nodes, at 5 pods per minion node

One major Prometheus problem
During the experiments we used Prometheus monitoring tool to verify resource consumption and the load put on the core system, Kubernetes, and OpenStack services. One note of caution when using Prometheus:  Deleting old data from Prometheus storage will indeed improve the Prometheus API speed &; but it will also delete any previous cluster information, making it unavailable for post-run investigation. So make sure to document any observed issue and its debugging thoroughly!
Thankfully, we had in fact done that documentation, but one thing we&8217;ve decided to do going forward to prevent this problem by configuring Prometheus to back up data to one of the persistent time series databases it supports, such as InfluxDB, Cassandra, or OpenTSDB. By default, Prometheus is optimized to be used as a real time monitoring / alerting system, and there is an official recommendation from the Prometheus developers team to keep monitoring data retention for only about 15 days to keep the tool working in a quick and responsive manner. By setting up the backup, we can store old data for an extended amount of time for post-processing needs.
Problems we experienced in our testing
Huge load on kube-apiserver
Symptoms
Initially, we had a setup with all nodes (including the Kubernetes control plane nodes) running on a virtualized environment, but the load was such that the API servers couldn&8217;t function at all so they were moved to bare metal.  Still, both API servers running in the Kubernetes cluster were utilising up to 2000% of the available CPU (up to 45% of total node compute performance capacity), even after we migrated them to hardware nodes.
Root cause
All services that are not on Kubernetes masters (kubelet, kube-proxy on all minions) access kube-apiserver via a local NGINX proxy. Most of those requests are watch requests that lie mostly idle after they are initiated (most timeouts on them are defined to be about 5-10 minutes). NGINX was configured to cut idle connections in 3 seconds, which causes all clients to reconnect and (even worse) restart aborted SSL sessions. On the server side, this it makes kube-apiserver consume up to 2000% of the CPU resources, making other requests very slow.
Solution
Set the proxy_timeout parameter to 10 minutes in the nginx.conf configuration file, which should be more than long enough to prevent cutting SSL connections before te requests time out by themselves. After this fix was applied, one api-server consumed only 100% of CPU (about 2% of total node compute performance capacity), while the second one consumed about 200% (about 4% of total node compute performance capacity) of CPU (with average response time 200-400 ms).
Upstream issue status: fixed
Make the Kargo deployment tool set proxy_timeout to 10 minutes: issue fixed with pull request by Fuel CCP team.
KubeDNS cannot handle large cluster load with default settings
Symptoms
When deploying an OpenStack cluster on this scale, kubedns becomes unresponsive because of the huge load. This end up with a slew of errors appearing in the logs of the dnsmasq container in the kubedns pod:
Maximum number of concurrent DNS queries reached.
Also, dnsmasq containers sometimes get restarted due to hitting the high memory limit.
Root cause
First of all, kubedns often seems to fail often in this architecture, even even without load. During the experiment we observed continuous kubedns container restarts even on an empty (but large enough) Kubernetes cluster. Restarts are caused by liveness check failing, although nothing notable is observed in any logs.
Second, dnsmasq should have taken the load off kubedns, but it needs some tuning to behave as expected (or, frankly, at all) for large loads.
Solution
Fixing this problem requires several levels of steps:

Set higher limits for dnsmasq containers: they take on most of the load.
Add more replicas to kubedns replication controller (we decided to stop on 6 replicas, as it solved the observed issue &8211; for bigger clusters it might be needed to increase this number even more).
Increase number of parallel connections dnsmasq should handle (we used &8211;dns-forward-max=1000 which is recommended parameter setup in dnsmasq manuals)
Increase size of cache in dnsmasq: it has hard limit of 10000 cache entries which seems to be reasonable amount.
Fix kubedns to handle this behaviour in proper way.

Upstream issue status: partially fixed
and 2 are fixed by making them configurable in Kargo by Kubernetes team: issue, pull request.
Others &8211; work has not yet started.
Kubernetes scheduler needs to be deployed on a separate node
Symptoms
During the huge OpenStack cluster deployment against Kubernetes, scheduler, controller-manager and kube-apiserver start fighting for CPU cycles as all of them are under a large load. Scheduler is the most resource-hungry, so we need a way to deploy it separately.
Solution
We moved the Kubernetes scheduler moved to a separate node manually; all other schedulers were manually killed to prevent them from moving to other nodes.
Upstream issue status: reported
Issue in Kargo.
Kubernetes scheduler is ineffective with pod antiaffinity
Symptoms
It takes a significant amount of time for the scheduler to process pods with pod antiaffinity rules specified on them. It is spending about 2-3 seconds on each pod, which makes the time needed to deploy an OpenStack cluster of 900 nodes unexpectedly long (about 3h for just scheduling). OpenStack deployment requires the use of antiaffinity rules to prevent several OpenStack compute nodes from being launched on a single Kubernetes minion node.
Root cause
According to profiling results, most of the time is spent on creating new Selectors to match existing pods against, which triggers the validation step. Basically we have O(N^2) unnecessary validation steps (where N = the number of pods), even if we have just 5 deployment entities scheduled to most of the nodes.
Solution
In this case, we needed a specific optimization that speeds up scheduling time up to about 300 ms/pod. It’s still slow in terms of common sense (about 30m spent just on pods scheduling for a 900 node OpenStack cluster), but it is at least close to reasonable. This solution lowers the number of very expensive operations to O(N), which is better, but still depends on the number of pods instead of deployments, so there is space for future improvement.
Upstream issue status: fixed
The optimization was merged into master (pull request) and backported to the 1.5 branch, and is part of the 1.5.2 release (pull request).
kube-apiserver has low default rate limit
Symptoms
Different services start receiving “429 Rate Limit Exceeded” HTTP errors, even though kube-apiservers can take more load. This problem was discovered through a scheduler bug (see below).
Solution
Raise the rate limit for the kube-apiserver process via the &8211;max-requests-inflight option. It defaults to 400, but in our case it became workable at 2000. This number should be configurable in the Kargo deployment tool, as bigger deployments might require an even bigger increase.
Upstream issue status: reported
Issue in Kargo.
Kubernetes scheduler can schedule incorrectly
Symptoms
When creating a huge amount of pods (~4500 in our case) and faced with HTTP 429 errors from kube-apiserver (see above), the scheduler can schedule several pods of the same deployment on one node, in violation of the pod antiaffinity rule on them.
Root cause
See pull request below.
Upstream issue status: pull request
Fix from Mirantis team: pull request (merged, part of Kubernetes 1.6 release).
Docker sometimes becomes unresponsive
Symptoms
The Docker process sometimes hangs on several nodes, which results in timeouts in the kubelet logs. When this happens, pods cannot be spawned or terminated successfully on the affected minion node. Although many similar issues have been fixed in Docker since 1.11, we are still observing these symptoms.
Workaround
The Docker daemon logs do not contain any notable information, so we had to restart the docker service on the affected node. (During the experiments we used Docker 1.12.3, but we have observed similar symptoms in 1.13 release candidates as well.)
OpenStack services don’t handle PXC pseudo-deadlocks
Symptoms
When run in parallel, create operations of lots of resources were failing with DBError saying that Percona Xtradb Cluster identified a deadlock and the transaction should be restarted.
Root cause
oslo.db is responsible for wrapping errors received from the DB into proper classes so that services can restart transactions if similar errors occur, but it didn’t expect the error in the format that is being sent by Percona. After we fixed this, however, we still experienced similar errors, because not all transactions that could be restarted were properly decorated in Nova code.
Upstream issue status: fixed
The bug has been fixed by Roman Podolyaka’s CR and backported to Newton. It fixes Percona deadlock error detection, but there’s at least one place in Nova that still needs to be fixed.
Live migration failed with live_migration_uri configuration
Symptoms
With the live_migration_uri configuration, live migrations fails because one compute host can’t connect to a libvirt on another host.
Root cause
We can’t specify which IP address to use in the live_migration_uri template, so it was trying to use the address from the first interface that happened to be in the PXE network, while libvirt listens on the private network. We couldn’t use the live_migration_inbound_addr, which would solve this problem, because of a problem in upstream Nova.
Upstream issue status: fixed
A bug in Nova has been fixed and backported to Newton. We switched to using live_migration_inbound_addr after that.
The post We installed an OpenStack cluster with close to 1000 nodes on Kubernetes. Here&8217;s what we found out. appeared first on Mirantis | Pure Play Open Cloud.
Quelle: Mirantis

How To Safely Send Your Nudes

A guide to sexting best practices for you and your favorite taker-of-nudes.

If you&;ve ever sent or received a sext, you&039;re not alone. In a 2013 study, about 27% of all smartphone users said they receive sexts on a regular basis, and 12% admitted to sending nudes (though the people polled may have been being coy). That number may even be higher now, as the study came out just as Snapchat, then an ephemeral multimedia messaging platform built around disappearing photos and video, was taking off.

This is a judgment-free zone. If you want to send a nude (and have a willing participant), then send a nude. There’s nothing wrong with nudity&; Human bodies are beautiful&033; But it&039;s also totally normal to want to maintain control of the way your nudes are seen and distributed.

The only way to truly control your nude distribution is to do it yourself. Just follow these simple steps: Take a pic of your goods, download the pic to an encrypted hard drive, drop in a password-protected folder, confiscate your partner’s phone, show them the image, close the file, return their phone, and proceed.

But that’s deeply unsexy&033; And also not how sexting works.

If you decide to send nudes, you assume the risk of those nudes ending up in a public forum, and should prepare yourself for the worst case scenario — but you can significantly lower that risk by following this guide to best practices for ~sensual~ electronic communication. These tips don’t offer a complete guarantee that your nudes won’t be leaked, but they are a good First Line of Defense Against the Dark Interwebs.

One note: If you’re under 18, never, ever, under any circumstances, share a photo of yourself naked. You can be prosecuted as a sex offender, even for sending a picture of yourself consensually.

Reclining Nude by Julien Vallou de Villeneuve / The Metropolitan Museum of Art

Here is the most important sexting advice of all: Only send NSFW content to people you trust. Does the recipient seem like someone who would publish your nudes as revenge or use them as blackmail? Do they seem like they take basic security precautions with their devices (see: tip )? Are they generally …trustworthy?

You can use apps that employ the most secure end-to-end encryption available, but it won’t matter if the person on the other end takes a screenshot, and “accidentally” posts it to Twitter. So make sure that the person you’re sending your Anthony Weiner to is someone who understands the value of the safekeeping of your selfie.

Because, duh&033; If their (or your) phone is ever stolen and left unlocked, your nudes might end up in the wrong hands.

You won’t always know when someone screenshots your sext. Yes, some services will notify you, but there are many ways to get around this.

You won’t always know when someone screenshots your sext. Yes, some services will notify you, but there are many ways to get around this.

Snapchat will display a particular icon (an arrow with spikes) when a screenshot of your Snap has been taken. Instagram will also notify you if the recipient of a “disappearing” Instagram direct message takes a screenshot.

However, neither of these notification features prevent someone from taking the screenshot in the first place, and they could easily take advantage of the app’s biggest loophole: taking a photo of the screen with another device.

Nicole Nguyen / BuzzFeed News


View Entire List ›

Quelle: <a href="How To Safely Send Your Nudes“>BuzzFeed

Announcing Azure SQL Database Threat Detection general availability coming in April 2017

Today we are happy to announce that Azure SQL Database Threat Detection will be generally available in April 2017. Through the course of the preview we optimized our offering and it has received 90% positive feedback from customers regarding the usefulness of SQL threat alerts. At general availability, SQL Database Threat Detection will cost of $15 / server / month. We invite you to try it out for 60 days for free.

What is Azure SQL Database Threat Detection?

Azure SQL Database Threat Detection provides an additional layer of security intelligence built into the Azure SQL Database service. It helps customers using Azure SQL Database to secure their databases within minutes without needing to be an expert in database security. It works around the clock to learn, profile and detect anomalous database activities indicating unusual and potentially harmful attempts to access or exploit databases.

How to use SQL Database Threat Detection

Just turn it ON – SQL Database Threat Detection is incredibly easy to enable. You simply switch on Threat Detection from the Auditing & Threat Detection configuration blade in the Azure portal, select the Azure storage account (where the SQL audit log will be saved) and configure at least one email address for receiving alerts.

Real-time actionable alerts – SQL Database Threat Detection runs multiple sets of algorithms which detect potential vulnerabilities and SQL injection attacks, as well as anomalous database access patterns (such as access from an unusual location or by an unfamiliar principal). Security officers or other designated administrators get email notification once a threat is detected on the database. Each notification provides details of the suspicious activity and recommends how to further investigate and mitigate the threat.

Live SQL security tile – SQL Database Threat Detection integrates its alerts with Azure Security Center. A live SQL security tile within the database blade in Azure portal tracks the status of active threats. Clicking on the SQL security tile launches the Azure Security Center alerts blade and provides an overview of active SQL threats detected on the database. Clicking on a specific alert provides additional details and actions for investigating and preventing similar threats in the future.

Investigate SQL threat – Each SQL Database Threat Detection email notification and Azure Security Center alert includes a direct link to the SQL audit log. Clicking on this link launches the Azure portal and opens the SQL audit records around the time of the event, making it easy to find the SQL statements that were executed (who accessed, what he did and when) and determine if the event was legitimate or malicious (e.g. application vulnerability to SQL injection was exploited, someone breached sensitive data, etc.).

Recent customer experiences using SQL Database Threat Detection

During our preview, many customers benefited from the enhanced security SQL Database Threat detection provides.

Case : Anomalous access from a new network to production database

Justin Windhorst, Head of IT North America at Archroma

“Archroma runs a custom built ERP/e-Commerce solution, consisting of more than 20 Web servers and 20 Databases using a multi-tier architecture, with Azure SQL Database at its core.  I love the built-in features that bring added value such as the enterprise level features: SQL Database Threat Detection (for security) and Geo Replication (for availability).  Case in point: With just a few clicks, we successfully enabled SQL Auditing and Threat Detection to ensure continuous monitoring occurred for all activities within our databases.  A few weeks later, we received an email alert that "Someone has logged on to our SQL server from an unusual location”. The alert was triggered as a result of unusual access from a new network to our production database for testing purposes.  Knowing that we have the power of Microsoft behind us that automatically brings to light anomalous such as these gives Archroma incredible peace of mind, and thus allows us to focus on delivering a better service.”

Case : Preventing SQL Injection attacks

Fernando Sola, Cloud Technology Consultant at HSI

“Thanks to Azure SQL Database Threat Detection, we were able to detect and fix vulnerabilities to SQL injection attacks and prevent potential threats to our database. I was very impressed with how simple it was to enable threat detection using the Azure portal. A while after enabling Azure SQL Database Threat Detection, we received an email notification about ‘An application generated a faulty SQL statement on our database, which may indicate a vulnerability of the application to SQL injection.’  The notification provided details of the suspicious activity and recommended actions how to observe and fix the faulty SQL statement in our application code using SQL Audit Log. The alert also pointed me to the Microsoft documentation that explained us how to fix an application code that is vulnerable to SQL injection attacks. SQL Database Threat Detection and Auditing help my team to secure our data in Azure SQL Database within minutes and with no need to be an expert in databases or security.”

Summary

We would like to thank all of you that provided feedback and shared experiences during the public preview. Your active participation validated that SQL Database Threat Detection provides an important layer of security built into the Azure SQL Database service to help secure databases without the need to be an expert in database security.

Click the following links for more information to:

Learn more about Azure SQL Database Threat Detection

Learn more about Azure SQL Database Auditing
Learn more about Azure SQL Database
Learn more about Azure Security Center

Quelle: Azure

How To Build Planet Scale Mobile App in Minutes with Xamarin and DocumentDB

Most mobile apps need to store data in the cloud, and  DocumentDB is an awesome cloud database for mobile apps. It has everything a mobile developer needs, a fully managed NoSQL database as a service that scales on demand, and can bring your data where your users go around the globe — completely transparently to your application. Today we are excited to announce Azure DocumentDB SDK for Xamarin mobile platform, enabling mobile apps to interact directly with DocumentDB, without a middle-tier.

Here is what mobile developers get out of the box with DocumentDB:

Rich queries over schemaless data. DocumentDB stores data as schemaless JSON documents in heterogeneous collections, and offers rich and fast queries without the need to worry about schema or indexes.
Fast. Guaranteed. It takes only few milliseconds to read and write documents with DocumentDB. Developers can specify the throughput they need and DocumentDB will honor it with 99.99% SLA.
Limitless Scale. Your DocumentDB collections will grow as your app grows. You can start with small data size and 100s requests per second and grow to arbitrarily large, 10s and 100s of millions requests per second throughput, and petabytes of data.
Globally Distributed. Your mobile app users are on the go, often across the world. DocumentDB is a globally distributed database, and with just one click on a map it will bring the data wherever your users are.
Built-in rich authorization. With DocumentDB you can easy to implement popular patterns like per-user data, or multi-user shared data without custom complex authorization code.
Geo-spatial queries. Many mobile apps offer geo-contextual experiences today. With the first class support for geo-spatial types DocumentDB makes these experiences very easy to accomplish.
Binary attachments. Your app data often includes binary blobs. Native support for attachments makes it easier to use DocumentDB as one-stop shop for your app data.

Let&;s build an app together!

Step . Get Started

It&039;s easy to get started with DocumentDB, just go to Azure portal, create a new DocumentDB account,  go to the Quickstart tab, and download a Xamarin Forms todo list sample, already connected to your DocumentDB account. 

Or if you have an existing Xamarin app, you can just add this DocumentDB NuGet package. Today we support Xamarin.IOS, Xamarin.Android, as well as Xamarin Forms shared libraries.

Step . Work with data

Your data records are stored in DocumentDB as schemaless JSON documents in heterogeneous collections. You can store documents with different structures in the same collection.

In your Xamarin projects you can use language integtated queries over schemaless data:

Step . Add Users

Like many get started samples, the DocumentDB sample you downloaded above authenticates to the service using master key hardcoded in the app&039;s code. This is of course not a good idea for an app you intend to run anywhere except your local emulator. If an attacker gets a hold of the master key, all the data across your DocumentDB account is compromised.

Instead we want our app to only have access to the records for the logged in user. DocumentDB allows developers to grant application read or read/write access to all documents in a collection, a set of documents, or a specific document, depending on the needs.

Here is for example, how to modify our todo list app into a multi-user todolist app, a complete version of the sample is available here: 

Add Login to your app, using Facebook, Active Directory or any other provider.
Create a DocumentDB UserItems collection with /userId as a partition key. Specifying partition key for your collection allows DocumentDB to scale infinitely as the number of our app users growth, while offering fast queries.
Add DocumentDB Resource Token Broker, a simple Web API that authenticates the users and issues short lived tokens to the logged in users with access only to the documents within the user&039;s partition. In this example we host Resource Token Broker in App Service.
Modify the app to authenticate to Resource Token Broker with Facebook and request the resource tokens for the logged in Facebook user, then access users data in the UserItems collection.  

This diagram illustrates the solution. We are investigating eliminating the need for Resource Token Broker by supporting OAuth in DocumentDB first class, please upvote this uservoice item if you think it&039;s a good idea!

Now if we want two users get access to the same todolist, we just add additional permissions to the access token in Resource Token Broker. You can find the complete sample here.

Step . Scale on demand.

DocumentDB is a managed database as a service. As your user base grows, you don&039;t need to worry about provisioning VMs or increasing cores. All you need to tell DocumentDB is how many operations per second (throughput) your app needs. You can specify the throughput via portal Scale tab using a measure of throughput called Request Units per second (RUs). For example, a read operation on a 1KB document requires 1 RU. You can also add alerts for "Throughput" metric to monitor the traffic growth and programmatically change the throughput as alerts fire.

  

Step . Go Planet Scale!

As your app gains popularity, you may acquire users accross the globe. Or may be you just don&039;t want to be caught of guard if a meteorite strkes the Azure data centers where you created your DocumentDB collection. Go to Azure portal, your DocumentDB account, and with a click on a map, make your data continuously replicate to any number of regions accross the world. This ensures your data is available whereever your users are, and you can add failover policies to be prepared for the rainy day.

We hope you find this blog and samples useful to take advantage of DocumentDB in your Xamarin application. Similar pattern can be used in Cordova apps using DocumentDB JavaScript SDK, as well as native iOS / Android apps using DocumentDB REST APIs.

As always, let us know how we are doing and what improvements you&039;d like to see going forward for DocumentDB through UserVoice, StackOverflow azure-documentdb, or Twitter @DocumentDB.
Quelle: Azure

Global Mentor Week: Thank you Docker Community!

Danke, рақмет сізге, tak, धन्यवाद, cảm ơn bạn, شكرا, mulțumesc, Gracias, merci, asante, ευχαριστώ, thank you community for an incredible Docker Global Mentor Week! From Tokyo to Sao Paulo, Kisimu to Copenhagen and Ottowa to Manila, it was so awesome to see the energy from the community coming together to celebrate and learn about Docker!

Over 7,500 people registered to attend one of the 110 mentor week events across 5 continents! A huge thank you to all the Docker meetup organizers who worked hard to make these special events happen and offer Docker beginners and intermediate users an opportunity to participate in Docker courses.
None of this would have been possible without the support (and expertise!) of the 500+ advanced Docker users who signed up as mentors to help newcomers .
Whether it was mentors helping attendees, newcomers pushing their first image to Docker Hub or attendees mingling and having a good time, everyone came together to make mentor week a success as you can see on social media and the Facebook photo album.
Here are some of our favorite tweets from the meetups:
 

@Docker LearnDocker at Grenoble France 17Nov2016 @HPE_FR pic.twitter.com/8RSxXUWa4k
— Stephane Bureau (@SBUCloud) November 18, 2016

Awesome turnout at tonight&;s @DockerNYC learndocker event! We will be hosting more of these &; Keep tabs on meetup: https://t.co/dT99EOs4C9 pic.twitter.com/9lZocCjMPb
— Luisa M. Morales (@luisamariethm) November 18, 2016

And finally&; &;Tada&; Docker Mentor Weeklearndocker pic.twitter.com/6kzedIoGyB
— Károly Kass (@karolykassjr) November 17, 2016

 
Learn Docker
In case you weren’t able to attend a local event, the five courses are now available to everyone online here: https://training.docker.com/instructor-led-training
Docker for Developers Courses
Developer &8211; Beginner Linux Containers
This tutorial will guide you through the steps involved in setting up your computer, running your first containers, deploying a web application with Docker and running a multi-container voting app with Docker Compose.
Developer &8211; Beginner Windows Containers
This tutorial will walk you through setting up your environment, running basic containers and creating a Docker Compose multi-container application using Windows containers.
Developer &8211; Intermediate (both Linux and Windows)
This tutorial teaches you how to network your containers, how you can manage data inside and between your containers and how to use Docker Cloud to build your image from source and use developer tools and programming languages with Docker.
Docker for Operations courses
This courses are step-by-step guides where you will build your own Docker cluster, and use it to deploy a sample application. We have two solutions for you to create your own cluster.

Using play-with-docker

Play With Docker is a Docker playground that was built by two amazing Docker captains: Marcos Nils and Jonathan Leibiusky during the Docker Distributed Systems Summit in Berlin last October.
Play with Docker (aka PWD) gives you the experience of having a free Alpine Linux Virtual Machine in the cloud where you can build and run Docker containers and even create clusters with Docker features like Swarm Mode.
Under the hood DIND or Docker-in-Docker is used to give the effect of multiple VMs/PCs.
To get started, go to http://play-with-docker.com/ and click on ADD NEW INSTANCE five times. You will get five &8220;docker-in-docker&8221; containers, all on a private network. These are your five nodes for the workshop!
When the instructions in the slides tell you to &8220;SSH on node X&8221;, just go to the tab corresponding to that node.
The nodes are not directly reachable from outside; so when the slides tell you to &8220;connect to the IP address of your node on port XYZ&8221; you will have to use a different method.
We suggest to use &8220;supergrok&8221;, a container offering a NGINX+ngrok combo to expose your services. To use it, just start (on any of your nodes) the jpetazzo/supergrok image. The image will output further instructions:
docker run –name supergrok -d jpetazzo/supergrok
docker logs –follow supergrok
The logs of the container will give you a tunnel address and explain you how to connected to exposed services. That&8217;s all you need to do!
You can also view this excellent video by Docker Brussels Meetup organizer Nils de Moor who walks you through the steps to build a Docker Swarm cluster in a matter of seconds through the new play-with-docker tool.

 
Note that the instances provided by Play-With-Docker have a short lifespan (a few hours only), so if you want to do the workshop over multiple sessions, you will have to start over each time &8230; Or create your own cluster with option below.

Using Docker Machine to create your own cluster

This method requires a bit more work to get started, but you get a permanent cluster, with less limitations.
You will need Docker Machine (if you have Docker Mac, Docker Windows, or the Docker Toolbox, you&8217;re all set already). You will also need:

credentials for a cloud provider (e.g. API keys or tokens),
or a local install of VirtualBox or VMware (or anything supported by Docker Machine).

Full instructions are in the prepare-machine subdirectory.
Once you have decided what option to choose to create your swarm cluster, you ready to get started with one of the operations course below:
Operations &8211; Beginner
The beginner part of the Ops tutorial will teach you how to set up a swarm, how to use it to host your own registry, how to build your app container images and how to deploy and scale a distributed application called Dockercoins.
Operations &8211; Intermediate
From global container scheduling, overlay networks troubleshooting, dealing with stateful services and node management, this tutorial will show you how to operate your swarm cluster at scale and take you on a swarm mode deep dive.

Danke, Gracias, Merci, Asante, ευχαριστώ, thank you Docker community for an amazing&8230;Click To Tweet

The post Global Mentor Week: Thank you Docker Community! appeared first on Docker Blog.
Quelle: https://blog.docker.com/feed/

How to avoid a self-inflicted DDoS Attack – CRE life lessons

Posted by Dave Rensin, Director of Customer Reliability Engineering, and Adrian Hilton, Software Engineer, Site Reliability Engineering

Editor’s note: Left unchecked, poor software architecture decisions are the most common cause of application downtime. Over the years, Google Site Reliability Engineering has learned to spot code that could lead to outages, and strives to identify it before it goes into production as part of its production readiness review. With the introduction of Customer Reliability Engineering, we’re taking the same best practices we’ve developed for internal systems, and extending them to customers building applications on Google Cloud Platform. This is the first post in a series written by CREs to highlight real-world problems — and the steps we take to avoid them.

Distributed Denial of Service (DDoS) attacks aren’t anything new on the internet, but thanks to a recent high profile event, they’ve been making fresh headlines. We think it’s a convenient moment to remind our readers that the biggest threat to your application isn’t from some shadowy third party, but from your own code!

What follows is a discussion of one of the most common software architecture design fails — the self-inflicted DDoS — and three methods you can use to avoid it in your own application.

Even distributions that aren’t
There’s a famous saying (variously attributed to Mark Twain, Will Rogers, and others) that goes:

“It ain’t what we don’t know that hurts us so much as the things we know that just ain’t so.”
Software developers make all sorts of simplifying assumptions about user interactions, especially about system load. One of the more pernicious (and sometimes fatal) simplifications is “I have lots of users all over the world. For simplicity, I’m going to assume their load will be evenly distributed.”

To be sure, this often turns out to be close enough to true to be useful. The problem is that it’s a steady state or static assumption. It presupposes that things don’t vary much over time. That’s where things start to go off the rails.

Consider this very common pattern: Suppose you’ve written a mobile app that periodically fetches information from your backend. Because the information isn’t super time sensitive, you write the client to sync every 15 minutes. Of course, you don’t want a momentary hiccup in network coverage to force you to wait an extra 15 minutes for the information, so you also write your app to retry every 60 seconds in the event of an error.

Because you’re an experienced and careful software developer, your system consistently maintains 99.9% availability. For most systems that’s perfectly acceptable performance but it also means in any given 30-day month your system can be unavailable for up to 43. minutes.

So. Let’s talk about what happens when that’s the case. What happens if your system is unavailable for just one minute?

When your backends come back online you get (a) the traffic you would normally expect for the current minute, plus (b) any traffic from the one-minute retry interval. In other words, you now have 2X your expected traffic. Worse still, your load is no longer evenly distributed because 2/15ths of your users are now locked together into the same sync schedule. Thus, in this state, for any given 15-minute period you’ll experience normal load for 13 minutes, no load for one minute and 2X load for one minute.

Of course, service disruptions usually last longer than just one minute. If you experience a 15-minute error (still well within your 99.9% availability) then all of your load will be locked together until after your backends recover. You’ll need to provision at least 15X of your normal capacity to keep from falling over. Retries will also often “stack” at your load balancers and your backends will respond more slowly to each request as their load increases. As a result, you might easily see 20X your normal traffic (or more) while your backends heal. In the worst case, the increased load might cause your servers to run out of memory or other resources and crash again.

Congratulations, you’ve been DDoS’d by your own app!

The great thing about known problems is that they usually have known solutions. Here are three things you can do to avoid this trap.

Try exponential backoff
When you use a fixed retry interval (in this case, one minute) you pretty well guarantee that you’ll stack retry requests at your load balancer and cause your backends to become overloaded once they come back up. One of the best ways around this is to use exponential backoff.

In its most common form, exponential backoff simply means that you double the retry interval up to a certain limit to lower the number of overall requests queued up for your backends. In our example, after the first one-minute retry fails, wait two minutes. If that fails, wait four minutes and keep doubling that interval until you get to whatever you’ve decided is a reasonable cap (since the normal sync interval is 15 minutes you might decide to cap the retry backoff at 16 minutes).

Of course, backing off of retries will help your overall load at recovery but won’t do much to keep your clients from retrying in sync. To solve that problem, you need jitter.

2 Add a little jitter

Jitter is the random interval you add (or subtract) to the next retry interval to prevent clients from locking together during a prolonged outage. The usual pattern is to pick a random number between +/- a fixed percentage, say 30%, and add it to the next retry interval.

In our example, if the next backoff interval is supposed to be 4 minutes, then wait between +/- 30% of that interval. Thirty percent of 4 minutes is 1.2 minutes, so select a random value between 2.8 minutes and 5.2 minutes to wait.

Here at Google we’ve observed the impact of a lack of jitter in our own services. We once built a system where clients started off polling at random times but we later observed that they had a strong tendency to become more synchronized during short service outages or degraded operation.

Eventually we saw very uneven load across a poll interval — with most clients polling the service at the same time — resulting in peak load that was easily 3X the average. Here’s a graph from the postmortem from an outage in the aforementioned system. In this case the clients were polling at a fixed 5-minute interval, but over many months became synchronized:

Observe how the traffic (red) comes in periodic spikes, correlating with 2x the average backend latency (green) as the servers become overloaded. That was a sure sign that we needed to employ jitter. This monitoring view is also significantly under-counting the traffic peaks because of its sample interval. Once we added a random factor of +/- 1 minute (20%) to each retry the latency, traffic flattened out almost immediately, with the periodicity disappearing:

and the backends were no longer overloaded. Of course, we couldn’t do this immediately — we had to build and push a new code release to our clients with this new behavior, so we had to live with this overload for a while.

At this point, we should also point out that in the real world, usage is almost never evenly distributed — even when the users are. Nearly all systems of any scale experience peaks and troughs corresponding with the work and sleep habits of their users. Lots of people simply turn off their phones or computers when they go to sleep. That means that you’ll see a spike in traffic as those devices come back online when people wake up.

For this reason it’s also a really good idea to add a little jitter (perhaps 10%) to regular sync intervals, in addition to your retries. This is especially important for first syncs after an application starts. This will help to smooth out daily cyclical traffic spikes and keep systems from becoming overloaded.

Implement retry marking
A large fleet of backends doesn’t recover from an outage all at once. That means that as a system begins to come back online, its overall capacity ramps up slowly. You don’t want to jeopardize that recovery by trying to serve all of your waiting clients at once. Even if you implement both exponential backoff and jitter you still need to prioritize your requests as you heal.

An easy and effective technique to do this is to have your clients mark each attempt with a retry number. A value of zero means that the request is a regular sync. A value of one indicates the first retry and so on. With this in place, the backends can prioritize which requests to service and which to ignore as things get back to normal. For example, you might decide that higher retry numbers indicate users who are further out-of-sync and service them first. Another approach is to cap the overall retry load to a fixed percentage, say 10%, and service all the regular syncs and only 10% of the retries.

How you choose to handle retries is entirely up to your business needs. The important thing is that by marking them you have the ability to make intelligent decisions as a service recovers.

You can also monitor the health of your recovery by watching the retry number metrics. If you’re recovering from a six-minute outage, you might see that the oldest retries have a retry sequence number of 3. As you recover, you would expect to see the number of 3s drop sharply, followed by the 2s, and so on. If you don’t see that (or see the retry sequence numbers increase), you know you still have a problem. This would not be obvious by simply watching the overall number of retries.

Parting thoughts
Managing system load and gracefully recovering from errors is a deep topic. Stay tuned for upcoming posts about important subjects like cascading failures and load shedding. In the meantime, if you adopt the techniques in this article you can help keep your one minute network blip from turning into a much longer DDoS disaster.
Quelle: Google Cloud Platform

Six DevOps myths and the realities behind them

The post Six DevOps myths and the realities behind them appeared first on Mirantis | The Pure Play OpenStack Company.
At OpenStack Days Silicon Valley 2016, Puppet Founder and CEO Luke Kanies dispelled the six most common misconceptions he’s encountered that prevent organizations from adopting and benefiting from DevOps.

Over a five-year period, Puppet conducted market research of 25,000 people that shows the adoption of DevOps is critical to building a great software company. Unfortunately, however, many companies find that the costs of the cultural change are too high. The result is that these firms often fail to become great software companies &; sometimes because even though they try to adopt the DevOps lifestyle, they do it in a such way that the change in a way doesn&;t have enough real value because the changes don’t go deep enough.

You see, all companies are becoming software companies, Kanies explained, and surveys have shown that success requires optimization of end-to-end software production. Organizations that move past barriers to change and go from the old processes to the new way of using DevOps tools and practices will be able to make the people on their team happy, spend more time on creating value rather than on rework, and deliver software faster.

Key points in the 2016 State of DevOps Report survey show that high-performing teams deploy 200 times more frequently than average teams, with over ,500 times shorter lead times, so the time between idea and production is minimal. Additionally, these teams see failure rates that are times lower than their non-DevOps counterparts, and they recover 24 times faster. The five-year span of the survey has also shown that the distance between top performers and average performers is growing.

In other words, the cost of not adopting DevOps processes is also growing.

Despite these benefits, however, for every reason to adopt DevOps, there are plenty of myths and cultural obstacles that hold organizations back.
Myth : There&8217;s no direct value to DevOps
The first myth Kanies discussed is that there’s no direct customer or business value for adopting DevOps practices. After all, how much good does it do customers to have teams deploying 200 times more frequently?

Quite a lot, as it happens. DevOps allows faster delivery of more reliable products and optimizes processes, which results in developing software faster. That means responding to customer problems more quickly, as well as drastically slashing time to market for new ideas and products. This increased velocity means more value for your business.
Myth 2: There&8217;s no ROI for DevOps in the legacy world
The second myth, that there’s no return on investment in applying DevOps to legacy applications, is based on the idea that DevOps is only useful for new technology. The problem with this view, Kanies says, is that the majority of the world still runs in legacy environments, effectively ruling out most of the existing IT ecosystem.

There are really good reasons not to ignore this reality when planning your DevOps initiatives. The process of DevOps doesn’t have to be all-or-nothing; you can make small changes to your process and make a significant difference, removing manual steps, and slow, painful, and error-prone processes.

What&8217;s more, in many cases, you can’t predict where returns will be seen, so there’s value in working across the entire organization. Kanies points out that it makes no sense to only utilize DevOps for the new, shiny stuff that no one is really using yet and neglect the production applications that users care about &8212; thus leaving them operating slowly and poorly.
Myth 3: Only unicorns can wield DevOps
Myth number three is that DevOps only works with “unicorn” companies and not traditional enterprise. Traditional companies want assurances that DevOps solutions and benefits work for their very traditional needs, and not just for new, from-scratch companies.

Kanies points out that DevOps is the new normal, and no matter where organizations are in the maturity cycle, they need to be able to figure out how to optimize the entire end-to-end software production, in order to gain the benefits of DevOps: reduced time to market, lower mean time to recovery, and higher levels of employee engagement.
Myth : You don&8217;t have enough time or people
The fourth myth is that improvement via DevOps requires spare time and people the organization doesn’t have. Two concepts at the root of this myth are the realities that no matter what you do, software must be delivered faster and more often and that costs must be maintained or decreased, and organizations don’t see how to do this &8212; especially if they take time to retool to a new methodology.

But DevOps is about time reclamation. First, it automates many tasks that computers can accomplish faster and more reliably and an overworked IT engineer. That much is obvious.  

But there&8217;s a second, less obvious way that DevOps enables you to reclaim time and money. Studies have shown that on average, SREs, sysadmins, and so on get interrupted every fifteen minutes &8212; and that it takes about thirty minutes to fully recover from an interruption. This means many people have no time to spend hours on a single, hard problem because they constantly get interrupted. Recognizing this problem and removing the interruptions can free up time for more value-added activity and free up needed capacity in the organization.
Myth : DevOps doesn&8217;t fit with regulations and compliance
Myth number five comes from companies subject to regulation and compliance who believe this precludes adoption of DevOps. However, with better software, faster recovery, faster deployments, and lower error rates, you can automate compliance as well. Organizations can integrate all of the elements of software development with auditing, security, and compliance to deliver higher value, and in fact, if these aren’t all done at once, companies are more than likely to experience a failure of some sort.
Myth : You don&8217;t really need it
Kanies says he hasn’t heard the sixth myth often, but once in a while, a company concludes it doesn’t have any problems that adopting DevOps would fix. But DevOps is really about being good at getting better, moving faster, and eliminating the more frustrating parts of the work, he explains.

The benefits of adopting DevOps are clear from Kanies’ points and from the data presented by the survey. As he says, the choice is really about whether to invest in change or to let your competitors do it first. Because the top performers are pulling ahead faster and faster, Kanies says, and “organizations don’t have a lot of time to make a choice.”

You can hear the entire talk on the OpenStack Days Silicon Valley site.The post Six DevOps myths and the realities behind them appeared first on Mirantis | The Pure Play OpenStack Company.
Quelle: Mirantis

How does the world consume private clouds?

The post How does the world consume private clouds? appeared first on Mirantis | The Pure Play OpenStack Company.
In my previous blog, why the world needs private clouds, we looked at ten reasons for considering a private cloud. The next logical question is how a company should go about building a private cloud.
In my view, there are four consumption models for OpenStack. Let’s look at each approach and then compare.

Approach : DIY
For the most sophisticated users, where OpenStack is super-strategic to the business, a do-it-yourself approach is appealing. Walmart, PayPal, and so on are examples of this approach.
In this approach, the user has to grab upstream OpenStack bits, package the right projects, fix bugs or add features as needed, then deploy and manage the OpenStack lifecycle. The user also has to “self-support” their internal IT/OPS team.
This approach requires recruiting and retaining a very strong engineering team that is adept at python, OpenStack, and working with the upstream open-source community. Because of this, I don’t think more than a handful companies can or would want to pursue this approach. In fact, we know of several users who started out on this path, but had to switch to a different approach because they lost engineers to other companies. Net-net, the DIY approach is not for the faint of heart.
Approach : Distro
For large sophisticated users that plan to customize a cloud for their own use and have the skills to manage it, an OpenStack distribution is an attractive approach.
In this approach, no upstream engineering is required. Instead, the company is responsible for deploying a known good distribution from a vendor and managing its lifecycle.
Even though this is simpler than DIY, very few companies can manage a complex, distributed and fast moving piece of software such as OpenStack &; a point made by Boris Renski in his recent blog Infrastructure Software is Dead. Therefore, most customers end up utilizing extensive professional services from the distribution vendor.
Approach : Managed Services
For customers who don’t want to deal with the hassle of managing OpenStack, but want control over the hardware and datacenter (on-prem or colo), managed services may be a great option.
In this approach, the user is responsible for the hardware, the datacenter, and tenant management; but OpenStack is fully managed by the vendor. Ultimately this may be the most appealing model for a large set of customers.
Approach : Hosted Private Cloud
This approach is a variation of the Managed Services approach. In this option, not only is the cloud managed, it is also hosted by the vendor. In other words, the user does not even have to purchase any hardware or manage the datacenter. In terms of look and feel, this approach is analogous to purchasing a public cloud, but without the &;noisy neighbor&; problems that sometimes arise.
Which approach is best?
Each approach has its pros and cons, of course. For example, each approach has different requirements in terms of engineering resources:

DIY
Distro
Managed Service
Hosted  Private Cloud

Need upstream OpenStack engineering team
Yes
No
No
No

Need OpenStack IT architecture team
Yes
Yes
No
No

Need OpenStack IT/ OPS team
Yes
Yes
No
No

Need hardware & datacenter team
Yes
Yes
Yes
No

Which approach you choose should also depend on factors such as the importance of the initiative, relative cost, and so on, such as:

DIY
Distro
Managed Service
Hosted  Private Cloud

How important is the private cloud to the company?
The business depends on private cloud
The cloud is extremely strategic to the business
The cloud is very strategic to the business
The cloud is somewhat strategic to the business

Ability to impact the community
Very direct
Somewhat direct
Indirect
Minimal

Cost (relative)
Depends on skills & scale
Low
Medium
High

Ability to own OpenStack operations
Yes
Yes
Depends if the vendor offers a transfer option
No

So as a user of an OpenStack private cloud you have four ways to consume the software.
The cost and convenience of each approach vary as per this simplified chart and need to be traded-off with respect to your strategy and requirements.
OK, so we know why you need a private cloud, and how you can consume one. But there&;s still one burning question: who needs it?
The post How does the world consume private clouds? appeared first on Mirantis | The Pure Play OpenStack Company.
Quelle: Mirantis