InfraKit and Docker Swarm Mode: A Fault-Tolerant and Self-Healing Cluster

Back in October 2016, released , an open source toolkit for creating and managing declarative, self-healing infrastructure. This is the second in a two part series that dives more deeply into the internals of InfraKit.
Introduction
In the first installment of this two part series about the internals of InfraKit, we presented InfraKit’s design, architecture, and approach to high availability.  We also discussed how it can be combined with other systems to give distributed computing clusters self-healing and self-managing properties. In this installment, we present an example of leveraging Docker Engine in Swarm Mode to achieve high availability for InfraKit, which in turn enhances the Docker Swarm cluster by making it self-healing.  
Docker Swarm Mode and InfraKit
One of the key architectural features of Docker in Swarm Mode is the manager quorum powered by SwarmKit.  The manager quorum stores information about the cluster, and the consistency of information is achieved through consensus via the Raft consensus algorithm, which is also at the heart of other systems like Etcd. This guide gives an overview of the architecture of Docker Swarm Mode and how the manager quorum maintains the state of the cluster.
One aspect of the cluster state maintained by the quorum is node membership &; what nodes are in the cluster, who are the managers and workers, and their statuses. The Raft consensus algorithm gives us guarantees about our cluster’s behavior in the face of failure, and fault tolerance of the cluster is related to the number of manager nodes in the quorum. For example, a Docker Swarm with three managers can tolerate one node outage, planned or unplanned, while a quorum of five managers can tolerate outages of up to two members, possibly one planned and one unplanned.
The Raft quorum makes the Docker Swarm cluster fault tolerant; however, it cannot fix itself.  When the quorum experiences outage of manager nodes, manual steps are needed to troubleshoot and restore the cluster.  These procedures require the operator to update or restore the quorum’s topology by demoting and removing old nodes from the quorum and joining new manager nodes when replacements are brought online.  
While these administration tasks are easy via the Docker command line interface, InfraKit can automate this and make the cluster self-healing.  As described in our last post, InfraKit can be deployed in a highly available manner, with multiple replicas running and only one active master.  In this configuration, the InfraKit replicas can accept external input to determine which replica is the active master.  This makes it easy to integrate InfraKit with Docker in Swarm Mode: by running InfraKit on each manager node of the Swarm and by detecting the leadership changes in the Raft quorum via standard Docker API, InfraKit achieves the same fault-tolerance as the Swarm cluster. In turn, InfraKit’s monitoring and infrastructure orchestration capabilities, when there’s an outage, can automatically restore the quorum, making the cluster self-healing.
Example: A Docker Swarm with InfraKit on AWS
To illustrate this idea, we created a Cloudformation template that will bootstrap and create a cluster of Docker in Swarm Mode managed by InfraKit on AWS.  There are couple of ways to run this: you can clone the InfraKit examples repo and upload the template, or you can use this URL to launch the stack in the Cloudformation console.
Please note that this Cloudformation script is for demonstrations only and may not represent best practices.  However, technical users should experiment and customize it to suit their purposes.  A few things about this Cloudformation template:

As a demo, only a few regions are supported: us-west-1 (Northern California), us-west-2 (Oregon), us-east-1 (Northern Virginia), and eu-central-1 (Frankfurt).
It takes the cluster size (number of nodes), SSH key, and instance sizes as the primary user input when launching the stack.
There are options for installing the latest Docker Engine on a base Ubuntu 16.04 AMI or using images which we have pre-installed Docker and published for this demonstration.
It bootstraps the networking environment by creating a VPC, a gateway and routes, a subnet, and a security group.
It creates an IAM role for InfraKit’s AWS instance plugin to describe and create EC2 instances.
It creates a single bootstrap EC2 instance and three EBS volumes (more on this later).  The bootstrap instance is attached to one of the volumes and will be the first leader of the Swarm.  The entire Swarm cluster will grow from this seed, as driven by InfraKit.

With the elements above, this Cloudformation script has everything needed to boot up an Infrakit-managed Docker in Swarm Mode cluster of N nodes (with 3 managers and N-3 workers).  
About EBS Volumes and Auto-Scaling Groups
The use of EBS volumes in our example demonstrates an alternative approach to managing Docker Swarm Mode managers.  Instead of relying on manually updating the quorum topology by removing and then adding new manager nodes to replace crashed instances, we use EBS volumes attached to the manager instances and mounted at /var/lib/docker for durable state that survive past the life of an instance.  As soon as the volume of a terminated manager node is attached to a new replacement EC2 instance, we can carry the cluster state forward quickly because there’s much less state changes to catch up to.  This approach is attractive for large clusters running many nodes and services, where the entirety of cluster state may take a long time to be replicated to a brand new manager that just joined the Swarm.  
The use of persistent volumes in this example highlights InfraKit’s philosophy of running stateful services on immutable infrastructure:

Use compute instances for just the processing cores;  they can come and go.
Keep state on persistent volumes that can survive when compute instances don’t.
The orchestrator has the responsibility to maintain members in a group identified by fixed logical ID’s.  In this case these are the private IP addresses for the Swarm managers.
The pairing of logical ID (IP address) and state (on volume) need to be maintained.

This brings up a related implementation detail &8212; why not use the Auto-Scaling Groups implementations that are already there?  First, auto-scaling group implementations vary from one cloud provider to the next, if even available.  Second, most auto-scalers are designed to manage cattle, where individual instances in a group are identical to one another.  This is clearly not the case for the Swarm managers:

The managers have some kind of identity as resources (via IP addresses)
As infrastructure resources, members of a group know about each other via membership in this stable set of IDs.
The managers identified by these IP addresses have state that need to be detached and reattached across instance lifetimes.  The pairing must be maintained.

Current auto-scaling group implementations focus on managing identical instances in a group.  New instances are launched with assigned IP addresses that don’t match the expectations of the group, and volumes from failed instances in an auto-scaling group don’t carry over to the new instance.  It is possible to work around these limitations with sweat and conviction; InfraKit, through support of allocation, logical IDs and attachments, support this use case natively.
Bootstrapping InfraKit and the Swarm
So far, the Cloudformation template implements what we called ‘bootstrapping’, or the process of creating the minimal set of resources to jumpstart an InfraKit managed cluster.  With the creation of the networking environment and the first “seed” EC2 instance, InfraKit has the requisite resources to take over and complete provisioning of the cluster to match the user’s specification of N nodes (with 3 managers and N-3 workers).   Here is an outline of the process:
When the single “seed” EC2 instance boots up, a single line of code is executed in the UserData (aka cloudinit), in Cloudformation JSON:
“docker run –rm “,{“Ref”:”InfrakitCore”},” infrakit template –url “,
{“Ref”:”InfrakitConfigRoot”}, “/boot.sh”,
” –global /cluster/name=”, {“Ref”:”AWS::StackName”},
” –global /cluster/swarm/size=”, {“Ref”:”ClusterSize”},
” –global /provider/image/hasDocker=yes”,
” –global /infrakit/config/root=”, {“Ref”:”InfrakitConfigRoot”},
” –global /infrakit/docker/image=”, {“Ref”:”InfrakitCore”},
” –global /infrakit/instance/docker/image=”, {“Ref”:”InfrakitInstancePlugin”},
” –global /infrakit/metadata/docker/image=”, {“Ref”:”InfrakitMetadataPlugin”},
” –global /infrakit/metadata/configURL=”, {“Ref”:”MetadataExportTemplate”},
” | tee /var/lib/infrakit.boot | sh n”
Here, we are running InfraKit packaged in a Docker image, and most of this Cloudformation statement references the Parameters (e.g. “InfrakitCore” and “ClusterSize”) defined at the beginning of the template.  Using parameters values in the stack template, this translates to a single statement like this that will execute during bootup of the instance:
docker run –rm infrakit/devbundle:0.4.1 infrakit template
–url https://infrakit.github.io/examples/swarm/boot.sh
–global /cluster/name=mystack
–global /cluster/swarm/size=4 # many more …
| tee /var/lib/infrakit.boot | sh # tee just makes a copy on disk

This single statement marks the hand-off from Cloudformation to InfraKit.  When the seed instance starts up (and installs Docker, if not already part of the AMI), the InfraKit container is run to execute the InfraKit template command.  The template command takes a URL as the source of the template (e.g. https://infrakit.github.io/examples/swarm/boot.sh, or a local file with a URL like file://) and a set of pre-conditions (as the &;global variables) and renders.  Through the &8211;global flags, we are able to pass a set of parameters entered by the user when launching the Cloudformation stack. This allows InfraKit to use Cloudformation as authentication and user interface for configuring the cluster.
InfraKit uses templates to simplify complex scripting and configuration tasks.  The templates can be any text that uses { { } } tags, aka “handle bar” syntax.  Here InfraKit is given a set of input parameters from the Cloudformation template and a URL referencing the boot script.  It then fetches the template and renders a script that is executed to perform the following during boot-up of the instance:
 

Formatting the EBS if it’s not already formatted
Stopping Docker if currently running and mount the volume at /var/lib/docker
Configure the Docker engine with proper labels, restarting it.
Starts up an InfraKit metadata plugin that can introspect its environment.  The AWS instance plugin, in v0.4.1, can introspect an environment formed by Cloudformation, as well as, using the instance metadata service available on AWS.   InfraKit metadata plugins can export important parameters in a read-only namespace that can be referenced in templates as file-system paths.  
Start the InfraKit containers such as the manager, group, instance, and Swarm flavor plugins.
Initializes the Swarm via docker swarm init.
Generates a config JSON for InfraKit itself.  This JSON is also rendered by a template (https://github.com/infrakit/examples/blob/v0.4.1/swarm/groups.json) that references environmental parameters like region, availability zone, subnet id’s and security group id’s that are exported by the metadata plugins.
Performs a infrakit manager commit to tell InfraKit to begin managing the cluster.

See https://github.com/infrakit/examples/blob/v0.4.1/swarm/boot.sh for details.
When the InfraKit replica begins running, it notices that the current infrastructure state (of only one node) does not match the user’s specification of 3 managers and N-3 worker nodes.  InfraKit will then drive the infrastructure state toward user’s specification by creating the rest of the managers and workers to complete the Swarm.
The topic of metadata and templating in InfraKit will be the subjects of future blog posts.  In a nutshell, metadata is information exposed by compatible plugins organized and accessible in a cluster-wide namespace.  Metadata can be accessed in the InfraKit CLI or in templates with file-like path names.  You can think of this as a cluster-wide read-only sysfs.  InfraKit template engine, on the other hand, can make use of this data to render complex configuration script files or JSON documents. The template engine supports fetching a collection of templates from local directory or from a remote site, like the example Github repo that has been configured to serve up the templates like a static website or S3 bucket.
 
Running the Example
You can either fork the examples repo or use this URL to launch the stack on AWS console.   Here we first bootstrap the Swarm with the Cloudformation template, then InfraKit takes over and provisions the rest of the cluster.  Then, we will demonstrate fault tolerance and self-healing by terminating the leader manager node in the Swarm to induce fault and force failover and recovery.
When you launch the stack, you have to answer a few questions:

The size of the cluster.  This script always starts a Swarm with 3 managers, so use a value greater than 3.

The SSH key.

There’s an option to install Docker or use an AMI with Docker pre-installed.  An AMI with Docker pre-installed gives shorter startup time when InfraKit needs to spin up a replacement instance.

Once you agree and launches the stack, it takes a few minutes for the cluster to be up.  In this case, we start a 4 node cluster.  In the AWS console we can verify that the cluster is fully provisioned by InfraKit:

Note the private IP addresses 172.31.16.101, 172.31.16.102, and 172.31.16.103 are assigned to the Swarm managers, and they are the values in our configuration. In this example the public IP addresses are dynamically assigned: 35.156.207.156 is bound to the manager instance at 172.31.16.101.  
Also, we see that InfraKit has attached the 3 EBS volumes to the manager nodes:

Because InfraKit is configured with the Swarm Flavor plugin, it also made sure that the manager and worker instances successfully joined the Swarm.  To illustrate this, we can log into the manager instances and run docker node ls. As a means to visualize the Swarm membership in real-time, we log into all three manager instances and run
watch -d docker node ls  
The watch command will by default refresh docker node ls every 2 seconds.  This allows us to not only watch the Swarm membership changes in real-time but also check the availability of the Swarm as a whole.

Note that at this time, the leader of the Swarm is just as we expected, the bootstrap instance, 172.31.16.101.  
Let’s make a note of this instance’s public IP address (35.156.207.156), private IP address (172.31.16.101), and its Swarm Node cryptographic identity (qpglaj6egxvl20vuisdbq8klr).  Now, to test fault tolerance and self-healing, let’s terminate this very leader instance.  As soon as this instance is terminated, we would expect the quorum leadership to go to a new node, and consequently, the InfraKit replica running on that node will become the new master.

Immediately the screen shows there is an outage:  In the top terminal, the connection to the remote host (172.31.16.101) is lost.  In the second and third terminals below, the Swarm node lists are being updated in real time:

When the 172.31.16.101 instance is terminated, the leadership of the quorum is transferred to another node at IP address 172.31.16.102 Docker Swarm Mode is able to tolerate this failure and continue to function (as seen by the continuously functioning of docker node ls by the remaining managers).  However, the Swarm has noticed that the 172.31.16.101 instance is now Down and Unreachable.

As configured, a quorum of 3 managers can tolerate one instance outage.   At this point, the cluster continues operation without interruption.  All your apps running on the Swarm continue to work and you can deploy services as usual.  However, without any automation, the operator needs to intervene at some point and perform some tasks to restore the cluster before another outage to the remaining nodes occur.  
Because this cluster is managed by InfraKit, the replica running on 172.31.16.102 now becomes the master when the same instance assumes leadership of the quorum.  Because InfraKit is tasked to maintain the specification of 3 manager instances with IP addresses 172.31.16.101, 172.31.16.102, and 172.31.16.103, it will take action when it notices 172.31.16.101 is missing.  In order to correct the situation, it will

Create a new instance with the private IP address 172.31.16.101
Attach the EBS volume that was previously associated with the downed instance
Restore the volume, so that Docker Engine and InfraKit starts running on that new instance.
Join the new instance to the Swarm.

As seen above, the new instance at private IP 172.31.16.101 now has an ephemeral public IP address 35.157.163.34, when it was previously 35.156.207.156.  We also see that the EBS volume has been re-attached:

Because of re-attaching the EBS volume as /var/lib/docker for the new instance and using the same IP address, the new instance will appear exactly as though the downed instance was resurrected and rejoins the cluster.  So as far as the Swarm is concerned, 172.31.16.101 may as well have been subjected to a temporary network partition and has since recovered and rejoined the cluster:

At this point, the cluster has recovered without any manual intervention.  The managers are now showing as healthy, and the quorum lives on!
Conclusion
While this example is only a proof-of-concept, we hope it demonstrates the potential of InfraKit as an active infrastructure orchestrator which can make a distributed computing cluster both fault-tolerant and self-healing.  As these features and capabilities mature and harden, we will incorporate them into Docker products such as Docker Editions for AWS and Azure.
InfraKit is a young project and rapidly evolving, and we are actively testing and building ways to safeguard and automate the operations of large distributed computing clusters.   While this project is being developed in the open, your ideas and feedback can help guide us down the path toward making distributed computing resilient and easy to operate.
Check out the InfraKit repository README for more info, a quick tutorial and to start experimenting &8212; from plain files to Terraform integration to building a Zookeeper ensemble. Have a look, explore, and join us on Github or online at the Docker Community Slack Channel (infrakit).  Send us a PR, open an issue, or just say hello.  We look forward to hearing from you!
More Resources:

Check out all the Infrastructure Plumbing projects
The InfraKit examples GitHub repo
Sign up for Docker for AWS or Docker for Azure
Try Docker today 

Part 2: InfraKit and Docker Swarm Mode: A Fault-Tolerant and Self-Healing Cluster by @dchungsfClick To Tweet

The post InfraKit and Docker Swarm Mode: A Fault-Tolerant and Self-Healing Cluster appeared first on Docker Blog.
Quelle: https://blog.docker.com/feed/

No, WikiLeaks Didn't Just Reveal That The Government Has Access To Your Secure Messaging Apps

Reuters File Photo / Reuters

SAN FRANCISCO — A misreading of new WikiLeaks documents published Tuesday morning led to mass panic over whether the CIA and allied intelligence organizations could hack into secure messaging apps trusted by millions of people across the world.

The claims were made off a cache of almost 9,000 documents and files that WikiLeaks said came from the CIA&;s Center for Cyber Intelligence and allegedly detail how the CIA hacks into phones, laptops, and other connected devices. A number of news outlets reported that the documents revealed that Signal, WhatsApp, and other messaging apps that use high-level encryption to ensure that messages are sent and received safely had been compromised.

Cybersecurity experts, however, were quick to point out that the documents simply stated that if a phone was compromised — which is to say if the CIA hacked into the phone itself — any apps on that phone would no longer be secure. This is the equivalent of saying that if your house is broken into and bugged, whispering softly on your phone in your bedroom is not going to make that conversation secure.

The leak is the latest to become public by WikiLeaks, which has come under fire for failing to adequately redact certain documents and also for its role in the US election. Last year the group released thousands of emails detailing the communications of top Democratic Party leaders — which were widely believed to originate from a Russian government–sponsored hack. US intelligence agencies accused Russia of trying to meddle in the US elections and said WikiLeaks had assisted in that cause.

Quelle: <a href="No, WikiLeaks Didn&039;t Just Reveal That The Government Has Access To Your Secure Messaging Apps“>BuzzFeed

Google Staffs Up As It Tries To Find A Way Into Trump Administration Circles

Eric Piermont / AFP / Getty Images

After vocally opposing President Trump, Alphabet, Google&;s parent company, has been making quiet inroads in Republican circles with a series of new hires and administration outreach.

Alphabet was one of several tech companies that led the charge in opposition to President Trump&039;s initial travel ban — it helped mount a legal challenge, hosted rallies on its campuses, and one of its co-founders took part in an airport protest. Its executive chairman, Eric Schmidt, who supported Hillary Clinton&039;s presidential campaign and had close ties to Obama, even told employees that the administration is going to do “evil things.” Yet simultaneously to that vocal opposition, the search giant has been working to secure its footing in the new GOP-dominated landscape.

Google recently hired two people to help bolster the company&039;s outreach to conservative groups and the Trump administration. Lee Carosi Dunn, who was previously head of elections sales and a Republican lobbyist for Google, is now the head of White House strategy and outreach. And Max Pappas, formerly a top advisor to Republican Senator Ted Cruz, will now serve as Google&039;s manager of outreach and public policy partnerships, working as Google&039;s point-person to conservative advocacy groups.

The company is also looking for an account team leader to helm Republican political advertising. Posted late last week, the position calls for candidates with “a wealth of experience with Republican campaigns,” and “strong relationships with GOP campaign managers, pollsters and general consultants.” In addition to an advanced degree and five years of management experience, the preferred qualifications include “deep relationships in Republican politics.”

Luntz Global, the corporate and political consulting firm founded by Republican pollster Frank Luntz, has also been tapped by the company to help with messaging and outreach to the administration, according to a person familiar with the partnership. Google has worked with Luntz Global in the past. The company is listed as one of their corporate clients among Uber, HBO, Walt Disney and several others. Neither Google/Alphabet nor Luntz Global responded to a request for comment.

Vincent Harris, CEO of Harris Media, who led digital strategy for Rand Paul’s presidential run and managed digital operations for Ted Cruz, told BuzzFeed News that the company has vastly improved its relationships among Republican operatives since he began working with it eight years ago.

“Google always has to be concerned about looking too liberal as a company from the perspective of the Republicans,” he said.

“Their management&039;s politics are often out of sync with the Republican party but from my personal perspective, the company has bent over backwards to try and work with Republican agencies and campaigns.” He added, “They often go out of their way to avoid any appearance of favoritism for Democrats.

Matt Stoller, a fellow at the Open Markets program at New America, whose research focuses on competition policy, described Google/Alphabet&039;s influence during the Obama years as “Wall Street West.”

“They weren&039;t going to repeat the mistakes of Microsoft,” Stoller told BuzzFeed News, referring to that company&039;s antitrust issues during the Clinton administration. “Microsoft showed disdain for Washington and thats why they got hit with the antitrust suit. That&039;s why Google curried so much political favor.”

But courting Republicans in Trump&039;s Washington may come as a challenge. And under the new administration, Stoller thinks the company is in a bind. “There are multiple factions in the Trump world that do not like Google — both corporate competitors who are up against a monopoly, but also some of the nationalists don&039;t trust Silicon Valley.”

Another point of tension exists between the company&039;s valuable engineering workforce, which generally opposes President Trump&039;s policies, and Alphabet&039;s corporate leadership, which has to curry favor with the White House.

Still, Alphabet remains a Washington powerhouse. Last year the company spent over $15.4 million lobbying Congress and federal agencies, and hired nearly two dozen outside firms to help push its priorities. It continues to outspend every other technology company in the nation&039;s capital.

Who President Trump appoints to fill top antitrust posts in the federal government may also serve as a sign of Alphabet&039;s influence in the post-Obama era. Trump has yet to nominate a permanent chair of the Federal Trade Commission or the chief antitrust lawyer at the Department of Justice. How these officials might grapple with Alphabet&039;s sprawling businesses and those of other tech titans like Amazon and Facebook will be closely watched. While regulators in Europe have brought several anti-competitive charges against Alphabet, the FTC closed its probe of the company&039;s search practices in 2013, a contentious move that critics point to as a troubling aspect of Obama&039;s tech legacy.

Three people with knowledge of Trump&039;s staffing decisions have told BuzzFeed News that Utah Attorney General Sean Reyes, who has called for the FTC to re-open that antitrust case, is a leading contender for the FTC chair.

Quelle: <a href="Google Staffs Up As It Tries To Find A Way Into Trump Administration Circles“>BuzzFeed

Protect and recover Hyper-V machines to premium storage with Azure Site Recovery

We are excited to announce support for replication of Hyper-V virtual machines (managed by System Center VMM or not under System Center VMM management)  to premium storage accounts in Azure. 

We recommend that you replicate I/O intensive enterprise workloads to premium storage which provides high IOPS and high disk throughput per VM with extremely low latencies for read operations. At the time of a failover to Azure, workloads replicating to Premium storage will come up on Azure virtual machines running on premium storage and achieves high-levels of performance, both in terms of throughout and latency.

To set up replication to premium storage, you will need

A premium storage account: When you replicate your on-premises virtual machines/physical servers to premium storage, all the data residing on the protected machine’s disks is replicated to the premium storage account.
A standard storage account: After the initial phase of replicating disk data is complete, all changes to the on-premises disk data are tracked continuously and stored as replication logs in the standard storage account.

 

Below are a few considerations to keep in mind when using premium storage:

Replication to premium storage is supported for both Classic and Resource Manager storage accounts.
Copy frequency of 5 minutes or 15 minutes (configured as a setting in Replication policies) is supported for premium storage. This is based on the number of snapshots per blob (100 snapshots per blob) supported by premium storage.

 

Support Matrix for premium storage

Scenario
Replication to premium storage

Hyper-V virtual machines (managed by System Center VMM/ not managed by System Center VMM)
                     Yes

VMware virtual machines/ physical servers
                     Yes

 

To understand more about how premium storage works, including performance and scalability targets of premium storage, you can refer the detailed documentation on Premium Storage from the Azure storage team.

Ready to start using ASR? Check out additional product information, to start replicating your workloads to Microsoft Azure using Azure Site Recovery today. You can use the powerful replication capabilities of Site Recovery for 31 days at no charge for every new physical server or virtual machine that you replicate. Visit the Azure Site Recovery forum on MSDN for additional information and to engage with other customers, or use the ASR UserVoice to let us know what features you want us to enable next.

Azure Site Recovery, as part of Microsoft Operations Management Suite, enables you to gain control and manage your workloads no matter where they run (Azure, AWS, Windows Server, Linux, VMware or OpenStack) with a cost-effective, all-in-one cloud IT management solution. Existing System Center customers can take advantage of the Microsoft Operations Management Suite add-on, empowering them to do more by leveraging their current investments. Get access to all the new services that OMS offers, with a convenient step-up price for all existing System Center customers. You can also access only the IT management services that you need, enabling you to on-board quickly and have immediate value, paying only for the features that you use.
Quelle: Azure