OpenShift for Operators Lab: Sneak Preview!

Back by popular demand, the OpenShift for Operators lab session is coming to Red Hat Summit with fresh new content for 2017! We get a full house every year, so if you want in, you need to get registered today. Read on to learn more about Red Hat Summit’s great lab sessions, and in particular, OpenShift for Operators.
Quelle: OpenShift

Cloud Therapy helps doctors better understand and diagnose rare diseases

For patients who have a rare disease, it can take a long time to get an accurate diagnosis. Some conditions have doctors working months or even years to try to make sense of symptoms and test results. Patients cannot afford to forgo treatment while doctors try to discern clues about their conditions. For many, there’s no time to waste, yet it can take five months to a year or more to reach a diagnosis.
Cloud Therapy was founded to help doctors distill diagnostic insights from enormous amounts of medical literature and research. The solution uses natural language processing (NLP), probabilistic reasoning and deep learning to analyze terabytes of unstructured data to assist doctors recognize rare conditions quickly and accurately.
If the solution can help medical professionals access information faster, it can mean the difference between life and death for someone with a rare disease.
A concept imagined
The predecessor to Cloud Therapy was a simple iPad app that contained information for patients to review before the doctor entered the consult room.
Throughout the day, a doctor may lose five minutes with one patient and six minutes with another while repeatedly answering the same questions. Going over the allotted time could eat up three hours, and the doctor would have to put in extra time to finish daily work. The app helped cut three down to about an hour and a half.
Around that time, IBM Watson debuted on Jeopardy!, and soon after, the technology became available for developers. This is when work began on the Cloud Therapy solution.
Cloud Therapy evolves
When the full range of Watson application programming interfaces (APIs) became available on the IBM Bluemix platform, Cloud Therapy took off.
Developing with Bluemix gives Cloud Therapy a competitive advantage because it can incorporate different services and data sources without spending time to develop custom integration logic. The company began experimenting with big data analytics and cognitive computing services. The team realized they had a noteworthy idea when the Global Mobile Innovators Tournament, enabled by IBM, chose Cloud Therapy as a finalist. The company’s prototype idea was later shared at the “4 Years From Now” (4YFN) conference in Barcelona, Spain.
Since then, the product has evolved, incorporating IBM Watson Conversation, Dialog, Natural Language Classifier, Retrieve and Rank, Speech to Text and Text to Speech services along with the IBM Cloudant noSQL database, IBM Mobile Push Notification and IBM Mobile Quality Assurance services.
Cloud Therapy relies on Cloudant because the company wants people to have the best possible experience with no lag. Delivering the solution in seconds is what makes Cloud Therapy what it is.
With IBM Bluemix Mobile Services, the Cloud Therapy solution supports mobile app capabilities, including the ability for users to offer immediate feedback within the app. It also uses Google Translate to accommodate English and Spanish language content.
Far more than a search engine
Many might wonder if they could use a regular search engine to understand the relationships of the question the way Watson does. With a search engine, people must cross your fingers that whatever results they get will be meaningful. The Cloud Therapy solution uses Watson’s brain to understand relationships and actually learn.
For each customer implementation, Cloud Therapy cleans, curates and uploads data, including clinical trials, drug research, previous case information, and genotype and phenotype patterns. The data can be both structured and unstructured. Initially, the company creates question-and-answer pairs to train the algorithms. The training process involves providing feedback to the system as it matches appropriate answers to questions. This requires input from both subject matter experts and data scientists.
After learning from the data set, the cognitive engine uses NLP to understand queries and then applies textual analysis and probabilistic reasoning to return appropriate content, ranked by confidence level. The platform can process 1.3 terabytes of data per second.
The solution also learns from user feedback. Doctors can converse with the system in natural language, and it helps provide clues to a patient’s condition.
Faster, more accurate diagnoses
For several pilot implementations, each Cloud Therapy customer brings its own data and desired use case. For one project, a customer will provide 13 years of data on 9,000 rare disease cases that doctors will want to query.
Cloud Therapy tells all its customers that this is experimental technology and they should understand that, especially because expectations in the cognitive era are enormously high. The company tries to help everyone recognize that the technology is brand new, and so far, so good.
Cloud Therapy sees the value, and its partners see the value. Understanding patient symptoms for a faster, more accurate diagnosis can be a matter of life and death.
Read more about Cloud Therapy.
The post Cloud Therapy helps doctors better understand and diagnose rare diseases appeared first on news.
Quelle: Thoughts on Cloud

Cloud Therapy helps doctors better understand and diagnose rare diseases

For patients who have a rare disease, it can take a long time to get an accurate diagnosis. Some conditions have doctors working months or even years to try to make sense of symptoms and test results. Patients cannot afford to forgo treatment while doctors try to discern clues about their conditions. For many, there’s no time to waste, yet it can take five months to a year or more to reach a diagnosis.
Cloud Therapy was founded to help doctors distill diagnostic insights from enormous amounts of medical literature and research. The solution uses natural language processing (NLP), probabilistic reasoning and deep learning to analyze terabytes of unstructured data to assist doctors recognize rare conditions quickly and accurately.
If the solution can help medical professionals access information faster, it can mean the difference between life and death for someone with a rare disease.
A concept imagined
The predecessor to Cloud Therapy was a simple iPad app that contained information for patients to review before the doctor entered the consult room.
Throughout the day, a doctor may lose five minutes with one patient and six minutes with another while repeatedly answering the same questions. Going over the allotted time could eat up three hours, and the doctor would have to put in extra time to finish daily work. The app helped cut three down to about an hour and a half.
Around that time, IBM Watson debuted on Jeopardy!, and soon after, the technology became available for developers. This is when work began on the Cloud Therapy solution.
Cloud Therapy evolves
When the full range of Watson application programming interfaces (APIs) became available on the IBM Bluemix platform, Cloud Therapy took off.
Developing with Bluemix gives Cloud Therapy a competitive advantage because it can incorporate different services and data sources without spending time to develop custom integration logic. The company began experimenting with big data analytics and cognitive computing services. The team realized they had a noteworthy idea when the Global Mobile Innovators Tournament, enabled by IBM, chose Cloud Therapy as a finalist. The company’s prototype idea was later shared at the “4 Years From Now” (4YFN) conference in Barcelona, Spain.
Since then, the product has evolved, incorporating IBM Watson Conversation, Dialog, Natural Language Classifier, Retrieve and Rank, Speech to Text and Text to Speech services along with the IBM Cloudant noSQL database, IBM Mobile Push Notification and IBM Mobile Quality Assurance services.
Cloud Therapy relies on Cloudant because the company wants people to have the best possible experience with no lag. Delivering the solution in seconds is what makes Cloud Therapy what it is.
With IBM Bluemix Mobile Services, the Cloud Therapy solution supports mobile app capabilities, including the ability for users to offer immediate feedback within the app. It also uses Google Translate to accommodate English and Spanish language content.
Far more than a search engine
Many might wonder if they could use a regular search engine to understand the relationships of the question the way Watson does. With a search engine, people must cross your fingers that whatever results they get will be meaningful. The Cloud Therapy solution uses Watson’s brain to understand relationships and actually learn.
For each customer implementation, Cloud Therapy cleans, curates and uploads data, including clinical trials, drug research, previous case information, and genotype and phenotype patterns. The data can be both structured and unstructured. Initially, the company creates question-and-answer pairs to train the algorithms. The training process involves providing feedback to the system as it matches appropriate answers to questions. This requires input from both subject matter experts and data scientists.
After learning from the data set, the cognitive engine uses NLP to understand queries and then applies textual analysis and probabilistic reasoning to return appropriate content, ranked by confidence level. The platform can process 1.3 terabytes of data per second.
The solution also learns from user feedback. Doctors can converse with the system in natural language, and it helps provide clues to a patient’s condition.
Faster, more accurate diagnoses
For several pilot implementations, each Cloud Therapy customer brings its own data and desired use case. For one project, a customer will provide 13 years of data on 9,000 rare disease cases that doctors will want to query.
Cloud Therapy tells all its customers that this is experimental technology and they should understand that, especially because expectations in the cognitive era are enormously high. The company tries to help everyone recognize that the technology is brand new, and so far, so good.
Cloud Therapy sees the value, and its partners see the value. Understanding patient symptoms for a faster, more accurate diagnosis can be a matter of life and death.
Read more about Cloud Therapy.
The post Cloud Therapy helps doctors better understand and diagnose rare diseases appeared first on news.
Quelle: Thoughts on Cloud

Let’s Meet At OpenStack Summit In Boston!

The post Let&;s Meet At OpenStack Summit In Boston! appeared first on Mirantis | Pure Play Open Cloud.

 
The citizens of Cloud City are suffering — Mirantis is here to help!
 
We&8217;re planning to have a super time at summit, and hope that you can join us in the fight against vendor lock-in. Come to booth C1 to power up on the latest technology and our revolutionary Mirantis Cloud Platform.

If you&8217;d like to talk with our team at the summit, simply contact us and we&8217;ll schedule a meeting.

REQUEST A MEETING

 
Free Mirantis Training @ Summit
Take advantage of our special training offers to power up your skills while you&8217;re at the Summit! Mirantis Training will be offering an Accelerated Bootcamp session before the big event. Our courses will be conveniently held within walking distance of the Hynes Convention Center.

Additionally, we&8217;re offering a discounted Professional-level Certification exam and a free Kubernetes training, both held during the Summit.

 
Mirantis Presentations
Here&8217;s where you can find us during the summit&;.
 
MONDAY MAY 8

Monday, 12:05pm-12:15pm
Level: Intermediate
Turbo Charged VNFs at 40 gbit/s. Approaches to deliver fast, low latency networking using OpenStack.
(Gregory Elkinbard, Mirantis; Nuage)

Monday, 3:40pm-4:20pm
Level: Intermediate
Project Update &; Documentation
(Olga Gusarenko, Mirantis)

Monday, 4:40pm-5:20pm
Level: Intermediate
Cinder Stands Alone
(Ivan Kolodyazhny, Mirantis)

Monday, 5:30pm-6:10pm
Level: Intermediate
m1.Boaty.McBoatface: The joys of flavor planning by popular vote
(Craig Anderson, Mirantis)

 

TUESDAY MAY 9

Tuesday, 2:00pm-2:40pm
Level: Intermediate
Proactive support and Customer care
(Anton Tarasov, Mirantis)

Tuesday, 2:30pm-2:40pm
Level: Advanced
OpenStack, Kubernetes and SaltStack for complete deployment automation
(Aleš Komárek and Thomas Lichtenstein, Mirantis)

Tuesday, 2:50pm-3:30pm
Level: Intermediate
OpenStack Journey: from containers to functions
(Ihor Dvoretskyi, Mirantis; Iron.io, BlueBox)

Tuesday, 4:40pm-5:20pm
Level: Advanced
Point and Click ->CI/CD: Real world look at better OpenStack deployment, sustainability, upgrades!
(Bruce Mathews and Ryan Day, Mirantis; AT&T)

Tuesday, 5:05pm-5:45pm
Level: Intermediate
Workload Onboarding and Lifecycle Management with Heat
(Florin Stingaciu and Lance Haig, Mirantis)

 

WEDNESDAY MAY 10

Wednesday, 9:50am-10:30am
Level: Intermediate
Project Update &8211; Neutron
(Kevin Benton, Mirantis)

Wednesday, 11:00am-11:40am
Level: Intermediate
Project Update &8211; Nova
(Jay Pipes, Mirantis)

Wednesday, 1:50pm-2:30pm
Level: Intermediate
Kuryr-Kubernetes: The seamless path to adding Pods to your datacenter networking
(Ilya Chukhnakov, Mirantis)

Wednesday, 1:50pm-2:30pm
Level: Intermediate
OpenStack: pushing to 5000 nodes and beyond
(Dina Belova and Georgy Okrokvertskhov, Mirantis)

Wednesday, 4:30pm-5:10pm
Level: Intermediate
Project Update &8211; Rally
(Andrey Kurilin, Mirantis)

 

THURSDAY MAY 11

Thursday, 9:50am-10:30am
Level: Intermediate
OSprofiler: evaluating OpenStack
(Dina Belova, Mirantis; VMware)

Thursday, 11:00am-11:40am
Level: Intermediate
Scheduler Wars: A New Hope
(Jay Pipes, Mirantis)

Thursday, 11:30am-11:40am
Level: Beginner
Saving one cloud at a time with tenant care
(Bryan Langston, Mirantis; Comcast)

Thursday, 3:10pm-3:50pm
Level: Advanced
Behind the Scenes with Placement and Resource Tracking in Nova
(Jay Pipes, Mirantis)

Thursday, 5:00pm-5:40pm
Level: Intermediate
Terraforming OpenStack Landscape
(Mykyta Gubenko, Mirantis)

 

Notable Presentations By The Community
 
TUESDAY MAY 9

Tuesday, 11:15am-11:55am
Level: Intermediate
AT&;T Container Strategy and OpenStack&8217;s role in it
(AT&038;T)

Tuesday, 11:45am-11:55am
Level: Intermediate
AT&038;T Cloud Evolution : Virtual to Container based (CI/CD)^2
(AT&038;T)

WEDNESDAY MAY 10

Wednesday, 1:50pm-2:30pm
Level: Intermediate
Event Correlation &038; Life Cycle Management – How will they coexist in the NFV world?
(Cox Communications)

Wednesday, 5:20pm-6:00pm
Level: Intermediate
Nova Scheduler: Optimizing, Configuring and Deploying NFV VNF&8217;s on OpenStack
(Wind River)

THURSDAY MAY 11

Thursday, 9:00am-9:40am
Level: Intermediate
ChatOpsing Your Production Openstack Cloud
(Adobe)

Thursday, 11:00am-11:10am
Level: Intermediate
OpenDaylight Network Virtualization solution (NetVirt) with FD.io VPP data plane
(Ericsson)

Thursday, 1:30pm-2:10pm
Level: Beginner
Participating in translation makes you an internationalized OpenStacker &038; developer
(Deutsche Telekom AG)

Thursday, 5:00pm-5:40pm
Level: Beginner
Future of Cloud Networking and Policy Automation
(Cox Communications)

The post Let&8217;s Meet At OpenStack Summit In Boston! appeared first on Mirantis | Pure Play Open Cloud.
Quelle: Mirantis

Steve Hardy: OpenStack TripleO in Ocata, from the OpenStack PTG in Atlanta

Steve Hardy talks about TripleO in the Ocata release, at the Openstack PTG in Atlanta.

Steve: My name is Steve Hardy. I work primarily on the TripleO project, which is an OpenStack deployment project. What makes TripleO interesting is that it uses OpenStack components primarily in order to deploy a production OpenStack cloud. It uses OpenStack Ironic to do bare metal provisioning. It uses Heat orchestration in order to drive the configuration workflow. And we also recently started using Mistral, which is an OpenStack workflow component.

So it’s kind of different from some of the other deployment initiatives. And it’s a nice feedback loop where we’re making use of the OpenStack services in the deployment story, as well as in the deployed cloud.

This last couple of cycles we’ve been working towards more composability. That basically means allowing operators more flexibility with service placement, and also allowing them to define groups of node in a more flexible way so that you could either specify different configurations – perhaps you have multiple types of hardware for different compute configurations for Nova, or perhaps you want to scale services into particular groups of clusters for particular services.

It’s basically about giving more choice and flexibility into how they deploy their architecture.

Rich: Upgrades have long been a pain point. I understand there’s some improvement in this cycle there as well?

Steve: Yes. Having delivered composable services and composable roles for the Newton OpenStack release, the next big challenge was giving operators the flexibility to deploy services on arbitrary nodes in your OpenStack environment, you need some way to upgrade, and you can’t necessarily make assumptions about which service is running on which group of nodes. So we’ve implented the new feature which is called composable upgrades. That uses some Heat functionality combined with Ansible tasks, in order to allow very flexible dynamic definition of what upgrade actions need to take place when you’re upgrading some specific group of nodes within your environment. That’s part of the new Ocata release. It’s hopefully going to provide a better upgrade experience, for end-to-end upgrades of all the OpenStack services that TripleO supports.

Rich: It was a very short cycle. Did you get done what you wanted to get done, or are things pushed off to Pike now.

Steve: I think there’s a few remaining improvements around operator-driven upgrades, which we’ll be looking at during the Pike cycle. It certainly has been a bit of a challenge with the short development timeframe during Ocata. But the architecture has landed, and we’ve got composable upgrade support for all the services in Heat upstream, so I feel like we’ve done what we set out to do in this cycle, and there will be further improvements around operator-drive upgrade workflow and also containerization during the Pike timeframe.

Rich: This week we’re at the PTG. Have you already had your team meetings, or are they still to come.

Steve: The TripleO team meetings start tomorrow, which is Wednesday. The previous two days have mostly been cross-project discussion. Some of which related to collaborations which may impact TripleO features, some of which was very interesting. But the TripleO schedule starts tomorrow – Wednesday and Thursday. We’ve got a fairly packed agenda, which is going to focus around – primarily the next steps for upgrades, containerization, and ways that we can potentially collaborate more closely with some of the other deployment projects within the OpenStack community.

Rich: Is Kolla something that TripleO uses to deploy, or is that completely unrelated?

Steve: The two projects are collaborating. Kolla provides a number of components, one of which is container definitions for the OpenStack services themselves, and the containerized TripleO architecture actually consumes those. There are some other pieces which are different between the two projects. We use Heat to orchestrate container deployment, and there’s an emphasis on Ansible and Kubernetes on the Kolla side, where we’re having discussions around future collaboration.

There’s a session planned on our agenda for a meeting between the Kolla Kubernetes folks and TripleO folks to figure out of there’s long-term collaboration there. But at the moment there’s good collaboration around the container definitions and we just orchestrate deploying those containers.

We’ll see what happens in the next couple of days of sessions, and getting on with the work we have planned for Pike.

Rich: Thank you very much.
Quelle: RDO

Ten Ways a Cloud Management Platform Makes your Virtualization Life Easier

I spent the last decade working with virtualization platforms and the certifications and accreditation’s that go along with them.  During this time, I thought I understood what it meant to run an efficient data center. After six months of working with Red Hat CloudForms, a Cloud Management Platform (CMP), I now wonder what was I thinking.  I encountered every one of the problems below, each are preventable with the right solution. Remember, we live in the 21st century&;shouldn’t the software that we use act like it?

We filled up a data store and all of the machines on it stopped working. 
It does not matter if it is a development environment or the mission critical database cluster, when storage fills up everything stops!  More often than not it is due to an excessive number of snapshots. The good news is CloudForms can quickly be set up with a policy to recognize and prevent this from happening.For example we can check the storage utilization and if it is over 90% full take action, or better yet, when it is within two weeks of being full based on usage trends. That way if manual action is required, there is enough forewarning to do so.  Another good practice is to setup a policy to disable more than a few snapshots. We all love to take snapshots, but there is a real cost to them, and there is no need to let them get out of hand.
I just got thousands of emails telling me that my host is down.The only thing worse than no email alert is receiving thousands of them. In CloudForms it is not only easy to set up alerts, but also to define how often they should be acted upon. For example, check every hour, but only notify once per day.
Your virtual machines (VMs) cannot be migrated because the VM tools updater CD-ROM image was not un-mounted correctly. 
This is a serious issue for a number of reasons.  First it breaks Disaster Recovery (DR) operations and can cause virtual machines to be out of balance. It also disables the ability to put a node into maintenance mode, potentially causing additional outages and delays.Most solutions involve writing a shell script that runs as root and attempts to periodically unmount the virtual CD-ROM drives. These scripts usually work, but are both scary from a security standpoint and indiscriminately dangerous, imagine physically ejecting the CD-ROM while the database administrator is in the middle of a database upgrade!  With CloudForms we can setup a simple policy that can unmount drives once a day, but only after sanity checking that it is the correct CD-ROM image and that the system is in a state where it can be safely unmounted.
I have to manually ensure that all of my systems pass an incredibly detailed and painful compliance check (STIGS, PCI, FIPS, etc.) by next week! 
I have lost weeks of my life to this and if you have not had the pleasure, count yourself lucky.  When the “friendly” auditors show up with a stack of three-ring binders and a mandate to check everything, you might as well clear your calendar for the next few weeks. In addition, since these checks are usually a requirement to continuing operations, expect many of these meetings to involve layers of upper management you did not know existed, and this is definitely not the best time to become acquainted.The good news is CloudForms allows for you to run automatic checks on VMs and hosts. If you are not already familiar with its OpenSCAP scanning capability, you owe yourself a look. Not only that, but if someone attempts to bring a VM online that is not compliant, CloudForms can shut it right back down. That is the type of peace of mind that allows for sleep-filled nights.
Someone logged into a production server as root using the virtual console and broke it.  Now you have to physically hunt down and interrogate all the potential culprits &; as well as fix the problem. 
Before you pull out your foam bat and roam the halls to apply some “sense” to the person who did this, it is good to know exactly who it was and what they did. With CloudForms you can see a timeline of each machine, who logged into what console, as well as perform a drift analysis to potentially see what changed.  With this knowledge you can now not only fix the problem, but also “educate” the responsible party.
The developers insist that all VM’s must have 8 vCPU’s and 64GB of RAM. 
The best way to fight flagrant waste or resources is with data.  CloudForms provides the concept of “Right-Sizing” where it will watch VMs operate and determine what resource allocation is the ideal size. With this information in hand CloudForms can either automatically adjust the allocations, or spit out a report to be used to show what the excessive resources are costing.
Someone keeps creating 32bit VM’s with more than 4GB of RAM! 
As we know there is no “good” way that a 32bit VM can possibly use that much memory and it is essentially just waste.  A simple CloudForms policy to check for “OS Type = 32bit” and “RAM > 4GB”, can be a very interesting report to run. Or better yet, put a policy in place to automatically adjust the memory to 4GB and notify the system owner.
I have to buy hardware for next year, but my capacity-planning formula involves a spreadsheet and a dart board. 
Long term planning in IT is hard, especially with dynamic workloads in a multi-cloud environment.  Once CloudForms is running, it automatically collects performance data and executes trend line analysis to assist with operational management. For example, in 23 days you will be out of storage on your production SAN. If that does not get the system administrator&;s attention nothing will. It can also perform simulations to see what your environment would look like if you added resources. So you can see your trend lines and capacity if you added another 100 VMs of a particular type and size.
For some reason two hosts were swapping VMs back and forth, and I only found out when people complained about performance. 
As an administrator there is no worse way to find out that something is wrong than being told by a user. Large scale issues such as this can be hard to see from the logs since they consist of typical output. With CloudForms, a timeline overview of the entire environment highlights issues like this and the root cause can be tracked down.
I spend most of my day pushing buttons, spinning up VMs, manually grouping them into virtual folders and tracking them with spreadsheets. 
Before starting a new administrator role it is always good to ask for the “Point of Truth” system that keeps track of what systems are running, where they are, and who is responsible for them.  More often than not the answer is, “A guy, who keeps track of the list, on his laptop”.This may be how it was always done, but now with tools such as CloudForms, you can automatically tag machines based on location, projects, users, or any other combination of characteristics, and as a bonus, can provide usage and costing information back to the user. Gary could only dream of providing that much helpful information.

Conclusion
There is never enough time in the day, and the pace of new technologies is accelerating. The only way to keep up is to automate processes. The tools that got you where you are today are not necessarily the same ones that will get you through the next generation of technologies. It will be critical to have tools that work across multiple infrastructure components and provide the visibility and automation required. This is why you need a cloud management platform and where the real power of CloudForms comes into play.
Quelle: CloudForms

We installed an OpenStack cluster with close to 1000 nodes on Kubernetes. Here’s what we found out.

The post We installed an OpenStack cluster with close to 1000 nodes on Kubernetes. Here&;s what we found out. appeared first on Mirantis | Pure Play Open Cloud.
Late last year, we did a number of tests that looked at deploying close to 1000 OpenStack nodes on a pre-installed Kubernetes cluster as a way of finding out what problems you might run into, and fixing them, if at all possible. In all we found several, and though in general, we were able to fix them, we thought it would still be good to go over the types of things you need to look for.
Overall we deployed an OpenStack cluster that contained more than 900 nodes using Fuel-CCP on a Kubernetes that had been deployed using Kargo. The Kargo tool is part of the Kubernetes Incubator project and uses the Large Kubernetes Cluster reference architecture as a baseline.
As we worked, we documented issues we found, and contributed fixes to both the deployment tool and reference design document where appropriate.  Here&8217;s what we found.
The setup
We started with just over 175 bare metal machines, allocating 3 of them to be used for Kubernetes control plane services placement (API servers, ETCD, Kubernetes scheduler, etc.), others had 5 virtual machines on each node, where every VM was used as a Kubernetes minion node.
Each bare metal node had the following specifications:

HP ProLiant DL380 Gen9
CPU &; 2x Intel(R) Xeon(R) CPU E5-2680 v3 @ .50GHz
RAM &8211; 264G
Storage &8211; 3.0T on RAID on HP Smart Array P840 Controller, HDD &8211; 12 x HP EH0600JDYTL
Network &8211; 2x Intel Corporation Ethernet 10G 2P X710

The running OpenStack cluster (as far as Kubernetes is concerned) consists of:

OpenStack control plane services running on close to 150 pods over 6 nodes
Close to 4500 pods spread across all of the remaining nodes, at 5 pods per minion node

One major Prometheus problem
During the experiments we used Prometheus monitoring tool to verify resource consumption and the load put on the core system, Kubernetes, and OpenStack services. One note of caution when using Prometheus:  Deleting old data from Prometheus storage will indeed improve the Prometheus API speed &; but it will also delete any previous cluster information, making it unavailable for post-run investigation. So make sure to document any observed issue and its debugging thoroughly!
Thankfully, we had in fact done that documentation, but one thing we&8217;ve decided to do going forward to prevent this problem by configuring Prometheus to back up data to one of the persistent time series databases it supports, such as InfluxDB, Cassandra, or OpenTSDB. By default, Prometheus is optimized to be used as a real time monitoring / alerting system, and there is an official recommendation from the Prometheus developers team to keep monitoring data retention for only about 15 days to keep the tool working in a quick and responsive manner. By setting up the backup, we can store old data for an extended amount of time for post-processing needs.
Problems we experienced in our testing
Huge load on kube-apiserver
Symptoms
Initially, we had a setup with all nodes (including the Kubernetes control plane nodes) running on a virtualized environment, but the load was such that the API servers couldn&8217;t function at all so they were moved to bare metal.  Still, both API servers running in the Kubernetes cluster were utilising up to 2000% of the available CPU (up to 45% of total node compute performance capacity), even after we migrated them to hardware nodes.
Root cause
All services that are not on Kubernetes masters (kubelet, kube-proxy on all minions) access kube-apiserver via a local NGINX proxy. Most of those requests are watch requests that lie mostly idle after they are initiated (most timeouts on them are defined to be about 5-10 minutes). NGINX was configured to cut idle connections in 3 seconds, which causes all clients to reconnect and (even worse) restart aborted SSL sessions. On the server side, this it makes kube-apiserver consume up to 2000% of the CPU resources, making other requests very slow.
Solution
Set the proxy_timeout parameter to 10 minutes in the nginx.conf configuration file, which should be more than long enough to prevent cutting SSL connections before te requests time out by themselves. After this fix was applied, one api-server consumed only 100% of CPU (about 2% of total node compute performance capacity), while the second one consumed about 200% (about 4% of total node compute performance capacity) of CPU (with average response time 200-400 ms).
Upstream issue status: fixed
Make the Kargo deployment tool set proxy_timeout to 10 minutes: issue fixed with pull request by Fuel CCP team.
KubeDNS cannot handle large cluster load with default settings
Symptoms
When deploying an OpenStack cluster on this scale, kubedns becomes unresponsive because of the huge load. This end up with a slew of errors appearing in the logs of the dnsmasq container in the kubedns pod:
Maximum number of concurrent DNS queries reached.
Also, dnsmasq containers sometimes get restarted due to hitting the high memory limit.
Root cause
First of all, kubedns often seems to fail often in this architecture, even even without load. During the experiment we observed continuous kubedns container restarts even on an empty (but large enough) Kubernetes cluster. Restarts are caused by liveness check failing, although nothing notable is observed in any logs.
Second, dnsmasq should have taken the load off kubedns, but it needs some tuning to behave as expected (or, frankly, at all) for large loads.
Solution
Fixing this problem requires several levels of steps:

Set higher limits for dnsmasq containers: they take on most of the load.
Add more replicas to kubedns replication controller (we decided to stop on 6 replicas, as it solved the observed issue &8211; for bigger clusters it might be needed to increase this number even more).
Increase number of parallel connections dnsmasq should handle (we used &8211;dns-forward-max=1000 which is recommended parameter setup in dnsmasq manuals)
Increase size of cache in dnsmasq: it has hard limit of 10000 cache entries which seems to be reasonable amount.
Fix kubedns to handle this behaviour in proper way.

Upstream issue status: partially fixed
and 2 are fixed by making them configurable in Kargo by Kubernetes team: issue, pull request.
Others &8211; work has not yet started.
Kubernetes scheduler needs to be deployed on a separate node
Symptoms
During the huge OpenStack cluster deployment against Kubernetes, scheduler, controller-manager and kube-apiserver start fighting for CPU cycles as all of them are under a large load. Scheduler is the most resource-hungry, so we need a way to deploy it separately.
Solution
We moved the Kubernetes scheduler moved to a separate node manually; all other schedulers were manually killed to prevent them from moving to other nodes.
Upstream issue status: reported
Issue in Kargo.
Kubernetes scheduler is ineffective with pod antiaffinity
Symptoms
It takes a significant amount of time for the scheduler to process pods with pod antiaffinity rules specified on them. It is spending about 2-3 seconds on each pod, which makes the time needed to deploy an OpenStack cluster of 900 nodes unexpectedly long (about 3h for just scheduling). OpenStack deployment requires the use of antiaffinity rules to prevent several OpenStack compute nodes from being launched on a single Kubernetes minion node.
Root cause
According to profiling results, most of the time is spent on creating new Selectors to match existing pods against, which triggers the validation step. Basically we have O(N^2) unnecessary validation steps (where N = the number of pods), even if we have just 5 deployment entities scheduled to most of the nodes.
Solution
In this case, we needed a specific optimization that speeds up scheduling time up to about 300 ms/pod. It’s still slow in terms of common sense (about 30m spent just on pods scheduling for a 900 node OpenStack cluster), but it is at least close to reasonable. This solution lowers the number of very expensive operations to O(N), which is better, but still depends on the number of pods instead of deployments, so there is space for future improvement.
Upstream issue status: fixed
The optimization was merged into master (pull request) and backported to the 1.5 branch, and is part of the 1.5.2 release (pull request).
kube-apiserver has low default rate limit
Symptoms
Different services start receiving “429 Rate Limit Exceeded” HTTP errors, even though kube-apiservers can take more load. This problem was discovered through a scheduler bug (see below).
Solution
Raise the rate limit for the kube-apiserver process via the &8211;max-requests-inflight option. It defaults to 400, but in our case it became workable at 2000. This number should be configurable in the Kargo deployment tool, as bigger deployments might require an even bigger increase.
Upstream issue status: reported
Issue in Kargo.
Kubernetes scheduler can schedule incorrectly
Symptoms
When creating a huge amount of pods (~4500 in our case) and faced with HTTP 429 errors from kube-apiserver (see above), the scheduler can schedule several pods of the same deployment on one node, in violation of the pod antiaffinity rule on them.
Root cause
See pull request below.
Upstream issue status: pull request
Fix from Mirantis team: pull request (merged, part of Kubernetes 1.6 release).
Docker sometimes becomes unresponsive
Symptoms
The Docker process sometimes hangs on several nodes, which results in timeouts in the kubelet logs. When this happens, pods cannot be spawned or terminated successfully on the affected minion node. Although many similar issues have been fixed in Docker since 1.11, we are still observing these symptoms.
Workaround
The Docker daemon logs do not contain any notable information, so we had to restart the docker service on the affected node. (During the experiments we used Docker 1.12.3, but we have observed similar symptoms in 1.13 release candidates as well.)
OpenStack services don’t handle PXC pseudo-deadlocks
Symptoms
When run in parallel, create operations of lots of resources were failing with DBError saying that Percona Xtradb Cluster identified a deadlock and the transaction should be restarted.
Root cause
oslo.db is responsible for wrapping errors received from the DB into proper classes so that services can restart transactions if similar errors occur, but it didn’t expect the error in the format that is being sent by Percona. After we fixed this, however, we still experienced similar errors, because not all transactions that could be restarted were properly decorated in Nova code.
Upstream issue status: fixed
The bug has been fixed by Roman Podolyaka’s CR and backported to Newton. It fixes Percona deadlock error detection, but there’s at least one place in Nova that still needs to be fixed.
Live migration failed with live_migration_uri configuration
Symptoms
With the live_migration_uri configuration, live migrations fails because one compute host can’t connect to a libvirt on another host.
Root cause
We can’t specify which IP address to use in the live_migration_uri template, so it was trying to use the address from the first interface that happened to be in the PXE network, while libvirt listens on the private network. We couldn’t use the live_migration_inbound_addr, which would solve this problem, because of a problem in upstream Nova.
Upstream issue status: fixed
A bug in Nova has been fixed and backported to Newton. We switched to using live_migration_inbound_addr after that.
The post We installed an OpenStack cluster with close to 1000 nodes on Kubernetes. Here&8217;s what we found out. appeared first on Mirantis | Pure Play Open Cloud.
Quelle: Mirantis

Using a standalone Nodepool service to manage cloud instances

Nodepool is a
service used by the OpenStack CI team to deploy and manage a pool of
devstack
images on a cloud server for use in OpenStack project testing.

This article presents how to use Nodepool to manage cloud instances.

Requirements

For the purpose of this demonstration, we’ll use a CentOS system and the
Software Factory
distribution to get all the requirements:

sudo yum install -y –nogpgcheck https://softwarefactory-project.io/repos/sf-release-2.5.rpm
sudo yum install -y nodepoold nodepool-builder gearmand
sudo -u nodepool ssh-keygen -N ” -f /var/lib/nodepool/.ssh/id_rsa

Note that this installs nodepool version 0.4.0, which relies on Gearman and
still supports snapshot based images. More recent versions of Nodepool require
a Zookeeper service and only support diskimage builder images. Even though the
usage is similar and easy to adapt.

Configuration

Configure a cloud provider

Nodepool uses os-client-config to define cloud providers and it needs
a clouds.yaml file like this:

cat > /var/lib/nodepool/.config/openstack/clouds.yaml <<EOF
clouds:
le-cloud:
auth:
username: “${OS_USERNAME}”
password: “${OS_PASSWORD}”
auth_url: “${OS_AUTH_URL}”
project_name: “${OS_PROJECT_NAME}”
regions:
– “${OS_REGION_NAME}”
EOF

Using the OpenStack client, we can verify that the configuration is correct
and get the available network names:

sudo -u nodepool env OS_CLOUD=le-cloud openstack network list

Diskimage builder elements

Nodepool uses disk-image-builder
to create images locally so that the exact same image can be used across
multiple clouds. For this demonstration we’ll use a minimal element to
setup basic ssh access:

mkdir -p /etc/nodepool/elements/nodepool-minimal/{extra-data.d,install.d}

In extra-data.d, scripts are executed outside of the image and the one bellow
is used to authorize ssh access:

cat > /etc/nodepool/elements/nodepool-minimal/extra-data.d/01-user-key <<‘EOF’
#!/bin/sh
set -ex
cat /var/lib/nodepool/.ssh/id_rsa.pub > $TMP_HOOKS_PATH/id_rsa.pub
EOF
chmod +x /etc/nodepool/elements/nodepool-minimal/extra-data.d/01-user-key

In install.d, scripts are executed inside the image and the following
is used to create a user and install the authorized_key file:

cat > /etc/nodepool/elements/nodepool-minimal/install.d/50-jenkins <<‘EOF’
#!/bin/sh
set -ex
useradd -m -d /home/jenkins jenkins
mkdir /home/jenkins/.ssh
mv /tmp/in_target.d/id_rsa.pub /home/jenkins/.ssh/authorized_keys
chown -R jenkins:jenkins /home/jenkins

# Nodepool expects this dir to exist when it boots slaves.
mkdir /etc/nodepool
chmod 0777 /etc/nodepool
EOF
chmod +x /etc/nodepool/elements/nodepool-minimal/install.d/50-jenkins

Note: all the examples in this articles are available in this repository:
sf-elements.
More information to create elements is available here.

Nodepool configuration

Nodepool main configuration is /etc/nodepool/nodepool.yaml:

elements-dir: /etc/nodepool/elements
images-dir: /var/lib/nodepool/dib

cron:
cleanup: ‘*/30 * * * *’
check: ‘*/15 * * * *’

targets:
– name: default

gearman-servers:
– host: localhost

diskimages:
– name: dib-centos-7
elements:
– centos-minimal
– vm
– dhcp-all-interfaces
– growroot
– openssh-server
– nodepool-minimal

providers:
– name: default
cloud: le-cloud
images:
– name: centos-7
diskimage: dib-centos-7
username: jenkins
private-key: /var/lib/nodepool/.ssh/id_rsa
min-ram: 2048
networks:
– name: defaultnet
max-servers: 10
boot-timeout: 120
clean-floating-ips: true
image-type: raw
pool: nova
rate: 10.0

labels:
– name: centos-7
image: centos-7
min-ready: 1
providers:
– name: default

Nodepool uses a gearman server to get node requests and to dispatch
image rebuild jobs. We’ll uses a local gearmand server on localhost.
Thus, Nodepool will only respect the min-ready value and it won’t
dynamically start node.

Diskimages define images’ names and dib elements. All the elements
provided by dib, such as centos-minimal, are available, here is the
full list.

Providers define specific cloud provider settings such as the network name or
boot timeout. Lastly, labels define generic names for cloud images
to be used by jobs definition.

To sum up, labels reference images in providers that are constructed
with disk-image-builder.

Create the first node

Start the services:

sudo systemctl start gearmand nodepool nodepool-builder

Nodepool will automatically initiate the image build, as shown in
/var/log/nodepool/nodepool.log: WARNING nodepool.NodePool: Missing disk image centos-7.
Image building logs are available in /var/log/nodepool/builder-image.log.

Check the building process:

# nodepool dib-image-list
+—-+————–+———————————————–+————+———-+————-+
| ID | Image | Filename | Version | State | Age |
+—-+————–+———————————————–+————+———-+————-+
| 1 | dib-centos-7 | /var/lib/nodepool/dib/dib-centos-7-1490688700 | 1490702806 | building | 00:00:00:05 |
+—-+————–+———————————————–+————+———-+————-+

Once the dib image is ready, nodepool will upload the image:
nodepool.NodePool: Missing image centos-7 on default
When the image fails to build, nodepool will try again indefinitely,
look for “after-error” in builder-image.log.

Check the upload process:

# nodepool image-list
+—-+———-+———-+———-+————+———-+———–+———-+————-+
| ID | Provider | Image | Hostname | Version | Image ID | Server ID | State | Age |
+—-+———-+———-+———-+————+———-+———–+———-+————-+
| 1 | default | centos-7 | centos-7 | 1490703207 | None | None | building | 00:00:00:43 |
+—-+———-+———-+———-+————+———-+———–+———-+————-+

Once the image is ready, nodepool will create an instance
nodepool.NodePool: Need to launch 1 centos-7 nodes for default on default:

# nodepool list
+—-+———-+——+———-+———+———+——————–+——————–+———–+——+———-+————-+
| ID | Provider | AZ | Label | Target | Manager | Hostname | NodeName | Server ID | IP | State | Age |
+—-+———-+——+———-+———+———+——————–+——————–+———–+——+———-+————-+
| 1 | default | None | centos-7 | default | None | centos-7-default-1 | centos-7-default-1 | XXX | None | building | 00:00:01:37 |
+—-+———-+——+———-+———+———+——————–+——————–+———–+——+———-+————-+

Once the node is ready, you have completed the first part of the process
described in this article and the Nodepool service should be working properly.
If the node goes directly from the building to the delete state, Nodepool will
try to recreate the node indefinitely. Look for errors in nodepool.log.
One common mistake is to have an incorrect provider network configuration,
you need to set a valid network name in nodepool.yaml.

Nodepool operations

Here is a summary of the most common operations:

Force the rebuild of an image: nodepool image-build image-name
Force the upload of an image: nodepool image-upload provider-name image-name
Delete a node: nodepool delete node-id
Delete a local dib image: nodepool dib-image-delete image-id
Delete a glance image: nodepool image-delete image-id

Nodepool “check” cron periodically verifies that nodes are available.
When a node is shutdown, it will automatically recreate it.

Ready to use application deployment with Nodepool

As a Cloud developper, it is convenient to always have access to a fresh
OpenStack deployment for testing purpose. It’s easy to break things and it
takes time to recreate a test environment, so let’s use Nodepool.

First we’ll add a new elements
to pre-install the typical rdo requirements:

diskimages:
– name: dib-rdo-newton
elements:
– centos-minimal
– nodepool-minimal
– rdo-requirements
env-vars:
RDO_RELEASE: “ocata”

providers:
– name: default
images:
– name: rdo-newton
diskimage: dib-rdo-newton
username: jenkins
min-ram: 8192
private-key: /var/lib/nodepool/.ssh/id_rsa
ready-script: run_packstack.sh

Then using a ready-script,
we can execute packstack to deploy services after the node has been created:

label:
– name: rdo-ocata
image: rdo-ocata
min-ready: 1
ready-script: run_packstack.sh
providers:
– name: default

Once the node is ready, use nodepool list to get the IP address:

# ssh -i /var/lib/nodepool/.ssh/id_rsa jenkins@node
jenkins$ . keystonerc_admin
jenkins (keystone_admin)$ openstack catalog list
+———–+———–+——————————-+
| Name | Type | Endpoints |
+———–+———–+——————————-+
| keystone | identity | RegionOne |
| | | public: http://node:5000/v3 |

To get a new instance, either terminate the current one, or manually delete it
using nodepool delete node-id.
A few minutes later you will have a fresh and pristine environment!
Quelle: RDO

Help DevOps teams respond faster with smarter alerts

Is your DevOps team drowning in alerts? Are you seeing an increase in escalations to management? Do you have new, tougher service issue response times to meet? Do you need to collaborate across teams to fix issues faster?
As companies scale up DevOps activity to deploy and support more new application services and updates, business leaders need to assure service levels during the next phase of growth. You will need to address three common growing pains:  not getting the right alerts soon enough; getting too many notifications, making it difficult to know what to prioritize among the clutter; and getting the wrong info at the wrong time.
Consider centralizing alert notification management
It may be time to get more efficient at responding to alerts. For a quick first response, you want the right team—and right team member—to be notified. For example, cloud database issues should automatically flow to the database expert – right?
Look at introducing some support to quickly set up on-duty and on-call schedule and shift patterns for weeks in advance. That way there will be no coverage gaps and you can streamline alert notifications to the right people at the right time.
For faster response, individuals can tailor the alert method to their preference, such as email, SMS, voice call or Slack. It’s also an advantage if your operations support team members can acknowledge a critical alert while away from the desk with SMS or mobile app—like when picking up a well-deserved latte at the café. It’s goes without saying that both the team and managers benefit from avoiding unnecessary escalations.
Collaborate to resolve alerts with ChatOps
For issues requiring broader input for fast diagnosis, you want to enable real-time collaboration. With ChatOps, the term coined by GitHub, stakeholders from different teams see alerts posted to a channel like Slack and have the conversations to diagnose the issue quickly.
Once you are up-and-running, you may want to tailor alert notification policies to help the team achieve more demanding requirements. For example, if you encounter issues that are taking too long to diagnose, you may want to customize a chain of escalations to bring in teams from the broader organization who can help troubleshoot.
Cloud operations can benefit from new management capabilities to get more efficient. Your first easy-to-take step? Adopt a policy-driven approach to delivering the right alerts to the right people at the right time.
You can start a 60-day free trial of IBM Alert Notification here.
The post Help DevOps teams respond faster with smarter alerts appeared first on news.
Quelle: Thoughts on Cloud