März 2020 - Seite 47 von 89 - Cloud Computing Köln

At Google, our teams follow site reliability engineering (SRE) practices to help keep systems healthy and users productive. There is a phrase we often use on our SRE teams: “At Google scale, million-to-one chances happen all the time.” This illustrates the massive complexity of the system that powers Google Search, Gmail, Ads, Cloud, Android, Maps, and many more. That type of scale creates complex, emergent modes of failure that aren’t seen elsewhere. Thus, SREs within Google have become adept at developing systems to track failures deep into the many layers of our infrastructure. Not every failure can be automatically detected, so investigative tools, techniques, and most importantly, attitude are essential. Rare, unexpected chains of events happen often. Some have visible impact, but most don’t.At Google scale, million-to-one chances happen all the time.This was illustrated in a recent incident that Google users would likely not have noticed. We consider these types of failures “within error budget” events. They are expected, accepted, and engineered into the design criteria of our systems. However, they still get tracked down to make sure they aren’t forgotten and accumulated into technical debt—we use them to prevent this class of failures across a range of systems, not just the one that had the problem. This incident serves as a good example of tracking down a problem once initial symptoms were mitigated, finding underlying causes and preventing it from happening again—without users noticing. This level of rigor and responsibility is what underlies the SRE approach to running systems in production. Digging deep for a problem’s rootsIn this event, an SRE on the traffic and load balancing team was alerted that some GFEs (Google front ends) in Google’s edge network, which statelessly cache frequently accessed content, were producing an abnormally high number of errors. The on-call SRE was paged. He immediately removed (“drained”) the machines from serving, thus eliminating the errors that might result in a degraded state for customers. This ability to rapidly mitigate an incident in this way is a core competency within Google SRE. Because we have confidence in our capacity models, we know that we have redundant resources to allow for this mitigation at any time.At this point, our SRE had mitigated the issue with the drain, but he wasn’t done yet. Based on previous similar issues, he knew this type of error is often caused by a transient network issue. After finding evidence of packet loss, isolated to a single rack of machines, our SRE got in touch with the edge networking team, which identified correlated BGP flapping on the router in the affected rack. However, the nature of the flaps hinted at a problem with the machines rather than the router. This indicated that the problem revolved around a particular machine or set of machines.Further investigation uncovered kernel messages in the GFE machines’ base system log. These errors indicated CPU throttling:MMM DD HH:mm:ss xxxxxxx kernel: [3220998.149713] CPU16: Package temperature above threshold, cpu clock throttled (total events = 1596886)The process on the machine responsible for BGP announcements showed higher-than-usual CPU usage, which perfectly correlated with both the onset of the errors and the CPU throttling. This confirmed the theory that the throttling was significant enough to be impactful and measurable by Google’s monitoring system:The SRE then checked on adjacent machines to find if there were any other similarly failing systems. Notably, the only machines that were affected were on a single rack. Machines on adjacent racks were not affected!Why would a single rack be overheating to the point of CPU throttling when its neighbors were totally unaffected?? What is it about the physical support for machines that would cause kernel errors? It didn’t add up.The SRE then sent the machine to repairs, which means that he filed a bug in our company-wide issue tracking system. In this case, the bug was sent to the on-site hardware operations and management team.This bug was clear and to the point:Please repair the following:Machines in XXXXXX are seeing thermal events in syslog:MMM DD HH:mm:ss xxxxxxx kernel: [3220998.149713] CPU16: Package temperature above threshold, cpu clock throttled (total events = 1596886)This throttling is ultimately causing user harm, so I’ve drained user traffic.This bug, or ticket, clearly specified the machine(s) that were affected and described the symptoms and actions taken up to that point. At this point, the hardware team took over the investigation and determined the physical issue that resulted in this chain of events in the software. Google’s 24×7 team is composed of many teams, working together to ensure problems are well-understood at all levels of the stack.Finding the cause of a chain of eventsSo what was the problem?Hello, we have inspected the rack. The casters on the rear wheels have failed and the machines are overheating as a consequence of being tilted.Problem solved? Not quite. This looks alarmingly like a refrigerator about to tip over.The wheels (casters) supporting the rack had been crushed under the weight of the fully loaded rack. The rack then had physically tilted forward, disrupting the flow of liquid coolant and resulting in some CPUs heating up to the point of being throttled.The caster got fixed and the rack was returned to proper alignment. But the greater issues of “How did this happen?” and “How can we prevent it?” needed to be addressed.The hardware teams discussed potential options, ranging from distributing wheel repair kits to all locations to improving the rack-moving procedures to avoid damaging the wheels, and even considered improving the method of transporting new racks to data centers during initial build-out.The team also considered how many existing racks risk similar failures. This then resulted in a systematic replacement of all racks with the same issue, while avoiding any customer impact.Talk about deep analysis! The SRE tracked the problem all the way from an external, front-end system down to the hardware that holds up the machines. This type of deep troubleshooting happens within Google’s production teams due to clear communication, shared goals, and a common expectation to not only fix problems, but prevent all future occurrences. Another phrase we commonly use here on SRE teams is “All incidents should be novel”—they should never occur more than once. In this case, the SREs and hardware operation teams worked together to ensure that this class of failure would never happen again.All incidents should be novel.This level of rigorous analysis and persistence is a great example of incident response using deep and broad monitoring and the culture of responsibility that keeps Google running 24×7. Google Cloud customers often ask how SRE can work in a hybrid, on-prem, or multi-cloud environment. SRE practices can be used to work across teams within an organization, across multiple environments. SRE helps teams work together during incidents like this, from traffic management to data center hardware operations. Find out more about the SRE approach to running systems and how your team can adopt SRE best practices.
Quelle: Google Cloud Platform

13. März 2020

da Agency

How EBSCO delivers dynamic research services with Apigee

Editor’s note: Today we hear from Adam Ray, platform product manager at EBSCO Information Services, a leading provider of research databases, e-journals, magazines, and eBooks. Learn how the company is working with Apigee to connect with customers through APIs.For more than 70 years, EBSCO has supported research at private and public institutions, including libraries, universities, hospitals, and government organizations. One of the reasons that customers have continued to rely upon us over the decades is because we actively innovate and adapt new technologies to give customers access to the growing pool of digital resources in the information age.Today we offer many product lines, including research databases, e-journals, magazines, eBooks, and other resources, to connect organizations with the right services for their research needs. Product teams typically created different API solutions upon request to improve customers’ experiences. For instance, customers might request APIs to help customize user interfaces or to deliver multiple EBSCO resources and assets to their users in a unified way.As the number of solutions using APIs at EBSCO grew, it started getting harder to control traffic and performance in our data center. We didn’t have a good way to regulate API calls, so heavy usage could cause dramatic spikes in traffic and degrade performance. As we expect the number of APIs and calls to grow in the future, we needed to find a better way to manage, secure, and even monetize APIs to maintain a high level of performance for customers.Working with a market leaderAlthough we have a lot of technical skill in-house, we wanted to get out of the business of building things for ourselves so that we could focus our resources on polishing and expanding core services. That’s why we decided to look at API management solutions that had all of the capabilities we wanted built in, such as a developer portal, monetization features, diverse policies, and analytics that would help us better understand our customers and how they’re using our APIs.We started by looking at the Gartner Magic Quadrant for Full Life Cycle API Management to learn more about API market leaders. We talked to all of the top vendors, read literature, and explored how the top solutions worked. The Apigee API Management Platform stood out for both its functionality and flexibility. While many competitors wanted to define the service hosting platform for us, we already have a strong digital ecosystem built around service meshes. Apigee was the only option flexible enough to work alongside our existing systems and architecture.As we worked more with Apigee, we also discovered that the team there has extensive experience with APIs, and they were eager to share their knowledge. They held workshops and even had “office hours” where they would sit down and answer all of our questions. If we weren’t sure about the best way to solve an issue, the Apigee team was always ready with best practices and insights to help us make the best choice.Advanced functionality out of the boxWe are already impressed by how easy it is to get started with the Apigee platform. Turnkey policies and built-in security standards help us evaluate traffic and minimize spikes that could overload our systems. Apigee has a wide range of features that work straight out of the box, so we don’t have to spend time building a developer portal, enforcing policies, or figuring out how to integrate all of the different components. Instead, internal teams can focus on adding APIs and planning new features for our customers.The developer portal is currently open to just a few existing customers, but we expect it to be an essential component of expanding our API program. The developer portal will give partners, vendors, customers, and other developers a single location to find a comprehensive list of API-related documentation. They can easily learn what APIs are available and how to use them to integrate services, adjust interfaces, or support other needed processes.A solid foundation for growthEach of our lines of business plans to expose two to four APIs for a total of 15 APIs on our roadmap. Once we have released the APIs and opened up our developer portal, we can start looking at opportunities for adding revenue streams through monetization. Because our central business model involves subscriptions and billing, we have payment infrastructure in place. But the Apigee monetization capabilities will enable us to set up payment for developers who aren’t currently EBSCO customers. We can create a system for developers to sign up through the developer portal and start paying for any API package that they want to use.Technology changes rapidly, and our customers constantly adopt new devices and platforms. With Apigee helping us to create and manage more APIs, we have a fast, easy way to connect with all sorts of customer touchpoints and provide excellent support now and in the future.
Quelle: Google Cloud Platform