Coreio's Take on Site Reliability Engineering

What is SRE and where does it come from?

Coreio’s Chief Technology Officer, Stephen Baird

The IT Industry has never met a buzzword it didn’t like; it’s only natural that it would be so, working in a field in which change is constant and rapid, and there’s a certain level of technical complexity, just getting things done requires that we talk about things in the fastest and simplest way possible. The danger is, if you avoid the complexity entirely and focus on the abstract, you can end up missing the value in specific innovations that can actually, tangibly, make your business’ IT environments work better.

One of the buzzwords today is an acronym, SRE. It stands for Site Reliability Engineering, and it has a very “trendy” pedigree, having emerged from Google. The term was invented by a VP of Engineering there, named Ben Traynor Soss. His observation, back in the 2000s, was that the divide between System Administrators, who were very serious about protecting the stability of IT environments at all costs, and Developers, who thrived on embracing innovation even if it meant trial and error was introduced into environments, was causing innovation to be suppressed. Sure, environments were stable, with virtually 100% reliability and availability, but they were not progressing, not taking advantage of new ideas to make them better.

Ben’s solution was to allow engineers to take over the “operations” side, and thus introduce innovation into systems administration. The Site Reliability Engineering team was born. The name itself is very clever—knowing that these “Dev” teams would be taking over, the natural inclination was to worry that the environment’s reliability and stability would be compromised in the rush to innovate. The name reminds everyone that reliability is still the bedrock of IT operations.

Why do we think it’s important?

At Coreio, we think two very important principles emerged from the move toward Site Reliability Engineering, and my team lives by them:

An engineer’s focus on the health of the whole system means that when a problem is detected, the diagnostic approach involves the monitoring of the application of involved infrastructure components, how they interact, rather than just looking at the components themselves in isolation. Important issues can be missed with the traditional approach that focuses on “the culprit” component and doesn’t look deeper. Engineers are also typically very curious about root causes so they are fanatical about getting that final answer.
By introducing engineers with software expertise, SRE teams benefit from their team members’ predisposition to reject repetitive tasks and traditional approaches, and replace them with software solutions that allow for greater automation. That automation and standardization creates more predictability within the system, so that when the engineers seek to implement changes, they encounter fewer unexpected issues. Additionally, automation frees up resources that would otherwise be dedicated to repetitive tasks, so they can devote more time to innovation.

How does this change the way things are done at Coreio?

Coreio differentiates itself from our competition by our site reliability engineering orientation that focuses on the availability and resiliency of our clients’ applications and services, and treats their systems as integrated systems, not as individual technology blocks.

Automation is a core tenet of our site reliability engineering discipline. We enable improved uptime of our clients’ servers, networks and, most importantly, their business, through our focus on automation. Our automation allows for a decrease in unplanned downtime and a predictable small maintenance window for planned downtime for patching and maintenance activities. Through automation and standardization we enable our clients to be more secure, reduce costs, improve quality and improve “time to market.”

Part of SRE is getting to root cause, and root cause analysis is core to what we do to ensure all problems are addressed and properly remediated so our clients can be confident in the security and reliability of their systems.

Finally, for us, systems management should not be an exercise in “red, yellow, green”: it is an evolution that requires continuous investment with experts engaged post-launch, continuing to provide daily support and strategic advice to make your systems even better. We are focused not just on the availability and resiliency of your applications and services, but also on proactive maintenance and monitoring to ensure problems are uncovered before they can impact your business.

What does that mean to our clients?

With Coreio’s Site Reliability Engineers managing your Data Centre, you can spend less time worrying about IT processes and malfunctions and more time focused on the tasks that make your business stand out.

We understand that when it comes to your core systems, the need for speed and advancement must be balanced against the obligation to remain both stable and secure—our deep experience gives us the flexibility to do both, and to offer strategic advice on balancing risk while becoming more agile, with things like Cloud opportunities, standardization, virtualization and automation.

Coreio has 30 years of experience in managing complex infrastructure environments with increasing high availability, reliability and security requirements, but we don’t rely on traditional approaches. We are always looking to further our continuous improvement with innovative ways of approaching our business like SRE.

Coreio’s Take on Site Reliability Engineering

What is SRE and where does it come from?

Why do we think it’s important?

How does this change the way things are done at Coreio?

What does that mean to our clients?

Recent Posts

Recent Comments

Archives

Categories