AUTHOR: CADE METZ.CADE
METZ BUSINESS
DATE OF PUBLICATION: 04.06.16.04.06.16 -from wired.com
TIME OF PUBLICATION: 12:00 PM.12:00 PM
HERE’S HOW GOOGLE MAKES SURE IT (ALMOST) NEVER
GOES DOWN
WHEN WAS THE last time you needed to Google
something and Google wasn’t there?
Odds are, you don’t remember that ever happening. Sure,
there are times when you can’t reach Google because your internet connection is
down. But Google’s primary online services, from its search engine to Gmail to
Google Docs and more, are nearly always accessible. The company’s Google Apps
suite, including Gmail and Docs, was available about 99.97 percent of the time
in 2015, according to the company’s own numbers. The world pretty much takes
this for granted, but it’s a remarkable reality. The billions who use Google
hardly stop to consider how Google made something so impressive seem so
mundane.
Google explains the feat in three
words: Site Reliability Engineering. OK, they aren’t the best three words. But
that’s the rather unsexy name Google gave to this seminal philosophy more than
a decade ago. It’s a rather nuanced and expansive philosophy, but it really
boils down to one central idea: Don’t get IT people who
specialize in running Internet services to run your Internet services. Have
software coders run them instead. If you do this, the thinking goes, the
software coders will build tools that can help run the operationwithout the active involvement of real live people.
.'We long for the day when nobody
runs anything
.'TODD UNDERWOOD, GOOGLE
“The result of our approach,” writes Googler Ben Treynor
Sloss in a new essay, “is that we end up with a team of people who will quickly
become bored by performing tasks by hand and have the skill set necessary to
write software to replace their previously manual work.”
For many in Silicon Valley, that may
seem like a common idea. This kind of thing is now practiced across the tech
world, from Amazon to Box.com. People call it DevOps—“development” plus
“operations”—an effort to combine the ways of the software coder with the aims
of the systems administrator. But the DevOps movement, embodied by tools like
Chef and Puppet, evolved separately from and
largely after the SRE philosophies that arose inside Google
(and similar ideas that took hold at Amazon). It’s just that Google has kept
largely quiet about this over the last decade, as it often did when the topic was the inner workings of its enormously
efficient online operation.
But the company has entered a new
period, one in which it’s more willing to discuss such things (mainly because it wants to promote the cloud services that allow
outside business to run their own software atop its vast network of data
centers and machines). Google has even gone so far as to write a
book about Site Reliability Engineering.
The book is called, well, Site Reliability Engineering. It was just published by O’Reilly, and the essay from Sloss serves
as the first chapter. If you’re into DevOps, it’s a must-read. And even if
you’re not, the opening of the book—the preface, the introduction, and the
first chapter–is a fascinating look at the attitudes that drive the world’s
largest online empire.
For many in tech—and almost everyone outside of tech—system
administration (or operations or whatever you want to call it) is an
afterthought, one of the more boring aspects of computer technology. But Sloss,
officially known as Google’s Vice President for 24/7 Operations, turns this
notion upside down, arguing that site reliability is “the most fundamental
feature of any product.” After all: “A system isn’t very useful if nobody can
use it.”
Ground Zero
Sloss is ground zero for the SRE movement. It began when
Google hired him to run its operations, and it was he who coined the term. “SRE
is what happens when you ask a software engineer to design an operations team,”
he says. “I designed and managed the group the way I would want it to work if I
worked as an SRE myself.”
For Todd Underwood, now an SRE director at Google, it’s only
natural that the company would hire a coder like Sloss for the job. “When
Google was in its infancy, there were so many software engineers who had a
better sense of how things broke and a better sense of how engineering could be
done well,” he tells WIRED. “But not one them wanted to do any of that by
hand.”
That’s a very Googly thing to say. But Adam Jacob, chief
technology officer at Chef, pretty much agrees, explaining that this is the
expected transition for an online operation that’s growing to such a large
size. “It’s natural to have a conversation to combine software development and
the practical pieces of operation—and to have no real divide between the two,”
he says. “When you look at the problem holistically, you get better results.”
The shift is particularly interesting when you consider that
dev and ops were traditionally opposing forces. The devs wanted to build new
software and change it and get the changes out to the public as a fast as
possible. But the ops folks wanted to ensure that nothing went wrong, and the
best way to do that was to keep changes to a minimum. “These are incommensurate
goals,” Underwood says. The trick is that, if you combine dev and ops, you can
start to eliminate their competing aims.
Underwood calls it a “Hegelian thesis-antithesis synthesis.”
He then acknowledges that when he says this, no one really buys it. “People
just don’t read Hegel anymore,” he quips. But the description is spot on. And
once this synthesis was in place, Google accelerated the process by adding all
sorts of other Googly ideas to the mix.
The Error Budget
One big idea is that, in an effort to reduce the conflict
between dev and ops, the company doesn’t strive for 100 percent uptime. The
reality, Sloss writes, is that you don’t need an internet service to be 100
percent available. Users can’t really tell the difference between 100 percent
and, say, 99.999 percent (their laptop or WiFi or electricity or ISP are down
far more than 0.001 percent of the time). If you set a reasonable uptime goal
below 100 percent—an “error budget”—you have more room to make changes and roll
out experiments.
“The use of an error budget resolves the structural conflict
of incentives between development and SRE,” Slosser says. “An outage is no
longer a ‘bad’ thing. It is an expected part of the process of innovation, and
an occurrence that both development and SRE teams manage rather than fear.”
At the same time, the company put rules in place to ensure
that SREs didn’t end up morphing into good old fashioned sysadmins. Basically,
it decreed that no SRE could spent more than 50 percent of his or her time on
traditional operations as opposed to coding. If ops starts to take precedence
over dev on a particular SRE team, Google shifts some of the ops load onto the
team that is typically just build the software—the regular Google software
engineers. “Consciously maintaining this balance between ops and development
work allows us to ensure that SREs have the bandwidth to engage in creative,
autonomous engineering,” Sloss writes, “while still retaining the wisdom
gleaned from the operations side of running a service.”
Chef’s Jacob says that the ratio here—50 percent—isn’t that
important. But he likes the attitude. “This is just economics,” he says.
“There’s always demand for people to do operational bullshit. There is an
almost infinite amount of bullshit that people will ask an operational person
to do. So the idea that you would put a cap on that it legit.”
Google even created strict guidelines for hiring its SREs.
It hires about 50 to 60 percect through exactly the same process that applies
to all other Google engineers, and the rest have about “85 to 99 percent” of
the same skills—plus a “set of technical skills that is useful to SRE but is
rare for most software engineers,” such as an intimate knowledge of the inside
of the UNIX operating system or hardware networking protocols. This too aims to
ensure that dev and ops maintain the proper balance.
The Moonshot That Keeps Google Online
In many ways, this was a new
philosophy. But in their book, as they seek to describe the philosophy, the
Google team uses a much older example. The spiritual forebear of the Google
SREs is Margaret Hamilton, the MIT programmer who spent
the ’60s building software for Apollo spacecraft that would one day land on the
moon. As explained by Hamilton herself—who was interviewed for the
book—part of the culture on the Apollo program “was to learn from everyone and
everything, including from that which one would least expect.”
Hamilton was a coder. But she played an important role in
operations. To show this, the book recounts the day Hamilton’s young daughter,
Lauren, who she often brought to the computer lab, happened to hit a button and
feed an Apollo pre-launch program into a computer that was running a post-launch
scenario.
This crashed the scenario, and Hamilton tried to add a new
error checking code to the system that automatically would prevent this during
a real flight. Her superiors rejected the idea, arguing that astronauts would
never do such a thing, but on Apollo 8, the astronauts did such a thing.
Luckily, Hamilton had added a workaround to the system documentation. And for
subsequent missions, she added the error checking code.
“If you come along and say ‘That’s going to break,’ it’s
really not that useful. But if you say: ‘That’s going to break, and let me tell
you how,’ you’ve done something amazing,” Underwood explains. “Here’s a person
who saw that something was going to break and saw how it was going to break and
devised a way to prevent it from breaking.”
That’s DevOps—or, in Google parlance, Site Reliability
Engineering. As three words, it doesn’t sound like much. But it’s an enormously
powerful idea. It has already produced Google. But particularly philosophical
SREs like Underwood have even bigger ambitions. They envision a world where
operations shift even further towards code. “We long for the day,” Underwood
says, “when nobody runs anything.”
============================================== For a great satire on email, please see the following:
https://www.youtube.com/watch?v=HTgYHHKs0Zwscoop_post=bcaa0440-2548-11e5-c1bd-90b11c3d2b20&__scoop_topic=2455618
===============================================
Good Netiquette And A Green Internet To All!
Special Bulletin - My just released book,
"You're Hired. Super Charge our Email Skills in 60 Minutes! (And Get That Job...)
is now on sales at Amazon.com
Great Reasons for Purchasing Netiquette IQ
·
Get more
email opens. Improve 100% or more.
·
Receive
more responses, interviews, appointments, prospects and sales.
·
Be better
understood.
·
Eliminate
indecisin.
·
Avoid
being spammed 100% or more.
·
Have
recipient finish reading your email content.
·
Save time
by reducing questions.
·
Increase
your level of clarity.
·
Improve
you time management with your email.
·
Have
quick access to a wealth of relevant email information.
Enjoy
most of what you need for email in a single book.
=================================
**Important note** - contact our company for very powerful solutions for IPmanagement (IPv4 and IPv6, security, firewall and APT solutions:
www.tabularosa.net
==================================================
Another Special Announcement - Tune in to my radio interview, on Rider University's station, www.1077thebronc.com I discuss my recent book, above on "Your Career Is Calling", hosted by Wanda Ellett.
In addition to this blog, Netiquette IQ has a website with great assets which are being added to on a regular basis. I have authored the premiere book on Netiquette, “Netiquette IQ - A Comprehensive Guide to Improve, Enhance and Add Power to Your Email". My new book, “You’re Hired! Super Charge Your Email Skills in 60 Minutes. . . And Get That Job!” has just been published and will be followed by a trilogy of books on Netiquette for young people. You can view my profile, reviews of the book and content excerpts at:www.amazon.com/author/paulbabicki
In addition to this blog, I maintain a radio show on BlogtalkRadio and an online newsletter via paper.li.I have established Netiquette discussion groups with Linkedin and Yahoo. I am also a member of the International Business Etiquette and Protocol Group and Minding Manners among others. I regularly consult for the Gerson Lehrman Group, a worldwide network of subject matter experts and I have been contributing to the blogs Everything Email and emailmonday . My work has appeared in numerous publications and I have presented to groups such as The Breakfast Club of NJ and PSG of Mercer County, NJ.
I am the president of Tabula Rosa Systems,
a “best of breed” reseller of products for communications, email,
network management software, security products and professional
services. Also, I am the president of Netiquette IQ. We are currently developing an email IQ rating system, Netiquette IQ, which promotes the fundamentals outlined in my book.
Over the past twenty-five years, I have enjoyed a dynamic and successful career and have attained an extensive background in IT and electronic communications by selling and marketing within the information technology marketplace.Anyone who would like to review the book and have it posted on my blog or website, please contact me paul@netiquetteiq.com.
=============================================================
No comments:
Post a Comment