Open Source Data Analytics: Part of your Standard-Issue Cloud Toolkit


DENNIS HUO: Hi, I’m Dennis Huo. And I just want to show you how
simple, smooth, and effective running open source,
big data software can be on Google Cloud Platform. Now, let’s just
begin by considering what we might expect out of any
modern programming language. It seems it’s not quite
enough to be turning complete. But we probably at least
expect some standard libraries for common things like say,
sorting a one megabyte list in-memory. Well, nowadays, it
seems like everybody has a lot more
data lying around. So what if it could be just as
easy to sort a terabyte of data using hundreds or even
thousands of machines? More recently
programming languages have had to incorporate much
stronger support for things like threads and synchronization
as we all moved on to multi-processor,
multi-core systems. Nowadays, it’s pretty
easy to distribute work onto hundreds of threads. What if it could be just
as easy to distribute work onto hundreds of machines? Nowadays, it’s
also pretty common to find some fairly
sophisticated packages for all sorts of things
like cryptography, number theory, and graphics in
our standard libraries. What if we could also have state
of the art distributed machine learning packages as part
of our standard libraries? Well, if you’ve been watching
the code snippets over here the last few
slides, you may have noticed that none of what I
just said is hypothetical. It’s all already possible
right now, today. Which is fortunate because just
as we spent the last decade finally getting the
hang of programming for multi-threaded systems
on a single machine, nowadays it seems more and more common
to find yourself developing for a massively distributed
system of hundreds of machines. And with these changes
in the landscape, it’s pretty easy to
imagine that big data could be a perfect match for a cloud. All the key strengths of cloud
like on-demand provisioning, seamless scaling, the separation
of storage from computation, and of course, just
having such easy access to such a wide variety
of services and tools. These all contribute
to opening up a whole new universe
of possibilities with countless ways to
assemble these building blocks into something
new and unique. And through this
shared evolution of big data and
distributed computing, big data has become a very
core ingredient of innovation for companies big and
small, old and new alike. Take, for example, how streaming
media companies like Netflix, Pandora, and Last.fm
have contributed to changing the
way that consumers expect content to be presented. They’ve applied large scale
distributed machine learning techniques to power these
personalized recommendation engines. Millions of users
around the world now expect tailor-made
entertainment as the norm. As another example, just taking
a look at our Google’s roots in web search and along side
a multitude of other search engines, news
aggregators, social media, and e-commerce
sites, we’ve all long grappled with applying
big data technologies just to tackle the sheer magnitude
of ever-growing web content. Users now expect
to find a needle in a haystack hundreds
of thousands of times, every second of every day. And amidst this
rise of big data, it’s not that the
data or the tools have suddenly become magical. All these innovations
ultimately come from developers just like you, always
finding new ways to put it all together so
that big data can still mean something completely
different to each person or entrepreneur. On Google Cloud Platform,
through a combination of cloud services and a wealth
of open source technologies like Apache Hadoop
and Apache spark, you’ll have everything
you need to focus on making a real impact instead
of having to worry about all the grungy little details
just to get started. Now for an idea of what
this all might look like, let’s follow a data set in
the cloud on its journey through a series of
processing and analytic steps. And we’ll watch as it
undergoes its transformation from raw data into
crucial insights. Now what I have
here is a set of CSV files from the Center for
Disease Control containing summary birth data
for the United States between 1969 and 2008. These files are just sitting
here in Google Cloud Storage. And, as you can see, it’s pretty
easy to peek here and there just using gsutil cat. Now suppose we want to answer
some questions by analyzing this data like, for
example, whether there’s a correlation between cigarette
usage and birth weight. At more than 100
million rows, it turns out we have at least
100 times too much data to fit into any normal
spreadsheet program. So that means it’s time to
bring out the heavy machinery. With our command line
tool, bdutil, and just these two simple commands, you
can have your very own 100 VM cluster fully loaded up
with Apache Hadoop, Spark, and Shark, ready to go just five
minutes or so after kicking it off. Once it’s done bdutil will
print out a command for SSHing into your cluster. Once we’re logged
into the cluster, one easy way to get
started here is just to spin up a Shark shell. We just type shark and
this prompt will appear. Now, what we’ll be
doing here is creating what’s known as
an external table. So we’ll just provide this
location parameter to Shark and point it at our existing
files in Google Cloud Storage. Shark will go in and
list all the files that match that location. And we can immediately start
querying those files in place, treating them just
like a SQL database. For example, once we
have this table loaded, we can select individual columns
and sort on other columns to take a peek at the data. We can also use some
handy built-in functions to calculate some
basic statistics like averages and correlations. Now one thing I’ve noticed about
this kind of manual analytics, is that while it’s a great way
to answer some basic questions, it’s an even better way
to discover new questions. For instance, over here we’ve
found a possible correlation between cigarette usage
and lower birth weights. So could we, perhaps,
apply what we found to build a prediction
engine for underweight births? And what about other factors
like the parent’s age or alcohol consumption? Now, this kind of chain
reaction of answers leading to new questions is part of the
power of big data analytics. With the right tools at hand,
then every step along the way you tend to find new paths
and new possibilities. In our case, let’s
go ahead and try out that idea of
building a prediction engine for underweight births. Now, to do this we’re going
to need some distributed machine learning tools. Traditionally, building a
prediction model on 100 node cluster might have
meant years of study and maybe even getting
a Ph.D. Luckily for us, through the combined efforts
of the open source community and, in this case, especially
UC Berkeley’s AMPLab, Apache Spark already comes
pre-equipped with a state of the art distributed machine
learning library called MLlib. So our Spark cluster
already has everything we need to get started
building our prediction engine. The first thing
we’ll need to do here is just to extract
some of these columns as numerical feature vectors
to use as part of our model. And there’s a few
ways to do this. But since we already
have a Shark prompt open, we’ll just go ahead and
use a select statement to create our new data set. We’ll start out with a rough
rule of thumb here, just defining underweight as being
anything less than 5.5 pounds. And we can just select a
few of these other columns that we want to use as part
of our feature vectors. We’ll use this WHERE clause to
limit and sanitize our data. And we could also
just chop off the data from the year
2008, just for now, to put in a separate
location, so that we can use that as our test data,
separate from our training data. Now, since we’re doing all
this as a single create table as a select query
statement, all we have to do is provide a location
parameter and tell Shark where to put these new files
once it’s created them. We can also go ahead and run
that second query on the data from 2008, so that
we’ll have our test data available in a separate
location, ready to go later. Sure enough, after
running these queries, we’ll find a bunch of new files
have appeared in Google Cloud Storage, which we can look at
just using gsutil or Hadoop FS, for example. Now, with all our data already
prepared, all we have to do is spin up a Spark prompt so
that we have access to MLlib. It just so happens we can
pretty much copy and paste the entire MLlib Getting Started
example for support vector machines, here. And we’ll just make a few
minor modifications that point it our training data
here, and plug that into Spark. Spark will go ahead and kick
off the distributed job, entering it over the
state at 600 times to train a brand new SVM model. Now, that will take
a couple minutes. And once that’s done, we’ll
have a fully trained model ready to go to make predictions. Here, for example,
we can just plug in a few examples of underweight
and non-underweight predictions using data from
our real data set. We can also go in and load
that separate data from 2008 that we saved separately, and
rerun our model against it to make sure that our error
fraction is still comparable. Now, taking a look at
everything we’ve done here, I’ll bet we’ve raised more
questions than we’ve answered, and probably planted
more new ideas than we’ve actually implemented. For instance, we could
try to extend this model to predict other
measures of health. Or maybe we can apply
the same principles but to something
other than obstetrics. We could also explore
a whole wealth of other possible
machine learning tools coming out of MLlib. Indeed, as we dive
deeper into any problem we’ll usually find that the
possibilities are pretty much limitless. Now, sadly, since
we don’t quite have limitless time in
this video session, we’ll have to come to an end
of this leg of the journey. Now everything you’ve seen
here is only a tiny peek into the ever ongoing
voyage of data through ever-growing stacks
of big data analytics technologies. Hopefully, with the help
of Google Cloud Platform, your discoveries can reach
farther and spread faster by always having the right
tool for the right job, no matter what you
might come across. Thanks for tuning in. I’m Dennis Huo. And if you want
to find out more, come visit us at
developers.google.com/hadoop.

Leave a Reply

Your email address will not be published. Required fields are marked *