Just in case you’ve been living with your head underneath a rock, the world appears to have gone “big data” crazy lately. Your customers, your company, and probably your competition have all started to talk about the problem of big data and just exactly what can be done about it. Somehow you are going to have to find a way to work “big data” into your product development definition. No matter what type of product you manage, it sure seems like you need to understand what this problem is – and how it can be solved.
What’s Wrong With How We Handle Data Today?
Before we go running off trying to solve a problem, let’s first make sure that we really have a problem that needs to be solved. If you and I were going to create a database today, how would we go about doing it?
Let’s say that we wanted to create a database to hold name and address information. The simplest way to think about a database is to picture a table. This table has both rows and columns. In our name and address database, we’ll create a new row to hold your address information and we’ll start out by creating a new column to hold your name. We’ll then create 5 more columns and use each one to store one component of your home address: street, apartment number, city, state, and zip code (assuming that you live in the United States).
That’s it! Now we have a very small database: it contains one record (yours) and that record holds 6 pieces of data: your name and your address. Now if we went one step further and added the names and addresses of everyone who lives in your town to this database it would grow from one record to now contain thousands of records, perhaps even millions of records depending on where you live.
Now imagine that you owned a flower shop in your town. One day you discover that you have too many roses. You’d like to send a postal letter to everyone who lives in the area around your store and remind them that a great way to say “I love you” is by giving someone roses. You don’t want to send this email to everyone in town because if they live too far away they won’t make the drive to your store and you’d just be wasting the money to send them the letter.
You can now go to our new database and ask it a question: please provide me with a list of all of the names and addresses for people whose address has the same area code as my store (this means that they live nearby). Once the database provides you with this list, you can go address all of your letters and sell your roses.
Say Hello To The Hadoop Distributed File System
The type of database system that we just described has worked very well for the past 40 years. However, in the past 15 years problems have started to show up because of big data. A little company called Google was one of the first to run into this problem. Back in 2002, Google wanted to index the world wide web every day – talk about a lot data!
Let’s think about a challenging problem. How about if we wanted to create a database that contained all of the data that was collected as a part of the last U.S. census. There are roughly 360M people living in the United States. If each answered 100 census questions, than that is a database with 360M rows and 100 columns – one big database!
Even if we were able to fit it onto a storage system that our little database engine from the last example could use, it would take a week or more to generate an answer to a question that we asked it. Don’t even think about having multiple people use it at the same time. If you could figure out a way to solve this problem, then that would be something that you could add to your product manager resume.
A better way to handle big data was needed. A researcher named Doug Cutting stumbled across a couple of papers that Google had published that talked about how they had solved the problem of indexing an ever growing word wide web in a reasonable amount of time. Doug realized that with some work, he might be able to use these ideas to create a database that could handle very large data sets. With this idea, the Hadoop database system was born.
When it comes to big data, the first problem that has to be solved is how to store all of that data. No matter how you slice it, it’s going to take a lot of hard drives. The Hadoop distributed file system tackles the problem in the following way.
The fundamental unit that makes up a Hadoop computer consists of a “node”. A node is a cheap processor, some memory, and one or more disk drives (generally hundreds of disk drives). Put a bunch of nodes together and you’ve got a “rack”. Put a bunch of racks together and now you’ve got a “cluster”.
First the data is broken up into 512k “storage units”. Next these storage units are grouped together into 64k “file units”. The file units are then stored on disks associated with a cluster. Since any disk in the cluster might fail at any time, multiple copies of each file unit (generally 3 copies) are stored on different disk drives at the same time. Although you are going to need to have a lot of disk drives, you have now solved your storage problem for your big data.
Did Somebody Say MapReduce?
Having all of that data stored will do you no good if you can’t ask the Hadoop database questions and get answers quickly. That’s where the Hadoop MapReduce function comes in.
This function is responsible for taking your question, splitting it up and sending it to all of the clusters. There an answer is created for the cluster. MapReduce then collects all of the answers and reduces these answers down into a single answer which is then returned to you.
What this means is that the problem of searching a very large database has been transformed from a single big problem into a set of distributed smaller problems. Since each of the file units are exactly the same size, the operation will take the same amount of time in each cluster and you’ll have your answer very quickly.
What All Of This Means For You
Whew! That’s a lot of database talk – what does a product manager care about all of this? No matter if your product can make use of a Hadoop database or if you are the one who is going to need to use a Hadoop database in order to process all of the product data that you collect and store, Hadoop is eventually going to be part of your life.
You might not be programming your product’s Hadoop database, but you will be interacting with the people who are. You need to understand how the system works so that you’ll be able to interpret what your database support team is telling you. Consider having a working knowledge of Hadoop to have been added to your product manager job description.
Take the time to do some studying and find out what situations the Hadoop database is well suited for. Work with your support team to make sure that they design a solution that is going to support your product’s needs for both today as well as for tomorrow.
Question For You: Do you think that you should build your own Hadoop computer or use someone else’s in the cloud?
P.S.: Free subscriptions to The Accidental Product Manager Newsletter are now available. It’s your product – it’s your career. Subscribe now: Click Here!
What We’ll Be Talking About Next Time
How many social media ecosystems are out there these days? By my count (if you still include MySpace), there are 9 big ones. As an already overworked product manager working on your product development definition this means that you’ve got an important question that you’re going to have to answer: which ones are you going to use to promote your product and which ones are you going to let fall by the wayside?