Archive

Archive for February, 2012

The Big Deal on Big Data (Part 3)

February 23rd, 2012 1 comment

Massively Parallel Processing: A Foray into Big Data

As it’s name implies, the company Teradata was founded in the late 1970’s with the idea of supporting terabyte-sized databases. Coming out of research from Caltech[i], the core idea was to develop a specialized database solution that relied on parallel processing. Essentially, Teradata was looking to horizontally scale the database using the idea of a connected set of servers managed by a software layer that would automatically break down database problems. The idea behind this solution, know as massively parallel processing (MPP)[ii], solved the complexity of application and database architects having to solve the partitioning strategy. Eventually, Teradata introduced a complete end-to-end solution known as a data appliance that incorporated the hardware, software and storage encapsulated into a single solution to handle very large databases (now up to petabytes in size)[iii].

Today, there are a number of competitors to Teradata including IBM Netezza, Oracle Exadata, EMC Greenplum, HP Vertica, Microsoft SQL Server Parallel Warehouse and Paraccel. All of these solutions are based off of the MPP architecture and provide various configurations from software only to specialized data appliance hardware. IBM Netezza[iv], for example, provides a database appliance that uses specialized blade hardware and tailored integrated circuits to increase the performance of data queries.

How Merkle Uses Massively Parallel Systems to Go Big

In the evolution of building more robust marketing databases, Merkle has deployed numerous MPP databases ranging from just a few to over a dozen terabytes in size. These systems incorporate numerous data feeds (in some cases well over one hundred) including demographic information, promotional data, sales transactions, web feeds, model scores, event registration & attendance, social profiles, research, 3rd party data (including credit bureaus), media performance and more. The value in using these MPP appliances is that they are able to ingest larger amounts of data, yet still run faster than traditional RDBM-based systems for analytical tasks such as creating attribute aggregates, performing analytical scoring and supporting business intelligence reporting.

Better Campaign Management and Execution

As an example, one of our major retail clients uses an MPP platform to significantly reduce the end-to-end campaign lifecycle (by 50%-70%) and improve marketing performance. They are able to accomplish this by leveraging the scale and speed of an MPP-based appliance to:

  • Leverage a single instance of the data within the appliance (for analytics, reporting and campaigns) that includes calculated fields needed (as opposed to ad-hoc creation of aggregates across multiple data marts)
  • Use a more comprehensive source of data to drive 20-35% response and revenue lifts across marketing campaigns
  • Store increased loads of data including store / e-commence purchase history, contact history, web traffic and email
  • Process campaigns faster by better managing campaign cadence to create better control groups for identification of incremental marketing opportunities yielding up to $10 million in operating income

Multitenant Solutions

Aside from processing large data sets for individual clients, Merkle is also using MPP technology to host multitenant solutions for digital and customer data integration. (The concept of multitenancy is core to cloud-based solutions as well to help drive cost efficiencies.) The sharing of space helps to optimize use of the environment and provides savings by not having to procure two sets of systems. Without the scale and performance of these tools, such an arrangement would not be possible. This has especially been true in the collection and management of digital data that often transfers as multiple gigabyte files that can be processed down into more manageable data sets within the data appliance itself.

In-database Analytics

Another big advantage that we’ve seen with MPP vendors is the introduction of in-database analytics. The traditional approach to performing analytics and scoring of a customer segment is to collect the data into a single database, overly with 3rd party data, extract a sample customer file, create models / scores on the sample data set, finalize the models, extract the entire data set, run the scoring against the entire data set and re-insert the database back into the database. All of that back and forth of data movement is completely wasted, and the speed and performance of running those scoring algorithms against millions of customer records can be slow.

Compare that approach with to keeping the customer records within the MPP appliance and, after finalizing models against a sample data set, running all of the scores within the database itself. For one of Merkle’s medical equipment supply clients, we ran a speed comparison between their traditional approach and then using in-database analytics:

Testing Parameters

Testing Results

  • Customer universe of 49 million records with approximately 350 attributes per record
  • Total customer database size of approximately 35 gigabytes
  • Analytics code included 10 models with up to 50 variables each
  • Tests were conducted against record sizes ranging from 1 million to 49 million incremented by 5 million. Overall, the MPP appliance provided sub-linear performance results (that means, as you added in more data, the per record processing time went down!)
  • Traditional scoring time finished in 4 hours but included several days of data transfer resulting in total end-to-end processing time of approximately 4 days
  • The MPP solution finished the scoring in 40 minutes and did not require the extensive data transfers since the scoring was performed within the database

Beyond SQL – NoSQL

Aside from MPP solutions, there is also a whole set of technology colloquially known as NoSQL[v] (or “Not Only SQL”). Unlike relational or MPP systems, these solutions primarily use the concepts of key value pairs to store data instead of a structured schema. These key value pairs (for example, <“city”, “Columbia”>, <“city”, “New York”>) are then grouped and queried just using the key structure. This allows for extreme horizontal scaling because keys can be easily partitioned across servers. Additionally, by being designed around the concept of horizontal scale, NoSQL solutions can take advantage of commodity (i.e., cheap) hardware for each of the individual nodes. Where this also gets very interesting is in leveraging the use of cloud or virtual machines to scale the solution. That means that the starting costs of a NoSQL database system can be quite low yet scale to very large sizes by adding in additional server nodes over time to meet the performance needs. As an extreme example, Facebook has been able to scale their NoSQL database to over 15 petabytes of data across more than 2 thousand server nodes[vi].

While there are a number of impressive case studies and companies that are using NoSQL-based technologies, the commercial offerings in the space are still quite new and immature. The primary organizations providing these tools are open-source foundations such as Apache. The biggest names in NoSQL are Apache Hadoop (using MapReduce and HBase), Apache Cassandra, Project Voldemort and MangoDB. Though open-source, there are major Internet company players behind these solutions including Facebook, Google, Twitter, Amazon and Yahoo! that provided a lot of the initial research and development of these systems and continue to support their evolution. Additionally, foundations such as Apache are backed up by major commercial organizations including IBM and Oracle that provide dedicated developer resources.

 


[i] http://www.teradata.com/history/

[ii] http://en.wikipedia.org/wiki/Massively_Parallel_Processing

[iii] http://www.teradata.com/history/

[iv] http://public.dhe.ibm.com/common/ssi/ecm/en/imd14378usen/IMD14378USEN.PDF

[v] First used by Carlo Strozzi in 1998. Perhaps a better name would have been related to the fact that these databases are decidedly not relational – but the name has stuck. See http://www.strozzi.it/cgi-bin/CSA/tw7/I/en_US/nosql/Home%20Page

[vi] http://dl.acm.org/citation.cfm?id=1807167.1807278

Categories: Digital Entertainment Tags:

The Big Deal on Big Data (Part 2)

February 15th, 2012 No comments

The Changing Face of Data

The challenge is not only the amount that data is growing but the type of data is changing as well. Traditionally, computer information systems are really good and collecting, processing and analyzing structured data – information that can be described using a structure or schema. The earliest versions of these were known as flat models, originally used by punch-card systems and later mainframe programs, to structure data into fixed-length fields[i]. (These models are still used today in what people call “flat files” that are often used to transfer data between systems.) Today’s modern systems now primarily rely on relational models that structure data according to tables of data (made up of rows and columns) and the relations between them. These are stored in major database systems including Microsoft’s SQL Server, Oracle’s RDBMS and IBM’s DB2.

However, the emergence of semi-structured and unstructured data is fueling much of the Internet’s data growth. An example of semi-structured data is an email – it includes structured elements such as from, subject, date and content. However, the message itself can contain anything the user wishes in whatever format they want. Unstructured data examples include pictures, videos, phone conversations, text messages and Tweets. While structured data can be more easily analyzed, semi-structured and unstructured data is more open to interpretation and is more difficult for computer systems to manage. (Just think how hard it would be for you to answer questions about the document you are reading compared to a table of sales statistics.)

This is especially relevant to marketers. The rapid increases in consumer generated data includes on-line behaviors including participation in social networks, mobile searching (which now includes location-based data), targeted display ads, data integration across e-commerce / web-sites and digital messaging including email, SMS and texting. As a great example, researchers at Northwestern University used time to respond to emails to glean information about social closeness between users[ii]. The shorter it took to respond to an email, the closer their research showed the connection. Are you collecting that type of information? There’s a lot of data out there, and there’s a lot of work to fully analyze and understand it.

Today’s Approach to Storing and Processing Data Wasn’t Built for This Explosion

Our heavy reliance on relational data models was built for a different world. In 1970, E.F. Codd wrote his seminal paper[iii] that first described the concepts behind using relational models for data storage. He was primarily concerned about wasted disk space and faster searching of information within larger data sets (larger being relative as Codd only had to dea

l with kilobytes and megabytes of data). This was at a time when computer resources were expensive and efficiency was extremely valuable. For example, Intel’s first commercial chip (released in 1971) was capable of 92,000 operations per second compared with today’s Quad-core i7 chips that are capable of 177,730,000,000 operations per second[iv]. Storage costs are another area that has seen amazing efficiencies. In 1971, IBM disk drives cost $17,000,000 per gigabyte in today’s dollars[v]. Today, the cost is under $0.10 per gigabyte[vi] and declining quickly.

The building blocks to solving the problems outlined by Codd (matched with the reality of how expensive computers were back then) were to centralize the data store and eliminate as much redundancy of the data set as possible. This helped to speed up searches and ensured data integrity. Yet, forty years later, with vast increases in the amount of data and shrinking costs of computer systems, we still rely on these 1970’s innovations to manage our data.

An implication of this is seen in challenges related to how we scale our relational database management systems (RDBMs). The two major approaches to scaling computer systems are vertical scaling and horizontal scaling:

  • Vertical scaling refers to the ability to add more scale to a single computer node by upgrading things such as the processing power, amount of memory or hard-drive capacity. Think of waiting in line at the grocery store, this approach parallels making the checkout process faster so that people in line behind you wait less.
  • Horizontal scaling refers to the ability to add in additional nodes to manage the workload required of the system. Going back to our grocery store analogy, this approach is similar to adding in additional checkout lines.

Relational database systems have often relied on vertical scaling requiring expensive hardware and hitting pragmatic limits in what a single computer is capable of processing. Horizontal scaling, while sometimes being more complex, can scale larger and typically costs less. But, since RDBM systems were originally built on a single computer assumption, they aren’t as amenable to horizontal scale.

As a strategy around this, database architects and administrators have implemented a number of work-arounds to find ways to scale horizontally. One approach is to use a master-slave architecture that uses data replication; essentially, data is pushed to others servers that can be used for read-only operations like reporting. Another approach is to use partitioning strategies such as list partitioning that segregates data across databases (e.g., by country, by first letter of the last name, grouping by zip code, etc.) This allows a degree of horizontal scaling, but there are several significant drawbacks:

  • Often, its up to the developer of the database system to make a choice on how to partition the data. While this strategy may work in theory, practice may prove otherwise. By putting the onus of the strategy on the application layer itself, the decision has to be made prior to building the solution, and if production performance demonstrates that the wrong strategy was selected, it will require a fairly significant re-design to mitigate.
  • The partitions themselves are treated as separate data stores. That’s good news in terms of scale and performance, but the challenge is that if you want to combine information across databases, you have to do some pretty computationally expensive joins across system boundaries. That means that faster performance can suffer in the name of scale.
  • The management and costs can be expensive. Relying on commercial vendors to provide partitioning in more seamless and manageable ways requires an enterprise suite of tools, technologies and expertise. Building and managing those environments can take a lot of resources both in terms of licensing and people costs.

While these various strategies and work-arounds have resulted in RDBM systems to scale to truly impressive levels, the onslaught of data and unnecessary complexity has brought us to an inflection point: managing for “Big Data”.

 


[i] http://en.wikipedia.org/wiki/Flat_file_database#History

[ii] http://news.sciencemag.org/sciencenow/2011/11/e-mail-reveals-your-closest-frie.html

[iii] http://www.seas.upenn.edu/~zives/03f/cis550/codd.pdf

[iv] http://en.wikipedia.org/wiki/Intel_4004

[v] http://www-03.ibm.com/ibm/history/exhibits/system7/system7_press.html. Original was $3,245,000 per GB based on purchase price of $16,225 and capacity of 5 MB (or 0.001 GB). Using inflation calculator at http://www.westegg.com/inflation comparing 1971 dollars to 2010.

[vi] http://www.mkomo.com/cost-per-gigabyte. Alternatively, go to Amazon.com or other supplier and check yourself – the prices continue to go down every day.

Categories: Digital Entertainment Tags:

The Big Deal on Big Data (Part 1)

February 1st, 2012 No comments

In today’s world, making effective decisions depends on having good information at your fingertips. But as our ability to collect and analyze vasts amount of this information has grown over the past decate, our capability to effectively use this information hasn’t sufficiently matured. It’s very likely that the big investments in the collection and storage of this data isn’t paying off in better decision making.

Yet, while we struggle with that gap today, the pace of data continues to accelerate. There are now more than 2 billion users of the Internet[i] accessing and generating vast amounts of data. According to Cisco’s most recent Visual Networking Index[ii], Internet traffic increased eightfold over the last five years and will increase another fourfold over the next five. They estimate that by 2015, annual Internet traffic will approach one zetabyte. That’s a staggering amount of data. The gap will continue to grow.

To put that into perspective, let’s start smaller with a petabyte. A petabyte is 1015 bytes or 1 million gigabytes – capable of storing about 350 million MP3 songs[iii]. Using Gracenotes[iv] as an estimate on the total number of songs available (around 97 million) and the release of about 50 albums (or 500 songs) per week, it would take almost a thousand more years to have a petabyte of professionally recorded music. And a zetabyte is one million petabytes!

Figure: How Big is Each Byte

Name Number of Bytes Number of Songs All of Wikipedia[v]
Megabyte

1,000,000

< 1 < 1
Gigabyte

1,000,000,000

350 < 1
Terabyte

1,000,000,000,000

350 thousand One tenth
Petabyte

1,000,000,000,000,000

350 million 100 copies
Exabyte

1,000,000,000,000,000,000

350 billion 100,000 copies
Zetabyte

1,000,000,000,000,000,000,000

350 trillion 100,000,000 copies

 

The reason why this is so problematic to on-line marketers is that it continues to underscore the one-to-one marketing “data treadmill” – no matter how much data you collect about a single customer or potential customer, there’s always more to collect as their “digital exhaust”[vi] continues to expand in size and scope. While some of today’s largest marketing databases range into the terabytes of data, future database will need to expand significantly. But the emphasis will continue to be applying “big judgment” to “big data”; collecting more data and throwing it at existing processes is just throwing gasoline at the fire.

 


[i] http://www.internetworldstats.com/stats.htm

[ii] http://www.cisco.com/en/US/solutions/collateral/ns341/ns525/ns537/ns705/ns827/VNI_Hyperconnectivity_WP.html

[iii] Assuming about 2.8 megabytes per recorded song.

[iv] http://www.gracenote.com/

[v] http://en.wikipedia.org/wiki/Wikipedia:Database_download. Using 10 terabytes to make the math a big more straightforward. The size does not include images, just the text.

[vi] http://en.wikipedia.org/wiki/Digital_exhaust#cite_note-digital_exhaust_1-0

Categories: Digital Entertainment Tags: