The Big Deal on Big Data (Part 3)
Massively Parallel Processing: A Foray into Big Data
As it’s name implies, the company Teradata was founded in the late 1970’s with the idea of supporting terabyte-sized databases. Coming out of research from Caltech[i], the core idea was to develop a specialized database solution that relied on parallel processing. Essentially, Teradata was looking to horizontally scale the database using the idea of a connected set of servers managed by a software layer that would automatically break down database problems. The idea behind this solution, know as massively parallel processing (MPP)[ii], solved the complexity of application and database architects having to solve the partitioning strategy. Eventually, Teradata introduced a complete end-to-end solution known as a data appliance that incorporated the hardware, software and storage encapsulated into a single solution to handle very large databases (now up to petabytes in size)[iii].
Today, there are a number of competitors to Teradata including IBM Netezza, Oracle Exadata, EMC Greenplum, HP Vertica, Microsoft SQL Server Parallel Warehouse and Paraccel. All of these solutions are based off of the MPP architecture and provide various configurations from software only to specialized data appliance hardware. IBM Netezza[iv], for example, provides a database appliance that uses specialized blade hardware and tailored integrated circuits to increase the performance of data queries.
How Merkle Uses Massively Parallel Systems to Go Big
In the evolution of building more robust marketing databases, Merkle has deployed numerous MPP databases ranging from just a few to over a dozen terabytes in size. These systems incorporate numerous data feeds (in some cases well over one hundred) including demographic information, promotional data, sales transactions, web feeds, model scores, event registration & attendance, social profiles, research, 3rd party data (including credit bureaus), media performance and more. The value in using these MPP appliances is that they are able to ingest larger amounts of data, yet still run faster than traditional RDBM-based systems for analytical tasks such as creating attribute aggregates, performing analytical scoring and supporting business intelligence reporting.
Better Campaign Management and Execution
As an example, one of our major retail clients uses an MPP platform to significantly reduce the end-to-end campaign lifecycle (by 50%-70%) and improve marketing performance. They are able to accomplish this by leveraging the scale and speed of an MPP-based appliance to:
- Leverage a single instance of the data within the appliance (for analytics, reporting and campaigns) that includes calculated fields needed (as opposed to ad-hoc creation of aggregates across multiple data marts)
- Use a more comprehensive source of data to drive 20-35% response and revenue lifts across marketing campaigns
- Store increased loads of data including store / e-commence purchase history, contact history, web traffic and email
- Process campaigns faster by better managing campaign cadence to create better control groups for identification of incremental marketing opportunities yielding up to $10 million in operating income
Multitenant Solutions
Aside from processing large data sets for individual clients, Merkle is also using MPP technology to host multitenant solutions for digital and customer data integration. (The concept of multitenancy is core to cloud-based solutions as well to help drive cost efficiencies.) The sharing of space helps to optimize use of the environment and provides savings by not having to procure two sets of systems. Without the scale and performance of these tools, such an arrangement would not be possible. This has especially been true in the collection and management of digital data that often transfers as multiple gigabyte files that can be processed down into more manageable data sets within the data appliance itself.
In-database Analytics
Another big advantage that we’ve seen with MPP vendors is the introduction of in-database analytics. The traditional approach to performing analytics and scoring of a customer segment is to collect the data into a single database, overly with 3rd party data, extract a sample customer file, create models / scores on the sample data set, finalize the models, extract the entire data set, run the scoring against the entire data set and re-insert the database back into the database. All of that back and forth of data movement is completely wasted, and the speed and performance of running those scoring algorithms against millions of customer records can be slow.
Compare that approach with to keeping the customer records within the MPP appliance and, after finalizing models against a sample data set, running all of the scores within the database itself. For one of Merkle’s medical equipment supply clients, we ran a speed comparison between their traditional approach and then using in-database analytics:
|
Testing Parameters |
Testing Results |
|
|
Beyond SQL – NoSQL
Aside from MPP solutions, there is also a whole set of technology colloquially known as NoSQL[v] (or “Not Only SQL”). Unlike relational or MPP systems, these solutions primarily use the concepts of key value pairs to store data instead of a structured schema. These key value pairs (for example, <“city”, “Columbia”>, <“city”, “New York”>) are then grouped and queried just using the key structure. This allows for extreme horizontal scaling because keys can be easily partitioned across servers. Additionally, by being designed around the concept of horizontal scale, NoSQL solutions can take advantage of commodity (i.e., cheap) hardware for each of the individual nodes. Where this also gets very interesting is in leveraging the use of cloud or virtual machines to scale the solution. That means that the starting costs of a NoSQL database system can be quite low yet scale to very large sizes by adding in additional server nodes over time to meet the performance needs. As an extreme example, Facebook has been able to scale their NoSQL database to over 15 petabytes of data across more than 2 thousand server nodes[vi].
While there are a number of impressive case studies and companies that are using NoSQL-based technologies, the commercial offerings in the space are still quite new and immature. The primary organizations providing these tools are open-source foundations such as Apache. The biggest names in NoSQL are Apache Hadoop (using MapReduce and HBase), Apache Cassandra, Project Voldemort and MangoDB. Though open-source, there are major Internet company players behind these solutions including Facebook, Google, Twitter, Amazon and Yahoo! that provided a lot of the initial research and development of these systems and continue to support their evolution. Additionally, foundations such as Apache are backed up by major commercial organizations including IBM and Oracle that provide dedicated developer resources.
[i] http://www.teradata.com/history/
[ii] http://en.wikipedia.org/wiki/Massively_Parallel_Processing
[iii] http://www.teradata.com/history/
[iv] http://public.dhe.ibm.com/common/ssi/ecm/en/imd14378usen/IMD14378USEN.PDF
[v] First used by Carlo Strozzi in 1998. Perhaps a better name would have been related to the fact that these databases are decidedly not relational – but the name has stuck. See http://www.strozzi.it/cgi-bin/CSA/tw7/I/en_US/nosql/Home%20Page
[vi] http://dl.acm.org/citation.cfm?id=1807167.1807278

In today’s world, making effective decisions depends on having good information at your fingertips. But as our ability to collect and analyze vasts amount of this information has grown over the past decate, our capability to effectively use this information hasn’t sufficiently matured. It’s very likely that the big investments in the collection and storage of this data isn’t paying off in better decision making.